Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 30 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
30
Dung lượng
2,05 MB
Nội dung
VLSI84 Fig. 10. The detail operations of the first stage 1-D DWT. (a) (b) Fig. 11. The detailed operations of the second stage 1-D DWT. (a) The HF (HH and HL) part operations. (b) The LF (LH and LL) part operations. 5. VLSI architecture and implementation for the 2-D dual-mode LDWT The IRSA approach has been discussed in the previous section, and the architecture of IRSA is described in this section. We can manipulate the control unit to read off-chip memory. In IRSA, two pixels are scanned concurrently, and the system needs two processing units. For the 2-D LDWT processing, the pixels are processed by the first stage 1-D DWT first. The outputs are then fed to the second stage 1-D DWT to find the four subband coefficients, HH, HL, LH, and LL. There are two parts in the architecture, the first stage 1-D DWT and the second stage 1-D DWT. Here we concentrate on the 2-D 5/3 mode LDWT. 5.1 The first stage 1-D LDWT The first stage 1-D LDWT architecture consists of the following units: signal arrangement unit, multiplication and accumulation cell (MAC), multiplexer (MUX), and FIFO register. The block diagram is shown in Fig. 12. The signal arrangement unit consists of three registers, R1, R2, and R3. The pixels are input to R1 first, and subsequently the content of R1 is transferred to R2 and then R3, and R1 keeps reading the following pixels. The operation is like a shift register. As soon as R1, R2, and R3 get signal data, MAC starts operating. The signal arrangement unit is shown in Fig. 13. In Fig. 13 MAC operates at the clock with gray circles. Memory-EfcientHardwareArchitectureof2-DDual-ModeLifting-Based DiscreteWaveletTransformforJPEG2000 85 Fig. 10. The detail operations of the first stage 1-D DWT. (a) (b) Fig. 11. The detailed operations of the second stage 1-D DWT. (a) The HF (HH and HL) part operations. (b) The LF (LH and LL) part operations. 5. VLSI architecture and implementation for the 2-D dual-mode LDWT The IRSA approach has been discussed in the previous section, and the architecture of IRSA is described in this section. We can manipulate the control unit to read off-chip memory. In IRSA, two pixels are scanned concurrently, and the system needs two processing units. For the 2-D LDWT processing, the pixels are processed by the first stage 1-D DWT first. The outputs are then fed to the second stage 1-D DWT to find the four subband coefficients, HH, HL, LH, and LL. There are two parts in the architecture, the first stage 1-D DWT and the second stage 1-D DWT. Here we concentrate on the 2-D 5/3 mode LDWT. 5.1 The first stage 1-D LDWT The first stage 1-D LDWT architecture consists of the following units: signal arrangement unit, multiplication and accumulation cell (MAC), multiplexer (MUX), and FIFO register. The block diagram is shown in Fig. 12. The signal arrangement unit consists of three registers, R1, R2, and R3. The pixels are input to R1 first, and subsequently the content of R1 is transferred to R2 and then R3, and R1 keeps reading the following pixels. The operation is like a shift register. As soon as R1, R2, and R3 get signal data, MAC starts operating. The signal arrangement unit is shown in Fig. 13. In Fig. 13 MAC operates at the clock with gray circles. VLSI86 Fig. 12. The architecture of the first stage 1-D DWT. Fig. 13. The operation of the signal arrangement unit (for example, IRAS signal in N1). Fig. 14. The block diagram of MAC. For the low-frequency coefficients calculation we need two high-frequency coefficients and an original pixel. Internal register R4 is used to store the original even pixel (N1) and internal register R9 is used to store the original odd pixel (N2). We can simply shift the content of R3 to R4 after the MAC operation. The FIFO is used to store the high-frequency coefficients to calculate the low-frequency coefficients. Register R5 has two functions: 1) to store the high-frequency coefficients for the low-frequency coefficient calculation, 2) to be used as a signal buffer for MAC. MAC needs time to compute the signal, and the output of MAC cannot directly feed the result to the output or the following operation may be incorrect due to the synchronization problems. R5 acts as an output buffer for MAC to prevent the error in the following operations. In the 5/3 integer lifting-based operations, MAC is used to find the results of the high-frequency output, (a1+a3)/2 + a2, and the low frequency output, (a1+a3)/4+a2. There are two multiplication coefficients, 1/2 and 1/4. To save hardware, we can use shifters to implement the 1/2 and 1/4 multiplications. Therefore the MAC needs adders, complementer, and shifters. The MAC block diagram is shown in Fig. 14, where a1, a2, and a3 are the inputs, ‛‛’’ the 2’s complement converter, and ‛‛>>’’ the right shifter. 5.2 The second stage 1-D LDWT Similar to the first stage 1-D DWT, the second stage 1-D DWT consists of the following units: signal arrangement unit, MAC, and MUX, as shown in Fig. 15. Due to the parallel architecture, two outputs are generated concurrently from the first stage 1-D DWT, and these two outputs must be merged in the second stage 1-D DWT. The signal arrangement unit processes the signal merging; Fig. 16 shows the processing diagram of the signal arrangement unit. At beginning, signals H0 and H1 are from IN1 and IN2 and these two signals are stored in R3 and R4, respectively. At the next clock, H0 and H1 are moved to R1 and R2 respectively, and concurrently new signals H3 and H4 from IN1 and IN2 are stored to R3 and R4 respectively. The signal arrangement unit operates repeatedly to input signals for the second stage 1-D DWT. Memory-EfcientHardwareArchitectureof2-DDual-ModeLifting-Based DiscreteWaveletTransformforJPEG2000 87 Fig. 12. The architecture of the first stage 1-D DWT. Fig. 13. The operation of the signal arrangement unit (for example, IRAS signal in N1). Fig. 14. The block diagram of MAC. For the low-frequency coefficients calculation we need two high-frequency coefficients and an original pixel. Internal register R4 is used to store the original even pixel (N1) and internal register R9 is used to store the original odd pixel (N2). We can simply shift the content of R3 to R4 after the MAC operation. The FIFO is used to store the high-frequency coefficients to calculate the low-frequency coefficients. Register R5 has two functions: 1) to store the high-frequency coefficients for the low-frequency coefficient calculation, 2) to be used as a signal buffer for MAC. MAC needs time to compute the signal, and the output of MAC cannot directly feed the result to the output or the following operation may be incorrect due to the synchronization problems. R5 acts as an output buffer for MAC to prevent the error in the following operations. In the 5/3 integer lifting-based operations, MAC is used to find the results of the high-frequency output, (a1+a3)/2 + a2, and the low frequency output, (a1+a3)/4+a2. There are two multiplication coefficients, 1/2 and 1/4. To save hardware, we can use shifters to implement the 1/2 and 1/4 multiplications. Therefore the MAC needs adders, complementer, and shifters. The MAC block diagram is shown in Fig. 14, where a1, a2, and a3 are the inputs, ‛‛’’ the 2’s complement converter, and ‛‛>>’’ the right shifter. 5.2 The second stage 1-D LDWT Similar to the first stage 1-D DWT, the second stage 1-D DWT consists of the following units: signal arrangement unit, MAC, and MUX, as shown in Fig. 15. Due to the parallel architecture, two outputs are generated concurrently from the first stage 1-D DWT, and these two outputs must be merged in the second stage 1-D DWT. The signal arrangement unit processes the signal merging; Fig. 16 shows the processing diagram of the signal arrangement unit. At beginning, signals H0 and H1 are from IN1 and IN2 and these two signals are stored in R3 and R4, respectively. At the next clock, H0 and H1 are moved to R1 and R2 respectively, and concurrently new signals H3 and H4 from IN1 and IN2 are stored to R3 and R4 respectively. The signal arrangement unit operates repeatedly to input signals for the second stage 1-D DWT. VLSI88 Fig. 15. The block diagram of the second stage 1-D LDWT. Fig. 16. Signal merging process for the signal arrangement unit. 5.3 2-D LDWT architecture In our IRSA operation, IN1 and IN2 read signals of even row and odd row in a zig-zag order, respectively. The detail is shown in Fig. 17. The block diagram of the proposed 2-D LDWT is shown in Fig. 19(b). It consists of two stages, the first stage 1-D DWT and the second stage 1-D DWT. This architecture needs only a small amount of transpose memory. Let us consider a 44 image. The signal processing of the first stage 1-D DWT is shown in Fig. 18. The pixels from the even rows are processed by the upper 1-D DWT, and the pixels from the odd rows are processed by the lower 1-D DWT. Each 1-D DWT generates one set of 22 high-frequency coefficients and one set of 22 low-frequency coefficients, respectively. The generated coefficients are fed to the second stage 1-D DWT under the direction of the arrow head. The input for each second stage 1-D DWT becomes a set of 24 signals. The signal processing of the second stage 1-D DWT is shown in Fig. 19. The 24 signals in each second stage 1-D DWT are then processed, and then HH, HL, LH, and LL are generated and each has 22 signal data. The complete architecture of the 2-D LDWT is shown in Fig. 19. The complete 2-D LDWT consists of four parts, two sets of the first stage 1-D DWT, two sets of the second stage 1-D DWT, control unit, and MAC unit. (a) (b) Fig. 17. The input signal sequences. (a) IN1 read signal of even row in zig-zag orders. (b) IN2 read signal of odd row in zig-zag orders. Memory-EfcientHardwareArchitectureof2-DDual-ModeLifting-Based DiscreteWaveletTransformforJPEG2000 89 Fig. 15. The block diagram of the second stage 1-D LDWT. Fig. 16. Signal merging process for the signal arrangement unit. 5.3 2-D LDWT architecture In our IRSA operation, IN1 and IN2 read signals of even row and odd row in a zig-zag order, respectively. The detail is shown in Fig. 17. The block diagram of the proposed 2-D LDWT is shown in Fig. 19(b). It consists of two stages, the first stage 1-D DWT and the second stage 1-D DWT. This architecture needs only a small amount of transpose memory. Let us consider a 44 image. The signal processing of the first stage 1-D DWT is shown in Fig. 18. The pixels from the even rows are processed by the upper 1-D DWT, and the pixels from the odd rows are processed by the lower 1-D DWT. Each 1-D DWT generates one set of 22 high-frequency coefficients and one set of 22 low-frequency coefficients, respectively. The generated coefficients are fed to the second stage 1-D DWT under the direction of the arrow head. The input for each second stage 1-D DWT becomes a set of 24 signals. The signal processing of the second stage 1-D DWT is shown in Fig. 19. The 24 signals in each second stage 1-D DWT are then processed, and then HH, HL, LH, and LL are generated and each has 22 signal data. The complete architecture of the 2-D LDWT is shown in Fig. 19. The complete 2-D LDWT consists of four parts, two sets of the first stage 1-D DWT, two sets of the second stage 1-D DWT, control unit, and MAC unit. (a) (b) Fig. 17. The input signal sequences. (a) IN1 read signal of even row in zig-zag orders. (b) IN2 read signal of odd row in zig-zag orders. VLSI90 (a) (b) Fig. 18. The signal process of the two stage LDWT. (a) First stage 1-D LDWT. (b) Second stage 1-D LDWT. According to (12) and (13), the proposed IRSA architecture can also be applied to the 9/7 mode LDWT. Fig. 20 illustrates the approach. From Figs. 10 and 11 in Section 3, the original signals (denoted as black circles) for both 5/3 and 9/7 modes LDWT can be processed by the same IRSA for the first stage 1-D DWT operation. The high-frequency signals (denoted as grey circles) and the correlated low-frequency signals together with the results of the first stage are used to compute the second stage 1-D DWT coefficients. Compared to the 9/7 mode LDWT computation, the 5/3 mode LDWT is much easier for computation, and the registers arrangement in Figs. 12 and 15 is simple. For 9/7 mode LDWT implementation with the same system architecture of 5/3 mode LDWT, we have to do the following modifications: 1) The control signals of the MUX in Figs. 12 and 15 must be modified. We have to rearrange the registers for the MAC block to process the 9/7 parameters. 2) The wavelet coefficients of the dual-mode LDWT are different. The coefficients are α= 1/2 and β=1/4 for 5/3 mode LDWT, but the coefficients are α= −1.586134142, β= −0.052980118, γ= +0.882911075, and δ= +0.443506852 for 9/7 mode LDWT. For calculation simplicity and good precision, we can use the integer approach proposed by Huang et al. (Huang et al., 2004) and Martina et al. (Martina & Masera, 2007) for 9/7 mode LDWT calculation. Similar to the multiplication implementation by shifters and adders in the 5/3 mode LDWT, we can adopt the shifters approach proposed in (Huang et al., 2005) further to implement the 9/7 mode LDWT. 3) According to the characteristics of the 9/7 mode LDWT, the control unit in Fig. 19(b) must be modified accordingly. (a) (b) Fig. 19. The complete 2-D DWT block diagram. (a) DSP diagram of the 2-D LDWT. (b) System diagram of the 2-D LDWT. Memory-EfcientHardwareArchitectureof2-DDual-ModeLifting-Based DiscreteWaveletTransformforJPEG2000 91 (a) (b) Fig. 18. The signal process of the two stage LDWT. (a) First stage 1-D LDWT. (b) Second stage 1-D LDWT. According to (12) and (13), the proposed IRSA architecture can also be applied to the 9/7 mode LDWT. Fig. 20 illustrates the approach. From Figs. 10 and 11 in Section 3, the original signals (denoted as black circles) for both 5/3 and 9/7 modes LDWT can be processed by the same IRSA for the first stage 1-D DWT operation. The high-frequency signals (denoted as grey circles) and the correlated low-frequency signals together with the results of the first stage are used to compute the second stage 1-D DWT coefficients. Compared to the 9/7 mode LDWT computation, the 5/3 mode LDWT is much easier for computation, and the registers arrangement in Figs. 12 and 15 is simple. For 9/7 mode LDWT implementation with the same system architecture of 5/3 mode LDWT, we have to do the following modifications: 1) The control signals of the MUX in Figs. 12 and 15 must be modified. We have to rearrange the registers for the MAC block to process the 9/7 parameters. 2) The wavelet coefficients of the dual-mode LDWT are different. The coefficients are α= 1/2 and β=1/4 for 5/3 mode LDWT, but the coefficients are α= −1.586134142, β= −0.052980118, γ= +0.882911075, and δ= +0.443506852 for 9/7 mode LDWT. For calculation simplicity and good precision, we can use the integer approach proposed by Huang et al. (Huang et al., 2004) and Martina et al. (Martina & Masera, 2007) for 9/7 mode LDWT calculation. Similar to the multiplication implementation by shifters and adders in the 5/3 mode LDWT, we can adopt the shifters approach proposed in (Huang et al., 2005) further to implement the 9/7 mode LDWT. 3) According to the characteristics of the 9/7 mode LDWT, the control unit in Fig. 19(b) must be modified accordingly. (a) (b) Fig. 19. The complete 2-D DWT block diagram. (a) DSP diagram of the 2-D LDWT. (b) System diagram of the 2-D LDWT. VLSI92 Fig. 20. The processing procedures of 2-D dual-mode LDWTs under the same IRSA architecture. The multi-level DWT computation can be implemented in a similar manner by the high performance 1-level 2-D LDWT. For the multi-level computation, this architecture needs N 2 /4 off-chip memory. As illustrated in Fig. 21, the off-chip memory is used to temporarily store the LL subband coefficients for the next iteration computations. The second level computation requires N/2 counters and N/2 FIFO’s for the control unit. The third level computation requires N/4 counters and N/4 FIFO’s for the control unit. Generally in the jth level computation, we need N/2 j-1 counters and N/2 j-1 FIFO’s. Fig. 21. The multilevel 2-D DWT architecture. 6. Experimental results and comparisons The 2-D dual-mode LDWT considers a trade-off between low transpose memory and low complexity in the design of VLSI architecture. Tables 2 and 3 show the performance comparisons of the proposed architecture and other similar architectures. Compression results indicate that the proposed VLSI architecture outperforms previous works in terms of transpose memory size, requiring about 50% less memory than the JPEG2000 standard (Chen, 2004) architecture. Moreover, the 2-D LDWT is frame-based, and its implementation bottleneck is the huge transpose memory. Less memory units are needed in our architecture and the latency is fixed on (3/2)N+3 clock cycles. Our architecture can also provide an embedded symmetrical extension function. The proposed IRSA approach has the advantages of memory-efficient and high-speed. The proposed 2-D dual-mode LDWT adopts parallel and pipelined schemes to reduce the transpose memory and increase the operation speed. The shifters and adders replace multipliers in the computation to reduce the hardware cost. Chen et al. (Chen & Wu, 2002) proposed a folded and pipelined architecture to compute the 2-D 5/3 lifting-based DWT, and they used transpose memory size of 2.5N for an NN 2-D DWT. This lifting architecture for vertical filtering with two adders and one multiplier is divided into two parts, and each part has one adder and one multiplier. Because both parts are activated in different cycles, they can share the same adder and multiplier. It can increase the hardware utilization and reduce the latency. However, according to the characteristics of the signal flow, it will increase the complexity at the same time. A 256×256 2-D LDWT was designed and simulated with VerilogHDL and further synthesized by the Synopsys design compiler with TSMC 0.18μm 1P6M CMOS standard process technology. The detailed specs of the 256×256 2-D LDWT are listed in Table 4. [...]... coefficients is processed to the top-left coefficient After first part process, the DC coefficients of 16 4x4 blocks can be collected as a 4x4 DC block The second part process is for the 4x4 DC block from the first part process The second part 2D 4x4 PCT is built by using the three operators: 2x2 T_h, T_odd and T_odd_odd After second part process, the 16 DC coefficients are processed as DC and AD coefficients... coefficients AC There are two parts process for the PCT as shown in Fig 6 The macroblock (MB) is partitioned into 16 4x4 blocks In each part process, each 4x4 block is pre-filtered and then transformed by 4x4 PCT A 2x2 transform is applied to four coefficients by four times Full HD JPEG XR Encoder Design for Digital Photography Applications 105 for each 4x4 block in first part process The low frequency... 1 L: Andra et al., 2002 N2 Jung & Chen, Vishwanath 20 041 Park, et al., 1995 2005 12N N2 /4+ L 22N N+L Huang et Huang Wu & Lin, 2005 al., 2005 et al, 2005 14N 5.5N 5.5N 4N2/3+ N2 2 N2/2~(2 N2 /3)N - - 8 4 4L 4L 16 12 16 10 12 9 36 36 the filter length Table 3 Comparisons of the 2-D architectures for 9/7 LDWT Lan et al., 2005 - 22+ (4/ 3)N2[1 -(1 /4) ]+6N[1(1/2)] 8 32 6 20 Chip specification N = 256, Tile... ISO/IEC 1 544 4-1 JTC1/SC29 WG1 (2000) JPEG 2000 Part 1 Final Committee Draft Version 1.0, Information Technology ISO/IEC JTC1/SC29/WG1 Wgln 16 84 (2000) JPEG 2000 Verification Model 9.0 ISO/IEC 1 544 4-1 JTC1/SC29 WG1 (2000) Motion JPEG2000, ISO/IEC ISO/IEC 1 544 4-3, Information Technology ISO/IEC JTC1/SC29 WG11 (2001), Coding of Moving Pictures and Audio, Information Technology Jiang, W & Ortega, A (2001)... detailed specs of the 256×256 2-D LDWT are listed in Table 4 94 VLSI 5/3 LDWT architecture Ours Transpose memory1 (bytes) Computation time2 2N Adders Multipliers 1 Transpose (3 /4 )N2+ (3/2 )N +7 8 0 Chiang & Hsia, 2005 N2 /4+ 5N Mei et al., 2006 2N Huang et al, 2005 3.5N Wu & Lin, 2005 (N2/2)+N +5 N2 (N2/ 2)+N - 10+ (4/ 3) N2[1(1 /4) ]+2N [1-(1/2)] 5 0 4 0 8 0 - 6 Diou et al., 2001 3.5N Andra et al., 2002... (Transpose + Internal) Latency 29,196 gates 1.8V TSMC 0.18m 1P6M (CMOS) 2-D 5/3 DWT: 512 bytes 2-D 9/7 DWT: 1,0 24 bytes (3/2)N+3 = 387 clock cycles Computing time (3 /4) N2+(3/2)N+7 = 49 , 543 clock cycles Maximum clock rate 83 MHz Table 4 Design specification of the proposed 2-D DWT Wu & Chen, 2001 N2+4N+ 4 2N2/3 16 16 Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based Discrete Wavelet Transform... Multimedia Workshops, (September 2000) pp 45 -49 Marino, F (2000) Efficient high-speed/low-power pipelined architecture for the direct 2-D discrete wavelet transform, IEEE Transactions on Circuits and Systems II, Vol 47 , No 12, (December 2000) pp 147 6- 149 1 Martina, M & Masera, G (2007) Folded multiplierless lifting-based wavelet pipeline, IET Electronics Letters, Vol 43 , No 5, (March 2007) pp 27-28 Mei,... Transactions on Signal Processing, Vol 53, No 4, (April 2005) pp 1575-1586 Huang, C.-T.; Tseng, P.-C & Chen, L.-G (2005) Generic RAM-based architecture for twodimensional discrete wavelet transform with line-based method, IEEE Transactions on Circuits and Systems for Video Technology, Vol 15, No 7, (July 2005) pp 910-919 ISO/IEC 1 544 4-1 JTC1/SC29 WG1 (2000) JPEG 2000 Part 1 Final Committee Draft Version 1.0,... High-speed and memory-efficient VLSI design of 2-D DWT for JPEG2000, IET Electronics Letter, Vol 42 , No 16, (August 2006) pp 907-908 Ohm, J.-R (2005) Advances in scalable video coding, Proceedings of The IEEE, Invited Paper, Vol 93, No.1, pp 42 -56, (January 2005) pp 42 -56 Richardson, I (2003) H.2 64 and MPEG -4 Video Compression, John Wiley & Sons Ltd Seo, Y.-H & Kim, D.-W (2007) VLSI architecture of line-based... (April 20 04) pp 386-398 Chen, P & Woods, J W (20 04) Bidirectional MC-EZBC with lifting implementation, IEEE Transactions on Circuits and Systems for Video Technology, Vol 14, No 10, (October 20 04) pp 1183-11 94 Chen, S.-C & Wu, C.-C (2002) An architecture of 2-D 3-level lifting-based discrete wavelet transform, VLSI Design/ CAD Symposium, (August 2002) pp 351-3 54 Chiang, J.-S & Hsia, C.-H (2005) An efficient . N 2 /4+ L N+L 22N 14N 5.5N 5.5N N 2 +4N+ 4 Computatio n time (3 /4) N 2 +(3/2) N +7 4N 2 /3+ 2 N 2 N 2 /2~(2 /3)N N 2 22+ (4/ 3)N 2 [1 -(1 /4) ]+6N[1- (1/2)] 2N 2 /3 Adders 16 8 12 4 L 36 16. N 2 /4+ L N+L 22N 14N 5.5N 5.5N N 2 +4N+ 4 Computatio n time (3 /4) N 2 +(3/2) N +7 4N 2 /3+ 2 N 2 N 2 /2~(2 /3)N N 2 22+ (4/ 3)N 2 [1 -(1 /4) ]+6N[1- (1/2)] 2N 2 /3 Adders 16 8 12 4 L 36 16. Wgln 16 84 (2000). JPEG 2000 Verification Model 9.0. ISO/IEC 1 544 4-1 JTC1/SC29 WG1. (2000). Motion JPEG2000, ISO/IEC ISO/IEC 1 544 4-3, Information Technology. ISO/IEC JTC1/SC29 WG11. (2001),