Design of integer motion estimator of HEVC for asymmetric motion-partitioning mode and 4K-UHD the data flow in direction (a), and the two types of dashed lines show the data flow in direction (b) or (c) The grey registers are added to the registers on the bottom line By reading the data beforehand, these registers reduce the read cycles to only one clock cycle in direction (b) J Byun, Y Jung and J Kim p00_00 p00_01 p00_02 p00_63 SRAM data SRAM data p00_01 p00_01 p00 02 p01_63 SRAM data SRAM data p63_00 p00_01 p63_02 p00_02 p63_63 SRAM data SRAM data p64_00 p64_01 p64_02 p64_63 SRAM data a b c Fig Data flow of search area registers 2N 4N 2N N next depth 3N 2N N 4N 3N N N Introduction: To provide a compression ratio higher than the previous standards, the inter-prediction of high-efficiency video coding (HEVC) uses the basic unit size of 64 × 64, which is called the coding tree unit (CTU), the recursive quad-tree coding unit structure and the asymmetric motion-partitioning (AMP) mode [1, 2] These features provide more flexible predictability of size partitioning than previous standards do, but they make it difficult to implement motion-estimator hardware Previous motion-estimator system structures are not suitable to support these features [3–5] Therefore, the HEVC requires a motion-estimator structure that is different from that of the previous standards SRAM data N A design for an integer motion estimator of high-efficiency video coding (HEVC) is presented HEVC supports the 64 × 64 coding tree unit, the recursive quad-tree coding unit structure and the asymmetric motion-partitioning mode in a high compression ratio These features require a structure of integer motion estimation that is more complex than that of H.264/AVC The new structures of a memory read controller and a sum of absolute difference (SAD) summation block are proposed The new memory read controller reduces the internal memory read time, and the new SAD summation block structure supports the recursive quad-tree coding unit structure and the asymmetric motionpartitioning mode The proposed design is implemented in Verilog HDL and synthesised using the 65 nm CMOS technology The gate count is 3.56 M, and the internal static random access memory is about 20 kbyte The operation frequency is 250 MHz when a KUltra high definition (UHD) (3840 × 2160P at 30 Hz) sized video is encoded N 3N 4N 2N Top-level structure: Our system consists of search area memories, current memory, 256 process elements (PEs), a sum of absolute difference (SAD) summation block, a cost block and a comparison tree Search area memories and current memory save the pixel values of the reference frame and the current coding unit One PE calculates the SAD value of a × block The SAD summation block calculates various SAD values using the results of PEs The cost block solves the cost values of variously sized blocks, and the comparison tree block decides the best mode that has the smallest cost value Since the basic unit of the HEVC is 16 times greater than H.264/AVC and the HEVC uses a recursive quad-tree coding unit and AMP mode, new structures of the memory read controller and the SAD summation block are required 3N N N 4N a 2N N 2N N 2N N 2N processing area (64x64) N b a c b N=4 SAD sum N=4 SAD sum N=4 SAD sum N=4 SAD sum N=8 SAD sum search area (127x127) scan order N=8 SAD sum N=4 SAD sum N=4 SAD sum 10 N=4 SAD sum N=4 SAD sum 11 N=16 SAD sum N=4 SAD sum N=4 SAD sum 12 N=4 SAD sum N=4 SAD sum 13 N=8 SAD sum Fig Scan order of search area memories N=8 SAD sum N=4 SAD sum N=4 SAD sum N=4 SAD sum 14 N=32 SAD sum N=4 SAD sum 15 c Memory read controller: Fig shows the scan order of the processing area, which is the region of the search area that is calculated immediately Since search area memories consist of line memories, each line memory of the search area reads only byte per one clock cycle There is no problem when the scan order is in the direction (a) or (c) However, when the scan order is in the direction (b), the line memory of the last search area has to read 64 bytes per one clock cycle The memory read cycles increase by four clock cycles when the memory bit width is 128 bits, which creates 388 800 unnecessary clock cycles in one K-Ultra high definition (UHD) (3840 × 2160P at 30 Hz) frame To solve this problem, we added registers on the bottom line Fig shows the data flow in the search area registers The solid line indicates Fig Structure of SAD summation block a N = 4, or 16 b N = 32 c Hierarchical structure of SAD summation block SAD summation block: The SAD summation block solves various sizes of SAD values using 256 × SAD values that are calculated by the PEs H.264/AVC uses only seven block sizes However, because the HEVC uses the recursive quad-tree coding unit structure and the AMP mode, it needs 27 block sizes [1, 2] The various block sizes need a SAD summation block that has a structure different from ELECTRONICS LETTERS 29th August 2013 Vol 49 No 18 H.264/AVC Fig 3a shows the structure of the SAD summation block when N is 4, or 16 and Fig 3b shows the structure of the SAD summation block when N is 32 Since the HEVC uses the recursive quadtree coding unit structure, the number of structures for N = is 16, for N = it is 8, for N = 16 and for N = 32 only one is needed As shown in Fig 3c, these structures are connected hierarchically If N = 32, the process of the SAD summation block is similar to that of H.264/AVC However, the bold lines in Fig 3a indicate the AMP mode when N is 4, or 16 These parts effectively calculate the SAD values of the AMP mode, using small SAD values The proposed SAD summation block solves the SAD values of every HEVC inter-prediction mode and depth by adding small neighbour SADs Cost block and comparison tree: The cost block calculates the cost values of every prediction mode and depth using SAD values and a motion vector The comparison tree determines the final prediction mode and depth of the CTU, using a comparison of the results of the cost block calculation Conclusion: This Letter presents a motion-estimator structure that effectively supports the recursive quad-tree coding unit and the AMP mode and reduces the number of memory read cycles The designed integer-motion-estimator system uses the 65 nm CMOS technology The gate count is 3.56 M with 20.23 kb of internal SRAM It can encode a K-UHD video in real time at a clock speed of 250 MHz Acknowledgment: This work was supported by the IT R&D program of MOTIE/KEIT (10035389) research on high speed and low power wireless communication SoC for high resolution video information mining © The Institution of Engineering and Technology 2013 24 March 2013 doi: 10.1049/el.2013.0936 J Byun, Y Jung and J Kim (School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of Korea) E-mail: Pipeline process: Fig shows the pipeline process of the proposed system The memory read stage uses only one clock cycle; additional clock cycles are not required in scan direction (b) by adding registers on the bottomline Finally, the proposed integer-motion-estimator system uses 4105 clock cycles for processing the integer motion estimation of one CTU clock clock memory read_1 PE_1 memory read_2 clock clock SAD summation_1 PE_2 cost block_1 SAD summation cost block_2 clock memory PE_4096 SAD summation_4096 read_4096 cost block_4096 comparison tree 4105 clock cycles Fig Pipeline process of proposed system Synthesised results: The proposed system was implemented in Verilog HDL and was synthesised using the 65 nm CMOS technology The gate count is 3.56 M and the internal static random access memory (SRAM) is 20 225 bytes The operation frequency is 250 MHz when a K-UHD-sized video is encoded Table shows a comparison of the proposed system with the previous H.264/AVC integer-motion-estimation system [5] The proposed system supports a greater variety of block sizes and a higher resolution K-UHD video than the previous one has References Bross, B., Han, W.-J., Sullivan, G.J., Ohm, J.-R., and Wiegand, T.: ‘High Efficiency Video Coding (HEVC) Text Specification Draft 9’, ITU-T/ ISO/IEC Joint Collaborative Team on Video coding (JCT-VC), October 2012, JCTVC-K1003 Francois, E., Guillo, L., Ichigaya, A., and Yu, H.: ‘TE12: report on AMP evaluation’, ITU-T/ISO/IEC Joint Collaborative Team On Video coding (JCT-VC), October 2010, JCTVC-C030 Kang, J.S., Lee, Y.T., and Jeon, J.W.: ‘Motion estimator with adaptive reduction of search points’, Electron Lett., 2003, 39, (22), pp 1584–1586 Hsia, S.-C., and Hong, P.-Y.: ‘Very large scale integration (VLSI) implementation of low-complexity variable block size motion estimation for H.264/AVC coding’, IET Circuits Devices Syst., 2010, 4, (5), pp 414–424 Kao, C.Y., and Lin, Y.L.: ‘A memory-efficient and highly parallel architecture for variable block size integer motion estimation in H.264/AVC’, IEEE Trans Very Large Scale Integr (VLSI) Syst., 2010, 18, (6), pp 866–874 Table 1: Comparison of proposed system with previous H.264/ AVC integer-motion-estimation system [5] Video standard Process Gate count (SRAM) Block size Search range Number of reference frame Operation frequency [5] H.264/AVC 0.18 μm 1.45 M (2.97 kb) 16 × 16 to × (seven kinds, without AMP) 64 × 64 Proposed HEVC 65 nm 3.56 M (20.23 kb) 64 × 64 to × (27 kinds, with AMP) 64 × 64 130 MHz (FHD) 250 MHz (4 K-UHD) ELECTRONICS LETTERS 29th August 2013 Vol 49 No 18