This paper proposes a hardware design solution to generate the residual Syntax Element (SE), which is the main work-load of CABAC that requires to access residual data memory to perform multiple scans for various SEs. While high throughput requirement has been provided, the paper also presents an efficient method of residual SE generation for reducing memory accessing times, resulting in the reduction of dynamic power consumption and process delay of the CABAC encoder.
Nghiên cứu khoa học công nghệ HARDWARE DESIGN SOLUTION FOR RESIDUAL SYNTAX ELEMENT GENERATION IN HEVC CABAC ENCODER Tran Dinh Lam1*, Tran Xuan Tu2, Luu Thi Thu Hong1, Nguyen Manh Cuong1 Abstract: Context Adaptive Binary Arithmetic Coding (CABAC) is the only entropy encoding method exploited in High Efficiency Video Coding (HEVC) standard that supports high compression rate to allow transmitting of real-time UHD 4K/8K video sequences in various modern video services However, it is also considered the most throughput bottle-neck stage in HEVC encoder that challenges the deployment of the standard Since the standard published, numerous research efforts have been successful in proposing high speed CABAC hardware designs that are able to solve the above issue Once the CABAC throughput improved, its input data, i.e Syntax Elements (SEs) should be well fabricated to avoid stage-stalls, which will degrade the throughput performance of the whole HEVC encoder This paper proposes a hardware design solution to generate the residual Syntax Element (SE), which is the main work-load of CABAC that requires to access residual data memory to perform multiple scans for various SEs While high throughput requirement has been provided, the paper also presents an efficient method of residual SE generation for reducing memory accessing times, resulting in the reduction of dynamic power consumption and process delay of the CABAC encoder Keywords: HEVC; CABAC; Residual Syntax Element; Hardware Implementation INTRODUCTION As the diversity of multi-media services, the popularity of (High Definition) HD and beyond HD video formats (e.g 4k×2k or 8k×4k resolutions) have been an emerging trend, it is necessary to have higher coding efficiency than that of current popular standard, H.264/AVC The newest video coding standard, HEVC has been created by Joint Collaborative Team on Video Coding (JCT-VC) as the predecessor of H.264/AVC It has been designed to face the challenges of transmitting real-time, high quality video sequences over the limited bandwidth media [1] HEVC standard achieves almost double compression rate compared to that of H.264/AVC, resulting in half bit rate, also half band width as well to carry the same quality of video sequences Besides maintaining coding efficiency, processing speed, power consumption and area cost also need to be considered during adoption of HEVC into high quality video services, battery-based applications [2] Entropy coding is the final stage of HEVC encoder where CABAC is applied This entropy coding method greatly contributes to improve the coding efficiency of HEVC However, due to the high data dependency and sequential coding characteristic, CABAC becomes a well-known throughput bottle-neck in HEVC architecture as it is difficult for paralleling and pipelining In addition, this also leads to high computation and hardware complexity during the development of CABAC architectures [3] Since the standard published, numerous worldwide researches have been conducted to propose hardware architectures for HEVC CABAC that trades-off multi goals including coding efficiency, high throughput performance, hardware resource, and low power consumption [2] Once CABAC has been well-designed to encode high throughput video sequences, its data providers also have to be able to provide enough workload to avoid stage-stall which leads to degrade the overall performance of HEVC encoder In HEVC hierarchy, CABAC’s workload comes from different sources such as General Encoder Control parameters, Prediction data, Filter parameters and Residual Coefficients [1] These data Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 101 Kỹ thuật Điện tử – Vật lý – Đo lường appear at the input of CABAC as sequences of SEs, which are then converted into binary symbols (bins) and encoded into bit string at the CABAC output Table shows the contributions of CABAC input data from above sources Obviously, Transform Unit (TU) data, which is the matrix forming of Residual Coefficients, occupies a significant amount of CABAC workload, 75% on average and over 90 % in the worst case [4, 5] Therefore, it is necessary to focus on design strategies for this type of CABAC input data, as it is one of the main causes of CABAC throughput degradation Beside throughput performance, power and area are also the criteria needed to be considered in hardware implementation Table Major Bins contributors among HEVC data hierarchy [4] Hierarchy level Coding tree unit/coding unit bins Prediction unit bins Transform unit bins Common Test Condition AI LD-P LD-B RA Worst-case 5.4% 15.8% 16.7% 11.7% 1.4% 9.2% 85.4% 20.6% 63.7% 19.5% 63.8% 18.8% 69.4% 5.0% 94.0% Note: The results are reported for each hierarchy level within the HEVC context: Coding Tree Unit/Coding Unit, Prediction Unit, and Transform Unit The common test criteria are used: All-Intra (AI), Low-Delay P (LD-P), Low-Delay B (LD-B), and Random Access (RA) This paper proposes a hardware design solution that implements a Residual SE Generation targeted power-saving while still provides enough data for high speed CABAC encoders Our contribution is the Proposal of the residual SE generation algorithm and hardware implementation solution to save dynamic power consumption and process delay To generate residual SEs, multiple accesses the Transform Block (TB) memory is required for multiple scan passes This operation will increase the dynamic power consumption and processing delay of the design, then our proposed solution will be an efficient residual SE generation implementation in term of power consumption and processing speed The rest of the paper is organized as follows: The principle of Residual SE generation in HEVC CABAC encoder and related state-of-the-art is presented in Section Section will be the proposal of hardware architecture for residual SE generation, hardware design strategies for delay reduction and power savings Section gives the implementation results and discussion, followed by conclusion in Section OVERVIEW OF RESIDUAL DATA GENERATION IN HEVC 2.1 Residual Syntax Generation for CABAC encoder HEVC standard provides the flexible method of partition residual TBs ranging from 44 up to 3232 pixels, which will be converted to residual coefficients after the Transformation and Quantization steps [6,7] T Q T-1 Q-1 Residual Coefficients Residual SE Generation Residual SEs CABAC Output bits Figure Residual SEs generation block in block diagram of HEVC encoder 102 T D Lam, …, N M Cuong, “Hardware design solution … in HEVC CABAC encoder.” Nghiên cứu khoa học công nghệ As shown in Figure 1, Residual SE Generation block is applied right after TransformQuantization steps that processes these residual coefficients to generate the Residual SEs sequences to feed CABAC encoder While H.264/AVC applies zigzag scan pattern, HEVC supports diagonal scan pattern for all of TBs to convert these 2-D blocks of residual coefficients into the 1-D arrays [7] The diagonal scan pattern starts from the bottom-right of TBs and progressively scans up to the top-left of that TB The first diagonal scan is applied to divide the large TB blocks into un-overlapped 44 sub-blocks of coefficients These 44 sub-blocks are processed by using the same logic and procedures across different TB size The second scan occurs within each 44 sub-block to form a 1-D array of 16 consecutive coefficients, named Coefficient Group (CG) Figure [5,6] describes the process of these diagonal scans samples 16 samples samples 16 samples (a) (b) Figure Diagonal scanning: (a) in large TB and (b) within 44 TB [5] For TBs with size larger than 4x4 TB, after dividing to un-overlapped 44 sub-blocks of coefficients, a set of flags will be determined to indicate whether each of its sub-blocks is significant A significant sub-block has at least one “none zero” coefficient and is signaled by a “1” flag, while a “0” flag is used to signal the insignificant sub-block that has all zero coefficients This set of flags is named Coded Sub-Block Flags (CSBFs) [4] It will be then sent to CABAC as CSBF residual SEs to signal the encoder whether process the sub-block (with CSBF = “1”) or only send CSBF = “0” without processing that subblock In addition, the last significant sub-block position is also scanned to find the last significant coefficient position, which will be the entry point for the remaining scans of that TB Figure shows an example of the scanning process to generate and signal CSBF SEs and the last significant coefficient position for a 1616 TB last significant coefficient position -1 -1 -4 -3 0 0 0 0 -2 -1 -1 -1 -1 -1 0 0 -1 -4 0 -1 -1 0 0 -1 -3 0 0 -1 0 -1 -1 0 0 -2 0 -2 0 -1 -1 0 -1 0 -1 0 1 0 0 0 -1 0 2 -1 0 1 -1 -2 -1 -1 0 1 0 0 0 0 -1 0 0 -2 -2 1 -1 -1 -3 0 0 0 0 0 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 last significant sub-block position CSBF 1 1 1 1 1 0 1 0 CSBFs = “0001011111111111” Figure Example of CSBF generation for 1616 TU After this step, all 4x4 sub-blocks (and 4x4 TBs as well) with CSBF = “1” are processed to generate the remaining residual SEs The set of different SEs representing residual coefficients of each 44 sub-block (i.e CG) and their binarization methods are defined in Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 103 Kỹ thuật Điện tử – Vật lý – Đo lường Table [8] To generate this set of residual SEs, each CG is undergone scan passes following the same scan pattern [6] Each scan pass will generate a type of residual SEs Table Syntax Elements of 44 Residual Transform data [9] Binarization Syntax Element Descriptions method Indicate of whether coefficient is Sig_coeff_flag Fixed-Length zero or non-zero by “0” or “1” flag Flag indicating whether absolute Coeff_abs_level_greater1_flag value of a coefficient level is greater Fixed-Length than Flag indicating whether absolute Coeff_abs_level_greater2_flag value of a coefficient level is greater Fixed-Length than Sign of a significant coefficient (0: Coeff_sign_flag Fixed-Length positive; 1: negative) Coefficient Remaining value for the absolute Coeff_abs_level_remaining Absolute Level value of a coefficient level Remaining Figure shows the process of scanning to generate residual SEs listed in Table from a sub-block [5] Diagonal scan pattern is applied to form CG, which is then undergone scan passes, consecutively -3 0 0 0 44 sub-block CG 15 0 0 1 -3 Scan passes SEs Diagonal scan Figure Process of diagonal scan and scan passes [5] 2.2 State-of-the-art Since the standard issued, most of research work have focused on CABAC implementation as it is the most throughput bottle-neck In recent years, the input workload of CABAC has been considered the potential issue of HEVC throughput bottleneck, particularly the residual data which is on average of 75% CABAC input data Bampi’s group [4,10] has emerged for this research direction, where high throughput implementations for residual SE generation have been proposed Saggiorato et al [10] proposed a multi-core residual SE generation architecture that can process coefficients simultaneously to provide enough data for high throughput CABAC encoder Ramos et al [4] also proposed a four pipeline SE processing cores to avoid CABAC input starved issue In addition, a power-gating scheme that is based on analysis of input data statistics is also proposed to save energy consumption Their solution is based on pipeline multicore design strategies This is the principal method to increase throughput, however it will come at the cost of hardware area and power consumption increases In our paper, we proposed a hardware design solution for residual SEs generation of all TB sizes that saves the power consumption while still provides enough data for CABAC encoder in UHD video applications Our solution is based on carefully analyze the internal mechanism of scanning processes for all sizes of TBs to reduce the memory access times This will result in the reduction of dynamic power consumption and processing delay as well 104 T D Lam, …, N M Cuong, “Hardware design solution … in HEVC CABAC encoder.” Nghiên cứu khoa học công nghệ PROPOSED ARCHITECTURE AND HARDWARE IMPLEMENTATION FOR RESIDUAL DATA GENERATION 3.1 Residual SE generation method and proposed efficient scanning algorithm As presented in sub-section 2.1, each TB has experienced several processing steps to generate its residual SEs set to provide data for CABAC encoder The process of scanning to determine the significant of sub-blocks (CSBF), the last significant sub-block position and the last significant coefficient position as shown in Figure -3 2 0 0 Coefficients of last significant sub-block 15 14 12 13 11 10 Last significant coefficient position in sub-block (0,0) (1,0) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1) (0,2) (1,2) (2,2) (3,2) (0,3) (1,3) (2,3) (3,3) X and Y coordinates of last significant position X and Y coordinates of last significant sub-block position 15 14 12 13 11 10 Last significant sub-block position in sub-block -1 -1 -4 -3 0 0 0 0 -2 -1 -1 -1 -1 -1 0 0 -1 -4 0 -1 -1 0 0 -1 -3 0 0 -1 0 -1 -1 0 0 -2 0 -2 0 -1 -1 0 -1 0 -1 0 1 0 0 0 -1 0 2 -1 0 1 -1 -2 -1 -1 0 1 0 0 0 0 -1 0 0 -2 -2 1 -1 -1 -3 0 0 0 0 0 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1616 TB memory Figure Diagonal scans for significant SEs Transform block of 1616 coefficients, for example, is diagonally scanned to divide into 16 sub-blocks of 44 coefficients in the order of to 15 In this scanning, the CSBFs are determined (“0001011111111111” in this example) to signal the significant of each sub-block Then, the position of the last significant sub-block (the third one in the example) is also figured out for the entry point of next scanning Based on this position, a Look-Up Table (LUT) is applied to determine the (X_sb, Y_sb) coordinates of that last significant block, which is (3, 1) for this example These coordinates are used to calculate the coordinates of last significant coefficient position as described latter The process of determining last significant position of TB is started at last significant sub-block This 44 sub-block will be diagonal scanned to find the position of its last significant coefficient, which will be the fifth coefficient in the example Then the position of this last coefficient (5) is used to calculate (x, y) coordinates (equal to (1, 3) in this example) by the same LUT The X and Y coordinates of last significant coefficient in 1616 TB are calculated by the generalized equation (1) below (1) In which: - last_sig_coeff_z: Denoted for x or y coordinate of last significant coefficient of TB, which will be last_sig_coeff_x or last_sig_coeff_x - Zsb: Denoted for x or y coordinate of last significant sub-block in TB, which will be Xsb or Ysb - last_z: Denoted for x or y coordinate of last significant coefficient in the last significant sub-block, which will be last_x or last_y For the example in Figure 5, we have (Xsb, Ysb) = (3, 1) and (last_x, last_y) = (1, 3) then the equation (1) is applied to calculate the coordinates of last significant position: and After this step, all of the significant 44 sub-blocks are processed by the same procedure, which includes scan passes, to generate residual SEs for each sub-block as Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 105 Kỹ thuật Điện tử – Vật lý – Đo lường described in Table Figure shows the process of generating residual SEs of a 44 subblock and their output order [8] The first scan pass will generate of sig_coeff_flags, that indicate the significant of coefficient (non-zero) by a “1” and the insignificant one (zero) by a “0” The second scan pass is to evaluate whether an absolute value of a significant coefficient is greater than one or not by adding a “1” or “0” flag There will be up to a maximum of significant coefficients from the last significant one are signaled by coeff_abs_level_greater1_flag The third scan pass involves in signaling coeff_abs_level_greater2_flag, based on absolute value of the first coefficient that has been signaled by a “1” coeff_abs_level_greater1_flag If the absolute value of this coefficient is greater than two, it is signaled by coeff_abs_level_greater2_flag of “1”, otherwise “0” The fourth scan is used for generating the signs of significant coefficients – coeff_sign_flag, where a positive coefficient is signaled by “0” and vice-versa The last scan pass is utilized to calculate coeff_abs_level_remaining, the remained level of significant coefficient [9] -1 1 1 - - 0 - - - -6 0 0 - - - - - - - - - - - 0 - - - - - - - - - - - - - - - 0 0 - - - - - - - - - - - - - - - - - - 44 sub-block 1st 2nd scan pass 3rd scan pass Scan pass 1st scan pass 2nd scan pass 3rd scan pass SEs sig_coeff_flag coeff_abs_level_greater1_flag coeff_abs_level_greater2_flag 4th scan pass 5th scan pass coeff_sign_flag coeff_abs_level_remaining scan pass 0 1 4th scan pass Values 0 0 5th scan pass 1 1 1 0 Data output order: 1 0 0 1 0 1 1 0 0 Figure Generated SEs and output order [8] TB Memory CSBFs First scan Last significant, CSBF scanning last_sig_coeff_x last_sig_coeff_y Last significant position Second scan Significant, Sign, Greater_one, Greater_two Scanning sig_coeff_flags coeff_abs_level_greater1_flags coeff_abs_level_greater2_flag coeff_sign_flags Greater2 position Third scan Coeff Absolute Level Remaining Scanning CALRs Figure Scanning and SE generation architecture As described, for each of sub-block 44 coefficients, residual SEs are generated after scan passes, which access TB memory to evaluate residual coefficients These memory access activities are the main cause of dynamic power consumption and processing delays Except the coeff_abs_level_remaining SE, the remaining SE (sig_coeff_flag, coeff_abs_level_greater1_flag, coeff_abs_level_greater2_flag and coeff_sign_flag) are the flags, in which each SE is a flag, i.e one-bit value In addition, table shows that Fixed- 106 T D Lam, …, N M Cuong, “Hardware design solution … in HEVC CABAC encoder.” Nghiên cứu khoa học công nghệ Length binarization is used for all of these flagged SEs Therefore, we propose an efficient scanning strategy that used one scan to determine of all flagged SEs at the same data path This will reduce the latency and power consumption due to memory access activities in comparison to the traditional method The proposed architecture of the Syntax Element Generation with above efficient method is shown in Figure 3.2 Hardware implementation of residual SE generation with efficient scanning algorithm Figure shows the proposed hardware implementation of the architecture in Figure for residual SE generation It includes significant scanning part (Last significant, CSBF scanning) to determine the significant of each sub-block in a large TB and the last significant coefficient position within each sub-block The remaining part (Flagged_SEs and CALR generations) will generate residual SEs of each 44 sub-block TB_size >>2 en