Hardware design solution for residual syntax element generation in HEVC CABAC encoder

Abstract: Context Adaptive Binary Arithmetic Coding (CABAC) is the only entropy encoding method exploited in High Efficiency Video Coding (HEVC) standard that supports high compression rate to allow transmitting of real-time UHD 4K/8K video sequences in various modern video services. However, it is also considered the most throughput bottle-neck stage in HEVC encoder that challenges the deployment of the standard. Since the standard published, numerous research efforts have been successful in proposing high speed CABAC hardware designs that are able to solve the above issue. Once the CABAC throughput improved, its input data, i.e. Syntax Elements (SEs) should be well fabricated to avoid stage-stalls, which will degrade the throughput performance of the whole HEVC encoder. This paper proposes a hardware design solution to generate the residual Syntax Element (SE), which is the main work-load of CABAC that requires to access residual data memory to perform multiple scans for various SEs. While high throughput requirement has been provided, the paper also presents an efficient method of residual SE generation for reducing memory accessing times, resulting in the reduction of dynamic power consumption and process delay of the CABAC encoder.

pdf9 trang | Chia sẻ: thanhle95 | Lượt xem: 462 | Lượt tải: 1download
Bạn đang xem nội dung tài liệu Hardware design solution for residual syntax element generation in HEVC CABAC encoder, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 101 HARDWARE DESIGN SOLUTION FOR RESIDUAL SYNTAX ELEMENT GENERATION IN HEVC CABAC ENCODER Tran Dinh Lam 1* , Tran Xuan Tu 2 , Luu Thi Thu Hong 1 , Nguyen Manh Cuong 1 Abstract: Context Adaptive Binary Arithmetic Coding (CABAC) is the only entropy encoding method exploited in High Efficiency Video Coding (HEVC) standard that supports high compression rate to allow transmitting of real-time UHD 4K/8K video sequences in various modern video services. However, it is also considered the most throughput bottle-neck stage in HEVC encoder that challenges the deployment of the standard. Since the standard published, numerous research efforts have been successful in proposing high speed CABAC hardware designs that are able to solve the above issue. Once the CABAC throughput improved, its input data, i.e. Syntax Elements (SEs) should be well fabricated to avoid stage-stalls, which will degrade the throughput performance of the whole HEVC encoder. This paper proposes a hardware design solution to generate the residual Syntax Element (SE), which is the main work-load of CABAC that requires to access residual data memory to perform multiple scans for various SEs. While high throughput requirement has been provided, the paper also presents an efficient method of residual SE generation for reducing memory accessing times, resulting in the reduction of dynamic power consumption and process delay of the CABAC encoder. Keywords: HEVC; CABAC; Residual Syntax Element; Hardware Implementation. 1. INTRODUCTION As the diversity of multi-media services, the popularity of (High Definition) HD and beyond HD video formats (e.g. 4k×2k or 8k×4k resolutions) have been an emerging trend, it is necessary to have higher coding efficiency than that of current popular standard, H.264/AVC. The newest video coding standard, HEVC has been created by Joint Collaborative Team on Video Coding (JCT-VC) as the predecessor of H.264/AVC. It has been designed to face the challenges of transmitting real-time, high quality video sequences over the limited bandwidth media [1]. HEVC standard achieves almost double compression rate compared to that of H.264/AVC, resulting in half bit rate, also half band width as well to carry the same quality of video sequences. Besides maintaining coding efficiency, processing speed, power consumption and area cost also need to be considered during adoption of HEVC into high quality video services, battery-based applications [2]. Entropy coding is the final stage of HEVC encoder where CABAC is applied. This entropy coding method greatly contributes to improve the coding efficiency of HEVC. However, due to the high data dependency and sequential coding characteristic, CABAC becomes a well-known throughput bottle-neck in HEVC architecture as it is difficult for paralleling and pipelining. In addition, this also leads to high computation and hardware complexity during the development of CABAC architectures [3]. Since the standard published, numerous worldwide researches have been conducted to propose hardware architectures for HEVC CABAC that trades-off multi goals including coding efficiency, high throughput performance, hardware resource, and low power consumption [2]. Once CABAC has been well-designed to encode high throughput video sequences, its data providers also have to be able to provide enough workload to avoid stage-stall which leads to degrade the overall performance of HEVC encoder. In HEVC hierarchy, CABAC’s workload comes from different sources such as General Encoder Control parameters, Prediction data, Filter parameters and Residual Coefficients [1]. These data Kỹ thuật Điện tử – Vật lý – Đo lường T. D. Lam, , N. M. Cuong, “Hardware design solution in HEVC CABAC encoder.” 102 appear at the input of CABAC as sequences of SEs, which are then converted into binary symbols (bins) and encoded into bit string at the CABAC output. Table 1 shows the contributions of CABAC input data from above sources. Obviously, Transform Unit (TU) data, which is the matrix forming of Residual Coefficients, occupies a significant amount of CABAC workload, 75% on average and over 90 % in the worst case [4, 5]. Therefore, it is necessary to focus on design strategies for this type of CABAC input data, as it is one of the main causes of CABAC throughput degradation. Beside throughput performance, power and area are also the criteria needed to be considered in hardware implementation. Table 1. Major Bins contributors among HEVC data hierarchy [4]. Common Test Condition Hierarchy level AI LD-P LD-B RA Worst-case Coding tree unit/coding unit bins 5.4% 15.8% 16.7% 11.7% 1.4% Prediction unit bins 9.2% 20.6% 19.5% 18.8% 5.0% Transform unit bins 85.4% 63.7% 63.8% 69.4% 94.0% Note: The results are reported for each hierarchy level within the HEVC context: Coding Tree Unit/Coding Unit, Prediction Unit, and Transform Unit. The common test criteria are used: All-Intra (AI), Low-Delay P (LD-P), Low-Delay B (LD-B), and Random Access (RA). This paper proposes a hardware design solution that implements a Residual SE Generation targeted power-saving while still provides enough data for high speed CABAC encoders. Our contribution is the Proposal of the residual SE generation algorithm and hardware implementation solution to save dynamic power consumption and process delay. To generate residual SEs, multiple accesses the Transform Block (TB) memory is required for multiple scan passes. This operation will increase the dynamic power consumption and processing delay of the design, then our proposed solution will be an efficient residual SE generation implementation in term of power consumption and processing speed. The rest of the paper is organized as follows: The principle of Residual SE generation in HEVC CABAC encoder and related state-of-the-art is presented in Section 2. Section 3 will be the proposal of hardware architecture for residual SE generation, hardware design strategies for delay reduction and power savings. Section 4 gives the implementation results and discussion, followed by conclusion in Section 5. 2. OVERVIEW OF RESIDUAL DATA GENERATION IN HEVC 2.1. Residual Syntax Generation for CABAC encoder HEVC standard provides the flexible method of partition residual TBs ranging from 44 up to 3232 pixels, which will be converted to residual coefficients after the Transformation and Quantization steps [6,7]. Figure 1. Residual SEs generation block in block diagram of HEVC encoder. Residual Coefficients T Q T-1 Q-1 CABAC Residual SE Generation Residual SEs Output bits Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 103 As shown in Figure 1, Residual SE Generation block is applied right after Transform- Quantization steps that processes these residual coefficients to generate the Residual SEs sequences to feed CABAC encoder. While H.264/AVC applies zigzag scan pattern, HEVC supports diagonal scan pattern for all of TBs to convert these 2-D blocks of residual coefficients into the 1-D arrays [7]. The diagonal scan pattern starts from the bottom-right of TBs and progressively scans up to the top-left of that TB. The first diagonal scan is applied to divide the large TB blocks into un-overlapped 44 sub-blocks of coefficients. These 44 sub-blocks are processed by using the same logic and procedures across different TB size. The second scan occurs within each 44 sub-block to form a 1-D array of 16 consecutive coefficients, named Coefficient Group (CG). Figure 2 [5,6] describes the process of these diagonal scans. (a) (b) Figure 2. Diagonal scanning: (a) in large TB and (b) within 44 TB [5]. For TBs with size larger than 4x4 TB, after dividing to un-overlapped 44 sub-blocks of coefficients, a set of flags will be determined to indicate whether each of its sub-blocks is significant. A significant sub-block has at least one “none zero” coefficient and is signaled by a “1” flag, while a “0” flag is used to signal the insignificant sub-block that has all zero coefficients. This set of flags is named Coded Sub-Block Flags (CSBFs) [4]. It will be then sent to CABAC as CSBF residual SEs to signal the encoder whether process the sub-block (with CSBF = “1”) or only send CSBF = “0” without processing that sub- block. In addition, the last significant sub-block position is also scanned to find the last significant coefficient position, which will be the entry point for the remaining scans of that TB. Figure 3 shows an example of the scanning process to generate and signal CSBF SEs and the last significant coefficient position for a 1616 TB. Figure 3. Example of CSBF generation for 1616 TU. After this step, all 4x4 sub-blocks (and 4x4 TBs as well) with CSBF = “1” are processed to generate the remaining residual SEs. The set of different SEs representing residual coefficients of each 44 sub-block (i.e. CG) and their binarization methods are defined in 4 samples 4 sa m p le s 16 samples 1 6 sa m p les -1 0 0 0 0 0 0 0 0 1 1 -1 1 0 -2 -1 0 1 0 0 -1 -1 0 1 -1 0 0 0 1 0 1 0 2 -2 0 0 -1 0 0 1 0 0 0 0 0 1 0 0 -1 -1 -1 0 0 0 0 0 0 0 -1 0 1 0 0 0 -4 -1 1 2 0 -1 0 0 -2 2 -2 1 6 1 4 1 6 2 -4 -1 1 1 -1 0 1 0 -1 -1 -3 2 1 0 3 5 3 -3 0 0 0 0 0 0 0 0 0 0 0 0 -3 3 2 0 0 0 0 0 -1 2 -1 0 3 2 0 0 0 0 1 0 -2 1 1 1 0 0 0 0 0 0 0 0 0 -1 0 1 2 -1 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 -1 -1 -2 1 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 -1 0 -1 1 0 0 0 0 0 0 0 0 CSBFs = “0001011111111111” CSBF last significant sub-block positionlast significant coefficient position 1 1 11 1 1 11 1 0 01 1 0 01 Kỹ thuật Điện tử – Vật lý – Đo lường T. D. Lam, , N. M. Cuong, “Hardware design solution in HEVC CABAC encoder.” 104 Table 2 [8]. To generate this set of residual SEs, each CG is undergone 5 scan passes following the same scan pattern [6]. Each scan pass will generate a type of residual SEs. Table 2. Syntax Elements of 44 Residual Transform data [9]. Syntax Element Descriptions Binarization method Sig_coeff_flag Indicate of whether coefficient is zero or non-zero by “0” or “1” flag Fixed-Length Coeff_abs_level_greater1_flag Flag indicating whether absolute value of a coefficient level is greater than 1. Fixed-Length Coeff_abs_level_greater2_flag Flag indicating whether absolute value of a coefficient level is greater than 2. Fixed-Length Coeff_sign_flag Sign of a significant coefficient (0: positive; 1: negative). Fixed-Length Coeff_abs_level_remaining Remaining value for the absolute value of a coefficient level. Coefficient Absolute Level Remaining Figure 4 shows the process of scanning to generate residual SEs listed in Table 2 from a sub-block [5]. Diagonal scan pattern is applied to form CG, which is then undergone 5 scan passes, consecutively. Figure 4. Process of diagonal scan and scan passes [5]. 2.2. State-of-the-art Since the standard issued, most of research work have focused on CABAC implementation as it is the most throughput bottle-neck. In recent years, the input workload of CABAC has been considered the potential issue of HEVC throughput bottle- neck, particularly the residual data which is on average of 75% CABAC input data. Bampi’s group [4,10] has emerged for this research direction, where high throughput implementations for residual SE generation have been proposed. Saggiorato et al [10] proposed a multi-core residual SE generation architecture that can process 4 coefficients simultaneously to provide enough data for high throughput CABAC encoder. Ramos et al [4] also proposed a four pipeline SE processing cores to avoid CABAC input starved issue. In addition, a power-gating scheme that is based on analysis of input data statistics is also proposed to save energy consumption. Their solution is based on pipeline multicore design strategies. This is the principal method to increase throughput, however it will come at the cost of hardware area and power consumption increases. In our paper, we proposed a hardware design solution for residual SEs generation of all TB sizes that saves the power consumption while still provides enough data for CABAC encoder in UHD video applications. Our solution is based on carefully analyze the internal mechanism of scanning processes for all sizes of TBs to reduce the memory access times. This will result in the reduction of dynamic power consumption and processing delay as well. Scan passes SEs6-310243011200000 6 1 4 1 2 1 0-3 0 0 0 0 3 2 0 0 0 15 44 sub-block Diagonal scan CG Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 105 3. PROPOSED ARCHITECTURE AND HARDWARE IMPLEMENTATION FOR RESIDUAL DATA GENERATION 3.1. Residual SE generation method and proposed efficient scanning algorithm As presented in sub-section 2.1, each TB has experienced several processing steps to generate its residual SEs set to provide data for CABAC encoder. The process of scanning to determine the significant of sub-blocks (CSBF), the last significant sub-block position and the last significant coefficient position as shown in Figure 5. Figure 5. Diagonal scans for significant SEs. Transform block of 1616 coefficients, for example, is diagonally scanned to divide into 16 sub-blocks of 44 coefficients in the order of 0 to 15. In this scanning, the CSBFs are determined (“0001011111111111” in this example) to signal the significant of each sub-block. Then, the position of the last significant sub-block (the third one in the example) is also figured out for the entry point of next scanning. Based on this position, a Look-Up Table (LUT) is applied to determine the (X_sb, Y_sb) coordinates of that last significant block, which is (3, 1) for this example. These coordinates are used to calculate the coordinates of last significant coefficient position as described latter. The process of determining last significant position of TB is started at last significant sub-block. This 44 sub-block will be diagonal scanned to find the position of its last significant coefficient, which will be the fifth coefficient in the example. Then the position of this last coefficient (5) is used to calculate (x, y) coordinates (equal to (1, 3) in this example) by the same LUT. The X and Y coordinates of last significant coefficient in 1616 TB are calculated by the generalized equation (1) below. (1) In which: - last_sig_coeff_z: Denoted for x or y coordinate of last significant coefficient of TB, which will be last_sig_coeff_x or last_sig_coeff_x. - Zsb: Denoted for x or y coordinate of last significant sub-block in TB, which will be Xsb or Ysb. - last_z: Denoted for x or y coordinate of last significant coefficient in the last significant sub-block, which will be last_x or last_y. For the example in Figure 5, we have (Xsb, Ysb) = (3, 1) and (last_x, last_y) = (1, 3) then the equation (1) is applied to calculate the coordinates of last significant position: and After this step, all of the significant 44 sub-blocks are processed by the same procedure, which includes 5 scan passes, to generate residual SEs for each sub-block as 15 13 10 6 11 7 314 12 8 4 1 9 5 2 0 15 13 10 6 11 7 314 12 8 4 1 9 5 2 0 (0,0) (1,0) (2,0) (3,0) (1,1) (2,1) (3,1)(0,1) (0,2) (1,2) (2,2) (3,2) (0,3) (1,3) (2,3) (3,3) Last significant coefficient position in sub-block X and Y coordinates of last significant position 6 1 4 1 2 1 0-3 0 0 0 0 3 2 0 0 -1 0 0 0 0 0 0 0 0 1 1 -1 1 0 -2 -1 0 1 0 0 -1 -1 0 1 -1 0 0 0 1 0 1 0 2 -2 0 0 -1 0 0 1 0 0 0 0 0 1 0 0 -1 -1 -1 0 0 0 0 0 0 0 -1 0 1 0 0 0 -4 -1 1 2 0 -1 0 0 -2 2 -2 1 6 1 4 1 6 2 -4 -1 1 1 -1 0 1 0 -1 -1 -3 2 1 0 3 5 3 -3 0 0 0 0 0 0 0 0 0 0 0 0 -3 3 2 0 0 0 0 0 -1 2 -1 0 3 2 0 0 0 0 1 0 -2 1 1 1 0 0 0 0 0 0 0 0 0 -1 0 1 2 -1 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 -1 -1 -2 1 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 -1 0 -1 1 0 0 0 0 0 0 0 0 Coefficients of last significant sub-block X and Y coordinates of last significant sub-block position Last significant sub-block position in sub-block 1616 TB memory Kỹ thuật Điện tử – Vật lý – Đo lường T. D. Lam, , N. M. Cuong, “Hardware design solution in HEVC CABAC encoder.” 106 described in Table 2. Figure 6 shows the process of generating residual SEs of a 44 sub- block and their output order [8]. The first scan pass will generate of sig_coeff_flags, that indicate the significant of coefficient (non-zero) by a “1” and the insignificant one (zero) by a “0”. The second scan pass is to evaluate whether an absolute value of a significant coefficient is greater than one or not by adding a “1” or “0” flag. There will be up to a maximum of 8 significant coefficients from the last significant one are signaled by coeff_abs_level_greater1_flag. The third scan pass involves in signaling coeff_abs_level_greater2_flag, based on absolute value of the first coefficient that has been signaled by a “1” coeff_abs_level_greater1_flag. If the absolute value of this coefficient is greater than two, it is signaled by coeff_abs_level_greater2_flag of “1”, otherwise “0”. The fourth scan is used for generating the signs of significant coefficients – coeff_sign_flag, where a positive coefficient is signaled by “0” and vice-versa. The last scan pass is utilized to calculate coeff_abs_level_remaining, the remained level of significant coefficient [9]. Figure 6. Generated SEs and output order [8]. Figure 7. Scanning and SE generation architecture. As described, for each of sub-block 44 coefficients, residual SEs are generated after 5 scan passes, which access TB memory to evaluate residual coefficients. These memory access activities are the main cause of dynamic power consumption and processing delays. Except the coeff_abs_level_remaining SE, the remaining SE (sig_coeff_flag, coeff_abs_level_greater1_flag, coeff_abs_level_greater2_flag and coeff_sign_flag) are the flags, in which each SE is a flag, i.e. one-bit value. In addition, table 2 shows that Fixed- Scan pass SEs Values 1st scan pass sig_coeff_flag 1 0 1 0 0 0 0 1 1 1 2nd scan pass coeff_abs_level_greater1_flag 0 0 1 1 1 3rd scan pass coeff_abs_level_greater2_flag 1 4th scan pass coeff_sign_flag 1 0 0 1 0 5th scan pass coeff_abs_level_remaining 0 4 7 Data output order: 1 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 0 0 1 0 0 4 7 9 3 0 -1 -6 0 0 0 0 1 0 0 0 0 0 0  1 - -  - - - - - - - - - - - 1 1 0 1 1 0 0 - 0 1 - - 0 - - - 1 1  0 1   -  0 - -  - - - 0 0 - 1 1 - - - - 0 - - - - - - 7 0 - - 4 - - - - - - - - - - - 44 sub-block 1st scan pass 2nd scan pass 3rd scan pass 4th scan pass 5th scan pass TB Memory Last significant, CSBF scanning CSBFs last_sig_coeff_x First scan Last significant position Significant, Sign, Greater_one, Greater_two Scanning Greater2 position Second scan Third scan sig_coeff_flags coeff_sign_flags coeff_abs_level_greater1_flags CALRs Coeff Absolute Level Remaining Scanning last_sig_coeff_y coeff_abs_level_greater2_flag Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 107 Length binarization is used for all of these flagged