Abstract: Context Adaptive Binary Arithmetic Coding (CABAC) is the only
entropy encoding method exploited in High Efficiency Video Coding (HEVC)
standard that supports high compression rate to allow transmitting of real-time
UHD 4K/8K video sequences in various modern video services. However, it is also
considered the most throughput bottle-neck stage in HEVC encoder that challenges
the deployment of the standard. Since the standard published, numerous research
efforts have been successful in proposing high speed CABAC hardware designs that
are able to solve the above issue. Once the CABAC throughput improved, its input
data, i.e. Syntax Elements (SEs) should be well fabricated to avoid stage-stalls,
which will degrade the throughput performance of the whole HEVC encoder. This
paper proposes a hardware design solution to generate the residual Syntax Element
(SE), which is the main work-load of CABAC that requires to access residual data
memory to perform multiple scans for various SEs. While high throughput
requirement has been provided, the paper also presents an efficient method of
residual SE generation for reducing memory accessing times, resulting in the
reduction of dynamic power consumption and process delay of the CABAC encoder.
9 trang |
Chia sẻ: thanhle95 | Lượt xem: 668 | Lượt tải: 1
Bạn đang xem nội dung tài liệu Hardware design solution for residual syntax element generation in HEVC CABAC encoder, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 101
HARDWARE DESIGN SOLUTION FOR RESIDUAL SYNTAX
ELEMENT GENERATION IN HEVC CABAC ENCODER
Tran Dinh Lam
1*
, Tran Xuan Tu
2
, Luu Thi Thu Hong
1
, Nguyen Manh Cuong
1
Abstract: Context Adaptive Binary Arithmetic Coding (CABAC) is the only
entropy encoding method exploited in High Efficiency Video Coding (HEVC)
standard that supports high compression rate to allow transmitting of real-time
UHD 4K/8K video sequences in various modern video services. However, it is also
considered the most throughput bottle-neck stage in HEVC encoder that challenges
the deployment of the standard. Since the standard published, numerous research
efforts have been successful in proposing high speed CABAC hardware designs that
are able to solve the above issue. Once the CABAC throughput improved, its input
data, i.e. Syntax Elements (SEs) should be well fabricated to avoid stage-stalls,
which will degrade the throughput performance of the whole HEVC encoder. This
paper proposes a hardware design solution to generate the residual Syntax Element
(SE), which is the main work-load of CABAC that requires to access residual data
memory to perform multiple scans for various SEs. While high throughput
requirement has been provided, the paper also presents an efficient method of
residual SE generation for reducing memory accessing times, resulting in the
reduction of dynamic power consumption and process delay of the CABAC encoder.
Keywords: HEVC; CABAC; Residual Syntax Element; Hardware Implementation.
1. INTRODUCTION
As the diversity of multi-media services, the popularity of (High Definition) HD and
beyond HD video formats (e.g. 4k×2k or 8k×4k resolutions) have been an emerging trend,
it is necessary to have higher coding efficiency than that of current popular standard,
H.264/AVC. The newest video coding standard, HEVC has been created by Joint
Collaborative Team on Video Coding (JCT-VC) as the predecessor of H.264/AVC. It has
been designed to face the challenges of transmitting real-time, high quality video
sequences over the limited bandwidth media [1]. HEVC standard achieves almost double
compression rate compared to that of H.264/AVC, resulting in half bit rate, also half band
width as well to carry the same quality of video sequences. Besides maintaining coding
efficiency, processing speed, power consumption and area cost also need to be considered
during adoption of HEVC into high quality video services, battery-based applications [2].
Entropy coding is the final stage of HEVC encoder where CABAC is applied. This
entropy coding method greatly contributes to improve the coding efficiency of HEVC.
However, due to the high data dependency and sequential coding characteristic, CABAC
becomes a well-known throughput bottle-neck in HEVC architecture as it is difficult for
paralleling and pipelining. In addition, this also leads to high computation and hardware
complexity during the development of CABAC architectures [3]. Since the standard
published, numerous worldwide researches have been conducted to propose hardware
architectures for HEVC CABAC that trades-off multi goals including coding efficiency,
high throughput performance, hardware resource, and low power consumption [2].
Once CABAC has been well-designed to encode high throughput video sequences, its
data providers also have to be able to provide enough workload to avoid stage-stall which
leads to degrade the overall performance of HEVC encoder. In HEVC hierarchy,
CABAC’s workload comes from different sources such as General Encoder Control
parameters, Prediction data, Filter parameters and Residual Coefficients [1]. These data
Kỹ thuật Điện tử – Vật lý – Đo lường
T. D. Lam, , N. M. Cuong, “Hardware design solution in HEVC CABAC encoder.” 102
appear at the input of CABAC as sequences of SEs, which are then converted into binary
symbols (bins) and encoded into bit string at the CABAC output. Table 1 shows the
contributions of CABAC input data from above sources. Obviously, Transform Unit (TU)
data, which is the matrix forming of Residual Coefficients, occupies a significant amount
of CABAC workload, 75% on average and over 90 % in the worst case [4, 5]. Therefore, it
is necessary to focus on design strategies for this type of CABAC input data, as it is one of
the main causes of CABAC throughput degradation. Beside throughput performance,
power and area are also the criteria needed to be considered in hardware implementation.
Table 1. Major Bins contributors among HEVC data hierarchy [4].
Common Test Condition
Hierarchy level AI LD-P LD-B RA Worst-case
Coding tree unit/coding unit
bins
5.4% 15.8% 16.7% 11.7% 1.4%
Prediction unit bins 9.2% 20.6% 19.5% 18.8% 5.0%
Transform unit bins 85.4% 63.7% 63.8% 69.4% 94.0%
Note: The results are reported for each hierarchy level within the HEVC context:
Coding Tree Unit/Coding Unit, Prediction Unit, and Transform Unit. The common test
criteria are used: All-Intra (AI), Low-Delay P (LD-P), Low-Delay B (LD-B), and
Random Access (RA).
This paper proposes a hardware design solution that implements a Residual SE
Generation targeted power-saving while still provides enough data for high speed CABAC
encoders. Our contribution is the Proposal of the residual SE generation algorithm and
hardware implementation solution to save dynamic power consumption and process
delay. To generate residual SEs, multiple accesses the Transform Block (TB) memory is
required for multiple scan passes. This operation will increase the dynamic power
consumption and processing delay of the design, then our proposed solution will be an
efficient residual SE generation implementation in term of power consumption and
processing speed.
The rest of the paper is organized as follows: The principle of Residual SE generation
in HEVC CABAC encoder and related state-of-the-art is presented in Section 2. Section 3
will be the proposal of hardware architecture for residual SE generation, hardware design
strategies for delay reduction and power savings. Section 4 gives the implementation
results and discussion, followed by conclusion in Section 5.
2. OVERVIEW OF RESIDUAL DATA GENERATION IN HEVC
2.1. Residual Syntax Generation for CABAC encoder
HEVC standard provides the flexible method of partition residual TBs ranging from
44 up to 3232 pixels, which will be converted to residual coefficients after the
Transformation and Quantization steps [6,7].
Figure 1. Residual SEs generation block in block diagram of HEVC encoder.
Residual
Coefficients
T Q
T-1 Q-1
CABAC
Residual
SE
Generation
Residual
SEs
Output
bits
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 103
As shown in Figure 1, Residual SE Generation block is applied right after Transform-
Quantization steps that processes these residual coefficients to generate the Residual SEs
sequences to feed CABAC encoder.
While H.264/AVC applies zigzag scan pattern, HEVC supports diagonal scan pattern
for all of TBs to convert these 2-D blocks of residual coefficients into the 1-D arrays [7].
The diagonal scan pattern starts from the bottom-right of TBs and progressively scans up
to the top-left of that TB. The first diagonal scan is applied to divide the large TB blocks
into un-overlapped 44 sub-blocks of coefficients. These 44 sub-blocks are processed by
using the same logic and procedures across different TB size. The second scan occurs
within each 44 sub-block to form a 1-D array of 16 consecutive coefficients, named
Coefficient Group (CG). Figure 2 [5,6] describes the process of these diagonal scans.
(a) (b)
Figure 2. Diagonal scanning: (a) in large TB and (b) within 44 TB [5].
For TBs with size larger than 4x4 TB, after dividing to un-overlapped 44 sub-blocks
of coefficients, a set of flags will be determined to indicate whether each of its sub-blocks
is significant. A significant sub-block has at least one “none zero” coefficient and is
signaled by a “1” flag, while a “0” flag is used to signal the insignificant sub-block that
has all zero coefficients. This set of flags is named Coded Sub-Block Flags (CSBFs) [4]. It
will be then sent to CABAC as CSBF residual SEs to signal the encoder whether process
the sub-block (with CSBF = “1”) or only send CSBF = “0” without processing that sub-
block. In addition, the last significant sub-block position is also scanned to find the last
significant coefficient position, which will be the entry point for the remaining scans of
that TB. Figure 3 shows an example of the scanning process to generate and signal CSBF
SEs and the last significant coefficient position for a 1616 TB.
Figure 3. Example of CSBF generation for 1616 TU.
After this step, all 4x4 sub-blocks (and 4x4 TBs as well) with CSBF = “1” are processed
to generate the remaining residual SEs. The set of different SEs representing residual
coefficients of each 44 sub-block (i.e. CG) and their binarization methods are defined in
4 samples
4
sa
m
p
le
s
16 samples
1
6
sa
m
p
les
-1 0 0 0 0 0 0 0 0 1 1 -1 1 0 -2 -1
0 1 0 0 -1 -1 0 1 -1 0 0 0 1 0 1 0
2 -2 0 0 -1 0 0 1 0 0 0 0 0 1 0 0
-1 -1 -1 0 0 0 0 0 0 0 -1 0 1 0 0 0
-4 -1 1 2 0 -1 0 0 -2 2 -2 1 6 1 4 1
6 2 -4 -1 1 1 -1 0 1 0 -1 -1 -3 2 1 0
3 5 3 -3 0 0 0 0 0 0 0 0 0 0 0 0
-3 3 2 0 0 0 0 0 -1 2 -1 0 3 2 0 0
0 0 1 0 -2 1 1 1 0 0 0 0 0 0 0 0
0 -1 0 1 2 -1 0 0 0 0 0 0 0 0 0 0
0 -1 0 0 0 0 2 0 0 0 0 0 0 0 0 0
0 -1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1 0 -1 -1 -2 1 2 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0 -1 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 -1 0 -1 1 0 0 0 0 0 0 0 0
CSBFs = “0001011111111111”
CSBF
last significant sub-block positionlast significant coefficient position
1 1 11
1 1 11
1 0 01
1 0 01
Kỹ thuật Điện tử – Vật lý – Đo lường
T. D. Lam, , N. M. Cuong, “Hardware design solution in HEVC CABAC encoder.” 104
Table 2 [8]. To generate this set of residual SEs, each CG is undergone 5 scan passes
following the same scan pattern [6]. Each scan pass will generate a type of residual SEs.
Table 2. Syntax Elements of 44 Residual Transform data [9].
Syntax Element Descriptions
Binarization
method
Sig_coeff_flag
Indicate of whether coefficient is
zero or non-zero by “0” or “1” flag
Fixed-Length
Coeff_abs_level_greater1_flag
Flag indicating whether absolute
value of a coefficient level is greater
than 1.
Fixed-Length
Coeff_abs_level_greater2_flag
Flag indicating whether absolute
value of a coefficient level is greater
than 2.
Fixed-Length
Coeff_sign_flag
Sign of a significant coefficient (0:
positive; 1: negative).
Fixed-Length
Coeff_abs_level_remaining
Remaining value for the absolute
value of a coefficient level.
Coefficient
Absolute Level
Remaining
Figure 4 shows the process of scanning to generate residual SEs listed in Table 2 from
a sub-block [5]. Diagonal scan pattern is applied to form CG, which is then undergone 5
scan passes, consecutively.
Figure 4. Process of diagonal scan and scan passes [5].
2.2. State-of-the-art
Since the standard issued, most of research work have focused on CABAC
implementation as it is the most throughput bottle-neck. In recent years, the input
workload of CABAC has been considered the potential issue of HEVC throughput bottle-
neck, particularly the residual data which is on average of 75% CABAC input data.
Bampi’s group [4,10] has emerged for this research direction, where high throughput
implementations for residual SE generation have been proposed. Saggiorato et al [10]
proposed a multi-core residual SE generation architecture that can process 4 coefficients
simultaneously to provide enough data for high throughput CABAC encoder. Ramos et al
[4] also proposed a four pipeline SE processing cores to avoid CABAC input starved
issue. In addition, a power-gating scheme that is based on analysis of input data statistics
is also proposed to save energy consumption. Their solution is based on pipeline multicore
design strategies. This is the principal method to increase throughput, however it will
come at the cost of hardware area and power consumption increases. In our paper, we
proposed a hardware design solution for residual SEs generation of all TB sizes that saves
the power consumption while still provides enough data for CABAC encoder in UHD
video applications. Our solution is based on carefully analyze the internal mechanism of
scanning processes for all sizes of TBs to reduce the memory access times. This will result
in the reduction of dynamic power consumption and processing delay as well.
Scan
passes
SEs6-310243011200000
6 1 4 1
2 1 0-3
0 0 0 0
3 2 0 0 0
15
44 sub-block Diagonal scan
CG
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 105
3. PROPOSED ARCHITECTURE AND HARDWARE IMPLEMENTATION FOR
RESIDUAL DATA GENERATION
3.1. Residual SE generation method and proposed efficient scanning algorithm
As presented in sub-section 2.1, each TB has experienced several processing steps to
generate its residual SEs set to provide data for CABAC encoder. The process of scanning
to determine the significant of sub-blocks (CSBF), the last significant sub-block position
and the last significant coefficient position as shown in Figure 5.
Figure 5. Diagonal scans for significant SEs.
Transform block of 1616 coefficients, for example, is diagonally scanned to divide
into 16 sub-blocks of 44 coefficients in the order of 0 to 15. In this scanning, the CSBFs
are determined (“0001011111111111” in this example) to signal the significant of each
sub-block. Then, the position of the last significant sub-block (the third one in the
example) is also figured out for the entry point of next scanning. Based on this position, a
Look-Up Table (LUT) is applied to determine the (X_sb, Y_sb) coordinates of that last
significant block, which is (3, 1) for this example. These coordinates are used to calculate
the coordinates of last significant coefficient position as described latter. The process of
determining last significant position of TB is started at last significant sub-block. This 44
sub-block will be diagonal scanned to find the position of its last significant coefficient,
which will be the fifth coefficient in the example. Then the position of this last coefficient
(5) is used to calculate (x, y) coordinates (equal to (1, 3) in this example) by the same
LUT. The X and Y coordinates of last significant coefficient in 1616 TB are calculated
by the generalized equation (1) below.
(1)
In which:
- last_sig_coeff_z: Denoted for x or y coordinate of last significant coefficient of
TB, which will be last_sig_coeff_x or last_sig_coeff_x.
- Zsb: Denoted for x or y coordinate of last significant sub-block in TB, which will
be Xsb or Ysb.
- last_z: Denoted for x or y coordinate of last significant coefficient in the last
significant sub-block, which will be last_x or last_y.
For the example in Figure 5, we have (Xsb, Ysb) = (3, 1) and (last_x, last_y) = (1, 3)
then the equation (1) is applied to calculate the coordinates of last significant position:
and
After this step, all of the significant 44 sub-blocks are processed by the same
procedure, which includes 5 scan passes, to generate residual SEs for each sub-block as
15 13 10 6
11 7 314
12 8 4 1
9 5 2 0
15 13 10 6
11 7 314
12 8 4 1
9 5 2 0
(0,0) (1,0) (2,0) (3,0)
(1,1) (2,1) (3,1)(0,1)
(0,2) (1,2) (2,2) (3,2)
(0,3) (1,3) (2,3) (3,3)
Last significant
coefficient position
in sub-block
X and Y
coordinates of
last significant
position
6 1 4 1
2 1 0-3
0 0 0 0
3 2 0 0
-1 0 0 0 0 0 0 0 0 1 1 -1 1 0 -2 -1
0 1 0 0 -1 -1 0 1 -1 0 0 0 1 0 1 0
2 -2 0 0 -1 0 0 1 0 0 0 0 0 1 0 0
-1 -1 -1 0 0 0 0 0 0 0 -1 0 1 0 0 0
-4 -1 1 2 0 -1 0 0 -2 2 -2 1 6 1 4 1
6 2 -4 -1 1 1 -1 0 1 0 -1 -1 -3 2 1 0
3 5 3 -3 0 0 0 0 0 0 0 0 0 0 0 0
-3 3 2 0 0 0 0 0 -1 2 -1 0 3 2 0 0
0 0 1 0 -2 1 1 1 0 0 0 0 0 0 0 0
0 -1 0 1 2 -1 0 0 0 0 0 0 0 0 0 0
0 -1 0 0 0 0 2 0 0 0 0 0 0 0 0 0
0 -1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1 0 -1 -1 -2 1 2 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0 -1 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 -1 0 -1 1 0 0 0 0 0 0 0 0
Coefficients of last
significant sub-block
X and Y
coordinates of
last significant
sub-block
position
Last significant sub-block
position in sub-block
1616 TB memory
Kỹ thuật Điện tử – Vật lý – Đo lường
T. D. Lam, , N. M. Cuong, “Hardware design solution in HEVC CABAC encoder.” 106
described in Table 2. Figure 6 shows the process of generating residual SEs of a 44 sub-
block and their output order [8]. The first scan pass will generate of sig_coeff_flags, that
indicate the significant of coefficient (non-zero) by a “1” and the insignificant one (zero)
by a “0”. The second scan pass is to evaluate whether an absolute value of a significant
coefficient is greater than one or not by adding a “1” or “0” flag. There will be up to a
maximum of 8 significant coefficients from the last significant one are signaled by
coeff_abs_level_greater1_flag. The third scan pass involves in signaling
coeff_abs_level_greater2_flag, based on absolute value of the first coefficient that has
been signaled by a “1” coeff_abs_level_greater1_flag. If the absolute value of this
coefficient is greater than two, it is signaled by coeff_abs_level_greater2_flag of “1”,
otherwise “0”. The fourth scan is used for generating the signs of significant coefficients –
coeff_sign_flag, where a positive coefficient is signaled by “0” and vice-versa. The last
scan pass is utilized to calculate coeff_abs_level_remaining, the remained level of
significant coefficient [9].
Figure 6. Generated SEs and output order [8].
Figure 7. Scanning and SE generation architecture.
As described, for each of sub-block 44 coefficients, residual SEs are generated after 5
scan passes, which access TB memory to evaluate residual coefficients. These memory
access activities are the main cause of dynamic power consumption and processing delays.
Except the coeff_abs_level_remaining SE, the remaining SE (sig_coeff_flag,
coeff_abs_level_greater1_flag, coeff_abs_level_greater2_flag and coeff_sign_flag) are the
flags, in which each SE is a flag, i.e. one-bit value. In addition, table 2 shows that Fixed-
Scan pass SEs Values
1st scan pass sig_coeff_flag 1 0 1 0 0 0 0 1 1 1
2nd scan pass coeff_abs_level_greater1_flag 0 0 1 1 1
3rd scan pass coeff_abs_level_greater2_flag 1
4th scan pass coeff_sign_flag 1 0 0 1 0
5th scan pass coeff_abs_level_remaining 0 4 7
Data output order: 1 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 0 0 1 0 0 4 7
9 3 0 -1
-6 0 0 0
0 1 0 0
0 0 0 0
1 - -
- - -
- - - -
- - - -
1 1 0 1
1 0 0 -
0 1 - -
0 - - -
1 1 0
1 -
0 - -
- - -
0 0 - 1
1 - - -
- 0 - -
- - - -
7 0 - -
4 - - -
- - - -
- - - -
44 sub-block 1st scan pass 2nd scan pass 3rd scan pass 4th scan pass 5th scan pass
TB Memory
Last significant,
CSBF scanning
CSBFs
last_sig_coeff_x
First scan
Last significant
position
Significant,
Sign,
Greater_one,
Greater_two
Scanning
Greater2 position
Second scan
Third scan
sig_coeff_flags
coeff_sign_flags
coeff_abs_level_greater1_flags
CALRs
Coeff Absolute
Level Remaining
Scanning
last_sig_coeff_y
coeff_abs_level_greater2_flag
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 107
Length binarization is used for all of these flagged