TASKS AND CONTENTS: To design and implement a polar code SC decoder on an FPGA using Verilog, improve the throughputs of the Semi-parallel Successive Cancellation.. To reduce latency cyc
INTRODUCTION
Overview
Polar codes, introduced by Arıkan in 2008, leverage the concept of channel polarization to achieve capacity on symmetric channels Through the combination and division of channels, polar codes transform identical binary-input discrete memoryless channels into polarized channels This process results in the emergence of noiseless "good channels" with a capacity of one and noisy "bad channels" with a diminishing capacity of zero By strategically polarizing channels, these codes optimize channel performance and pave the way for efficient and reliable communication.
As the channel number, or code length, approaches infinity, the proportion of good channels to the total channels converges toward the capacity of the original channel This phenomenon distinguishes polar codes from traditional channel codes like Turbo/LDPC codes Polar codes introduce a novel concept in coding design, departing from the conventional approaches and showcasing a unique perspective on optimizing communication systems
In practical applications, channel coding serves as a crucial technology for ensuring reliable transmission, particularly in wireless communications Figure 1-1 illustrates the progression of channel code applications across 3G to 5G wireless systems This roadmap highlights the pivotal role of channel coding in advancing the reliability and performance of wireless communication technologies over the years
The 5G systems introduce more stringent requirements for transmission latency (1ms) and reliability (99.999%), posing challenges that traditional Turbo codes struggle to meet In 2019, IEEE communication society published the best readings of polar coding online [2] show that polar codes provides excellent error-correcting performance with low decoding complexity for practical blocklengths when combined List SC
2 decoding with CRC check These favorable traits have led to polar codes being used in the 5G wireless standard, which is a testament to their outstanding performance
Figure 1-1 Roadmap of channel coding in wireless communication systems
For the decoding of polar codes, the concept of list successive cancellation decoding (LSCD) [3] [4] was introduced LSCD involves generating L decoding paths by employing L parallel successive cancellation decodings (SCDs) [5] [6] However, it's important to note that this method comes with increased implementation complexity and decoding latency Enhancements in the implementation of the successive cancellation (SC) decoder play a crucial role in improving the overall implementation of LSCD Consequently, our focus centers on optimizing the FPGA implementation of the Semi-parallel SC decoder [5], which forms the core of the original LSCD approach Designing the high-throughput and low-latency architecture is the key issue of hardware implementation.
Related science researching
In the realm of hardware implementation, there is a pursuit of high-throughput and low-latency architectures for both Successive Cancellation (SC) and Successive Cancellation List (SCL) decoders in practical applications Leroux et al [7] introduced the pipelined-tree architecture to enhance the throughput of the SC decoder, while in [5], they proposed a semi-parallel architecture for a similar purpose Building upon
To enhance the performance of SC decoders, researchers have explored architectural optimizations and decoding algorithms Zhang and Parhi developed sequential and overlapped architectures to minimize latency, while Yuan and Parhi introduced multi-bit decision to augment throughput These advancements demonstrate the pursuit of balancing high throughput with low latency in hardware implementations for SC decoders.
Several papers have delved into the FPGA implementation of polar decoders, each offering unique insights and methodologies [10] [5] [11] [12] Pamuk [10] contributed by presenting an FPGA implementation of a belief propagation decoder tailored for polar codes Leroux et al [5] introduced a semi-parallel Successive Cancellation (SC) decoder architecture designed to maximize FPGA resource utilization efficiently However, it's noteworthy that the latency associated with the semi-parallel decoder architecture, as presented by Leroux et al [5], is constrained to at least 2 N – 2 cycles Consequently, its throughput is limited to approximately fmax/2N, where N denotes the length of the considered polar code, and fmax represents the maximum clock frequency In another approach, Dizdar et al [12] proposed an SC decoder architecture leveraging only combinational logic circuits Their work demonstrated that latency could be minimized by combining combinational and synchronous SC decoders of shorter lengths These diverse approaches highlight the ongoing efforts to enhance the FPGA implementation of polar decoders by addressing factors such as resource utilization, latency reduction, and overall decoding efficiency
Y Ideguchi contributes to the field with a notable proposal for an efficient FPGA implementation of a Successive Cancellation (SC) decoder for polar codes [13] In their work, they showcase the FPGA implementation of the decoder architecture tailored for a 1024-bit-length polar code Remarkably, their FPGA decoder achieves a threefold increase in throughput compared to the conventional sequential semi-parallel decoder, all while managing to avoid a substantial increase in hardware resource utilization The
4 emphasis on achieving higher throughput with optimized resource utilization is a significant stride in FPGA implementations of polar decoders
As part of future work, the focus is directed towards further enhancing the frequency, highlighting an ongoing commitment to advancing the performance and efficiency of FPGA implementations in the realm of polar code decoding This continuous pursuit of improvement reflects the dynamic nature of research in FPGA- based polar code decoders
Table 1-1 Comparison implementation for 1024-bit polar codes using SCD architectures on Stratix IV FPGA
Tasks and expected results
To design and implement a polar code SC decoder on an FPGA using Verilog, improve the throughputs of the Semi-parallel Successive Cancellation
To reduce latency cycles by improving the architecture to decode the codeword in parallel based on the Semi-parallel Successive Cancellation
To improve fmax by analyzing and reducing the most critical delay path of the Semi-parallel Successive Cancellation
To evaluate the performance of the FPGA-based polar code implementation in terms of resource utilization, fmax, latency, throughput
By improving the Semi-parallel Successive Cancellation, our expectation results are:
Reduce N/2 the latency cycles, 512 clock cycles in case N24
Improve fmax from 173MHz to more than 200MHz
Improve throughput from 85 Mbps to more than 130 Mbps
PRELIMINARIES
Polar Code Construction and Encoding
Polar codes represent linear block codes with a length of N = 2 n , where their generator matrix is formulated through the n th Kronecker power of the matrix 𝐹 [1 0
Figure 2-1 depicts the equivalent graph representation of 𝐹 ⨂3 , where 𝑢 = 𝑢 0 7 represents the information-bit vector and 𝑥 = 𝑥 0 7 represents the codeword transmitted through the channel The vector notation adheres to the conventions established in [1], namely 𝑢 𝑎 𝑏 consists of bits 𝑢 𝑎 , … , 𝑢 𝑏 of the vector u
Figure 2-1 Polar code encoder with N=8
In the process of decoding received vectors with an SC decoder, each estimated bit 𝑢̂ 𝑖 under the assumption of correct decoding for bits 𝑢 0 𝑖−1 , tends toward a predetermined error probability, approaching either 0 or 0.5 Additionally, as established in [1], the fraction of estimated bits with a low error probability converges toward the capacity of the underlying channel Polar codes leverage this phenomenon, known as channel polarization, by utilizing the most reliable 𝐾 bits for information transmission while "freezing" or setting the remaining 𝑁 − 𝐾 bits to a predetermined value, often 0
Successive Cancellation (SC) Decoding
In SC decoding, the received vector is used to sequentially estimate transmitted bits, beginning with 𝑢0 and ending with 𝑢𝑁−1 For each step 𝑖, if 𝑖 is not included in the frozen set, the SC decoder estimates 𝑢̂ 𝑖 by minimizing the distance between the received vector and the codeword corresponding to the estimated bits.
(2) where Pr(𝑦, 𝑢̂ 0 𝑖−1 |𝑢 𝑖 = 𝑏) represents the probability that y was received, given the previously decoded bits as 𝑢̂ 0 𝑖−1 , with the currently decoded bit being 𝑏, where 𝑏 ∈ {0, 1} In this context, the ratio of probabilities in the above function serves as the likelihood ratio (LR) of bit 𝑢̂ 𝑖
The SC decoding algorithm sequentially assesses the likelihood ratio LR 𝐿 𝑖 of each bit 𝑢̂ 𝑖 Arıkan demonstrated that these LR computations can be efficiently carried out in a recursive manner using a data flow graph resembling the structure of a fast Fourier transform This structure, illustrated in Fig 2, is referred to as a butterfly-based decoder The messages exchanged within the decoder are LR values denoted as 𝐿 𝑙,𝑖 , where 𝑙 and 𝑖 represent the graph stage index and row index, respectively Additionally,
𝐿 0,𝑖 = 𝐿(𝑢̂ 𝑖 ) and 𝐿 𝑛,𝑖 is the LR directly calculated from the channel output 𝑦 𝑖 The nodes in the decoder graph compute the messages using one of two functions:
𝑔(𝑠̂ 𝑙,𝑖−2 𝑙; 𝐿 𝑙+1,𝑖−2 𝑙; 𝐿 𝑙+1,𝑖 ), 𝑖𝑓 𝐵(𝑙, 𝑖) = 1 (3) where 𝑠̂ is a modulo-2 partial sum of decoded bits, 𝐵(𝑙, 𝑖) ≜ 𝑖
𝑛, 𝑎𝑛𝑑 0 ≤ 𝑖 < 𝑁 In the LR domain, functions f and g can be expressed as:
The computation of function 𝑓 becomes feasible once 𝑎 = 𝐿 𝑙+1,𝑖 and
𝑏 = 𝐿 𝑙+1,𝑖+2 𝑙 are accessible On the contrary, the calculation of 𝑔 relies on knowledge of 𝑠̂, which is derivable using the factor graph of the code As illustrated in Figure 2-1, for example, 𝑠̂ 2,1 is estimated by propagating 𝑢̂ 0 3 in the factor graph: 𝑠̂ 2,1 = 𝑢̂ 1 ⨁𝑢̂ 3 This partial sum of 𝑢̂ 0 3 is then utilized to compute 𝐿 2,5 = 𝑔( 𝑠̂ 2,1 ; 𝐿 3,1 ; 𝐿 3,5 )
Figure 2-2 Butterfly-based SC decoder with N=8
The necessity for partial sum computations introduces significant data dependencies in the SC algorithm, imposing constraints on the sequence in which the likelihood ratios (LRs) can be computed in the graph In Figure 2-3, the scheduling of the decoding process for 𝑁 = 8 is depicted using a butterfly-based SC decoder At each clock cycle (CC), LRs are assessed by computing either function 𝑓 or 𝑔 It is assumed that these functions are calculated promptly upon the availability of the required data
As the channel information 𝑦 0 𝑁−1 becomes accessible on the right-hand side of the
9 decoder, the estimation of bits 𝑢̂ 𝑖 unfolds successively by updating the relevant nodes in the graph from right to left Upon the estimation of bit 𝑢̂ 𝑖 , all partial sums involving 𝑢̂ 𝑖 are updated, facilitating subsequent evaluations of function 𝑔
Figure 2-3 Scheduling for the butterfly-based SC decoder with N=8
It is evident that when stage 𝑙 is activated, a maximum of 2 𝑙 operations can be executed simultaneously Additionally, only one type of function (either 𝑓 or 𝑔) is employed during the activation of a specific stage Furthermore, a stage 𝑙 is activated
2 𝑛−𝑙 times throughout the decoding process of a vector Consequently, assuming one clock cycle per stage activation, the total number of clock cycles needed to decode a vector is:
In spite of the apparent parallel structure in this decoder, robust data dependencies impose constraints on the decoding process, rendering the decoder less efficient Specifically, defining an active node as one with ready inputs capable of executing operations, it becomes apparent that only a fraction of the nodes are active during each decoding clock cycle, as depicted in Figure 2-3 To quantify the efficiency of an SC decoder architecture, the utilization rate, denoted as 𝛼, is employed This rate signifies the average number of active nodes per clock cycle:
In SC decoding, 𝑁𝑙𝑜𝑔 2 𝑁 node updates are required to decode one vector A butterfly-based SC decoder performs this amount of computation with 𝑁𝑙𝑜𝑔 2 𝑁 node processors which are used during 2𝑁 − 2 clock cycles; its utilization rate is thus:
The utilization rate rapidly decreases towards 0 as 𝑁 grows This indicates potential for a more efficient utilization of processing resources.
Semi-parallel SC decoder
Hence, [5] proposal aims to enhance efficiency by imposing a constraint on the number of processing elements (PEs) integrated into the decoder
Given that in the line decoder, all 𝑁
2 PEs are simultaneously activated only twice during the decoding of a vector, irrespective of the code size, it becomes evident that we can enhance the utilization rate of a decoder by reducing the number of PEs without considerably affecting throughput For instance, a modified line decoder incorporating only 𝑁
4 PEs would incur only a 2-clock cycle penalty compared to a full line decoder This simplified architecture, termed the semi-parallel decoder, exhibits lower complexity at the cost of a marginal increase in latency
This approach can be extended to a reduced number of processing elements
2 as the count of implemented PEs In Figure 2-4 the scheduling of a semi-parallel decoder with {𝑃 = 2; 𝑁 = 8} is illustrated, revealing that this schedule demands only 2 extra clock cycles compared to the equivalent line decoder Notably, the computations conducted during clock cycles {0, 1} and {8, 9} in the semi-parallel decoder are accomplished within a single clock cycle in a line decoder
Furthermore, Figure 2-4 the data flow graph illustrating the likelihood ratios (LRs) generated throughout the decoding process for {𝑃 = 2; 𝑁 = 8} is presented Notably, data generated during CC = {0, 1} becomes unnecessary after CC = 5 and can
11 thus be substituted with the data produced in CC = {8, 9} Consequently, the same memory element can serve to store the results of both computations
Generally, the memory requirements remain unaltered when compared to the line decoder: the semi-parallel SC decoder necessitates 𝑁 memory elements (MEs) for the channel information y and 𝑁 − 1 MEs for intermediate results Consequently, for a code of length 𝑁, the memory requirements of the semi-parallel decoder remain consistent, irrespective of the number of implemented processing elements (PEs)
Figure 2-4 illustrates the scheduling and LR data flow graph of a semi-parallel SC decoder with N = 8 and P = 2 Note that the data dependencies related to 𝑠̂ are not depicted in this figure Despite the impression that data generated at CC = {8, 4, 2, 1} is directly consumed by BB = {7, 3}, BB actually waits until the symbols are decoded and re-encoded from AA.
9} could have been produced earlier, this is not feasible, as the value of 𝑢̂ 3 must be known to compute 𝐿 2,4 , 𝐿 2,5 , 𝐿 2,6 and 𝐿 2,7
While the decreased count of processing elements in a semi-parallel SC decoder results in heightened latency, this increase predominantly impacts stages that necessitate more than node updates Building upon this overarching observation, we proceed to assess the specific impact of diminishing the number of processing elements on latency To maintain scheduling regularity, we assume that the implemented number of processing elements, denoted as 𝑃, is a power of 2, where 𝑃 = 2 𝑝
Within a semi-parallel decoder, a restricted set of processing elements is employed, potentially resulting in several clock cycles needed to finalize a stage update The stages conforming to the condition 2 𝑝 ≤ 𝑃 remain unaffected, and their latency remains constant However, for stages demanding LR computations beyond the count of implemented processing elements, completing the update necessitates multiple clock cycles Specifically, 2
𝑃 clock cycles are required to update a stage 𝑙 with 𝑃 implemented PEs
Therefore, the total latency of a semi-parallel decoder is:
As anticipated, the latency of the semi-parallel decoder rises with a decrease in the number of implemented processing elements (PEs) However, this latency penalty does not exhibit a linear correlation with 𝑃 To quantify the trade-off between the latency of the semi-parallel decoder (ℒ 𝑆𝑃 ) and 𝑃, we introduce the relative-speed factor:
This metric defines the throughput achievable by the semi-parallel decoder relative to that of the line decoder It's important to note that the definition 𝜎 𝑆𝑃 assumes both decoders can be clocked at the same frequency: 𝑇 𝑐𝑙𝑘−𝑙𝑖𝑛𝑒 = 𝑇 𝑐𝑙𝑘−𝑆𝑃 Synthesis results reveal that due to the substantial number of PEs in the line decoder, we indeed have 𝑇 𝑐𝑙𝑘−𝑙𝑖𝑛𝑒 > 𝑇 𝑐𝑙𝑘−𝑆𝑃 Consequently, the above function represents the least favorable case for the semi-parallel architecture
The utilization rate of a semi-parallel decoder, on the other hand, is defined as:
Figure 2-5 Utilization rate 𝛼 𝑆𝑃 and relative-speed factor 𝜎 𝑆𝑃 for the semiparallel SC decoder
Figure 2-5 illustrates 𝜎 𝑆𝑃 and 𝛼 𝑆𝑃 as P varies from 1 to 128 for code lengths
𝑁 = {2 10 , 2 11 , 2 12 , 2 20 } Notably, these metrics exhibit only marginal variation concerning the code length Moreover, these curves indicate that 𝜎 𝑆𝑃 is close to 1 even for small values of 𝑃 This implies that a small number of PEs is adequate to achieve a throughput comparable to that of a line decoder For instance, the semi-parallel decoders in this figure can attain over 90% of the throughput of a line SC decoder using only 64
PEs The reduction in the number of PEs by a factor of 𝑁
2𝑃, which is 8192 for 𝑁 = 2 20 and 𝑃 = 64 demonstrates a significant improvement For 𝑃 = 64 and 𝑁 = 1024, the utilization rate (𝛼 𝑆𝑃 = 3.5%) is enhanced by a factor of 8 compared to the line decoder, showcasing a more efficient utilization of processing resources during the decoding process
This substantial reduction in complexity renders the size of processing resources very small compared to the memory resources required by this architecture
Next sections furnish an elaborate description of the diverse modules encompassed within the semi-parallel decoder, depicted in Figure 2-6 as the top-level architecture
Figure 2-6 Semi-parallel SC decoder architecture.
Processing Elements
SC polar code decoders conduct likelihood estimations through update rules (4) and (5) However, these equations involve divisions and multiplications, rendering them impractical for hardware implementation To mitigate complexity, [7] proposed
15 substituting these likelihood ratio (LR) updates with equivalent functions in the logarithmic domain Throughout this paper, log likelihood ratio (LLR) values are denoted by 𝜆 𝑋 = log(𝑋), where X is an LR
In the LLR domain, functions 𝑓 and 𝑔 become the sum-product algorithm (SPA) equations:
Upon initial inspection, 𝜆 𝑓 may seem more intricate than its counterpart (4) due to the involvement of hyperbolic functions However, as demonstrated in [14], it can be approximated using the minimum function, resulting in simpler min-sum (MS) equations:
𝜆 𝑔 (𝑠̂, 𝜆 𝑎 , 𝜆 𝑏 ) = 𝜆 𝑎 (−1) 𝑠̂ + 𝜆 𝑏 (15) where |𝑋| represents the magnitude of variable X and 𝜓 ∗ (𝑋), its sign, defined as:
Equations (14) and (15) indeed propose a considerably simpler hardware implementation compared to their counterparts in the LR domain Furthermore, Figure 2-5 illustrates that, despite the approximation involved in (14), its influence on decoding performance is minimal
From a hardware perspective, our proposal involves consolidating 𝜆 𝑓 and 𝜆 𝑔 into a single processing element utilizing the sign and magnitude (SM) representation for LLR values, as this simplifies the implementation of (14):
16 where 𝜓(𝑋), like 𝜓 ∗ (𝑋), describes the sign of variable 𝑋, although in a way that is compatible with the sign and magnitude representation:
These calculations are executed using a single XOR gate and a (𝑄 − 1)-bit compare-select (CS) operator, as depicted in Figure 2-7 Conversely, function 𝜆 𝑔 , is realized using an SM adder/subtractor In SM format, 𝜓(𝜆 𝑔 ) and |𝜆 𝑔 | depend not only on 𝑠̂, 𝜓(𝜆 𝑎 ), 𝜓(𝜆 𝑏 ), |𝜆 𝑎 | and |𝜆 𝑏 | but also on the relation between the magnitudes |𝜆 𝑎 | and |𝜆 𝑏 | For instance, if 𝑠̂ = 0, 𝜓(𝜆 𝑎 ) = 0, 𝜓(𝜆 𝑏 ) = 1, and |𝜆 𝑎 | > |𝜆 𝑏 |, then 𝜓(𝜆 𝑔 ) 𝜓(𝜆 𝑎 ) and |𝜆 𝑔 | = |𝜆 𝑏 | − |𝜆 𝑎 | This relation between |𝜆 𝑎 | and |𝜆 𝑏 | is represented by bit
𝛾 𝑎𝑏 , which is generated using a magnitude comparator:
The sign 𝜓(𝜆 𝑔 ) relies on four binary variables 𝜓(𝜆 𝑎 ), 𝜓(𝜆 𝑏 ), 𝑠̂ and 𝛾 𝑎𝑏 Employing conventional logic minimization techniques on the truth table of 𝜓(𝜆 𝑔 ), we derive the following simplified boolean equation:
𝜓(𝜆 𝑔 ) = 𝛾̅̅̅̅ ∙ 𝜓(𝜆 𝑎𝑏 𝑏 ) + 𝛾 𝑎𝑏 ∙ (𝑠̂ ⊕ 𝜓(𝜆 𝑎 )) (21) where ⊕, ∙ and + represent binary XOR, AND and OR, respectively
As show in Figure 2-7, the computation of 𝜓(𝜆 𝑔 ) necessitates only an XOR gate and a multiplexer Notably, 𝛾 𝑎𝑏 is already accessible from the CS operator, shared between 𝜆 𝑓 and 𝜆 𝑔
On the other hand, the magnitude |𝜆 𝑔 | is the addition or subtraction of max(|𝜆 𝑎 |, |𝜆 𝑏 |) and min(|𝜆 𝑎 |, |𝜆 𝑏 |):
17 where bit 𝜒 dictates whether min (|𝜆 𝑎 |, |𝜆 𝑏 |) should undergo inversion or not The implementation of |𝜆 𝑔 | involves an unsigned adder, a multiplexer, and a two’s complement operator The two’s complement operator is utilized to negate a number, allowing the unsigned adder to perform subtraction through overflowing This implementation also incorporates the shared CS operator
Figure 2-7 Sign and magnitude processing element architecture
Finally, the result of the processing element is determined by bit 𝐵(𝑙, 𝑖), such that:
In [5], the PE architecture calculates |𝐿 𝑔 | by converting min (|𝐿 𝑎 |, |𝐿 𝑏 |) to two’s complement representation and adding it to max (|𝐿 𝑎 |, |𝐿 𝑏 |) This operation involves a long carry path through the magnitude comparator, the two’s complement conversion block (an adder), and the adder By improving this critical paths, we could improve the max clock frequency fmax.
MAIN DESIGN, ALGORITHM OF THE THESIS
Architectural improvements
Figure 3-1 Enhanced semi-parallel SC decoder high-level architecture
Figure 3-2 Schedule for original reference and enhanced semi-parallel SC decoder
The reference architecture [5] outlines the semi-parallel SC decoder, consisting of four key components: an array of 𝑃 processing elements (PEs), the Log-Likelihood Ratio (LLR) memory, logic for partial-sum updates along with the corresponding storage registers, and the controller, as depicted in Figure 3-1 Additionally, a read-only memory (ROM), implemented as a lookup table (LUT), is used to store the frozen bit indices corresponding to a specific channel signal-to-noise ratio, following the methodology described in [15]
To enhance the decoder's throughput, the architecture is modified to reduce the cycles required to decode one codeword by 𝑁/2 This improvement is achieved by decoding two bits in parallel when the decoder is in the last stage (s = 0), exploiting the fact that two subsequent 𝑢̂ 𝑖 are obtained from 𝑓 and 𝑔 nodes that share the same input LLRs However, the 𝑔-node needs the output of the preceding 𝑓 node as its 𝑢̂ 𝑠0,𝑧 input, and traditionally, the semi-parallel architecture computes the 𝑔-node in the cycle after the 𝑓 node [5] In the modified architecture, both possible 𝑔-node outputs are calculated speculatively while the 𝑓 output is computed The correct 𝑔-node output is then selected with a negligible additional combinational delay Figure 3-2 illustrates the new, shortened schedule of 𝑓 and 𝑔-nodes for parallel decoding of two bits in case of 𝑁 = 8 and 𝑃 = 4, denoted CCparallel This modification reduces the number of cycles for decoding by N/2
In terms of area, it's crucial to note that only one of the P processing elements needs to perform this parallel decoding, resulting in a barely noticeable increase in area However, since two bits are decoded in parallel, both must be considered in the 𝑢̂ 𝑠 memory update logic, introducing a slight increase in overall complexity.
Optimized PE Implementation
The architecture proposed for all processing elements (PEs) represents an enhancement over the PE architecture in [5], where both functions 𝑓 and 𝑔 are merged in a single PE, sharing a comparator and an XOR gate between the two functions In the improved architecture, the LLRs are stored in sign-and-magnitude form The value of sign (𝐿 𝑓 ) is determined by 𝑠𝑖𝑔𝑛(𝐿 𝑎 ) ⊕ 𝑠𝑖𝑔𝑛(𝐿 𝑏 ), whereas |𝐿 𝑓 | is min (|𝐿 𝑎 |, |𝐿 𝑏 |)
To enhance the maximum clock frequency (fmax), optimizing critical paths within the processing element architecture is crucial [5] The proposed architecture achieves this by concurrently calculating all potential values of |𝐿 𝑔 |, covering the three possible magnitudes in parallel This parallelization reduces the length of critical paths, leading to a higher fmax.
|𝐿 𝑏 |), (|𝐿 𝑏 | − |𝐿 𝑎 |), and (|𝐿 𝑎 | + |𝐿 𝑏 |)) Then select the correct output based on 𝑢̂ 𝑠 , 𝑠𝑖𝑔𝑛(𝐿 𝑎 ), and 𝑠𝑖𝑔𝑛(𝐿 𝑏 ); and finally saturating as required This optimized architecture, marked as “PE enhanced” in Figure 3-3 The value of 𝑠𝑖𝑔𝑛(𝐿 𝑔 ) is given by 𝑢̂ 𝑠 ⨁𝑠𝑖𝑔𝑛(𝐿 𝑎 ) when |𝐿 𝑎 | > |𝐿 𝑏 |, and 𝑠𝑖𝑔𝑛(𝐿 𝑏 ) otherwise This enhanced architecture
21 achieves reduction in delay within the processing element (PE) comparing to [5] with increasing area Given that the PEs cost only 5.5% of the total ALUTs (for P = 64), the total area impact is small but the the circuit fmax improves significantly
Figure 3-3 RTL architecture of a standard PE
Additionally, a special PE named PE0 in Figure 3-3 – is introduced, capable of computing two decoded bits in parallel as described in section 3.1, PE0 has an additional
𝑔 node output for parallel decoding, which is used in stage 0, as depicted in Figure 3-1
PE0 does not replicate a full 𝑔 node but shares speculative computations of the standard
PE PE0 employs eight additional 2-input MUXs comparing to standard PE PE0 also functions as a standard PE when used in stages 1, 2, … Although, PE0 is larger than the other standard PEs, its impact on the total area is minimal, as this change affects only a single PE in the entire design The delay through PE0 is virtually the same as that of the other PEs.
LLR Memory
During the decoding process, the processing elements (PEs) compute Log- Likelihood Ratios (LLRs) that are subsequently reused in subsequent steps To facilitate this reuse, the decoder must store intermediate estimates in memory As demonstrated in [7], 2𝑁 − 1 𝑄 -bit memory elements suffice to store the received vector and track all
The article describes a memory structure used for intermediate LLR estimates in layered decoding algorithms It is represented as a tree with multiple levels, each storing LLRs for a specific decoding graph stage The leaf nodes contain channel LLRs, while the root node provides decoded bits.
To ensure single-clock-cycle operation of the processing elements without introducing additional delays, simultaneous reading of inputs and writing of outputs in a single clock cycle is sought Although a register-based architecture, proposed in [7] for the line decoder , could be a straightforward solution, preliminary synthesis results revealed routing and multiplexing challenges, especially for very large code lengths needed by polar codes Instead, a parallel access approach using random access memory (RAM) is proposed In a polar codes decoder, PEs consume twice as much information as they produce Therefore, the semi-parallel architecture employs a dual-port RAM with a write port of width 𝑃𝑄 and a read port of width 2𝑃𝑄, along with a specific data placement in memory This RAM-based approach not only meets the operational requirements but also reduces the area per stored bit compared to the register-based approach
Figure 3-4 Mirrored decoding graph for N=8
Within each memory word, LLRs need proper alignment to present data coherently to the PEs For instance, in the {N=8; P=2} semi-parallel SC decoder depicted in Figure 2-2 𝜆 𝐿 1,0 and 𝜆 𝐿 1,1 are computed by accessing a memory word containing {𝜆 𝐿 2,0 , 𝜆 𝐿 2,2 , 𝜆 𝐿 2,1 , 𝜆 𝐿 2,3 } in bit-reversed order as per the indexing scheme of [1] This bit-reversal leads to a mirrored decoding graph, as illustrated in Figure 3-4, with bit-reversed vectors for channel information and decoded output 𝑢̂
The chosen ordering is beneficial as it allows processing elements to access contiguous values in memory For instance, the previously discussed LLRs {𝜆 𝐿 2,0 , 𝜆 𝐿 2,2 , 𝜆 𝐿 2,1 , 𝜆 𝐿 2,3 } are now situated in LLR locations {8, 9, 10, 11} in memory , following the received vector 𝑥 This applies to any nodes emulated by a PE in the mirrored graph, enabling the decoder to feed a contiguous block (word 2 in Figure 3-5 in the example) of memory directly to the PEs It's important to note that this necessitates storing the received vector 𝑦 in bit-reversed order in memory, a modification easily implemented by adjusting the order in which the encoder transmits the codeword over the channel
To streamline memory address generation, a specific structure and data placement illustrated in Figure 3-5 are employed This design includes storing unused values computed by the PEs in stages where 𝑙 ≤ 𝑝 in memory, preserving a regular structure This arrangement facilitates a direct connection between the dual-port RAM and the PEs, eliminating the need for complex multiplexing logic or interconnection networks However, this layout incurs an overhead of 𝑄(2𝑃𝑙𝑜𝑔 2 𝑃 + 1) bits over the minimum required amount of memory Fortunately, this overhead remains constant concerning 𝑁, resulting in a decreasing proportion of overall memory requirements as code length increases for a fixed 𝑃 For instance, the approach demands an additional 769𝑄 bits of RAM for 𝑃 = 64, irrespective of the code size The overhead for a {𝑁 1024; 𝑃 = 64} decoder is approximately ~37.6% and this percentage diminishes to around ~1.17% for a 𝑁 = 32,768 decoder with the same parameters
Unlike [5], the LLR memory in this approach is implemented using registers to achieve better frequency, connected to the PEs through a multiplexer network Given
24 that each PE utilizes two Q-bit values to calculate one Q-bit value, the registers are of size Q
Figure 3-5 Organization of the LLR memory for N=8 and P=2 with uniform memory block size.
Partial Sum Registers
Throughout the decoding process, specific partial sums are crucial for 𝜆 𝑔 in the processing elements Additionally, whenever a bit 𝑢̂ 𝑖 is estimated, numerous such partial sums typically require updating
Unlike the regular structure of likelihood estimates in the LLR memory, partial sums lack a consistent pattern suitable for packing into memory words Storing them in RAM would result in scattered memory accesses requiring multiple clock cycles, potentially diminishing decoder throughput To maintain efficiency, these partial sums are instead stored in registers Each 𝑔 node in the decoding graph is mapped to a specific
25 flip-flop in the partial sum register The partial sum update logic module, detailed in Section 3.9, updates the values of this register each time a bit 𝑢̂ 𝑖 is estimated
To efficiently store partial sums, N-1 bits suffice when employing time-multiplexing Nodes can be grouped into 2^l groups at each stage l, requiring only one memory bit per group In Figure 2-2, all partial sums in stage 0 are stored in a single bit, reset every odd clock cycle Similarly, in stage 1, the nodes are grouped into two partial sums, stored in the same locations and reset at clock cycles 3 and 7 This mapping, shown in Figure 3-6 for N = 8, ensures each partial sum is assigned to a specific 1-bit flip-flop.
Figure 3-6 Architecture of the partial sum registers with N=8
Partial Sum Update Logic
In every computation of function 𝜆 𝑔 , a specific input 𝑠̂ 𝑙,𝑧 is required, corresponding to a sum of a subset of the previously estimated bits 𝑢̂ 0 𝑁−1 , as indicated
26 by equation (5) The determination of this subset of 𝑢̂ 0 𝑁−1 needed for the 𝑔 node with index 𝑧 when decoding bit 𝑖 in stage 𝑙 is guided by the indicator function
(26) where ∙ and ∏ are the binary AND operation, + the binary OR operation, respectively
2 𝑎 𝑚𝑜𝑑 2 An estimated bit 𝑢̂ 𝑖 is incorporated into the partial sum if the corresponding indicator function value is 1 For instance, the values of the indicator function when 𝑁 = 8 and 𝑙 = 2 are
0 0 ] and the first four partial sums are
Using the indicator function, the general form of the partial sum update equation is
𝑠̂ 𝑙,𝑧 = ∐ 𝑁−1 𝑖=0 𝑢̂ 𝑖 𝐼(𝑙, 𝑖, 𝑧) (27) where ∐ is the binary XOR operation
Concerning hardware implementation, as each evaluation of function g necessitates a distinct partial sum 𝑠̂ 𝑙,𝑧 , flip- flops are employed to store all the required combinations Given that the hard decisions 𝑢̂ 𝑖 are acquired sequentially as decoding
27 advances, the content of flip-flop (𝑙, 𝑧) is generated by adding 𝑢̂ 𝑖 to the current flip-flop value if 𝐼(𝑙, 𝑖, 𝑧) = 1 Otherwise, the flip-flop value remains unaltered
Through the time multiplexing elucidated in the preceding section, the indicator function can be further simplified:
𝑣=0 + 𝐵(𝑣, 𝑖)) (28) where 𝑧̆ corresponds to the index of the flip- flops within a stage Since time multiplexing is used in the partial sum registers, a flip- flop in stage 𝑙 effectively holds
2 𝑛−𝑙−1 partial sums, at different points in the decoding process Both indexing methods are illustrated in Figure 3-6.
Frozen Channel ROM
A polar code is entirely defined by its code length 𝑁 and the indices of the frozen bits in the vector 𝒖
Our architecture incorporates a 1-bit ROM of size 𝑁 to store the indices of those frozen bits Each generated soft output 𝜆 𝐿 0,𝑖 undergoes the 𝑢 𝑖 computation block, where the output is set to the frozen bit value if indicated by the ROM's contents or undergoes a threshold-detection-based hard decision otherwise This ROM is directly addressed using the current decoded bit 𝑖
It's worth noting that this implementation provides flexibility in reprogramming the decoder for various operational configurations by replacing the ROM's contents with a different set of frozen bits, thereby decoupling the decoder's architecture from its operational parameters Considering that a polar code is designed for specific channel conditions, such as the noise variance for the AWGN channel, replacing the ROM with a RAM would allow the adaptation of frozen bit indices to current channel conditions
Controller
The controller module orchestrates the decoding process by computing different control signals, including the current decoded bit 𝑖, the current stage 𝑙, and the portion
The calculation of 𝑖 is straightforward, employing a simple 𝑛-bit counter enabled each time the decoder reaches 𝑙 = 0
The determination of the stage number 𝑙, involves a slightly more intricate computation, illustrated in Figure 2-4, as it depends on both 𝑖 and 𝑝 𝑠 Whenever 𝑝 𝑠 ≥
After updating the value of 𝑖, the index 𝑙 is adjusted to represent the position of the first set bit in 𝑖's binary representation If 𝑖 wraps around to its maximum value (𝑁−1), 𝑙 is set to 𝑛−1 instead.
As stage 𝑙 needs 2 𝑙 calculations, it necessitates ⌈ 2 𝑙
𝑃⌉ uses of the processing elements, resulting in the same number of clock cycles to perform all computations associated with this stage
Figure 3-8 RTL design of the stage number
Hence, a simple counter is employed to keep track of 𝑝 𝑠 , with a reset conditioned on the aforementioned condition
Figure 3-9 RTL design of the portion 𝑝 𝑠 of a stage
To control the LLR memory module, the controller decide the read and write addresses to provide the LLR data for the process elements or write back the output data of the process elements to the RAM
Figure 3-10 RTL design of the LLR memory read/write address
Through the decoding process, the controller controls the address to read data from partial sum register in order to calculate the g function
Figure 3-11 RTL design of the partial sum register read address
Ultimately, the controller determines the function (𝑓 or 𝑔) performed by the processing elements based on i[0]
Figure 3-12 RTL design of the F/G function selecting signal
IMPLEMENTATION RESULTS
Polar Code Encoder & Decoder on Matlab
Implementation of Polar code with (1024, 512) SC encoding & decoding used with BPSK and AWGN channel on MATLAB The system model and various functions can be visualized using the below block digram:
Figure 4-1 Polar code system with BPSK-AWGN channels
We randomly generate 100.000 blocks of: 512-bit message, 1024-bit encoder output,
1024 values of BPSK-AWGN channel data (quantization), MATLAB decoder output by MATLAB to support the verification process in Model Sim simulation
Figure 4-2 MATLAB test bench data generating scripts
Figure 4-3 Test bench data generated by MATLAB
512-bit message randomly generated by MATLAB:
E746971A518DE08AA81DC63A6EE07E64AF9239E7DD55438C31D084A1CD8C BA53A1B33763EAE8CB8AABDD3A51C735D53AF7B9AE382D97FB92B63E5BB E4427228D
DC29667A0AB9B3B5385A157CEA23AB0A67911DC645058DEC281C54F52579B74DF7F85871A08E162BFCF8E6D6A364659C5B34DF527FC1AC955E6FFE65A697A7665C63B4D9D462189383ED4DC6969C13ACFDF75D3E18942DF74A445DD7194FC71560005EA3A52BA5B050FD6A1D04A5C587D6E04BDBF92E14332B85233641DFCF83
BPSK modulation over an AWGN channel output:
Figure 4-4 BPSK modulation over an AWGN channel
Matlab Polar decoder output – 512-bit message (same as input message):
E746971A518DE08AA81DC63A6EE07E64AF9239E7DD55438C31D084A1CD8C BA53A1B33763EAE8CB8AABDD3A51C735D53AF7B9AE382D97FB92B63E5BB E4427228D
Table 4-1 Simulation result, frame error rate of polar codes of length N = 1024 the SC decoding of the 3GPP 5G standard under offset min-sum decoding All simulations were performed using BPSK modulation over an AWGN channel
R=1/2 EbNodB FER_sim BER_sim Nblkerrs Nbiterrs Nblocks EbNodB2 FER_sim2 BER_sim2 Nblkerrs2 Nbiterrs2 Nblocks2
Successive cancellation list decodeSuccessive cancellation decode
Polar Code Decoder Simulation on Model Sim
With support of MATLAB, ModelSim could read the BPSK-AWGN channel data
(quantization) to the test design memory and run the verifycation & simulation for 100.000 generated test cases:
Figure 4-5 ModelSim test bench scripts
Figure 4-6 Memory loading and decoding process in ModelSim
Figure 4-7 Model Sim simulation result
After simulating, our decoder could finish decoding the 1024 bit codeword with only 1572 clock cycles (Semi parallel 2084 cycles [5])
The output message is the same as MATLAB decoder and original message:
E746971A518DE08AA81DC63A6EE07E64AF9239E7DD55438C31D084A1CD8C
BA53A1B33763EAE8CB8AABDD3A51C735D53AF7B9AE382D97FB92B63E5BB
More test results are showed in Appendix section Figure 7-21, the decoder gives the incorrect message with 5 quantization bits – use the same size with [5] This is expected due to some overflow F & G calculations that exceed 5-bit length We could decrease the error rate by increasing the quantization bit length
Polar Code Decoder Function test on FPGA DE10
Figure 4-8 Function test on FPGA DE10
By using Signal Tap Logic Analyzer of Quartus tool We could easily get the output message of the decoder and compare with the MATLAB or Model Sim results
In this case, the FPGA DE10 gives the same results:
E746971A518DE08AA81DC63A6EE07E64AF9239E7DD55438C31D084A1CD8C
BA53A1B33763EAE8CB8AABDD3A51C735D53AF7B9AE382D97FB92B63E5BB
Polar Code Decoder Synthesis Result on FPGA Stratix IV
We demonstrate an FPGA implementation of our decoder for N = 1024, PEno 64 In the implementation, the number of quantization bits for internal data was set to 5 bits which was same as that of Leroux, et al.’s study [5]
To compare this decoder with Leroux one, we selected Stratix IV as an FPGA chip We used Quartus Prime Version 22.1std.0 Build 915 10/25/2022 as a synthesis tool.
Figure 4-10 Max clock frequency result
Table 4-2 Comparison of our result for 1024-bit polar codes with other architectures on Stratix IV FPGA
The circuit synthesis results are shown in Table 4-2 As show by these results, our FPGA decoder can achieve more 50% throughput comparing to [5]
Leveraging registers instead of RAM for LLR memory implementation, as opposed to Leroux's approach, results in significant resource savings In our design, the LLR memory utilizes 50568 ALUTs and 10235 registers, leaving 5060 ALUTs and 1046 registers available for other modules This efficient resource allocation highlights the cost-saving benefits of our register-based LLR memory implementation.
645 register and only costs more 930 ALUTs comparing to [5].
CONCLUSION AND FUTURE IMPROVING WORK
We demonstrate an FPGA implementation of our decoder for N = 1024, PEno 64 We meet the target results which reduces 512 clock cycles, fmax is 203MHz, and throughput is 132 Mbps Then, our FPGA decoder can achieve more 50% throughput comparing to [5] without significantly increasing the hardware resources
In the future, subsequent efforts will be directed toward elevating the clock frequency by streamlining the partial sum update logic and the related registers, currently identified as the principal factors influencing the critical path in successive cancellation decoding of polar codes
We could also evaluate the CRC-List SC decoding – which is real 5G applications – performance and complexity based on our better SC decoding architecture
APPENDIX
Figure 7-2 Flow Non-Default Global Settings
Figure 7-6 Analysis & Synthesis Source Files Read
45 Figure 7-7 Analysis & Synthesis Resource Usage Summary
Figure 7-8 Analysis & Synthesis Resource Utilization by Entity
Figure 7-9 Analysis & Synthesis Post-Synthesis Netlist Statistics for Top
Figure 7-12 Fitter Resource Usage Summary
48 Figure 7-13 Fitter Resource Utilization by Entity
49 Figure 7-15 Timing Analyzer SDC File List
50 Figure 7-17 Timing Analyzer Fmax Summary at Slow 900mV 85C Model
Figure 7-18 Timing Analyzer Fmax Summary at Slow 900mV 0C Model
51 Figure 7-19 Timing Analyzer Multicorner Timing Analysis Summary
24b836766352ae2848e2228638293ba5c356f51cfd5f2a9b072cbf675b2a4de8c18ff5d6 d0891fcebcf18fd6a77d80faf02d7114d38858c43e16a9d3d570adf3 de3fcaca61b07354d0e41972ff3b42c13f0b39ce54908a9792414bf50c7d189baca17b35 5e5e19c9d6079b7678c1f4cb366348fe64713fd71cf85bb85ea62f4f
5e812f31c984dc7a76d30443ed6adf7a3a8e74ce3e3d66471ebc7d902d02e5e76113c592 9c62183a8fe10ef5a864ee959d84531292b9cea32f4f3d03c89b6be9
0d386dc0158cd5dc24393d96ec743be488c8f31ad94d3ba5dd892d95aa98ea4579776e5 75c2bbc32798bf252731eff861ff1a17cf49830549872b16df90e4b5c
06d720310d26831accf65f84b20d72e09d33f5c126157bfb3f637a5e252df15df1217e3cf 2bb23cd141984ec960226346f1aae1cf343593e26406abea2da31d7
12bf547fb2431e64e2efdb7bf2f8d0aebc2c15671b2a975d8f80960ae7282b9511358282 d9fe19291c850c2d9849ef67e1067a525603319af4b197e4922c22bd cb13c8d90f1bb21d05b4d848887e98d4032b22ce22014de6e05ed8d7c5409b8b9bf6b73 40f350dcbb1e1013a544ac4134ced0d52ae19ceff92d5076e9fb34d49 da23b5226e75405663a8c78ed1efae12ea6c8e17753abdbe54fad9e635978366a0be69b8 ff83ab7c283004b4e2cbf09250e067561d1e84d2f06c005b4a61a338
1fb009054b81c2e6f642e8daeb94edff9870aff1a6122b9cab5618370417585b2ad599fa2 5d8fa88a450acf1d30d444851ed51c6bc3522a377df4ea528de6c31
0536944ef6c75cde4478daaa953c6ce9f3faf4a33044073574aae07b02f50f53422916492 d8a35080281f3f928cef1a65fe9f1155459f634840424f5b521ef22
08e6db880ef8be67df8a9d7d67f6cb313a4c9651ed96de3211f411f4a2256ed959594eb8 b121c729fc04f03e0ffc82f435503a023aa666ad7de6d4cdc1778db2
6fef28aa1db2a1f544b33d49ddfa32644b3b81f46e6571acd5d13c34e5e0fd3a49f3406d0 eb16689b05f4398b3d7980c74c43958178a72ea9af6ff6134787032
Figure 7-20 MATLAB decoder results, same as original message (1-12 test cases)
2cb836766352ae2848e2228638293ba5c356b51cfd5f2a9b072cbf675b2a4de8c18ff5d6 d0891fcebcf18fd6a77d80faf02d7114d38858c41e16a9d3d570adf3 d63fcaca69b06354d0e41972ff3b42413f0b39ce54908a9792414bf50c7d1899aca17b35 5e5e19c9d6079b7678c9f4cb366348fe64713fd71cf85bb85ea62f4f
5e812f31d184cc7a76d30443ed6adffa3a8e34ce3e3d66471ebc7d902d02e5e76113c593 9c62183a8fe10ef5a864ee959d84531292b9cea30f4f3d03c89b6be9
0d386dc00d8cc5dc24393d96ec743b6488cab31ad9cd3ba5dd892d91aa98ea4779776e5 75c2bbc32798bf252731eff861ff1a17cf4983054b872b16df90e4b5c
0ed720310526931accf65f84b20d72609d33b5c126157bfb3f637a5a252df15ff1217e3df 2bb23cd141984ec960226346f1aae1cf343593e26406abea2da31d7
12bf547faa431e64e2efdb7bf2f8d02ebc2c15671baa975dcf80960ee7282b9511358283 d9fe19291e854c2d9849ef67e1067a525603319af4b197e4922c22bd c313c8d91f1bb21d05b4d848887e9854032922ce22814de6e05ed8d3c5409b899bf6b73 40f350dcbb1e1013a544ac4134ced0d52ae19ceffb2d5076e9fb34d49 d223b5227675505663a8c78ed1efae92ea6cce17753abdbe14fad9e235978364a0be69b8 ff83ab7c2a3004b4e2cbf09250e067561d1e84d2f06c005b4a61a338
17b009054b81d2e6f642e8daeb94edff9870eff1a6122b9cab5618330417585b2ad599fb 25d8fa88a650acf1d30d444851ed51c6bc3522a377df4ea528de6c31
0536944ef6c75cde4478daaa953c6ce9f3f8f4a33044073534aae07f02f50f51422916482 d8a35080281f3f928cef1a65fe9f1155459f634a40424f5b521ef22
00e6db8816f8ae67df8a9d7d67f6cbb13a4cd651ed96de3211f411f0a2256ed959594eb8 b121c729fe04b03e0ffc82f435503a023aa666ad7de6d4cdc1778db2
67ef28aa1db2a1f544b33d49ddfa32644b39c1f46e6571ac95d13c34e5e0fd3a49f3406c0 eb16689b25f0398b3d7980c74c43958178a72ea9af6ff6134787032
Figure 7-21 ModelSim decoder simulation and verification results (1-12 test cases)
54 f43d06d2e3337065b828a48d8fe24cdcf484cee3017fc4ca9e8649efe72eb1dff0668115b d3312253a11adb155aadd1567472da176a6dbc8eca01ca70ea744c4 c2688bdc61e0cbf57c36d464bf2af8618d7a61c54c5574bdbc3a7f81e19a3017928cbaca 949ce5b8e0010e8cc6d459a300caa38bebee04a9df7d60bf5e1b0b56
31f1651f4dfe35c280af77edc465d578448ca0933bb40e5b097ec5ef894f5a47bee6f4c4e eb463884127af3455f524ff35a535e22449d5be4ec62704f2f7ce4a d54456417e39b2afc0bdb078cc6f2d6c88ffdf943b3ed38ab4a0d131fca02de09d679f633 c3e126c96ba5d3400a345eea402985b95d76e1e9a16b2b8e014f0cf c59bf374595488408ae1dd02e2b579d199b6a1da436ea9086c5787050fc0962a48cc264 1bea2dee272516abab24e6b54a5aadfac0808bc7d1ddcd9097d48d442 e57d9c734344cef40452ea8c86166e918a12d91d9a8e305f1c15fcde902ee1c9b3df61ec 2758b66401f1a3ff37ef23ff71bc26f4a9098a08682e1b063b02078c e3d5b6175f0b3200498bbb9c378a8e21dba42bf73e524c349f4b2d59f7dddc6bf4a4c272 3e0b700038f08dfd2fd8dc37df03ae1c526ec7a2af12c4b89fc4ec69
8cfd049c928b26677c947aceb736600936514d8b9dc3ccbcecaae8896d9ce16692994de c1296c35ab1cd8fd1f0382402ad4eea03a8db67f93bff37a82bca26cf da9d9319672f67c9bc52d5628c6c375a03d440f0cece685596cd70411edd9ad0d803c7d 708441fe5e25bcaa8df34f3c61997557eb09d2a1c647ad767632b9092 dc0fce5e72dbb66ddfcbbb0102ce9704559d9dab612debe34cbd06da59892cb27d1d62a d2dda8f379abb552840ed401ec1fe5d51456f44442ea5e0c94092d2f4 a4e199ef92b13b34a46d5e0a46d692ed8e4adc6500efb40db3640e501cffcf7418d5e77a 69d4f8ce1a6233a71419e67baeb1d30068e1e1da7532ecd4a333201b f18f8e2e476e9f273b31f27a279c50f50763714c7d70579700ddac884e0687eaa512720d e59dbfa270627fc4e7a44f565048726c1f38ced9c22a321b1303d207
Figure 7-22 MATLAB decoder results, same as original message (13-24 test cases)
55 f43d06d2f3336065b828a48d8fe24c5cf486cee301ffc4ca9e8649efe72eb1dff0668115bd 3312253811edb155a2dd1567472da176a6dbc8cca01ca70ea744c4 ca688bdc79e0cbf57c36d464bf2af8618d7a61c54c5574bdbc3a7f85e19a3015928cbacb 949ce5b8e0010e8cc6dc59a300caa38bebee04a9df7d60bf5e1b0b56
31f1651f45fe25c280af77edc465d5f8448ea0933b340e5b497ec5ef894f5a47bee6f4c5ee b463884327af3455fd24ff35a535e22449d5be6ec62704f2f7ce4a d54456416639a2afc0bdb078cc6f2dec88ffdf943bbed38ab4a0d135fca02de29d679f633 c3e126c96ba1d3400ab45eea402985b95d76e1e9a16b2b8e014f0cf c59bf374415488408ae1dd02e2b5795199b6a1da43eea9086c5787010fc0962a48cc264 0bea2dee270512abab2466b54a5aadfac0808bc7d3ddcd9097d48d442 e57d9c734344cef40452ea8c86166e918a12991d9a8e305f5c15fcde902ee1cbb3df61ec 2758b66401f1e3ff37e723ff71bc26f4a9098a08682e1b063b02078c e3d5b617570b2200498bbb9c378a8e21dba46bf73ed24c349f4b2d5df7dddc69f4a4c27 33e0b700038f0cdfd2fd0dc37df03ae1c526ec7a2af12c4b89fc4ec69
8cfd049c828b26677c947aceb736608936514d8b9d43ccbcecaae88d6d9ce16492994de d1296c35ab1cd8fd1f0302402ad4eea03a8db67f93bff37a82bca26cf d29d93196f2f67c9bc52d5628c6c37da03d400f0ce4e685596cd70411edd9ad0d803c7d 608441fe5e05b8aa8df34f3c61997557eb09d2a1c447ad767632b9092 d40fce5e7adbb66ddfcbbb0102ce9704559f9dab612debe34cbd06de59892cb07d1d62ad 2dda8f3798bb552840e5401ec1fe5d51456f44440ea5e0c94092d2f4 a4e199ef9ab13b34a46d5e0a46d692ed8e489c6500efb40db3640e541cffcf7618d5e77b 69d4f8ce186273a71419e67baeb1d30068e1e1da5532ecd4a333201b f18f8e2e4f6e9f273b31f27a279c50750763314c7df0579700ddac884e0687e8a512720d e59dbfa270623fc4e7a44f565048726c1f38ced9c22a321b1303d207
Figure 7-23 ModelSim decoder simulation and verification results (13-24 test cases)