Thesis
Thesis
Master’s Thesis
Vorgelegt von:
Yadhunandana Rajathadripura Kumaraiah
Betreut von:
Dr. Moritz Harteneck, Alexander Heinz, Rohde & Schwarz
Fabian Steiner M.Sc, Peihong Yuan M.Sc
Master’s Thesis am
Lehrstuhl für Nachrichtentechnik (LNT)
der Technischen Universität München (TUM)
Titel : Low Latency Polar FEC Chain Development in Software for 5G
Autor : Yadhunandana Rajathadripura Kumaraiah
München, 21.11.2018
.......................................................................................
Ort, Datum (Yadhunandana Rajathadripura Kumaraiah)
Contents
2. Background 7
2.1. Background of Polar codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1. Polar code construction . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2. Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3. Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2. Processor Architecture Background . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1. Cache memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2. Instruction pipelining and branch predictors . . . . . . . . . . . . . 16
2.2.3. Vector Processing Units . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4. Recursive function calling mechanism . . . . . . . . . . . . . . . . 20
3. Polar Codes in 5G 23
3.1. 5G Physical Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1. Physical Broadcast Channel (PBCH) . . . . . . . . . . . . . . . . . 24
3.1.2. Physical Downlink Control Channel (PDCCH) . . . . . . . . . . . 25
3.1.3. Physical Uplink Control Channel (PUCCH) . . . . . . . . . . . . . 25
3.1.4. Physical Uplink Shared Channel (PUSCH) . . . . . . . . . . . . . 26
i
Contents
Bibliography I
ii
List of Figures
2.1. Channel polarization example for binary erasure channel with = 0.4 . . 9
2.2. Butterfly circuit representing Arıkan Kernel matrix . . . . . . . . . . . . . 10
2.3. Polar encoder in circuit form for N = 8 . . . . . . . . . . . . . . . . . . . 10
2.4. Decoding tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5. Local Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6. SC decoding example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7. Pruned Decoder Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8. Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9. Instruction pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10. Vector processing units [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iii
List of Tables
v
Abstract
In this thesis, we study the feasibility of developing a complete polar FEC chain of 5th
generation cellular mobile communication standard [4] in software. Specifically on gen-
eral purpose processors. Thesis work attempts to achieve stringent latency requirements
through software, algorithmic and platform specific optimizations. Many algorithms in
FEC chain are optimized for hardware implementations. Direct implementation of these
algorithms in software results in poor performance. To obtain best performance in terms
of latency on general purpose processors, these algorithms are modified or reformulated
to suit processor architecture and software implementation. Initially both encoding and
decoding FEC chains are implemented naively without any optimization. Code profiling
is performed on this naive implementation to identify the significant latency contribu-
tors. We split algorithms of significant latency contributing components into primitive
operations. These primitive operations are optimized either with software optimizations
or mapped to specialized functional units of a general purpose processor to achieve best
performance. Specialized units include vector processing units (SSE, AVX and AVX2)
and cache-prefetching units.
We concentrate on polar encoding and decoding FEC chain which are used to transmit
and receive control information. Latency contributing components are identified. Algo-
rithms of those components are reformulated to avoid or to reduce latency contributing
operations. Major latency contributors in encoding FEC chain are the cyclic redundancy
check (CRC) calculation, the polar code construction itself and polar encoding. For the
1
List of Tables
decoding FEC chain subblock deinterleaver, polar decoder, parity bit extraction and CRC
calculation constitute the major bottlenecks. Algorithms of these components are refor-
mulated to suit software requirements and implemented using efficient vector processing
instruction sets. Algorithms are modified to reduce complexity and lookup tables are used
to avoid complex computations. Other optimizations include function unrolling, avoiding
superfluous copy operations, hints for the compiler for better instruction scheduling and
block wise copying et cetera. At the end of both encoding and decoding chapter latency
comparisons between naive and optimized implementations are presented. In decoding
FEC chain chapter, latencies of decoder of this work and state of the art decoder are
compared.
2
1. Introduction and Motivation
The ability to perform computations has evolved tremendously from the day the first
computer was invented by Charles Babbage in the 19th century. By the end 19th century
another important event occurred, in 1897 an Italian inventor and engineer Guglielmo
Marconi demonstrated radio’s ability to maintain continuous contact with ships in the
English channel. A major breakthrough happened in the development of computers and
wireless systems in 1948 when scientists at Bell Labs achieved groundbreaking results.
Claude E. Shannon published his paper “A mathematical theory of communication”. John
Bardeen, Walter Brattain, and William Shockley announced the invention of the tran-
sistor effect. These two landmark events paved way for the widespread adoption of
computers and wireless communication systems in numerous applications. Since then,
the telecommunication industry has grown manifold fueled by the advancements in RF
and transistor fabrication techniques, miniaturization and very large scale integration.
These technological advances made computing devices smaller, cheaper and more reli-
able. Recent advances in wireless communication have allowed not only short distance
communication such as cellular communication but also deep space communication with
billions of kilometers distance.
Today computing devices and wireless systems have become integral parts of our society.
They allow communication between people even from remote areas. The invention of the
internet has enabled people to have access to a world of information in their fingertips.
Until recently, wireless devices were primarily used for information exchange between
people. Today’s wireless applications are entering new avenues such as industrial au-
tomation, telemedicine, Autonomous driving. These applications demand ultra-reliability
and ultra-low-latency. Latest mobile communication standard 5G took a giant step to-
wards providing service for such mission-critical applications. 5G has adopted several
techniques to service stringent latency requirements. To name few, different OFDM nu-
merologies, flexible frame structure et cetera. Traditionally, to achieve stringent latency
requirements wireless communication stacks are implemented in hardware, specifically in
3
1. Introduction and Motivation
Optimizations for hardware such as recursive formulation, reducing look-up tables (LUTs)
and flip-flops are not always relevant in software. Let’s try to understand conflicts in the
optimizations targeted for hardware and software. Most of the encoder/decoder algo-
rithms are formulated in a recursive form. In hardware implementations, recursive for-
mulations are particularly useful since same the design can be replicated multiple times
without significant effort and also with no performance compromise. However, in soft-
ware implementations recursive implementation incur with significant overhead. Mainly
due to a large number of branches, stack allocation/deallocation, and pipeline flushing.
The next optimization steps in hardware targeted implementations are minimizing the
required memory and flip-flops. The cost of hardware implementation depends on the
amount of memory and number flips-flops required[5]. In contrast, general-purpose com-
puting world can make use of off-the-shelve available cheap memory. Software designer
should reduce the number of cache misses and branch miss-predictions[6]. In addition,
software implementations should also avoid expensive operations such as multiplications,
division, and modulus operations. If not, reformulate them by using inexpensive bitwise
4
1.2. Polar Forward Error Correction (FEC) Chain Development in Software
Memory accesses
Polar FEC chain Shuffling/interleaving
Cache prefetching
units Floating/integer Conditional data copy
Parity check
operations Operations
SHUFFLE/PERMUTE
units
operators.
5
1. Introduction and Motivation
6
2. Background
Polar codes were introduced by Arıkan in his seminal work [8]. They belong to the class
of capacity achieving codes. In the past decade, polar codes have sparked a interest
from both academia and industry alike, resulting in significant research work in improv-
ing performance. The 5th generation wireless systems (5G) standardization has adopted
polar codes for uplink and downlink control information for the enhanced mobile broad-
band (eMBB). They are also considered as the potential coding schemes for two other
frameworks of 5G, namely ultra-reliable-low-latency (URLLC) and massive machine-type
communications (mMTC).
Polar codes achieve capacity asymptotically for binary input memoryless channel. Al-
though they are the first theoretically capacity achieving codes with an explicit con-
struction, capacity is approached only asymptotically. Their performance is suboptimal
compared to LDPC (Low Density Parity Check Codes) or Turbo codes at short block
lengths with successive cancellation decoding (SCD). In [9] the authors present an im-
proved version of SCD called successive cancellation list decoder (SCLD).
The construction of polar codes involves the identification of channel reliability values.
Information bits are placed in the K (number of information bits) high reliable bit indices
out of N (block-length) positions and remaining bits are set to zero. These N bits are
passed through a polar encoding circuit to get the encoded bits. Selection is of reliability
indices is done based on the code length and channel signal-to-noise ratio. Due to varying
code length and channel conditions in 5G systems, a significant effort has been put into
identifying the reliable indices which have good error correction performance over different
code length and channel conditions.
7
2. Background
The mathematical foundations of polar codes lies in the polarization effect of the kernel
[8]. G = 11 01 also called Arıkan matrix. Polar codes are (N, K) linear block codes of
size N = 2n where n being a natural number. N is the block length of the code and K
e
where G⊗n denotes the nth Kronecker power of G. The encoding process involves the
multiplication of N -bit vector U consisting of K information bits and N − K frozen bits
e e
with G⊗n .
e
In polar coding, the first step is to identify the channel reliability values for a particular
block length, this step is also called polar code construction. Basic idea is to produce a
fraction of channels which are either completely noiseless or noisy out of N (block-length)
independent copies of the given binary discrete memoryless channel. This process of
creating extremal channels is called channel polarization. As N → ∞, the fraction of
noiseless channels approaches the capacity of the channel. Estimating reliability indices
of channels is carried by considering the Bhattacharyya parameter [8]. Bhattacharyya
parameter indicates the reliability of the individual channel.
For a generic binary-input discrete memoryless channel (B-DMC) which is represented as
W : X → Y with input alphabet X , output alphabet Y and transition probabilities given
by W (y|x), x ∈ X , y ∈ Y.
Bhattacharyya parameter is given by
Xp
Z(W ) , W (y|0)W (y|1) (2.1)
y∈Y
8
2.1. Background of Polar codes
The Bhattacharyya parameter indicates how unreliable the channel is, It is easy see that
Z(W ) takes values between [0, 1] better the channel smaller is the Z(W ). Polarization
for N → ∞ creates channels with either Z(W ) → 0 or Z(W ) → 1.
Fully unreliable
channels
Fully reliable
channels
Figure 2.1.: Channel polarization example for binary erasure channel with = 0.4
Figure 2.1 illustrates channel polarization for different block lengths for binary erasure
channel with erasure probability = 0.4. It can be seen that as block-length increases
channels gets polarized to extremal channels (either completely reliable or unreliable).
2.1.2. Encoding
After N −K frozen bit positions have been found, they are set to zero and information bits
are placed in remaining K positions. This N -bit vector U is multiplied with generator
matrix obtained by the Kronecker power of Ar ıkan kernel matrix. Multiplying with
generator matrix can also be represented as circuit form. Arıkan kernel matrix can also
be represented in a circuit form as shown in Figure 2.2 also called butterfly circuit.
For an n = 3 told Kronecker product, block length N becomes 8 for such a case encoding
circuit looks shown in Figure 2.3, which is a repeated application of the butterfly circuit.
The read locations are the frozen bit indices which are set to zero, in remaining positions
information bits are inserted. The output of the circuit is a code word which is transmit-
9
2. Background
u0 c0
u1 c1
ted over the channel. Lets consider an example with N = 8 and K = 4, rate of this code
is R = K/N = 1/2. As given in the figure frozen bit indices are {0, 1, 2, 4} remaining
indices contain information bits. Let the information bits, which needs to transmitted
be {1,1,0,0}. After placing information bits at reliable channel positions the vector U
becomes {0,0,0,1,0,1,0,0}. It is passed through the polar encoding circuit shown in Fig-
ure 2.3. Result at the output of encoder is {0,0,1,1,1,1,0,0}. These encoded bits are then
transmitted over the channel.
u0 c0
u1 c1
u2 c2
u3 c3
u4 c4
u5 c5
u6 c6
u7 c7
The encoding circuit is nothing but the recursive application of the transformation repre-
sented by the butterfly circuit shown in the figure 2.2. One butterfly unit can transform
two uncorrelated bits (a, b) into two correlated output bits (a ⊕ b, b). This corresponds
to a polarization into two channels. In the above example, the reliability of the u1 is
increased compared to the u0 . This operation recursively applied to the whole code word
results in the circuit shown in Figure 2.3. Code word splits into two parts in stage-3,
which again splits into two parts in stage-2 and so on, until one reaches to single source
10
2.1. Background of Polar codes
bit ui in stage-1. So the process of polar encoding for N = 8 involves three stages of but-
terfly operations. Generally, for a given code length N = 2n , the polar encoding consists
of stages each with N/2 butterfly operations, which results in an encoding complexity of
O(N log(N )).
2.1.3. Decoding
11
2. Background
its error correction performance is inferior compared to SCL. Naive SCL algorithm is
implemented, however, its optimization is considered for the future work.
Decoding metric can be one of the three different types of metrics shown below:
• log-likelihood ratio (LLR) where
(i)
!
(i) WN (y1N , uˆ1 i−1 |ui = 0)
LN (y1N , uˆ1 i−1 ) = ln ; (2.2)
(i)
W (y N , ui−1ˆ |u = 1)
N 1 1 i
(i)
!
(i) WN (y1N , uˆ1 i−1 |ui = 0)
LRN (y1N , uˆ1 i−1 ) = ; (2.3)
(i)
W (y N , ui−1ˆ |u = 1)
N 1 1 i
ˆ |u = 1) ;
h i
(i) (i)
LL(y1N , uˆ1 i−1 ) = ln WN (y1N , uˆ1 i−1 |ui = 0) , ln WN (y1N , ui−1
1 i (2.4)
Decoding metrics computed from LLRs exhibit better numerical stability than those from
LRs or LLs, so we have used the LLRs metric throughout this work. There are different
ways to view and understand the operation of an SC decoder. In this work, decoding is
viewed as a message passing algorithm on a binary tree with log(N ) levels. Decoding is
performed by traversing a tree from root to leaf node. The process of decoding involves
check node (CN), variable node (VN) operations and threshold detection at the leaf node.
The decoder receives an LLR value for every bit which needs to be decoded (including
both frozen and information bits), hence for a code with block length N , SC decoder
receives N LLR values. Decoding process estimates the bits ûi where i = 1, 2..., N . The
12
2.1. Background of Polar codes
Level 0
Level 1
Level 2
Level 3
ρυ
αυ βυ
αυ
l
βυ
r
βυ αυ
r
υl l
υr
In a decoding tree, the messages to left child node are computed with CN and to the
right are with VN operation. Figure 2.5 shows how the messages are exchanged in a local
component decoder.
The CN and VN operations in LLR domain are given by the following equations:
• Check Node (CN) operation
After decoding is done at both right and left child nodes the bits are combined at common
parent node. the bit combining operation is given by the following equation.
β [i] ⊕ β [i] if i < Nv /2
vl vr
βv [i] =
β [i]
vr
13
2. Background
In the Figure 2.5 and in equations (2.5), (2.6) αv , βv represent intermediate LLR values
and estimated bits at the local decoder respectively.
Figure 2.6 gives an example of SC decoding for the block-length N = 8 and a number of
information bits K = 4. The 16-bit quantized LLR values are an input to the decoder.
In the figure, decoded bits and intermediate βv are represented by black font color and
computed intermediate αv are indicated by green. The frozen pattern is provided below
the leaf nodes. The frozen pattern indicates the position of information and frozen bits.
One in frozen pattern indicates frozen bit, zero indicates information bit.
LLRs = [-4,1,-1,-4,1,-6,-6,5]
Frozen Bit
Decoded bits 0 0 0 1 0 1 1 0
Frozen pattern 1 1 1 0 1 0 0 0
14
2.2. Processor Architecture Background
Authors in [13] extend the idea presented in [12] by identifying two additional kinds of
special nodes which can be decoded without traversing the tree single parity check (SP C)
and repetition (REP ) nodes. Both in [12] and [13], the node type is identified based on
the frozen pattern at the component decoder. For an SP C node, only one frozen bit
is present at the leftmost position. For a REP node, the frozen pattern contains one
information bit at the rightmost position, remaining are frozen bits.
One such example, when frozen indices for N = 8 are {0, 1, 3, 4}. The full decoding tree of
Figure 2.4 gets reduced to a tree with fewer nodes as shown in Figure 2.7. We can easily
see that in the original decoder tree number of nodes were 15, in the pruned tree nodes
are reduced to 7, which results in a significant reduction in the number of computations
and decoding latency.
Level 0
Level 1
Level 2
15
2. Background
time it is copied from the RAM to the cache, future accesses to the same location is
done via cache. This fast memory is placed between RAM and processor. In modern
processors instead of single cache, multi-level caches are present. The main idea behind
having multi-level caches is that if the data is not found in the first level then second level
is checked if not then the third level until the last level, still, if the data is not found then
RAM is accessed. This model significantly reduces the probability of accessing the RAM
compared to having a single level cache. Complete memory hierarchy of the modern
processors is shown in Figure 2.8 [14].
Above figure shows processor architecture with three level caches namely L1, L2, and
L3. In the order of increasing access latency, reducing cost and increasing size. L1 cache
is fastest, costliest and smallest among all caches. Data is mapped to either memory
or registers. If the available registers are not enough in such a case data is stored in
memory. If the data is not found in all cache levels then it results in cache-miss which
causes processor instruction execution to stop until data is fetched from RAM. Whenever
the memory location is accessed for the first time it always results in cache-miss. Modern
processors provide special instructions to avoid these compulsory cache misses, these
are called cache prefetch instructions which allow a programmer to fetch data from the
cache before it is accessed, hence hiding the memory access latency. Some other software
techniques to reduce cache misses are reusing the allocated memory as much as possible
and bit packing/unpacking to reduce the required memory. In this work, all of the
above-mentioned techniques namely using prefetch instructions (prefetch) provided by
AMD EPYC processor, reusing the allocated memory and bit packing/unpacking are
used reduce the memory access latency. AMD EPYC processor used in this work has
3MB L1 cache, 16MB L2 cache, and 64MB L3 cache.
Traditionally processors were designed to follow the steps fetch, decode, execute, memory
finally write-back and then fetch the next instruction. Although these steps are sufficient
to solve any problem at hand, it is very inefficient in terms of hardware utilization. In
instruction fetch phase, all modules except fetch module are idle. Similarly, during the
16
2.2. Processor Architecture Background
other phases module processing the current phase of an instruction is active remaining
modules are idle. To overcome underutilization of hardware resources modern processors
implement instruction pipelining concept, where if the current instruction is in decoding
phase the next instruction will be concurrently fetched by the fetch module. A pipelin-
ing mechanism increases the instruction throughput by significantly reducing Cycles per
Instruction (CPI). Example of sequential and pipelined execution is shown in figure 2.9
[15].
The example shown in Figure 2.9 assumes only five phases of instruction execution. Mod-
ern processors divide instructions execution to nineteen plus phases, which allows running
processor at much higher frequency due to reduced critical path delay. Maximum advan-
tage of pipelining can only be exploited when there are no pipeline stalling or flushing
which happen when there is a data dependency, cache misses or branch instructions.
Major contributors to pipeline stalling are cache misses and branch instructions. As ex-
plained previous section, cache-miss can be reduced by using a combination of different
optimization techniques. Next culprit is branch instructions, whether to branch or not
is decided only the at execution stage. By the time branching is decided, many of the
future instructions are already fetched if the decision is to jump then all the prefetched
instructions must be flushed which introduces stall in the pipeline. To overcome this is-
sue branch predictor are designed to pro-actively fetch instructions from correct address,
hence avoiding flushing of the pipeline. Branch predictors function by storing the previ-
ous decisions on the branching whether it was taken or not, hence requires the correct
previous state to proactively fetch future instructions. This method reduces the pipeline
caused due to looping type of code, by reducing pipeline flushes. For the scenarios where
17
2. Background
there are no looping instruction just if or if-else constructs branch predictors fail to cor-
rectly fetch the future instructions. These kinds of scenarios can be minimized by avoiding
branch instructions wherever possible and by providing hints to compiler built-in macros
to reduce the branching by better placement of assembly instructions (kind of instruction
scheduling). One such macro is
which tells the compiler to generate code in such a way that the code which is more
frequently executed is just after the branch instruction to minimize pipeline flushing.
Code snippet 2.1 shows the typical usage.
Listing 2.1.: Branching hints to compiler
uint32_t bitMask = 0;
18
2.2. Processor Architecture Background
Both codes achieve same results, however in Listing 2.2 a pipeline flush will happen
due to branch-misprediction, whereas for Listing 2.3 compiler identifies conditional move
construct and generates CMOV potentially avoiding pipeline flush. Optimizations such as
minimizing branches, using built-in macros and using constructs which help the compiler
to identify pattern are utilized in this work.
19
2. Background
SIMD
Vector Unit
Data Pool
Most of the encoder and decoder implementations are implemented through recursion.
The recursive function is a function which calls itself. It is a powerful tool in computer
science which can be used for solving many interesting problems [17]. However when it
comes to performance, recursive problems solving fall behind algorithms which use loops
to solve the same problem. Every time a recursive function is called, a new stack frame is
allocated, local data is pushed to call stack and execution must branch to the beginning of
20
2.2. Processor Architecture Background
the function. Branching and pushing data to a call stack are expensive operations. In the
case of polar codes, both encoding and decoding are implemented as recursive functions.
To reduce the latency of FEC chain encoding/decoding implementations are unrolled to
avoid recursions. For the encoder, unrolling is carried out by manually implementing
multiple inline functions. In decoder implementation unrolling is carried out by using
template concept of C++. Although unrolling increases code size, for the application
in hand latency is of at most importance than code size. Unrolling of the encoder and
decoder implementations significantly improved the latency of the FEC chain. Listing
2.6 shows recursive implementation of factorial calculation. Listing 2.7 illustrates same
operation implemented with unrolled implementation.
Listing 2.6.: Recursive implementation
int fact(int n) {
if(n > 1)
return n * factorial(n - 1);
else
return 1;
}
template <>
inline int fact<0>() {
return 1;
}
21
3. Polar Codes in 5G
Polar codes are a class of capacity-achieving codes introduced by Arıkan in 2009 [8]. 5th
generation wireless systems (5G) has adopted polar codes as the channel coding candi-
dates for uplink and downlink control information for the enhanced mobile broadband
(eMBB) [4]. This chapter explains the details about different types of 5G physical chan-
nels which use polar codes for channel coding, the purpose of these physical channels and
used modulation formats. This chapter also presents the generic encoding FEC chain of
the polar codes and explains the details about FEC chain parameters such as adopted
polar code block-length sizes, Cyclic Redundancy Check (CRC) type, size.
Generic polar FEC chain is shown in Figure 3.1. This FEC chain is configured with the
parameters of a particular physical channel as specified in [4] and [18].
A E
CRC Generation L
IBIL Channel Interleaver
K = A + L
E
K
N
K + nPC
N
N
Polar code
Polar Encoding
construction
The polar FEC chain receives A information bits from the upper layer. These bits need to
be transmitted through a code of length E bits. Size of A depends on the physical channel
and size of uplink or downlink control information. Value of E depends on the scheduling
and resource allocation parameters and it is configured from higher layers. After receiving
23
3. Polar Codes in 5G
A information bits, a L-bit CRC is attached to the information bits resulting in K = A+L
bits. The value of L is determined based on the physical channel, size of A and E. After
attaching CRC bits, K bits are interleaved by input bits interleaver. The input bits
interleaver is configured with the parameter IIL . IIL can take either 0 or 1. This module
is enabled when IIL = 1 otherwise input bits interleaver is disabled. Next component
in the FEC chain is Parity Check (PC) bits calculation. The number of PC bits are
configured with the parameter nP C . nP C value is selected based on the physical channel,
size of A and E. Another parameter of PC component is nwm
pc which indicates how many
parity bits need to be placed in the rows of minimum hamming weight in the polar code
generator matrix [4]. After parity bits calculation, the polar code construction component
identifies K + nP C reliable indices for placing A + L bits and nP C positions for parity
bits. The parameter N = 2n is the block-length of polar code. Value of n is calculated
based on the values of E, K and nmax . where nmax is configured based the physical
channel. In 5G polar FEC chain, five different block-lengths are supported as given by
N = 32, 64, 128, 256, 512, 1024. The encoded bits go through sub-block interleaving and
rate matching to obtain E bits from N bit codeword. Next configurable parameter is
IBIL , it configures the channel interleaver. It can take either 0 or 1. IBIL = 1 enables
channel interleaver otherwise it is disabled. Details of the different physical channels and
their FEC chain parameters are presented in the following sections.
This section presents details of the 5G physical channels which use polar codes. The
5G standard adopted polar codes for uplink and downlink control channels. Uplink
control channels carry information about channel quality indicators, acknowledgments.
In downlink control channels carry resource allocation information, uplink power control
instructions and the information required for the user equipment (UE) to access the
network. Following sections explain each of these uplink and downlink control channels
and their polar FEC chain parameters.
In downlink, polar coding is applied to PBCH which carries the essential information
required for the UE to access the network. PBCH carries network information such as
system bandwidth, current system frame sequence. The polar FEC chain parameters of
PBCH are fixed. In other words, payload size (A) of PBCH is always 56 bits. Other
fixed parameters of PBCH are E = 846, L = 24, nmax = 9, IIL = 1, IBIL = 0, and
24
3.1. 5G Physical Channels
nP C = nwm
pc = 0. Modulation format used for PBCH is always QPSK. PBCH is explained
in more detail in [4].
The PDCCH is another downlink control channel which uses polar codes. Resources
requested by the UE are assigned by the base station. This resource allocation information
is transmitted via PDCCH channel. PDCCH also carries information related to uplink
power control, downlink resource grant and system paging information [18]. The PDCCH
contains a message called Downlink Control Information (DCI) which carries all the
control information of UE. Payload size of PDCCH is not fixed. It varies based on the
format of DCI, As a consequence, values of A, N and E vary. Type of DCI is configured
from the higher layer. Except A, N and E, other parameters of the PDCCH polar FEC
chain are same as PBCH. Complete details about different DCI formats in PDCCH are
presented in Section 7.3 of [4].
In uplink, PUCCH contains Uplink Control Information (UCI) similar to DCI in the
downlink. UCI carries channel state information, acknowledgments, scheduling request.
The payload size of PUCCH varies based on the PUCCH formats [18]. PUCCH uses dif-
ferent channel coding techniques depending on payload size. When payload size A ≥ 12
polar codes are used. PUCCH polar FEC chain parameters also vary depending on the
values of A and E.
• There are three different cases for PUCCH polar FEC chain parameters based on the
values of A and E as presented below.
• Case 1. A ≥ 20
L = 11, nmax = 10, IIL = 0, IBIL = 1, and nP C = nwm
pc = 0.
25
3. Polar Codes in 5G
The parameters of the polar FEC chain for different physical channels of 5G are summa-
rized in Table 3.1.
26
4. Encoding FEC Chain
In the 5G standard, polar codes are used in the downlink to encode downlink control in-
formation (DCI) over physical downlink control channel (PDCCH) and for Master Infor-
mation Block (MIB) in the physical broadcast channel (PBCH). In the uplink, to encode
uplink control information (UCI) over the physical uplink control channel (PUCCH) and
physical uplink shared channel (PUSCH). In this work, notations introduced in 3GPP
technical specification [4] are used.
This chapter presents the details of the polar encoding FEC chain in 5G with a block
diagram. Future sections will explain the functionality and potential latency contribu-
tion of individual components in the FEC chain. Each of these individual components
is extensively profiled to identify expensive operations and latency contribution. After
identifying the bottlenecks, both algorithmic and software optimization techniques are
employed. Algorithm optimizations include reformulation of the problem to avoid expen-
sive operations, encoder tree pruning using lookup tables etc. Huge latency reduction is
achieved through software optimizations as well. Some of the major software optimization
methods are unrolling an encoder function, exploiting data parallelism with SIMD, avoid-
ing exponentially complex operations and finally reformulation of polar code construction
to avoid expensive remove, erase and copying operations.
Figure 4.1 represents the complete polar FEC chain for PBCH and PDCCH. In general,
A bits have to be transmitted with a code of length E bits. L CRC bits are added to
the information bits, resulting in K = (A + L) bits. The Resulting K bits are then
passed through an input bit-interleaver. Interleaved bits are concatenated with parity
bits. In the next step information bit indices are identified and the information bits are
inserted in those positions to obtain a vector u of length N , where N = 2n . Encoding is
performed with a mother code with parameters (N, K). Encoding is performed through
d = uGN , where the generator matrix GN = G⊗n obtained by nth Kronecker product
of Arıkan matrix. The codeword d is passed through a subblock interleaver. It divides
the codeword into 32 blocks and performs interleaving. The interleaving pattern is given
in Figure 4.2. The next step involves rate matching, which maps mother code block-
length N to rate matching size E bits. Rate matching can be repetition, puncturing or
shortening. This decision is taken based on the value of E, N , and K. when E > N
27
4. Encoding FEC Chain
repetition is applied. For repetition, some parts of the code word are repeated to create
E bits from N bits. For the case E < N either shortening or puncturing is applied. In
this mode, bits are discarded to create E bits from N . Finally, channel interleaving is
performed to improve the error correction performance for higher order modulations. This
chapter dives into the implementation details of each block in an algorithmic level with
small code snippets whenever necessary. It also analyzes the bottlenecks and presents
solutions through different optimization techniques. All the latency measurements in this
section are performed on AMD EPYC processor running at 1.6 GHz with Turbo disabled.
Turbo mode allows the processor to dynamically increase the frequency when the load is
high, To get accurate measurements, Turbo mode is disabled.
K = A + L bits E bits
K bits N bits
Polar code
Polar Encoding Subblock Interleaving
construction N bits N bits
I/P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
O/P 0 1 2 4 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
28
4.1. Data Packing and Unpacking Operations
• Increased memory footprint: For 1024 bits, 64·1024 bits memory need to be allocated.
It is equivalent to 8 kilobytes. Allocating and initializing this memory can introduce
significant latency.
• Results in more cache misses: If more memory is allocated, more data needs to accessed
from RAM which can result in more cache misses.
To avoid the above disadvantages and to enable data parallelism, this work tries to pack
multiple bits into a single integer. Although packing multiple bits to a single integer has
advantages, for some operations such as bitwise interleaving accessing each bit efficiently
is very important. To exploit the advantages of bit packing as well as the advantages of
accessing each bit separately, it is necessary to convert between the two. This is where
the power of SIMD instructions in modern processors comes into play. These processors
come with special hardware instructions which help to efficiently pack and unpack data.
Data bits are used in packed format when data parallelism needs to be exploited and in
unpacked format when certain operations require bits to be accessed individually. These
pack/unpack instructions are very efficient and have low latency. Details of the AMD
EPYC processor’s instructions with corresponding latencies are provided in [19].
Some instructions which are used for fast packing and unpacking are:
Listing 4.1.: Sample packing/unpacking instructions
29
4. Encoding FEC Chain
template<>
inline int8_t packBits<8>(int8_t s[]) {
__m64 v8 = _mm_set_pi8(s[0],s[1],s[2],s[3],s[4],s[5],s[6],s[7]);
uint8_t idx = _mm_movemask_pi8(_mm_slli_si64(v8, 7));
}
Listing 4.2 shows an example of bit packing. It receives vector of size 8, containing 8-bit
integers each representing a bit. Returns a one bit-packed one 8-bit integer.
g6 (x) = x6 + x4 + 1 (4.1)
g24 (x) = x24 + x23 + x21 + x20 + x17 + x13 + x12 + x8 + x4 + x2 + x + 1 (4.3)
Information bits concatenated with CRC increases the error correction performance of
polar codes significantly. CRC is used for selecting the correct codeword out of potential
candidates in the list. With CRC aided decoding, polar codes performance better than
LDPC and Turbo codes at short block-lengths. To reduce the latency of encoding FEC
chain CRC needs to be calculated very efficiently. A naive implementation of CRC
calculation uses a shift register. This method calculates the CRC sequentially for one bit
at a time as given in [20]. As explained in Section 4.1, it is very inefficient to process bits
sequentially. Algorithm in [21] is adapted to calculate the CRC for the polynomial in (4.3)
30
4.3. Input Bit Interleaver
using lookup table based approach. This algorithm calculates the CRC blockwise with
the help of the lookup table. In other words, divide the data into blocks of B-bits, read
the corresponding CRC value from a lookup table and combine the individual CRC’s of
the blocks in a predefined way to create a CRC for complete data. Data bits are divided
into blocks of 8-bits and packed into 8-bit integers. The CRC value corresponding to
8-bit integer is read from the lookup table and combined with CRC of a previous 8-bit
integer. This process continues until CRC is calculated for all data bits. If the number of
data bits are not multiple of 8 then zero’s are appended at the MSB position. Table 4.1
presents the latency values of naive and optimized CRC calculation methods for payload
of 41-bits. There is a significant improvement in the optimized method compared to naive
implementation.
Information bit stream of length A is attached with CRC (L) and is interleaved by
the input bit interleaver to create a distributed-CRC. This step is necessary for the
early termination of decoding at the receiver if CA-SCL (CRC aided SCL) is used. The
interleaving pattern is designed in such a way that every CRC remainder bit is placed
after its relevant information bits. This helps to discard some paths in the list decoding
process if the CRC calculated from the previously decoded information bits doesn’t match
the CRC bit. In other words, the input bit interleaver distributes CRC to ease decoding
when a list decoder used through early termination. The interleaving pattern is calculated
at runtime since it depends on the number of information bits (K). In this part of the
implementation, there is not much optimization performed in this work since interleaving
needs to carried out sequentially and also due to the fact that K is not very large.
Complete details of the input bit interleaver and interleaving pattern calculation can be
found in [4].
31
4. Encoding FEC Chain
Polar code construction is the process of identifying information and frozen bit position,
i.e K out of N positions. This step determines the error correction performance of polar
codes. There are many methods in the literature to construct polar codes. Arıkan [8]
proposed to use the Bhattacharyya parameter as reliability metric for Binary Erasure
Channels (BEC) then deriving reliability values using Monte Carlo simulation. For other
channels, Mori and Tanaka [22] use more accurate density evolution (DE) methods but
it suffers huge complexity. Tal and Vardy proposed Gaussian Approximation (GA) to
reduce the complexity of DE with approximations [23]. Still, the GA method has a high
computational complexity which scales linearly with code block-length, therefore, it is
unacceptable for varying SNR, block-length and code rate. In use cases such as 5G,
where the channel is continuously varying. it is not feasible to construct polar codes on
the fly due to stringent latency requirements of both encoder and decoder. The polar
code construction in 5G takes a suboptimal approach, instead of constructing polar codes
for every different SNR, block-length and code rate, construction is carried out in such
a way that the constructed code performs sufficiently good over a large range of SNR,
block-length and code rate. 5G polar code construction method is based on the contribu-
tion from Huawei which uses a β-expansion method with universal partial order (UPO)
property of channel reliability as presented in [24].
The 5G standard has adopted five different polar code block-lengths. Block-length sizes
N are by N = {32, 64, 128, 256, 512, 1024}. For each of the block lengths, reliability
indices values are specified in [4]. The polar code construction also depends on the rate
matching mode since it affects the reliability of bit indices. The polar code construction
is straightforward when rate matching output E greater than or equal to block-length
N . In such a case, code construction involves selection of K most reliable indices for
information bits remaining positions are frozen since bit reliability not affected by rate
matching. Example 4.4 shows such a case. However when rate matching output size E
is smaller than block-length N the selection of reliability indices becomes more involved
which is described briefly in the next paragraph.
Let’s take an example with N = 32 channel reliability values are extracted from the
reliability table provided in [4] and are given by
32
4.4. Polar Code Construction
Q31
0 = {0, 1, 2, 6, 3, 7, 9, 16, 4, 8, 11, 17, 13, 19, 20, 26, 5, 10, 12, 18, 14, 21, 24, 27, 15, 23, 22, 28,
QK
I = {5, 10, 12, 18, 14, 21, 24, 27, 15, 23, 22, 28, 25, 29, 30, 31}
For any other case, when E < N either puncturing or shortening is performed during rate
matching. Empirically it is been observed for polar codes that at low rates puncturing
works better and shortening for high rates [25]. In 5G, it is not uncommon to have
scenarios with rate matching output E is less than block-length N . In such scenarios, some
bits need to be discarded in the rate matching stage through puncturing or shortening.
When encoded bits are discarded in the rate matching stage, the reliability of bit channels
get affected, identifying reliable bits by taking effect of rate matching procedure makes
polar code construction complex in terms of time. The naive implementation of reliability
indices selection algorithm provided in [4] is carried out in C++ as shown in algorithm 1.
Upon code profiling of encoder FEC chain implementation, it was found that polar code
selection algorithm is the most time-consuming part among all the encoding FEC chain
stages.
The following algorithm gives a simplified picture of functional implementation to select
information bit indices by taking the effect of rate matching. Notations used in the
algorithm are the same as the ones specified in 5G standard [4].
J(n) : Subblock interleaver pattern for a particular block-length N .
E : Rate matcher output size.
N : Mother code block length.
K : Number of information bits.
QN-1
0 : Reliability indices array for block-length N in ascending order of reliability.
N
QI : Information bit positions.
Algorithm 1 shows how the information bit indices are selected by taking rate matching
into account. Finding and removing incapable bits due to rate matching is an expensive
operation. As it can be seen in the algorithm in lines 4, 7, 11 and 15 subblock inter-
33
4. Encoding FEC Chain
21 for i = 0 to f rozenSize do
22 iterator = remove(QN I , QF,i ) ;
23 N
erase(QI , iterator) ;
24 end
25 startIdxInf o = N − K − nP C ;
K
26 QI = {QI,startIdxInfo , QI,startIdxInfo + 1 , ..., QI,END }
leaving pattern is also taken into account for identifying the incapable bits due to the
presence of subblock interleaver between rate matching and encoder. Due to presence of
time consuming operations such as sorting, set union, search, remove and erase. The
contribution of this function highest among all the components of the FEC chain. In
terms of latency, for a scenario with E = 846, N = 1024, K = 130 puncturing is required.
For these parameters, the polar code construction, encoding, subblock interleaving rate
matching and channel interleaver takes 411µs, only polar code construction contribution
is 377 µs.
34
4.4. Polar Code Construction
21 for i = 0 to f rozenSize do
22 if mode == puncturing then
23 index = lookU pT able[QF,i ];
24 end
25 else
26 index = lookU pT able[J[i + E]] ;
27 end
28 QNI [index] = INVALID ;
29 end
30 idx = K ;
31 for i = size(QN I ) to 0 do
32 i=i−1 ;
33 if QNI [i] 6=INVALID then
K
34 QI [idx] = QN I [i] ;
35 idx = idx − 1 ;
36 end
37 end
35
4. Encoding FEC Chain
Let’s analyze the complexity of each the operations. Sorting is O (N − E) log (N − E)
complex and of Set union is again O (N −E) log (N −E) complex [26]. The block-length
N is derived using E and K, so (N − E) is small compared to N . Next operation in
the algorithm is remove and erase. These functions are directly used from the standard
C++ library. After deciding rate matching type (shortening/puncturing) and identifying
incapable bit indices, these locations must be frozen, this requires traversing through re-
liability indices array and removing incapable bit locations. remove and erase functions
are used to perform this operation. remove function searches through a reliability array
for incapable bit index and removes the element. erase operation erases the memory allo-
cated to removed element and resizes the array. The complexity of the remove operation
is O(N ). The remove function has to search through all the elements of an array for
every frozen value and have to move the elements to overwrite the removed position, the
size of the array is N , it can be as large as 1024. The erase operation has to deallocate
the memory and resize the container. remove and erase together are O N 2 complex.
In this work, the algorithm is reformulated to avoid searching, copying and memory
deallocation while removing incapable bit indices. To avoid search operations, a lookup
table is built whose values indicate the position of particular reliability value. After
identifying the position it is marked as removed instead of removing. Marking has two
advantages first one is avoiding memory deallocation and copying, the second one is
keeping the same order of elements which is particularly useful for using the same lookup
table for finding the next incapable bit index. After all the incapable bit indices are
marked as removed, only the unmarked elements are considered as reliable bit positions
for placing information bits.
The next optimization is avoiding copying subblock interleaving pattern to frozen in-
dices array in case of shortening. Instead, a subblock interleaving pattern is directly
used from the lookup table to mark the reliability indices as removed. In addition to
above-mentioned optimizations, minor ones such as avoiding dynamic memory allocation
instead reserving required memory in advance and employing pointer operations to avoid
copying are performed. Finally, information bit positions are obtained from iterating the
reliability table from the end (since indices are sorted in ascending order of reliabilities)
and extracting K unmarked positions. These optimizations reduced the latency of polar
code construction from 377 µs to 15 µs.
Algorithm 2 presents the optimized reformulation of algorithm 1 without erase, remove,
36
4.5. Polar Encoding
of the art direct row vector multiplication with matrix algorithm is O N 2.38 complex
e
[27]. However in case of polar codes generator matrix follows a regular structure, hence
it is shown that encoding can be reduced to a recursive structure with a complexity of
O(N log N ) [8]. There are different ways to visualize encoder, one is through the encoding
circuit and another is through the tree structure. The latter is suitable when encoding is
performed in hardware where a group of bits processed in parallel. Since this work focuses
on implementation/optimization for software, the tree structure is considered. Example
encoding visualized as a binary tree for N = 8 is illustrated in figure 4.3. Every node in
i i
a tree splits i-bit to i/2-bit vector and performs XOR of 0 to 2 − 1 bits with 2 to i − 1
bits. This process continues till bit vector length becomes one.
37
4. Encoding FEC Chain
As it can be observed from the tree, the same operation is performed at every node
only the size of a vector is different, due to the regular structure encoding problem fits
the recursive algorithmic form. Algorithm 3 shows a naive implementation of recursive
encoder. In the algorithm, each bit is represented as one integer and each bit is processed
serially, hence parallelism is not exploited. Upon profiling the implementation it also
identified that 8 and 9 are also the bottlenecks since copying is involved. One more issue
with the algorithm is recursive implementation. Although the encoding can be easily
implemented in this way, each recursive function call in software is expensive, since it
requires a new stack frame to be allocated. Background about the overhead of recursive
function calling is presented in the previous chapter.
To avoid the disadvantages mentioned above, the following optimization techniques are
considered.
• Data parallelism To avoid the serial processing of bits and hence to improve the par-
allelism factor, the method described in Section 4.1 is used, i.e, multiple bits are packed
to a single integer. In this particular instance, every 64 bits are packed into 64-bit inte-
gers so that 64-bits can be processed in parallel with a 64-bit processor. This results in a
parallelism factor (P) of 64. Packing also helps to further increase the parallelism factor
P with state of the art SIMD processing units of modern processors. SIMD instructions
can process 256-bit or 128-bit in a single instruction which results in a parallelism factor
(P) of 256 with AVX and 128 with SSE instructions.
38
4.5. Polar Encoding
• Avoiding the copy operations: The encoding in Algorithm 3 splits N -bits into
N
two 2 -bit vectors and copies them to temporarily allocated variables. Code profiling
pointed out that these copying operations are the bottlenecks. In an optimized algo-
rithm, instead of copying, C++ pointers concept is used to calculate the index where the
next block of vector starts and this index is passed to the next node for further processing.
• Pruning the encoder tree: As shown in the Figure 4.3, the encoding process can
be represented as traversal of a binary tree. During encoding when traversing the tree
towards the leaf node bit vector size becomes less than 8 bits. In such a scenario 4/2/1
bits of an integer needs to be accessed. In standard processors smallest unit of data
accessed in software is an 8-bit integer. To access 4/2/1 bits, masking operations are
needed. The number of nodes in a binary tree which accesses 4/2/1 bits are huge. There-
fore a significant number of masking operations are needed, which introduce quite some
overhead. Pruning of tree at the level where the bit vector size is 8, avoids this overhead
in addition to reducing the number of nodes to be traversed in a binary tree. Pruning
is done by building a lookup table containing an encoded value for every combination of
the 8-bit vector. Value is read from a lookup table for encoding when the bit vector size
is 8. The lookup table has 256 values, one encoded value for every combination of 8-bit
vector.
Pruning of the tree had a significant latency improvement, it can be better understood by
taking an example. In a scenario where N = 1024, with unpruned tree number of nodes
to be traversed for encoding, is 2047 nodes, out of these nodes contain 1024 1-bit, 512
2-bit, and 256 4-bit nodes. With pruned tree 4/2/1 bit nodes are not present, the number
of nodes needs to be traversed reduces to 255 from 2047 (87% reduction). Pruning avoids
masking operations in addition to reducing the nodes in a tree.
39
4. Encoding FEC Chain
Example of a pruned unrolled encoder containing also the tree traversal is shown in
Figure 4.4. Inline function names for different bit vector size are also shown in the figure.
One can see that tree traversing ends at bitM ult8 function due to pruning. Tree traversal
flow is represented with an orange line in the figure.
Sample code snippet of node operation with SIMD instructions is shown in listing 4.3.
Listing 4.3.: Node operation using SIMD instructions
40
4.6. Sub-block Interleaver
41
4. Encoding FEC Chain
is disabled. There are no optimizations performed for this block in the FEC chain.
42
4.11. Summary
To obtain overall picture of improvement from all the above optimizations it is worthwhile
to look at worst case latencies of the FEC chain which is with maximum block size (N =
1024), with puncturing (requires more time to identify reliability indices as well as for
rate matching) and with both input bit-interleaving and channel-interleaving. Following
table gives a comparison between naive and optimized versions.
Worst case latency FEC chain parameters,
IIL = 1, nmax = 10, npc = 0, nwm
pc = 0, IBIL = 1, E = 846, K = 106.
4.11. Summary
In this section, we implemented the polar encoding FEC chain. Different components
of the FEC chain are analyzed to understand the complexity and latency contributions.
Through extensive analysis and code profiling, three major latency contributors are iden-
tified, namely CRC calculation, polar code construction, and polar encoding. All these
components are optimized using algorithmic and platform-specific optimizations. The la-
tency of CRC calculation is reduced by using lookup and exploiting the data parallelism.
The biggest contributor to latency in the FEC chain is the polar code construction due to
the presence of expensive search, remove and copy operations. The polar code construc-
tion algorithm is reformulated to reduce latency by avoiding above mentioned expensive
operations. Polar encoding is another latency contributor. The encoder is optimized
using a number of techniques. Significant optimizations include pruning encoder tree,
implementing encoding with SIMD instructions, unrolling the recursive function and by
avoiding superfluous copy operations. All the above-mentioned optimizations to FEC
chain reduced the latency by 10x compared to naive implementation.
43
5. Decoding FEC Chain
In this chapter, the implementation and optimization details of 5G polar decoding FEC
are presented including the challenges faced while achieving low latency decoding. In the
decoding FEC chain, the decoder is the critical part due to inherent sequential nature of
polar decoding. ith bit is decoded by using all the previously decoded bits, hence ith bit
depends on 1 to i − 1 bits. Due to the sequential decoding process, significant latency is
introduced by the decoder. This chapter presents the optimization techniques employed
to improve decoding FEC chain latency, which include both algorithmic and platform-
specific optimizations. Each of these techniques is explained in a separate section. Every
section talks about one particular component of the FEC chain, presents implementation
details and employed optimization techniques. In this work, the FEC chain considered is
part of the base station, therefore uplink control information is decoded at the receiver.
PUCCH and PUSCH contain polar encoded information. Received signal after demod-
ulation is quantized to 16-bit LLR (log likelihood ratio) values. Decoding is performed
with LLR (Log-likelihood ratio) values rather than probabilistic likelihoods due to their
numerical stability and low computational complexity. Receiver side FEC chain is a re-
verse of the operations performed at the transmitter. Figure 5.1 shows the receiver side
polar decoding FEC chain.
45
5. Decoding FEC Chain
E: LLRs
(K = A + L) bits
N: LLRs
Subblock
Extract Parity Bits
Deinterleaver
Polar code
Polar Decoding
construction
N: LLRs
a binary tree. The significant research effort has been devoted by both academia and
industry to improve the decoding latency of the SC algorithm. Major improvement to
SC algorithm, which reduced the decoding latency is identifying special kind nodes in a
tree. These special nodes allow immediate decoding of multiple bits without requiring
full tree traversal. Algorithms presented in [12] and [13] present such improvements,
which identify special nodes specifically Rate-0, Rate-1, RPC and SPC nodes. RPC and
SPC mean repetition and single parity check code respectively. Identification of special
nodes requires finding particular patterns in the frozen bit locations in the constructed
polar code. To gain full advantages of Fast-SSC (Fast Simplified Successive Cancellation)
algorithm, special nodes must be identified efficiently. In this work, 5G RX FEC chain
is implemented with the Fast-SSC algorithm, optimized in software and feasibility of
achieving desired latency( < 50µs) is analyzed.
This section briefly discusses the functionalities carried out by different blocks of the
decoding FEC chain. Figure 5.1 shows the complete receiver side FEC chain. It is almost
46
5.3. Channel Deinterlever
an inverse operation of encoding FEC chain except a few differences related to PUCCH
and PUSCH which contain parity check bits (nP C ). The decoding FEC chain receives
the Uplink Control Information (UCI) in form of 16-bit quantized E LLR values. Before
passing the LLR values to the decoder, the following operations are performed to LLR
values namely channel deinterleaving, inverse rate matching and subblock deinterleaving.
These steps grouped by a pink rectangle in Figure 5.1. After these steps, the polar code
construction is performed using the same optimized method as presented in the previous
chapter. Polar code construction procedure outputs the information bit positions, from
which frozen pattern can be obtained. The next step in the FEC chain is polar decoding,
N LLR values, and the frozen pattern is passed to polar decoder, which outputs the
decoded bits. Polar construction and decoding blocks are colored green the FEC chain
figure. Using information bit positions obtained in the polar construction procedure K +
nP C +L bits are extracted from N decoded bits. K +nP C +L bits contain nP C parity bits,
extracting these bits requires identifying the row of minimum weight from the generator
matrix of polar code. Finally input deinterleaving is applied on the remaining K + L bits
to obtain concatenated information and CRC bits. Blocks representing Extracting parity
bits and input bit deinterleaver are grouped with a blue rectangle.
47
5. Decoding FEC Chain
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 4 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Inverse rate matching step maps the E LLR values to mother code block size N . Rate
matching step has three modes puncturing, shortening, and repetition. The mode is
selected based on rate matcher output size (E) and mother code size(N ). If E > N then
K 7
repetition is performed, otherwise either puncturing and shortening is done. If E > 16
shortening else puncturing is performed. Major optimization in inverse rate matching
is utilizing SIMD capability for soft combining when E > N and performing block-wise
copying.
After inverse rate matching, E values are mapped to N LLRs, which is always a power of
two. Subblock interleaver/deinterlever divides block of N LLRs into 32 subblocks, each
N
containing 32 LLRs. Functionally, subblock deinterleaving can be implemented as an
inverse of interleaving operation as presented in [4]. Upon measuring the latency contri-
bution of subblock deinterleaver it was found to be taking 19µs. Computation complexity
of interleaving indexes is huge due to the use of multiplication, division and modulus op-
erations.
If we look at figure 5.2, we can see that not all the values of LLRs are interleaved, only 18
positions out of 32. Calculating interleaving positions is expensive. Instead, they can be
pre-calculated and stored in a lookup table. For the mother code size of 1024, with pre-
calculated positions interleaving requires looping for 576 times. Modern processors with
AVX and AVX2 extensions provide special swizzle instructions, which allow shuffling,
permuting and blending of vectors. These instructions process vector of values hence
allow data-parallelism (Multiple data elements processed in parallel). To make use of
swizzle instructions [30] for sub-block deinterleaving, it must be reformulated to fit into
functionality provided by platform-specific SIMD instructions. It is divided into three
parts, each one is independent of another. Part one and three are exactly the same
operations. Each one is dealing with 8 sub-blocks and performing the operation marked
48
5.6. Decoder Optimization
green in Figure 5.2. Part one and three are mapped to permute SIMD instructions. Part
two deals with 16 subblocks, marked by blue in the figure. Part two operation is achieved
with blend and permute SIMD instructions provided by AVX2 vector extension.
The code snippet in the listing 5.1 shows sample SIMD implementation of sub-block
deinterleaving operation for a mother code size (N ) 64.
//prepare part1
v256_in = _mm256_loadu_si256((__m256i*)y);
v256_out = _mm256_permutevar8x32_epi32 (v256_in,v256_perm0);
_mm256_storeu_si256((__m256i*)d,v256_out);
//prepare part2
v256_in = _mm256_loadu_si256((__m256i*)(y + 16));
v256_out = _mm256_permutevar8x32_epi32 (v256_in,v256_perm1);
v256_in = _mm256_loadu_si256((__m256i*)(y + 32));
v256_out2 = _mm256_permutevar8x32_epi32(v256_in,v256_perm2);
v256_blended = _mm256_blend_epi32 (v256_out,v256_out2,0b11110000);
_mm256_storeu_si256((__m256i*)(d + 16),v256_blended);
v256_out2 = _mm256_permutevar8x32_epi32(v256_out, v256_perm3);
v256_out = _mm256_permutevar8x32_epi32(v256_in, v256_perm1);
v256_blended = _mm256_blend_epi32(v256_out2,v256_out,0b11110000);
_mm256_storeu_si256((__m256i*)(d + 32),v256_blended);
//prepare part3, same as part1
//....Same as part 1
}
Results latency optimization of sub-block deinterleaver for N = 1024 are given in ta-
ble 5.1.
49
5. Decoding FEC Chain
50
5.6. Decoder Optimization
If the CN operation needs to perform on the received vector, the naive method would be
to access each value from the vector and perform the same CN operation. Sequentially
performing CN and VN operations is very inefficient. Size of the vector received at each
node is always a power of 2. Due to this fact, CN and VN operations naturally fit
vector processing units provided in the modern processors. In this work, both CN and
VN operations are efficiently implemented using SSE and AVX instruction extensions
provided by AMD EPYC platform. During CN and VN operations memory access is
regular therefore data required in the future can be fetched to cache from main memory
to reduce cache misses. Listing 5.2 and 5.3 show the efficient vectorized CN and VN
operations. Bit combination operation is an XOR operation decoded bits from child
nodes. Bit combination is again implemented using SIMD vector XOR operations.
51
5. Decoding FEC Chain
The naive SC algorithm is purely sequential, hence decoder introduces significant latency
in the FEC chain. With improvements such as [13] and [12] decoding, the tree can be
pruned by identifying particular patterns in frozen bit positions. Pruning of a tree allows
decoding of multiple bits in parallel. Component codes out of polar codes can be iden-
tified. These codes allow decoding without traversing the full decoder tree. Authors in
[12] and [13] identify four such codes namely rate-0, rate-1, repetition and single parity
codes. Decoding a codeword through component codes improves latency without com-
promising the error correction performance. However, to fully enjoy the fruits of decoding
tree pruning, the implementation should be able to identify component codes efficiently.
One simple functional way is to go through all frozen bits and search by comparing with
predefined patterns. The naive way of searching for a pattern introduces significant la-
tency in the decoding process. Processors with AVX and AVX2 support contain registers
which can store 256/128 bits in a single register. Frozen pattern array/vector contain
either one or zero, one indicating a frozen bit and zero an information bit. Since one bit
is enough to represent the type of bit position, frozen pattern can be stored by packing
multiple bits type information to single 256bit register. Bit packing allows identifying a
pattern by comparing with an integer vector using single SIMD instruction. For example,
a mother code size N = 256 information about which position is frozen and which is not
is stored in a single 256 bit SIMD register in a bit packed format. To check whether it
is a rate-0, rate-1, SPC or RPC requires one SIMD comparison instruction. The snippet
in listing 5.4 illustrates an example of identifying node type in bit packed frozen bits
pattern.
Listing 5.4.: Checking rate-0 node
template<>
inline int identify_R0<256>(uint64_t s[]) {
__m256i temp1 = _mm256_loadu_si256 ((__m256i*)s);
__m256i temp2;
temp2 = _mm256_set1_epi8 ((char)0xFF);
__m256i pcmp = _mm256_cmpeq_epi64 (temp1, temp2);
unsigned bitmask = _mm256_movemask_epi8(pcmp);
return (bitmask == 0xffffffffU);
}
52
5.6. Decoder Optimization
The R0 code is a special kind of node in a decoding tree in which all the descendants
represent frozen bit positions, in other words, the corresponding node’s frozen pattern
contains all ones. One such example is given in the background chapter. For such a node,
we know that all the bits are frozen hence decoder can immediately decode values as zero.
All the decoded bits corresponding to such a node are set to zero. Rate-0 node allows
the decoder to avoid performing VN and CN operations at the subsequent child nodes in
addition to decoding multiple bits simultaneously.
A node is considered as an R1 node if all of its descendants in a decoder tree are informa-
tion bits. In other words, the Rate one node contains no frozen bits. Decoding of Rate-1
nodes is performed without traversing till the end of a decoder. This avoids a significant
number of VN and CN operation and function calls. However, decoding Rate-1 node is
not as straightforward as Rate-0 node. Decoding is performed through threshold detec-
tion of all the LLRs and performing the polar transform on the result to obtain decoded
bits.
Although R1 nodes avoid CN and VN operations, decoding is not parallel. The decoder
needs to go through each of the LLRs to decode the bits and finally perform the polar
transform. Both steps are costly operations. To improve the latency through data
parallelism, threshold detection can make use of SIMD instructions. Threshold detection
of a complete vector can be performed through single SIMD comparison instruction.
This improves the parallelism factor to sixteen for 16-bit LLRs with AVX2 instructions.
The code snippet in listing 5.5 presents an example where rate-1 node decoding is
implemented using AVX instructions. Resulting parallelism factor is eight.
53
5. Decoding FEC Chain
template<unsigned Nv>
void decR_1(int16_t demodLLRs[],int8_t beta[],int8_t decodedBits[]) {
__m128i temp1,tempDecodedVec;
_m_prefetch(beta);
_m_prefetch(decodedBits);
for(unsigned i = 0; i < Nv; i = i + 8) {
temp1 = _mm_loadu_si128((__m128i*)(demodLLRs + i));
tempDecodedVec = _mm_cmplt_epi16(temp1,zeros);
tempDecodedVec = tempDecodedVec & _mm_set1_epi16(1);
tempDecodedVec = _mm_packs_epi16(tempDecodedVec,_mm_setzero_si128());
_mm_storeu_si128((__m128i*)(beta + i),tempDecodedVec);
_mm_storeu_si128((__m128i*)(decodedBits + i),tempDecodedVec);
}
polarTransform<Nv>(decodedBits,decodedBits);
}
Next step in decoding rate-1 node is performing polar transform operation. It is equivalent
to performing polar encoding. As explained in the previous chapter, the binary tree
represents the encoding process. Efficient polar transform implementation makes use of
the same optimizations techniques employed in encoding such as SIMD vectorization and
lookup table techniques.
Repetition code (RPC) is another type of component identified from polar code. It also
allows decoding multiple bits without full tree traversal. A node is considered as RPC
when only one of its rightmost descendent contains information and all remaining bits
are frozen. A bit packed frozen pattern allows easy identification of RPC node. If frozen
pattern at the node is equal to one then it is an RPC node. RPC node decoding is
similar to simple repetition code decoding. Bit decoded by summing all LLR values and
doing threshold detection of the result. The result of threshold detection is stored at
information bit position and remaining bits at the node are set to zero.
0, when P α [j] ≥ 0;
j v
βv [i] =
1, otherwise.
54
5.6. Decoder Optimization
Again, the RPC-node decoding process has room for improvement. Decoding requires
summing of all LLRs. Summation operation efficiently performed with SIMD instruc-
tions through data parallelism instead of adding individual values. Figure 5.3 illustrates
how blockwise addition is performed using SIMD instructions. This vectorization has a
significant gain when the sum of huge LLR vector is required.
Another type of constituent codes identified from decoding tree are single parity check
(SPC) codes. SPC codes can also be decoded more efficiently without complete tree
traversal. These nodes have a code rate N v − 1/N v. SPC nodes have only one frozen bit
at the right most position remaining ones are information bits. Optimal ML decoding of
SPC codes is performed with very low complexity. Similar to R1 code decoding SPC code
requires threshold detection for all the LLRs. To achieve this efficiently same optimization
of Rate-1 to employ threshold detection for whole vector is reused. For SPC decoding two
more additional steps are required namely finding position of minimum magnitude LLR
and calculating the parity of decoded bits. If the parity of decoded bits is not even then
bit value at the position of lowest magnitude LLR is flipped. Final step in the decoding
is to obtain final decoded bits values through polar transform. The steps described above
are shown in the algorithm 5.
SPC code uses the same two operations (threshold detection and polar transform) as
used in RPC decoding. This allows reusing of the same optimization techniques. Two
55
5. Decoding FEC Chain
additional operations in SPC decoding are finding the position of minimum magnitude
LLR and calculating the parity. The time complexity of finding a minimum magnitude
LLR position is O N . It can be reduced to O N/8 by mapping search operation to
SIMD instruction which processes vectors of size eight in parallel. SSE4.1 instruction
phminposu comes to rescue. It processes vectors of size 128 bits, computes the minimum
amongst the packed unsigned 16-bit vectors returns the position and its value. Parity
calculation of the decoded bits also requires iterating through all bits obtained after
threshold detection. To efficiently perform parity check calculation decoded bits are
packed into a single integer. The number of set bits in an integer is obtained by the
hardware instruction popcnt. If the number of set bits is odd then a bit in the lowest
magnitude LLR position is flipped. These optimizations reduced the latency of SPC
decoding to less than 50% of naive algorithmic implementation shown in 5.
56
5.6. Decoder Optimization
SSC<8>
SSC<4> SSC<4>
cessor cycles. Authors in [10] propose complete unrolling of the decoder, hence avoiding
recursive function calls and branch instructions. With complete unrolling, number special
nodes and their level in decoder tree are known at compile time due to the fixed frozen
bit position. In polar FEC chain of 5G, a full unrolling of the decoder is not possible
due to varying block-length and code rate requirements. Different code rate requirement
makes it impossible to know frozen bit positions during compile time. Therefore it is
necessary to dynamically identify the presence of special nodes in the decoder tree. In
this work, partial decoding unrolling is done, i.e recursive function calls are completely
avoided. Template concept of C++ language is used for auto unrolling of functions.
However, branches are still present to check for the presence of special nodes in a tree.
Partial unrolling also reduced the decoding latency considerably.
57
5. Decoding FEC Chain
the prefetching instruction can be executed well in advance by software before accessing
memory. These instructions if used efficiently can hide the memory access latency. In
this work, decoder implementation is optimized for AMD EPYC platform, which provides
cache prefetching instructions through 3dnow extensions. Due to regular memory access
in the polar decoder, prefetching instructions are used whenever possible to hide memory
access latency.
Every non-leaf node in the decoder performs bit combine operation to obtain β. Only half
of the β is calculated by XOR operation remaining bits are copied unchanged. One op-
timization method is to eliminate these superfluous copy operations by choosing suitable
i
memory layout for β values. If i is the size of β, only 2 values are modified. After updat-
i
ing 2 bits same aligned memory address is passed to the parent node. Since vector sizes
are always powers of two, memory passed to the parent node is again implicitly aligned
to SIMD vector size. This alignment of memory allows vectorization of bit combination
operation at the parent node.
58
5.6. Decoder Optimization
SSC<64>
SSC<32> SSC<32>
performance versus the pruning level plot for a code rate of 0.5.
Table 5.3.: Latency comparison: this work versus state of the art [3]
this work [3]*
Latency (µs) 5.3 8
*
Scaled according to frequency
Authors in [3] use Intel-i7 processor running at 3.1GHz for latency measurement. In this
work, the AMD EPYC processor is running at 1.6GHz. Therefore the latency values in
[3] are scaled by frequency factor for meaningful comparison.
59
5. Decoding FEC Chain
10-1
BLER
10-2
10-3
1 1.5 2 2.5 3 3.5 4 4.5
SNR (dB)
Figure 5.6.: Pruned code error correction performance for code rate 0.5
60
5.8. Extracting Information Bits
After extracting parity bits. these positions are marked as frozen to ease the extraction
of information bits.
61
5. Decoding FEC Chain
5.11. Summary
In this section, the polar decoding FEC chain is implemented. Each component of the
FEC chain is analyzed to understand the complexity and latency contribution. After ex-
tensive analysis and code profiling, three major latency contributors are identified namely
sub-block deinterleaving, polar decoding and CRC calculation. Sub-block deinterleaving
is optimized through algorithm reformulation and mapping interleaving operations to
specialized SIMD instructions. Biggest latency contributor in the decoding FEC chain
is polar decoder. A number of optimizations are performed to reduce decoder latency.
Optimizations which resulted in major latency reduction are SIMD implementation of
CN, VN and bit combination operations and component code decoding, unrolling the
decoder, avoiding superfluous copying and cache prefetching. CRC calculation is opti-
mized through a lookup table and parallel bit processing. Above optimizations reduced
the latency of the FEC chain by 10x compared to the naive implementation.
62
6. Conclusion and Outlook
The objective of this work is to study the feasibility of developing polar FEC chain of 5G
in software on general-purpose-processor while satisfying stringent latency requirements.
In other words, all the components of encoder and decoder FEC chain are developed on
general purpose AMD EPYC processor. The software satisfies latency constraint of less
than 50µs. In the first part of the thesis, we provide necessary background about polar
encoding/decoding and computer architecture. In the second part, we develop encoding
and decoding FEC chains and optimize them to satisfy the necessary latency constraints.
To begin with, we provided necessary mathematical background about polar code con-
struction, polar encoding, and decoding. Including different polar decoding algorithms.
To understand FEC chain development in software it is necessary to know the basics
of modern computer architecture. Computer architecture section talks about pipelining,
cache memory and vector processing units in modern general purpose processors.
In the next chapter, we talk about the details of polar encoding FEC chain. In this chap-
ter, we analyze the different components of the FEC chain to identify latency contributors.
Each of these latency contributors is further studied to reformulate the algorithm to avoid
costly operations. Algorithms are reformulated to fit into specialized functional units of
modern processors such as vector processing units. Vector processing units allow data
parallelism in addition to supporting very fast mathematical computations. The encod-
ing, major latency contributors were polar code construction, CRC calculation, encoding,
and rate matching. A wide range of optimization techniques is employed to reduce the
latency both algorithmic and platform specific. Namely, reducing algorithm complexity,
using lookup tables, compiler hints for better instruction scheduling, vector processing
instructions for data parallelism and avoiding superfluous copy operations et cetera. Op-
timizations reduced the worst-case latency of the encoding FEC chain from 451µs to 40µs
which is more than 10x reduction in latency.
For the decoding FEC chain again same steps as encoding chain are followed to identify
the latency contributors. Major contributors in decoding FEC chain were channel dein-
63
6. Conclusion and Outlook
terleaver, subblock deinterlever, polar decoder, parity bit extractor, and CRC calculation.
Decoding FEC chain extensively uses SIMD, bit count, cache prefetching instructions to
reduce latency. Subblock deinterleaving operation is divided into three primitive small
operations which are implemented efficiently with permute and blend vector instruc-
tions. The polar decoder is optimized by implementing XOR, CN, VN, bit combination
and frozen pattern identification operations using vector processing instructions. Parity
bit extractor optimized by avoiding expensive remove and erase operations instead uses
modified algorithm marking indexes and dynamically calculating hamming weights of
generator matrix rows. Finally, for CRC calculation an algorithm based on lookup table
is developed based on [21] which processes block of data bits to calculate CRC. These
optimizations significantly reduced the latency of decoding FEC chain from 391µs to 40µs
almost a 10x reduction in latency.
As an outlook, for the above stated decoding FEC chain, decoder is developed with fast-
SSC algorithm. This algorithm has much lower error correction performance than similar
block-length LDPC and Turbo counterparts. As part of this work, CRC-Aided Successive
Cancellation List (CA-SCL)[9] decoding algorithm is also implemented, however, it is not
optimized for software. CA-SCL ideally suits very low SNR scenarios such as mmWave
communication. It has approximately 1.5dB gain over fast-SSC algorithm for N = 2048
and list size L = 8. Ideal continuation of this work would be to extend the decoding chain
by incorporating CA-SCL algorithm to the FEC chain. It would be interesting to see the
latency values of this algorithm, which has expensive sort and copying operations.
64
Bibliography
[2] V. Bioglio, C. Condo, and I. Land, “Design of Polar Codes in 5G New Radio,” ArXiv
e-prints, Apr. 2018.
[4] 3GPP, “Technical Specification Group Radio Access Network; NR; Multiplexing and
channel coding,” Technical Specification (TS) 38.212, 3rd Generation Partnership
Project (3GPP), April 2018. Version 15.1.1.
[6] U. Drepper, “What Every Programmer Should Know About Memory,” tech. rep.,
Red Hat, Inc., 2018.
[9] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Inf. Theory, vol. 61,
pp. 2213–2226, May 2015.
[10] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast polar decoders:
Algorithm and implementation,” IEEE J. Sel. Areas Commun., vol. 32, pp. 946–957,
May 2014.
I
Bibliography
[14] A. Herkersdorf, “Chip Multi Core Processors.” Lecture Notes, Institute for Inte-
grated Systems, Technische Universität München, 2017.
[15] A. Herkersdorf, “System on Chip Technologies.” Lecture Notes, Institute for Inte-
grated Systems, Technische Universität München, 2016.
[17] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein,
Introduction to Algorithms. Cambridge, Massachusetts: The MIT Press, 2009.
[18] 3GPP, “Technical Specification Group Radio Access Network; NR; Physical channels
and modulation,” Technical Specification (TS) 38.211, 3rd Generation Partnership
Project (3GPP), April 2018. Version 15.1.0.
[19] A. Fog, “Instruction tables: Lists of instruction latencies, throughputs and micro-
operation breakdowns for Intel, AMD and VIA CPUs,” tech. rep., Technical Univer-
sity of Denmark, 2018.
[21] D. V. Sarwate, “Computation of cyclic redundancy checks via table look-up,” Com-
mun. ACM, vol. 31, pp. 1008–1013, Aug. 1988.
[22] R. Mori and T. Tanaka, “Performance of polar codes with the construction using
density evolution,” IEEE Commun. Lett., vol. 13, pp. 519–521, July 2009.
[23] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Trans. Inf. Theory,
vol. 59, pp. 6562–6582, Oct 2013.
[24] G. He, J. Belfiore, I. Land, G. Yang, X. Liu, Y. Chen, R. Li, J. Wang, Y. Ge,
R. Zhang, and W. Tong, “Beta-expansion: A theoretical framework for fast and
recursive construction of polar codes,” in IEEE Global Telecommun. Conf, pp. 1–6,
Dec 2017.
II
Bibliography
[31] T. Cover and J. Thomas, Elements of Information Theory. Wiley series in telecom-
munications, New York: John Wiley & Sons, 1991.
[32] K. Niu, K. Chen, J. Lin, and Q. T. Zhang, “Polar codes: Primary concepts and
practical decoding algorithms,” IEEE Commun. Mag., vol. 52, pp. 192–203, July
2014.
[34] G. Sarkis and W. J. Gross, “Increasing the throughput of polar decoders,” IEEE
Commun. Lett., vol. 17, pp. 725–728, April 2013.
[35] B. L. Gal, C. Leroux, and C. Jego, “Multi-gb/s software decoding of polar codes,”
IEEE Trans. Signal Process., vol. 63, pp. 349–359, Jan 2015.
[36] H. Ji, S. Park, J. Yeo, Y. Kim, J. Lee, and B. Shim, “Ultra-reliable and low-latency
communications in 5g downlink: Physical layer aspects,” IEEE Wireless Commun.,
vol. 25, pp. 124–130, JUNE 2018.
[37] G. Sarkis, Efficient Encoders and Decoders for Polar Codes: Algorithms and Imple-
mentations. PhD thesis, McGill University, Montreal, Canada, 2016.
III