ReLU module block diagram Max Pooling module block diagram.... Flatten perceptron module block diagram .CECC module block diagram CECC module operation.... Convolutional Neural Network A
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
COMPUTER ENGINEERING DEPARTMENT
LE MINH PHUC
BÙI HỮU TRÍ
GRADUATION THESIS RESEARCH AND EVALUATE HARDWARE
Trang 2VIETNAM NATION UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
COMPUTER ENGINEERING DEPARTMENT
PhD LAM DUC KHAI
HO CHi MINH CITY, 2021
Trang 3LIST OF THE COUNCIL TO PROTECT THE THESIS
Graduation thesis grading council established under Decision no 62/QD-DHCNTT Dated
14/02/2022, of the Rector of the University of Information Technology
Trang 4THANK YOU
After studying at the University of Information Technology, we would like to sendour sincere thanks to the teachers of Computer Engineering have taught us knowledge
and invaluable experiences
Especially, we are also sincere thanks to PhD Lam Duc Khai for helping us andtaking the time to guide and instruct us so that we can complete the graduation thesis.Once again, we would like to sincerely thank the council for spending their time andeffort to help us in the graduation thesis process We apologize to everyone because
of the mistakes, we hope teachers and you can ignore and forgive
Trang 5Chapter 1 OVERVIEW S2 HH1 H00 3
I.l Introduction 6-6 T TH HH ngư 3
I1 ác dd 3Chapter 2 THEORETICAL BASIS ¿5-55 Street 4
2.1 Neural NetwOrk ác Sà SH HH HH ướt 4
2.2 Feed Forward Algorithm
2.3 Convolutional Neural NetWOrK ¿- 555 c+e+tsrererrrrkrrsreree 62.3.1 Convolutional lay€r - 5+5 Sky 72.3.2 Padding and Stride c- «cty 8
2.3.2.1 Padding chư §
2.3.2.2 Strid MaDe MF đề dc 212.3.3 Pooling Layer se 22.3.4 Fully Connected Layer
PIN ras ee snnensss sins cossscsvesessssesssseessssesocsssossocsssosssssseosssessecsssooss 24
2.3.6 CTOSS-En(TODV ch HH HH Hết 252.3.7 Activation Function c-c+cccxcreererrrerrerrrrerrrrrrrsev 252.4 Backpropagation + HH re 282.4.1 Backpropagation in Fully Connected Block - 28
2.4.2 Backpropagation in Convolutional Bloek -:-+ 30
2.5 VGG-16 Structure
2.6 MNIST Dataset Ă ào
Trang 62.8 Development 'TOOÌ: 55+ S+S‡xstsveerekeerteeeeerrrrrrrerrrrerreeereee 22Chapter 3 SYSTEM ARCHITECTURE DESIGN AND IMPLEMENTATION.34
3.1 CNN Datapath IP architecture 55555c5+5<cccxcxsecss+ 343.1.1 Feed-forward: TH HH HH HH HH HH it 37
3.1.1.1 Convolve module ccc E73.1.1.2 ReLU module: -c-5s5ccccc+crererrrrerrrrrrerrrrerree 39
3.1.1.3 Max Pooling module:
3.2 VGG-16 IP: „633.2.1 Rock-paper-scissors dataset: -.ccccccccccrecercee 63
3.2.2 VGG-16 Overall architecture: +5 cccccccsecrrerrrrerrsrv 64
Trang 73.2.2.1 Quantization module:, ¿5s + svevxeerexrrrerresrrrrrrsrrserk 653.2.2.2 Mem blOCK: 5c cScS+ctttettrterrrrererrrrerrrrrrrerrrrree 66
3.3 Testbench environmeI(: -++5++cSS++++t+xexexerexexererre 673.3.1 CNN Datapath IP: nghi 67
3.3.2 WGG-162 ceeescesesseseeseeseeeseeseseeseeseseesssessessseessseseessseessssseesssesssenseties 68Chapter 4 CONCLUSION AND DEVELOPMENT DIRECTIONS 69
4.1.1.4 Comparison with related articles (Testing mode only): T7
4.1.2, VGGIID: H đi À SEỢ ì ceniireenriiiee 794.1.2.1 Waveform:
4.1.2.2 ACCUTACY Ă Sàn HH HH Hit 794.1.2.3 System RÑ€SOUTCCS kh rên 80
4.1.2.4 Comparison with related articles ‹ -cs<5<5<+x+s+> 81
4.2 _ What We Gained, Limitation and Direction of Development: 824.2.1 What We Gained: 0 cece ees necsesesesneeaeseseesesesesesneeseseesseeneaeee 824.2.2 Limifation: c chư 83
4.243 Direction of Development
Trang 8A complete convolutional neural network [6].
Kernel in the field of Computer Vision [9]
Simulation of Convolution [6]
Padding = 1 for input matrix
Red box - Kernel, Black box — Input
Convolve valid paddingConvolve same padding
Convolve full paddingStride = 1 for data matrix, the yellow tiles are kernel core
Max pooling
Average pooling
Sum pooling [6]
Fully Connected Layer
Graph of Sigmoid function
Graph of Tanh function
Graph of ReLU function
Graph of Softmax function
Neural Network example
Convolutional exampleMask of Pooling
Convolve module block diagram
ReLU module block diagram
Max Pooling module block diagram
Flatten perceptron module block diagram Softmax module block diagram
CFED module block diagram
Flatten perceptron module block diagram CECC module block diagram
CECC module operation
Update weight module operation
Update kernel module operationConvolve valid module operation
Floating-point adder module block diagramFloating-point multiplication module block diagram
Trang 9Figure 4.1: Mixing testing mode and learning mode
Figure 4.2: Testing mode
Figure 4.3: Learning mode
Mem block example
Mem block signal
Testbench environment block diagramTest flow
Testbench environment block diagram
Test flow
Trang 10LIST OF TABLES
*Table 3.1: CNN Datapath IP IO information
*Table 3.2: Convolve module IO information.
*Table 3.3: ReLU module IO information
*Table 3.4: Max Pooling module IO information
*Table 3.5: Max Pooling module IO information
*Table 3.6: Softmax module IO information
*Table 3.7: CFED module IO information
*Table 3.8: CDCP module IO information
*Table 3.9: CECC module IO information .
*Table 3.10: Update weight module IO information
*Table 3.11: Assuming of LUT consumption if implement with Feed-forwardConvolve architecture
*Table 3.12: Assuming of LUT consumption if implement with 3.1.2.5
archifeCfUIF - chê,
*Table 3.13: Update kernel module IO information
*Table 3.14: Floating-point adder IO information
*Table 3.15: Floating-point multiplication IO information
*Table 3.16: Floating-point division IO information
*Table 3.17: Exponential IO information
*Table 4.1: Accuracy per epoch started from epoch 0
*Table 4.2: Accuracy per epoch started from epoch 12
*Table 4.3: Accuracy per epoch started from epoch 24
*Table 4.4: The error of Convolve, Pooling, and FC
*Table 4.5: The error of Softmax
*Table 4.6: The error of CFED, CECC, and CDCP
*Table 4.7: The error of Update weight, Update kernel
*Table 4.8: Implementation result of CNN Datapath IP
*Table 4.9: Implementation result of Feed-forward sub-modules
*Table 4.10: Implementation result of Backpropagation sub-modules (P1)
*Table 4.11: Implementation result of Backpropagation sub-modules (P2)
*Table 4.12: Comparison between [1] and our work . - 5< «-+ T1
*Table 4.13: Comparison between RTL accuracy and Matlab accuracy
*Table 4.14: Implementation result of VGG-16 sub-modules
*Table 4.15: Implementation result of VGG-16
*Table 4.17: Comparison between [2] and our work
*Table 4.18: Error increase per epoch while updating parameters (U#: update
1) 83
Trang 11LIST OF ACRONYMS
DSP: Digital Signal Processing
FF: Flip-Flop
FIFO: First in First Out
FPGA: Field Programmable Gate Array
GPU: Graphics Processing Unit
HDL: Hardware Description language
IP: Intellectual Property
ISE: Integrated Synthesis Environment
LUT: Look-Up Table
RAM: Read Access Memory
ReLU: Rectified Linear Units
RTL: Register-transfer Level
SRAM: Static Random-access Memory
Trang 12SUMMARY OF THESIS
In recent years, the Deep Learning models have been interested in many scientists toparticipate in research, notably the Neural Network model as a good candidate to
solve problems such as recognizing object processing
Nowadays many companies have produced the AI chip by using the cloud for thetraining process, and it will consume a huge number of internal chips when itdevelops in the future In response to this problem, developing an IP for learning will
reduce the cost of production
We propose an architecture, design of Convolutional Neural Network architecture
that can learn, test and VGG16 using quantization to reduce resources and get better
accuracy.
Trang 13Chapter 1 OVERVIEW
1.1 Introduction
In recent years, along with the explosion of the 4.0 revolution, Machine Learning
plays an important role in human life We can find it in many devices around us and
it also is an important thing to build many future devices such as an automated car.CNN is one of the most popular models of machine learning, it has many architecturesand uses, but image classification is one of the most basic and commonly used
To implement machine learning, nowadays many companies have produced the AIchip, with a specific CNN architecture such as YOLO, GoogleNet, VGG-16, etc
Mostly the AI chip nowadays is implemented with the Feed-forward operation only,
and the parameter will be trained via Cloud With the growth of AI as well as human’sneeds, the amount of data also growth, and the needs of the server, ethernet to perform
the AI training also increased, resulting in a high cost in server, ethernet chip
To solve the above problem, our team is implementing a CNN IP capable of doing
both the learning and applying With a CNN IP capable of learning and applying
operation, the pressure on the server, ethernet chip production
Besides, to reduce the area consumption and increase performance on AI chips, our
team also implemented a VGG-16 IP based on the Quantization method With theQuantization method, the AI chip arithmetic operation will use low-bit widtharithmetic such as 8-bit integer adder, multiplication, and shifter With low-bit width
arithmetic, result in a significant reduction of area and the increment of speed
1.2 Project’s goal
There are 2 main goals in our project:
The first goal, design a CNN model on hardware capable of learning and testing
There are two parts to this goal, one is making a Feed-forward structure to perform a
Trang 14the learning, compatible with the Feed-forward, using floating-point value The CNNmodel will be experimented with both learning and testing by using the MNIST
A neural network is built based on a biological neural network inspired by the human
brain It consists of neurons that connect and process information by passinginformation along with connections and calculating values at the neurons
Each neuron will have input data, which will be changed by the inputs and weights
A layer is many neural nodes that are the same input but have different weights
(Figure 2.1)
After going through all the layers, the input data will be processed and then combined
in the final layer to create a prediction
The neural network learns by generating difference errors between the network'spredictions and the desired value and then uses those errors to update the weights andbiases again for more accuracy
Trang 15Figure 2.1: Basic Neural Network
2.2 Feed Forward Algorithm
@ Bias unit
Wo
—
INPUT NEURON OUTPUT
Figure 2.2: Feed-forward algorithm
In Figure 2.2, the inputs x are being fed to the red neuron along with w in their path
Trang 16val = 3?-+w¡x¡ + b (2.1)
In which, the main b value is w0 in Figure 2.2, which is called the bias weight
Each neuron when producing output will have to go through an activation function,the formula is as shown in Figure 2.2
out = f(val) (2.2)
2.3 Convolutional Neural Network
A convolutional neural network is a commonly used network in image processingand classification, including two large blocks, feature extraction blocks (includingconvolutional blocks, pooling blocks), and classification blocks
Figure 2.3 shows an example of a CNN model to classify 5 classes: car, truck,
airplane, ship, horse Besides that, Figure 2.4 illustrates the shape of the image when
it goes through each convolutional block
car
truck SÏfplane Ship horse
Figure 2.3: CNN example [5]
Trang 17t3 f4Fully-Connected Fully-Connected
Neural Network Neural Network
Conv_1 Conv_2 ReLU activation
Convolution Convolution | —*®—x
xsl ke xế Max-Pooling exe 5) Karel Max-Pooling (with
valid padding (2x2) valid padding (2x2) dong
The performance of the convolution operation also depends on the following factors:
kernel matrix size, padding, stride
Convolution formula (Figure 2.6) for each element to get the feature matrix:
yị=sum(A@K) (2.3)
In which A is the image input, K is the kernel matrix applied to calculate the convolve
yij is a single-pixel output of the convolve.
Trang 18Figure 2.6: Simulation of Convolution [6]
In Figure 2.5, there is the operation of convolutional and several common kernelmatrixes that we can use in computer vision
2.3.2 Padding and Stride
2.3.2.1 Padding
Padding is simply a process of adding layers of zeros to our input images
Trang 19We call padding = n where n > 0 is the number of layers of zero added outside thematrix.
Figure 2.7: Padding = 1 for input matrix
Padding (example in Figure 2.7) is usually used to get a more accurate feature mapduring the convolve With Padding, we can separate the Convolve operation into 3types, with the given example in Figure 2.8
3 1 3
1 3 1
3 1 1
Figure 2.8: Red box - Kernel, Black box — Input
Padding valid: Non-padding convolve (Figure 2.9), the output width will besmaller than input width, equal to (input width — kernel width + 1)
Trang 21Figure 2.11: Convolve full padding
Note that the number of zero layers added to the input will depend on the kernel
size and the determined type of convolving algorithm
2.3.2.2 StrideStride is the parameter for which the sliding window (Figure 2.12) can "jump"
We call stride = n where n > 0 is the number of distances (in matrix cells) that the
sliding window will jump
When stride > 0, the output matrix that we collect will be smaller in size than the
input matrix
Trang 22If stride = 0, output matrix size will depend on padding and Kernel size
Trang 23m"|o|>|s xÌ2|M|+
Figure 2.15: Sum pooling [6]
2.3.4 Fully Connected Layer
After processing the image, image features will be pushed into the fully connected
layer This layer has the function of converting the feature matrix in the previous layerinto a vector containing the probabilities of the object that need to be predicted
Trang 24Figure 2.16: Fully Connected Layer
For example, in Figure 2.16, the fully connected layer covers the characteristic tensor
of the previous layer into a 2-dimensional vector representing the probabilities of the
2 similar layers
And finally, the process of training the Convolutional Neural Network Model for theimage classification problem is similar to training other models We need an errorfunction to calculate the error that keeps the model and label prediction accurate, as
well as we use a backpropagation algorithm for the weight update process
2.3.5 Weight
Each neuron in a neural network computes an output value by applying a specificfunction to the input values coming from the receptive field in the previous layer Thefunction that is applied to the input values is determined by a vector of weight and abias (typically real number) Learning in Neural Network progresses by making theiterative adjustment to these biases and weight
The vector of weight and the bias are called filters and represent a specific feature ofthe input (e.g., a particular shape) A distinguishing feature of Neural Network is thatmany neurons can share the same filter This reduces memory footprint because a
Trang 25single bias and a single vector of weight are used across all receptive fields sharingthat filter, as appose to each receptive field having its own bias and vector weighting.2.3.6 Cross-Entropy
The Cross-Entropy(CE) function is responsible for the backpropagation calculation
of the softmax function
This function is often used in classification with output using the Softmax activationfunction
In addition, it can help the output of CNN more clearly
For example, we have 2 outputs of 0.8 and 0.2, then when we pass the CE function,
we get | and 0, respectively, similar to when we one hot encoding a sequence ofnumbers and the largest number will be 1
The formula of the function:
— >xp(R)log (qŒ)) (2.4)
_ yr)
qœ&) (2.5)
Derivative:
Derivative of softmax and CE: q(k) — p(k) (2.6)
In which, q(k) is the result that the neural network calculates, p(k) is the expected
Trang 26The disadvantage of the Sigmoid function is that if the input has an absolute valuethat is too large or too small, the function will saturate, making it impossible to update
the parameter (Figure 2.17 shows the graph of the Sigmoid function)
Derivative of Sigmoid: a(z)' = ø(2) * (1—a(z)) (2.8)
=
oat / /
ool!
bị
/ Az
As the input is larger, the function will give a value closer to 1, and vice versa, the
function will give a value close to -1
The disadvantage of the tanh function is similar to the sigmoid function (Figure 2.18
shows the graph of the Tanh function)
Derivative of Tanh: tanh(x)’= 1 - tanh(x) (2.10)
1.0 a
-10 -5 † 5 10
Figure 2.18: Graph of Tanh function
Trang 27f(x) = max (0,x) (2.11)
The ReLU function is very popular because it is simpler to calculate and does not
saturate functions like sigmoid and tanh
The ReLU function has the effect of filtering all values < 0
The disadvantage of the ReLU function is that if so many values are less than 0, the
ReLU function will become 0, this is the phenomenon of Dying ReLU (Figure 2.19
shows the graph of ReLU function)
The softmax function is used to calculate the ratio of the inputs to each other, widely
used for calculating the final classification result for neural networks
Trang 28Usually, when using softmax, we often use it with the cross-entropy function so that
it can be easily calculated in the chain rule (Figure 2.20 shows the graph of theSoftmax function)
W4
Figure 2.21: Neural Network example
Trang 29We have an example of a Neural Network in Figure 2.21.
Assuming the correct value of output is yl and y2, we have an Error:
Eo1 = 504 — out(o1))2 (2.17)
Eo = 502 —out(02))? (2.18)
Evotal = Eo1 + Eo2 (2.19)
Derivative of Euai following w is the effect of that w on the Etotar which is also theadjustment of that w We have adjustment ws:
AEtotat — AE total „ Pout(o1) „ dval(ot)
@wS ~ ôout(o1) aval(o1) aws (2.20)
In there: Ea=s(yị — out(01))? +5 (yz — out(o2))?
dval(o1) g'(val(ol)) ite val(or) a Tae sao) (2.22)
Val(ol) = out(h1)*w5 + out(h2)*w6 + b2*1
aval(ot) _
SOY “0ut(hl) (2.23)
The new ws:
New(ws) = old(ws) — learning_rate * Than (2.24)
Similar to other w, we will update the new w and through many updates, the w willgradually converge
Trang 302.4.2 Backpropagation in Convolutional Block
Similar to being fully connected, we use the derivative to calculate the adjustment
value Figure 2.22 below is an example for Convolutional :
Ôerror(x) _ derror(x) ` Øout(c) * dval(c) @ 25)
val(c) = conv(image,w)
Aval(c)
3w =conv_full(val(c),image) (2.28)
Trang 31Out(c1) | Out(c2) | Out(c3) | Out(c4)
Out(c5) | Out(c6) | Out(c7) | Out(c8) Out(c1) | Out(c7)
Max poolin
Out(c9) | Out(c10) | Out(c11) | Out(c12) Out(c14) | Out(c12)
Out(c13) | Out(c14) | Out(c15) | Out(c16)
Figure 2.23: Mask of Pooling
When pooling, we need to create a mask (Figure 2.23) to save the initial position of
the elements of (B) in (A) before pooling
The purpose of creating a Mask is to serve the reshape during the Back Propagation
phase and to create the derivative of ReLU
Trang 32How to make Øp0ooÏ:
0|0|0 |E4 ol0|0 1| ` |E3 E3 | E4 | E4
0|E3|0 0 0/1|0|0 E3 | E3 | E4 | E4
Figure 2.24: Making 0pool
Step1: resize the error in flatten and double it (Figure 2.24)
Step2: multiply the matrix just created in step] for the mask created at feedforward,
we will get the pooling derivative (Figure 2.24)
2.5 VGG-16 Structure
VGG-16 is a convolutional neural network model proposed by K Simonyan and A.Zisserman from the University of Oxford in the paper “Very Deep ConvolutionalNetworks for Large-Scale Image Recognition” [4] The model achieves 92.7% top-5test accuracy in ImageNet, which is a dataset of over 14 million images belonging to
1000 classes
Trang 33pl zÌ= r| zÌ= z| z| zÍ= pl eles pl z| z|=
S|] a] 9 S| =|£ | Ss} =|S S| S| =| S S| S| =| S OVO fe] > o|-> sị> S| o|——>
5} 5| 8] II RMEEEEERMEEEERMSEEEE ion han
Figure 2.25: VGG-16 StructureVGG-16 Structure (Figure 2.25) has 16 layers: 13 layers of convolutional block and
3 layers of fully connected block
In convolutional block, it uses convolve with padding n = | to get the output size
equal to the input size
2.6 MNIST Dataset
The MNIST[7] database (Figure 2.26) is a large database of handwritten digitscommonly used in training various image processing systems This database is alsowidely used for training and testing in the field of machine learning The databasewas created by "remixing" samples from the original NIST dataset The databasecreators felt that because the NIST training dataset was obtained from the US Census
Bureau, while the test dataset was obtained from US high school students because so
it is not suitable for machine learning experiments Furthermore, the black and whiteimages from NIST normalized to fit the 28x28 pixel
0000060299700 O00000 FAUNA ALL ZN bas AtrRARAAAPAAQLIAZWZAZA
33333332393333833334##ứW#4qdqw#w#27dd+wx#w
Trang 35mode The other one is the Backpropagation, for the learning mode only While in
the testing mode, the CNN Datapath only performs the Feed-forward operation, theBackpropagation is off In the learning mode, the Feed-forward and Backpropagation
will be executed
Trang 36*Table 3.1: CNN Datapath IP IO information.
Name No bits | Direction | Description
clk 1 Input System clock
Asynchronous reset signal
rst_n 1 Input
When LOW, the IP will be reset
Asynchronous reset signal
rst_backprop 1 Input | When LOW, the IP will be reset
except the parameter memory
HIGH: Learning mode
en_update 1 Input
LOW: Testing mode
pixel_in 8 Input Image pixel input
Synchronous signal, indicate the
valid_in 1 Input /ý :
pixel_in is valid
kernel_in 4*32 Input Initial kernel input
weight_in 10*32 Input Initial weight input
Load initial kernels into Parameters
Trang 37predict 4 Output | The prediction output.
valid_out_forward 1 Output Indicate the predict signal is valid.
Indicate a learning process has been
3.1.1.1 Convolve moduleThe Convolve module (Figure 3.3) is designed with the 3x3 kernel size to convolvethe algorithm The architecture uses the Line-buffers to perform the sliding window
Trang 38output The edge detection in this module is in charge of detecting the end of an imageline or frame.
By using two line buffers, the delay from the input to the output of the line bufferonly took WIDTH*2+2 cycles for the pixel to come out The MAC contains multiple
full-pipelined floating-point multiplications and floating-point adder trees, with the
floating-point multiplication being optimized and only taking 6 cycles for thecalculation compared to [1] is 47 cycles However, the delay of floating-point addertakes more cycles than [1], which is 7 cycles due to optimizing the performance, while[1] only took 4 cycles In total, the delay of convolving is much shorter than [1], onlytaking 93 cycles for a 28x28 image for the first-pixel output to come out, while [1]took 121 cycles
Figure 3.3: Convolve module block diagram
*Table 3.2: Convolve module IO information
Name No bits | Direction | Description
clk 1 Input System clock
Asynchronous reset signal
rst_n 1 Input
When LOW, the IP will be reset
Trang 39data_in 32 Input Data input.
Synchronous signal, indicate the
data_out 32 Output | The convolve output.
valid_out 1 Output | Indicate the convolved output is valid
3.1.1.2 ReLU module:
The ReLU module (Figure 3.4) is designed y using only a MUX The output will bezero if the input is a negative number, and equal to input if it is a positive number.The valid_out in the ReLU module is the bypass from valid_in without delay
Data[31:0] =
32 bit 0
Figure 3.4: ReLU module block diagram
*Table 3.3: ReLU module IO information
out[31:0]
Name No bits | Direction | Description
data_in 32 Input Data input.
data_out 32 Output | The ReLU output.
Trang 403.1.1.3 Max Pooling module:
The Max Pooling module (Figure 3.5) is designed by using only one line buffer for
slide window 2x2 The stride detection is to perform the stride by removing theunused values in the Max Pooling operation and is designed by multiple counters
The main operation of Pooling is the Comparator, which compares 4 inputs andbypasses the highest value as the output
The Max Pooling module has more latency than [1] In our work, since we made morestages of pipeline in the comparator than [1] for improving performance purposes,
our comparator will take 3 cycles while [1] only took 2 cycles The single line buffer
will have the same delay (WIDTH+1) as [1]
valid_in ——+—» Stride detection} —> Delay >> valid_out
Figure 3.5: Max Pooling module block diagram
*Table 3.4: Max Pooling module IO information
Name No bits | Direction | Description
clk 1 Input System clock
Asynchronous reset signal
rst_n 1 Input
When LOW, the IP will be reset
data_in 32 Input Data input