Khóa luận tốt nghiệp Kỹ thuật máy tính: Nghiên cứu và phát triển mô hình CNN trên phần cứng

ReLU module block diagram Max Pooling module block diagram.... Flatten perceptron module block diagram .CECC module block diagram CECC module operation.... Convolutional Neural Network A

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

COMPUTER ENGINEERING DEPARTMENT

LE MINH PHUC

BÙI HỮU TRÍ

GRADUATION THESIS RESEARCH AND EVALUATE HARDWARE

Trang 2

VIETNAM NATION UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

COMPUTER ENGINEERING DEPARTMENT

PhD LAM DUC KHAI

HO CHi MINH CITY, 2021

Trang 3

LIST OF THE COUNCIL TO PROTECT THE THESIS

Graduation thesis grading council established under Decision no 62/QD-DHCNTT Dated

14/02/2022, of the Rector of the University of Information Technology

Trang 4

THANK YOU

After studying at the University of Information Technology, we would like to sendour sincere thanks to the teachers of Computer Engineering have taught us knowledge

and invaluable experiences

Especially, we are also sincere thanks to PhD Lam Duc Khai for helping us andtaking the time to guide and instruct us so that we can complete the graduation thesis.Once again, we would like to sincerely thank the council for spending their time andeffort to help us in the graduation thesis process We apologize to everyone because

of the mistakes, we hope teachers and you can ignore and forgive

Trang 5

Chapter 1 OVERVIEW S2 HH1 H00 3

I.l Introduction 6-6 T TH HH ngư 3

I1 ác dd 3Chapter 2 THEORETICAL BASIS ¿5-55 Street 4

2.1 Neural NetwOrk ác Sà SH HH HH ướt 4

2.2 Feed Forward Algorithm

2.3 Convolutional Neural NetWOrK ¿- 555 c+e+tsrererrrrkrrsreree 62.3.1 Convolutional lay€r - 5+5 Sky 72.3.2 Padding and Stride c- «cty 8

2.3.2.1 Padding chư §

2.3.2.2 Strid MaDe MF đề dc 212.3.3 Pooling Layer se 22.3.4 Fully Connected Layer

PIN ras ee snnensss sins cossscsvesessssesssseessssesocsssossocsssosssssseosssessecsssooss 24

2.3.6 CTOSS-En(TODV ch HH HH Hết 252.3.7 Activation Function c-c+cccxcreererrrerrerrrrerrrrrrrsev 252.4 Backpropagation + HH re 282.4.1 Backpropagation in Fully Connected Block - 28

2.4.2 Backpropagation in Convolutional Bloek -:-+ 30

2.5 VGG-16 Structure

2.6 MNIST Dataset Ă ào

Trang 6

2.8 Development 'TOOÌ: 55+ S+S‡xstsveerekeerteeeeerrrrrrrerrrrerreeereee 22Chapter 3 SYSTEM ARCHITECTURE DESIGN AND IMPLEMENTATION.34

3.1 CNN Datapath IP architecture 55555c5+5<cccxcxsecss+ 343.1.1 Feed-forward: TH HH HH HH HH HH it 37

3.1.1.1 Convolve module ccc E73.1.1.2 ReLU module: -c-5s5ccccc+crererrrrerrrrrrerrrrerree 39

3.1.1.3 Max Pooling module:

3.2 VGG-16 IP: „633.2.1 Rock-paper-scissors dataset: -.ccccccccccrecercee 63

3.2.2 VGG-16 Overall architecture: +5 cccccccsecrrerrrrerrsrv 64

Trang 7

3.2.2.1 Quantization module:, ¿5s + svevxeerexrrrerresrrrrrrsrrserk 653.2.2.2 Mem blOCK: 5c cScS+ctttettrterrrrererrrrerrrrrrrerrrrree 66

3.3 Testbench environmeI(: -++5++cSS++++t+xexexerexexererre 673.3.1 CNN Datapath IP: nghi 67

3.3.2 WGG-162 ceeescesesseseeseeseeeseeseseeseeseseesssessessseessseseessseessssseesssesssenseties 68Chapter 4 CONCLUSION AND DEVELOPMENT DIRECTIONS 69

4.1.1.4 Comparison with related articles (Testing mode only): T7

4.1.2, VGGIID: H đi À SEỢ ì ceniireenriiiee 794.1.2.1 Waveform:

4.1.2.2 ACCUTACY Ă Sàn HH HH Hit 794.1.2.3 System RÑ€SOUTCCS kh rên 80

4.1.2.4 Comparison with related articles ‹ -cs<5<5<+x+s+> 81

4.2 _ What We Gained, Limitation and Direction of Development: 824.2.1 What We Gained: 0 cece ees necsesesesneeaeseseesesesesesneeseseesseeneaeee 824.2.2 Limifation: c chư 83

4.243 Direction of Development

Trang 8

A complete convolutional neural network [6].

Kernel in the field of Computer Vision [9]

Simulation of Convolution [6]

Padding = 1 for input matrix

Red box - Kernel, Black box — Input

Convolve valid paddingConvolve same padding

Convolve full paddingStride = 1 for data matrix, the yellow tiles are kernel core

Max pooling

Average pooling

Sum pooling [6]

Fully Connected Layer

Graph of Sigmoid function

Graph of Tanh function

Graph of ReLU function

Graph of Softmax function

Neural Network example

Convolutional exampleMask of Pooling

Convolve module block diagram

ReLU module block diagram

Max Pooling module block diagram

Flatten perceptron module block diagram Softmax module block diagram

CFED module block diagram

Flatten perceptron module block diagram CECC module block diagram

CECC module operation

Update weight module operation

Update kernel module operationConvolve valid module operation

Floating-point adder module block diagramFloating-point multiplication module block diagram

Trang 9

Figure 4.1: Mixing testing mode and learning mode

Figure 4.2: Testing mode

Figure 4.3: Learning mode

Mem block example

Mem block signal

Testbench environment block diagramTest flow

Testbench environment block diagram

Test flow

Trang 10

LIST OF TABLES

*Table 3.1: CNN Datapath IP IO information

*Table 3.2: Convolve module IO information.

*Table 3.3: ReLU module IO information

*Table 3.4: Max Pooling module IO information

*Table 3.6: Softmax module IO information

*Table 3.7: CFED module IO information

*Table 3.8: CDCP module IO information

*Table 3.9: CECC module IO information .

*Table 3.10: Update weight module IO information

*Table 3.11: Assuming of LUT consumption if implement with Feed-forwardConvolve architecture

*Table 3.12: Assuming of LUT consumption if implement with 3.1.2.5

archifeCfUIF - chê,

*Table 3.13: Update kernel module IO information

*Table 3.14: Floating-point adder IO information

*Table 3.15: Floating-point multiplication IO information

*Table 3.16: Floating-point division IO information

*Table 3.17: Exponential IO information

*Table 4.1: Accuracy per epoch started from epoch 0

*Table 4.4: The error of Convolve, Pooling, and FC

*Table 4.5: The error of Softmax

*Table 4.6: The error of CFED, CECC, and CDCP

*Table 4.7: The error of Update weight, Update kernel

*Table 4.8: Implementation result of CNN Datapath IP

*Table 4.9: Implementation result of Feed-forward sub-modules

*Table 4.10: Implementation result of Backpropagation sub-modules (P1)

*Table 4.11: Implementation result of Backpropagation sub-modules (P2)

*Table 4.12: Comparison between [1] and our work . - 5< «-+ T1

*Table 4.13: Comparison between RTL accuracy and Matlab accuracy

*Table 4.14: Implementation result of VGG-16 sub-modules

*Table 4.15: Implementation result of VGG-16

*Table 4.17: Comparison between [2] and our work

*Table 4.18: Error increase per epoch while updating parameters (U#: update

1) 83

Trang 11

LIST OF ACRONYMS

DSP: Digital Signal Processing

FF: Flip-Flop

FIFO: First in First Out

FPGA: Field Programmable Gate Array

GPU: Graphics Processing Unit

HDL: Hardware Description language

IP: Intellectual Property

ISE: Integrated Synthesis Environment

LUT: Look-Up Table

RAM: Read Access Memory

ReLU: Rectified Linear Units

RTL: Register-transfer Level

SRAM: Static Random-access Memory

Trang 12

SUMMARY OF THESIS

In recent years, the Deep Learning models have been interested in many scientists toparticipate in research, notably the Neural Network model as a good candidate to

solve problems such as recognizing object processing

Nowadays many companies have produced the AI chip by using the cloud for thetraining process, and it will consume a huge number of internal chips when itdevelops in the future In response to this problem, developing an IP for learning will

reduce the cost of production

We propose an architecture, design of Convolutional Neural Network architecture

that can learn, test and VGG16 using quantization to reduce resources and get better

accuracy.

Trang 13

Chapter 1 OVERVIEW

1.1 Introduction

In recent years, along with the explosion of the 4.0 revolution, Machine Learning

plays an important role in human life We can find it in many devices around us and

it also is an important thing to build many future devices such as an automated car.CNN is one of the most popular models of machine learning, it has many architecturesand uses, but image classification is one of the most basic and commonly used

To implement machine learning, nowadays many companies have produced the AIchip, with a specific CNN architecture such as YOLO, GoogleNet, VGG-16, etc

Mostly the AI chip nowadays is implemented with the Feed-forward operation only,

and the parameter will be trained via Cloud With the growth of AI as well as human’sneeds, the amount of data also growth, and the needs of the server, ethernet to perform

the AI training also increased, resulting in a high cost in server, ethernet chip

To solve the above problem, our team is implementing a CNN IP capable of doing

both the learning and applying With a CNN IP capable of learning and applying

operation, the pressure on the server, ethernet chip production

Besides, to reduce the area consumption and increase performance on AI chips, our

team also implemented a VGG-16 IP based on the Quantization method With theQuantization method, the AI chip arithmetic operation will use low-bit widtharithmetic such as 8-bit integer adder, multiplication, and shifter With low-bit width

arithmetic, result in a significant reduction of area and the increment of speed

1.2 Project’s goal

There are 2 main goals in our project:

The first goal, design a CNN model on hardware capable of learning and testing

There are two parts to this goal, one is making a Feed-forward structure to perform a

Trang 14

the learning, compatible with the Feed-forward, using floating-point value The CNNmodel will be experimented with both learning and testing by using the MNIST

A neural network is built based on a biological neural network inspired by the human

brain It consists of neurons that connect and process information by passinginformation along with connections and calculating values at the neurons

Each neuron will have input data, which will be changed by the inputs and weights

A layer is many neural nodes that are the same input but have different weights

(Figure 2.1)

After going through all the layers, the input data will be processed and then combined

in the final layer to create a prediction

The neural network learns by generating difference errors between the network'spredictions and the desired value and then uses those errors to update the weights andbiases again for more accuracy

Trang 15

Figure 2.1: Basic Neural Network

2.2 Feed Forward Algorithm

@ Bias unit

Wo

—

INPUT NEURON OUTPUT

Figure 2.2: Feed-forward algorithm

In Figure 2.2, the inputs x are being fed to the red neuron along with w in their path

Trang 16

val = 3?-+w¡x¡ + b (2.1)

In which, the main b value is w0 in Figure 2.2, which is called the bias weight

Each neuron when producing output will have to go through an activation function,the formula is as shown in Figure 2.2

out = f(val) (2.2)

2.3 Convolutional Neural Network

A convolutional neural network is a commonly used network in image processingand classification, including two large blocks, feature extraction blocks (includingconvolutional blocks, pooling blocks), and classification blocks

Figure 2.3 shows an example of a CNN model to classify 5 classes: car, truck,

airplane, ship, horse Besides that, Figure 2.4 illustrates the shape of the image when

it goes through each convolutional block

car

truck SÏfplane Ship horse

Figure 2.3: CNN example [5]

Trang 17

t3 f4Fully-Connected Fully-Connected

Neural Network Neural Network

Conv_1 Conv_2 ReLU activation

Convolution Convolution | —*®—x

xsl ke xế Max-Pooling exe 5) Karel Max-Pooling (with

valid padding (2x2) valid padding (2x2) dong

The performance of the convolution operation also depends on the following factors:

kernel matrix size, padding, stride

Convolution formula (Figure 2.6) for each element to get the feature matrix:

yị=sum(A@K) (2.3)

In which A is the image input, K is the kernel matrix applied to calculate the convolve

yij is a single-pixel output of the convolve.

Trang 18

Figure 2.6: Simulation of Convolution [6]

In Figure 2.5, there is the operation of convolutional and several common kernelmatrixes that we can use in computer vision

2.3.2 Padding and Stride

2.3.2.1 Padding

Padding is simply a process of adding layers of zeros to our input images

Trang 19

We call padding = n where n > 0 is the number of layers of zero added outside thematrix.

Figure 2.7: Padding = 1 for input matrix

Padding (example in Figure 2.7) is usually used to get a more accurate feature mapduring the convolve With Padding, we can separate the Convolve operation into 3types, with the given example in Figure 2.8

3 1 3

1 3 1

3 1 1

Figure 2.8: Red box - Kernel, Black box — Input

Padding valid: Non-padding convolve (Figure 2.9), the output width will besmaller than input width, equal to (input width — kernel width + 1)

Trang 21

Figure 2.11: Convolve full padding

Note that the number of zero layers added to the input will depend on the kernel

size and the determined type of convolving algorithm

2.3.2.2 StrideStride is the parameter for which the sliding window (Figure 2.12) can "jump"

We call stride = n where n > 0 is the number of distances (in matrix cells) that the

sliding window will jump

When stride > 0, the output matrix that we collect will be smaller in size than the

input matrix

Trang 22

If stride = 0, output matrix size will depend on padding and Kernel size

Trang 23

m"|o|>|s xÌ2|M|+

Figure 2.15: Sum pooling [6]

2.3.4 Fully Connected Layer

After processing the image, image features will be pushed into the fully connected

layer This layer has the function of converting the feature matrix in the previous layerinto a vector containing the probabilities of the object that need to be predicted

Trang 24

Figure 2.16: Fully Connected Layer

For example, in Figure 2.16, the fully connected layer covers the characteristic tensor

of the previous layer into a 2-dimensional vector representing the probabilities of the

2 similar layers

And finally, the process of training the Convolutional Neural Network Model for theimage classification problem is similar to training other models We need an errorfunction to calculate the error that keeps the model and label prediction accurate, as

well as we use a backpropagation algorithm for the weight update process

2.3.5 Weight

Each neuron in a neural network computes an output value by applying a specificfunction to the input values coming from the receptive field in the previous layer Thefunction that is applied to the input values is determined by a vector of weight and abias (typically real number) Learning in Neural Network progresses by making theiterative adjustment to these biases and weight

The vector of weight and the bias are called filters and represent a specific feature ofthe input (e.g., a particular shape) A distinguishing feature of Neural Network is thatmany neurons can share the same filter This reduces memory footprint because a

Trang 25

single bias and a single vector of weight are used across all receptive fields sharingthat filter, as appose to each receptive field having its own bias and vector weighting.2.3.6 Cross-Entropy

The Cross-Entropy(CE) function is responsible for the backpropagation calculation

of the softmax function

This function is often used in classification with output using the Softmax activationfunction

In addition, it can help the output of CNN more clearly

For example, we have 2 outputs of 0.8 and 0.2, then when we pass the CE function,

we get | and 0, respectively, similar to when we one hot encoding a sequence ofnumbers and the largest number will be 1

The formula of the function:

— >xp(R)log (qŒ)) (2.4)

_ yr)

qœ&) (2.5)

Derivative:

Derivative of softmax and CE: q(k) — p(k) (2.6)

In which, q(k) is the result that the neural network calculates, p(k) is the expected

Trang 26

The disadvantage of the Sigmoid function is that if the input has an absolute valuethat is too large or too small, the function will saturate, making it impossible to update

the parameter (Figure 2.17 shows the graph of the Sigmoid function)

Derivative of Sigmoid: a(z)' = ø(2) * (1—a(z)) (2.8)

=

oat / /

ool!

bị

/ Az

As the input is larger, the function will give a value closer to 1, and vice versa, the

function will give a value close to -1

The disadvantage of the tanh function is similar to the sigmoid function (Figure 2.18

shows the graph of the Tanh function)

Derivative of Tanh: tanh(x)’= 1 - tanh(x) (2.10)

1.0 a

-10 -5 † 5 10

Figure 2.18: Graph of Tanh function

Trang 27

f(x) = max (0,x) (2.11)

The ReLU function is very popular because it is simpler to calculate and does not

saturate functions like sigmoid and tanh

The ReLU function has the effect of filtering all values < 0

The disadvantage of the ReLU function is that if so many values are less than 0, the

ReLU function will become 0, this is the phenomenon of Dying ReLU (Figure 2.19

shows the graph of ReLU function)

The softmax function is used to calculate the ratio of the inputs to each other, widely

used for calculating the final classification result for neural networks

Trang 28

Usually, when using softmax, we often use it with the cross-entropy function so that

it can be easily calculated in the chain rule (Figure 2.20 shows the graph of theSoftmax function)

W4

Figure 2.21: Neural Network example

Trang 29

We have an example of a Neural Network in Figure 2.21.

Assuming the correct value of output is yl and y2, we have an Error:

Eo1 = 504 — out(o1))2 (2.17)

Eo = 502 —out(02))? (2.18)

Evotal = Eo1 + Eo2 (2.19)

Derivative of Euai following w is the effect of that w on the Etotar which is also theadjustment of that w We have adjustment ws:

AEtotat — AE total „ Pout(o1) „ dval(ot)

@wS ~ ôout(o1) aval(o1) aws (2.20)

In there: Ea=s(yị — out(01))? +5 (yz — out(o2))?

dval(o1) g'(val(ol)) ite val(or) a Tae sao) (2.22)

Val(ol) = out(h1)*w5 + out(h2)*w6 + b2*1

aval(ot) _

SOY “0ut(hl) (2.23)

The new ws:

New(ws) = old(ws) — learning_rate * Than (2.24)

Similar to other w, we will update the new w and through many updates, the w willgradually converge

Trang 30

2.4.2 Backpropagation in Convolutional Block

Similar to being fully connected, we use the derivative to calculate the adjustment

value Figure 2.22 below is an example for Convolutional :

Ôerror(x) _ derror(x) ` Øout(c) * dval(c) @ 25)

val(c) = conv(image,w)

Aval(c)

3w =conv_full(val(c),image) (2.28)

Trang 31

Out(c1) | Out(c2) | Out(c3) | Out(c4)

Out(c5) | Out(c6) | Out(c7) | Out(c8) Out(c1) | Out(c7)

Max poolin

Out(c9) | Out(c10) | Out(c11) | Out(c12) Out(c14) | Out(c12)

Out(c13) | Out(c14) | Out(c15) | Out(c16)

Figure 2.23: Mask of Pooling

When pooling, we need to create a mask (Figure 2.23) to save the initial position of

the elements of (B) in (A) before pooling

The purpose of creating a Mask is to serve the reshape during the Back Propagation

phase and to create the derivative of ReLU

Trang 32

How to make Øp0ooÏ:

0|0|0 |E4 ol0|0 1| ` |E3 E3 | E4 | E4

0|E3|0 0 0/1|0|0 E3 | E3 | E4 | E4

Figure 2.24: Making 0pool

Step1: resize the error in flatten and double it (Figure 2.24)

Step2: multiply the matrix just created in step] for the mask created at feedforward,

we will get the pooling derivative (Figure 2.24)

2.5 VGG-16 Structure

VGG-16 is a convolutional neural network model proposed by K Simonyan and A.Zisserman from the University of Oxford in the paper “Very Deep ConvolutionalNetworks for Large-Scale Image Recognition” [4] The model achieves 92.7% top-5test accuracy in ImageNet, which is a dataset of over 14 million images belonging to

1000 classes

Trang 33

pl zÌ= r| zÌ= z| z| zÍ= pl eles pl z| z|=

S|] a] 9 S| =|£ | Ss} =|S S| S| =| S S| S| =| S OVO fe] > o|-> sị> S| o|——>

5} 5| 8] II RMEEEEERMEEEERMSEEEE ion han

Figure 2.25: VGG-16 StructureVGG-16 Structure (Figure 2.25) has 16 layers: 13 layers of convolutional block and

3 layers of fully connected block

In convolutional block, it uses convolve with padding n = | to get the output size

equal to the input size

2.6 MNIST Dataset

The MNIST[7] database (Figure 2.26) is a large database of handwritten digitscommonly used in training various image processing systems This database is alsowidely used for training and testing in the field of machine learning The databasewas created by "remixing" samples from the original NIST dataset The databasecreators felt that because the NIST training dataset was obtained from the US Census

Bureau, while the test dataset was obtained from US high school students because so

it is not suitable for machine learning experiments Furthermore, the black and whiteimages from NIST normalized to fit the 28x28 pixel

0000060299700 O00000 FAUNA ALL ZN bas AtrRARAAAPAAQLIAZWZAZA

33333332393333833334##ứW#4qdqw#w#27dd+wx#w

Trang 35

mode The other one is the Backpropagation, for the learning mode only While in

the testing mode, the CNN Datapath only performs the Feed-forward operation, theBackpropagation is off In the learning mode, the Feed-forward and Backpropagation

will be executed

Trang 36

*Table 3.1: CNN Datapath IP IO information.

Name No bits | Direction | Description

clk 1 Input System clock

Asynchronous reset signal

rst_n 1 Input

When LOW, the IP will be reset

rst_backprop 1 Input | When LOW, the IP will be reset

except the parameter memory

HIGH: Learning mode

en_update 1 Input

LOW: Testing mode

pixel_in 8 Input Image pixel input

Synchronous signal, indicate the

valid_in 1 Input /ý :

pixel_in is valid

kernel_in 4*32 Input Initial kernel input

weight_in 10*32 Input Initial weight input

Load initial kernels into Parameters

Trang 37

predict 4 Output | The prediction output.

valid_out_forward 1 Output Indicate the predict signal is valid.

Indicate a learning process has been

3.1.1.1 Convolve moduleThe Convolve module (Figure 3.3) is designed with the 3x3 kernel size to convolvethe algorithm The architecture uses the Line-buffers to perform the sliding window

Trang 38

output The edge detection in this module is in charge of detecting the end of an imageline or frame.

By using two line buffers, the delay from the input to the output of the line bufferonly took WIDTH*2+2 cycles for the pixel to come out The MAC contains multiple

full-pipelined floating-point multiplications and floating-point adder trees, with the

floating-point multiplication being optimized and only taking 6 cycles for thecalculation compared to [1] is 47 cycles However, the delay of floating-point addertakes more cycles than [1], which is 7 cycles due to optimizing the performance, while[1] only took 4 cycles In total, the delay of convolving is much shorter than [1], onlytaking 93 cycles for a 28x28 image for the first-pixel output to come out, while [1]took 121 cycles

Figure 3.3: Convolve module block diagram

*Table 3.2: Convolve module IO information

rst_n 1 Input

Trang 39

data_in 32 Input Data input.

Synchronous signal, indicate the

data_out 32 Output | The convolve output.

valid_out 1 Output | Indicate the convolved output is valid

3.1.1.2 ReLU module:

The ReLU module (Figure 3.4) is designed y using only a MUX The output will bezero if the input is a negative number, and equal to input if it is a positive number.The valid_out in the ReLU module is the bypass from valid_in without delay

Data[31:0] =

32 bit 0

Figure 3.4: ReLU module block diagram

*Table 3.3: ReLU module IO information

out[31:0]

data_in 32 Input Data input.

data_out 32 Output | The ReLU output.

Trang 40

3.1.1.3 Max Pooling module:

The Max Pooling module (Figure 3.5) is designed by using only one line buffer for

slide window 2x2 The stride detection is to perform the stride by removing theunused values in the Max Pooling operation and is designed by multiple counters

The main operation of Pooling is the Comparator, which compares 4 inputs andbypasses the highest value as the output

The Max Pooling module has more latency than [1] In our work, since we made morestages of pipeline in the comparator than [1] for improving performance purposes,

our comparator will take 3 cycles while [1] only took 2 cycles The single line buffer

will have the same delay (WIDTH+1) as [1]

valid_in ——+—» Stride detection} —> Delay >> valid_out

Figure 3.5: Max Pooling module block diagram

rst_n 1 Input

data_in 32 Input Data input

Tiêu đề	Research and Evaluate Hardware Implementation of CNN Model
Tác giả	Le Minh Phuc, Bui Huu Tri
Người hướng dẫn	PhD. Lam Duc Khai
Trường học	University of Information Technology
Chuyên ngành	Computer Engineering
Thể loại	Graduation Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	85
Dung lượng	19,86 MB