Khóa luận tốt nghiệp Kỹ thuật máy tính: Nghiên cứu và thực hiện thuật toán CNN (YOLO) trên Zedboard Zynq 7000

SUMMARY OF THESISIn recent years, the Deep Learning models has been interested by many scientists toparticipate in research, notably the Neural Network NNW model as a good candidate to s

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

COMPUTER ENGINEERING DEPARTMENT

PHAN TUAN THANH

VŨ HOANG HY

GRADUATION THESIS

THE RESEARCH & IMPLEMENTATION OF CNN ALGORITHM (YOLO) ON ZEDBOARD ZYNQ 7000

NGHIEN CUU VA THUC HIEN THUAT TOAN CNN (YOLO)

TREN ZEDBOARD ZYNQ 7000

ENGINEER OF COMPUTER ENGINEERING

HO CHi MINH CITY, 2021

Trang 2

VIETNAM NATION UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

COMPUTER ENGINEERING DEPARTMENT

PHAN TUAN THANH - 16521807

VŨ HOANG HY - 16520545

GRADUATION THESIS

THE RESEARCH & IMPLEMENTATION OF CNN ALGORITHM (YOLO) ON ZEDBOARD ZYNQ 7000

NGHIEN CUU VA THUC HIEN THUAT TOAN CNN (YOLO)

TREN ZEDBOARD ZYNQ 7000

ENGINEER OF COMPUTER ENGINEERING

INSTRUCTOR

PhD NGUYEN MINH SON

HO CHi MINH CITY, 2021

Trang 3

LIST OF THE COUNCIL TO PROTECT THE THESIS

Graduation thesis grading council established under Decision no 70/QD-DHCNTT datedJanuary 27, 2021 of the Rector of University of Information Technology

Trang 4

THANK YOU

To complete this graduation thesis, we would like to send our sincere thanks to the

teachers of Computer Engineering and the University of Information Technology

-National University of Ho Chi Minh have taught us knowledge and impartinvaluable experiences throughout the past leaning journey

Especially, we are also sincere thank Mr Nguyen Minh Son helped us and take thetime to guide and instruct us throughout the thesis so that we can complete the

graduation thesis

Thank kith and kin who helped us in finding information during the time of this

thesis Thank you for accompanying through the past 5 years of school

Once again, we would like to sincerely thank everyone for their time and effort tohelp us in the graduation thesis process We apologize to everyone because themistakes and fault are unavoidable during the thesis, we hope teachers and you can

ignore and forgive

Trang 5

Chapter 1 OVERVIEW ST HH1 100 re 2

L.1 Overseas SIfUA(IOH - óc kh HH ng rà 2

1.2 Situation 0.0 Ầ - 21.3 Project’s goal c- ch HH HH 3

Chapter 2 THEORETICAL BASIS «<6 + Hư 4

2.1 Neural Network

2.1.1 Neural Lay€ sàng gen 52.1.2 Nonlinear Layr - «+6 tt 72.1.3 Pooling Layer sàn Hee 7

2.1.4 Fully Connected Layer :sc5c5+2c+csccsseeeererererxee 8

2.1.5 Weight 00 UF đề 6 / 82.2 YOLOvI Gh áo THÀ JẾU cào 92.2.1 Grid System

2.2.2 Neural Network for YOLOV1 ou ceeeseseeeeseeeseseseteeeeeeeseaes 3

2.2.3 Loss FUNCTION 20.0 ceceececseeeseseseeseseseseseeeacseeeeeneseseseeeescseseseseeecsesesaees 42.2.4 Intersection Over Union ccceeseeeseeseeeeeeseseseeeesesesesesseseseseseeees 52.2.5 Disadvantages of YOLOV1 00.0.0 cece nes eseseeeseseetseeneseneeeenenes 6

P (909/22 6

2.3.1 Batch NormaliZAtiOT - «1t k1 1v HH it 6

2.3.2 _ High Resolution Classifier

2.3.3 Use Anchor Box Architecture to Make Predictions 82.3.4 K-mean Clustering for Anchor Selection eee 20

Trang 6

2.3.5 Direct Location Prediction - -c+5++c+cceererrerrerersre 212.3.6 Add Fine-grained F€aftUres -¿- 6c ce cty 2

2.3.7 Multi-Scale Training -5c5cccccccccexseercee 242.3.8 Light weight Backbone ¿- «c1 Sxhnn iey 24

P n9 90) 252.4 Vivado & PetalinuX -ccccccc St net 25

2.4.1 Vivado HLS

2.4.2 Petalinux + Xilinx SDK - SH ớt 27

2.5 Training Before Recognizing ¿ 5-5 sS+ Street 28

2.6 Field Programmable Gate Array (FPGA) - - c5 s+ssscevseevexreree 282.7 Overview of ZedBOard - 5-5: S kề E221 21212121 1101012 rey 30

2.8 Advanced Extensible Interface (AXI) BUS -‹-«-<++++<<+ 332.9 Development Tool and Overall Architecture? s-« «-«5x++++<«+ 35

2.9.1 | Development Tool

2.9.2 Overall Architecture -c++c+c+csesrererseerersrerr.ee DD

2.9.3 Neuron Network TFÏOW cà Sà kg ey 352.9.4 Execution Time TT 38

2.9.5 Accelerator Overall Architecfure ‹-: +scs+c+c+ccese+ 392.9.6 How It Works in General 555 55c+csssceczeeseeeerre 20)Chapter 3 SYSTEM DESIGN AND IMPLEMENTATION 42

3.1 Input and Output Module

3.2 Neuron MOdUIe - ¿6x1 1E TT HH gưư 43

3.3 Pooling MOdule - th HH ngờ 44

Trang 7

3.4 Reordering MOdỤe - «6 tt vn ngư 453.5 Implement of recognition DFOC€SSINE . - ¿5556562 £+srcrtevererre 46

3.5.1 Implement on Vivado ¿s5 + xe 463.5.2 Implement on Zedboard - + +5+5++ccs+c+>ezxsrrrcr+ 50

3.5.2.1 Verified on Some preliminary SC€TATIO - «55+ sc+ssc+ 503.5.2.2 Verification and ACCUTaCY 5-55 St‡t‡ttrrrkekererrkree 53

Chapter 4 CONCLUSION AND DEVELOPMENT DIRECTIONS

4.1 The status of the entire project ccceccecseeecseeseseseeseneseseseeeesessseeneneeeee 56

4.1.1 Expected result ccccceecscceseseseescssesesneneseseseeneeseseseenensseseseeeeneee 564.1.2 Actual result „xe hẠ 564.2 System R€SOUTCGS cà HH run 57

4.2.1 - SyntheSigh? Ae đ^sA / ì c.c 574.2.1.1 (273) ‹`^ S^ ae ” 57

4.2.1.2 Write_back_output .4.2.1.3 Weight_mmepy_everyKXk ccc eeeeseseeeeeeeeeeneseereneseneseees 594.2.1.4 Weight_load_reorg oo eeeseescseeeneseersesseseseeeseeneneaeeeeee 60

4.2.1.5 Reorg_yOl0 cccccccccceecccseseseseeesseseseeteneseseseeeesesssesesnesenssesesneesenseee 61

4.2.1.6 POOL yỌO SH HH HH HH HH 6l4.2.1.7 Outputpixel2buÏ ¿-5-5-ccrtsSttetrrrrerererrrerrei 62

4.2.1.8 Mmcpy_outputportl ¿ĩc 5 2S StEeekerrrrrrrkrek 634.2.1.9 Mmcpy_outputport

4.2.1.10 Mimepy_outputpixel oc cccccecceeeeseseeecneeseseseeenesesesneeeeseeee 65

4.2.1.11 Mimcpu_inputport oo ceccsenesesesseseesesenessseesssseneneseeeseeees 66

Trang 8

4.2.1.12 Mimcpu_inputport] sóc c5 St SxsvtsEEkeketerrrererrrevev 674.2.1.13 Mimcpu_inputport2 wo cesesesescscesessseessessesesesessensseasseeee! 67

4.2.1.14 Mmecpu_InputpOTLỔ ¿ óc Street 684.2.1.15 Copy_input2buf rOwW scSccc ccctereeerererrrerrree 69

4.2.1.16 Copy_input_ Weigl - sách it 704.2.1.17 COmpUI€ - 5 + S SĐT HH H0 hết 71

4.2.2 Generate Bitstream

4.3 What We Gained, Limitation and Direction of DevelopmeIt 73

4.3.1 What We Gained 00.0 ccccccesseceeseeecneseeseeneseessneseenessesssnenssnensensanenseee 734.3.2 Limitation, 52s \ 744.3.3 Direction of Development - - 5 +5++c+£+xe£erzkexererereree 74

Trang 9

LIST OF FIGURES

Figure 1: The common Neural Network stream to process the input image andclassify objects based On vaÌÏue ¿- ¿+ +54 11k 121211 gi 5Figure 2: Neural calculation DFOC€SS ¿5 55252 2*2E93* 2E2vErrkrkekerrrree 5

Figure 3: A 5x5 filter is used to detect the angle / edỹe -¿- ¿+ «cccxscerrxe+ 6Figure 4: ReLU process - - 522 2222222223219 221217121 13 17171121111 111 ke 7Figure 5: How YOLOvÏ pT€diCS - - - ¿5S S22 k2 1 111111 1 re 9

Figure 6: YOLOvI architecture

Figure 7: Frames per second of each aÏlgOrithim - - +5 c+++x+xexexzererexsee 1

Figure 8: Example the only object that the square Contains -‹-+ -+ 2Figure 9: Each square is responsible for each predicting 2 boundary boxes 2Figure 10: A vector has 2 boxes and 3 layers for each square - - eee 3

Figure 11: Neural Network model for YOLOvIL -. ¿-5-+555<+<+5<++ 3

Figure 12: IOU eXaImpÌE - ¿5c kt tt 3312k kề 5Figure 13: NN output (i, j) maps to image (i, ])

Figure 14: Down sample uu ccccesceseseesesceseseseseeseseassessesssescsesesessseessnaneaesseneeaees 9

Figure 15: Anchor box refining the anchor box position and size - 20Figure 16: Predict the bounding box in YOLOV2 ccseeeeesereeeeeeseeeeeeeteeeeee 22

Figure 17: YOLOv2 ArChit€C(UT€ 5:52 5 2 2S x2 2 21211111111 re 23Figure 18: The Reorg technique in YOLOV2 5 25+ 5++++£‡£+£s£ztztsrerrx 23

Figure 19: Darknet- [9 ¿+ + + t2 v219222121212121121212121112101 01111210 e 24Figure 20: WordTree-YOLO9000

Figure 21: Vivado Design HLS flow - Street 26

Trang 10

Config to make Zedboard use boot file on SD card . - 27

Config to attach bootloader to BOOT.BIN . -.ccc<c 28Config to access the kernel file from SD card ‹ s«-s5+++s+> 28General FPGA architecture 0 c cscs eeceeseseeseesesesesseseseseessneseneaeene 29

Structure of Configurable Logic Block

Functional Overlay of ZedBoard - - ¿-5555<+c+cece<+ecexsee 2 Í

Block Diagram of ZedBoard - 5-55 5+5++c+c+cexereeezeseseee-e.32Interface and interCOMNect - - ¿5255-5552 sssssx+e+eeesreeeeseeexe .233Channel in A XI interface cece c5 E2 2ESEk+kEk2 E11 gưên 34

Overall Architecture - 6 S191 1E EE 2 E1 TH HH gướn 35

Execution Schedule -. - 5+ + 522++++2tS2x2EEevererkrkerrkerrkrrrree 38Overall architecture of accelerator

Single channel data transimiSSIO - - «¿5 S2 *£+x++Evxervrkevevee 42

Neuron module (Tn = 4 and Tm = 2) -¿- 5 55+ 5+ £+£+++zxzxcxcxe 43Pooling mOdUÏe ¿c6 tt 1 E1 111 1 111 1H11 kg rư 44

Reorg Module TT 45Overall Architecture implement on ZedBoard - - 46

Block design on VIVadO á cà tk SH Hư 47Result of simulate with close object

Output image of close Object 2 ee seeeeeseeseteeteneeeeeseeseeeneneeereseeeeneneee 49Result of simulate with far ObjeC( - s52 525555 sccstzerrrerreeeree 49

Trang 11

Output image of far Obj€C - - 5c 2S 1 ghê 50

Aero plane €f€CfiOI - 6S 1 1 12121 1131 101210111111 kg gưư 50Bicycle detection S11 211g HH Hư 51

Bus deteCfiOn 25 + 2.2 12 221212 2112111 111210 H0 rrưưn 51Car det€CtÏOn S2 2tr tt 52Fire hydrant det€CfÏOH ¿- - + S512 k2 2E 112 2112121 1g 1x tr 52

Giraffe and zebra detection

Result of processing on Z@dbOarrd 6 + 5c 53Detail of project in Vivado HLS - 5-5-5252 5+5++c+cccsrcreresrer 57Result of running bitstream . - - - +55 scccvztzterrrerrereeree 72Power Resources after generating bitstream -. «-+-«es+cs+ 73

Trang 12

LIST OF TABLES

Table 1: Compare small filter and large Ẩi](€T ¿-¿- ¿5+ 6+ s*£v+s£vxexexrrxrre+ 6

Table 2: Batch Normalization Transform, applied to activation x over a mini-batch

— 16

Table 3: Neuron Network FÏOW 5:55 25222222321 1211212111111 te 36Table 4: Compare result between software darknet and our implement 54Table 5: Timing (ns) of YOLO2_FPGA module: - 2-5-5552 5<5+++z++£+s++ 57

Table 6: Utilization estimates of YOLO2_FPGA module

Table 7: Timing (ns) of Write_back_output moduÌe ¿- s5 +5+5+ 5c 58

Table 8: Utilization estimates of Write_back_outputt module - 58Table 9: Timing (ns) of Weight_mmcpy_everyKxk module -: 59Table 10: Utilization estimates of Weight_mmcpy_everyKxk module 59

Table 11: Timing (ns) of Weight_load_reorg module -: -‹-‹-=+<+++ 60

Table 12: Utilization estimates of Weight_load_reorg module - 60Table 13: Timing (ns) of Reorg_yolo module

Table 14: Utilization estimates of Reorg_yolo module ¿- - + 5<++ 61

Table 15: Timing (ns) of Pool_yolo moduÏe -5: 2 5+5+5+2ss+s++++x+x+svss+ 62Table 16: Utilization estimates of Pool_yolo module ‹-s -«ex++<+s«ceses+ 62

Table 17: Timing (ns) of Outputpixel2buf module - 5-52 5555 £+s+s<++ 62Table 18: Utilization estimates of Outputpixel2buf module - : 63Table 19: Timing (ns) of Mmcpy_outputport] module : :-:-<:‹-+ 63

Table 20: Utilization estimates of Mmcpy_outputport! module 603

Table 21: Timing (ns) of Mmcpy_outputport module ‹ - «5s5s5c+s«s+ 64

Trang 13

Timing (ns) of Copy_input2buf_row module ‹-‹-5-5+5+<+ 69Utilization estimates of Copy_input2buf_row module 69

Timing (s) of Copy_input_weight module

Utilization estimates of Copy_input_weight module . - 70

Timing (ns) of Compute4 module ¿ 55525555 55+s+s+s+>+ 71Utilization estimates of Compute4 module -:-+-+ 71

Trang 14

LIST OF ACRONYMS

ASIC: Application Specific Integrated Circuit

AXI: Advanced Extensible Interface

BN: Batch Normalization

BRAM: Block Random Access Memory

CLB: Configurable Logic Block

COCO: Common Objects in Context

DDR: Double Data Rate

DMA: Direct Memory Access

DSP: Digital Signal Processing

DSP48E: Digital Signal Processing Logic Element

FF: Flip-Flop

FIFO: First in First Out

FPGA: Field Programmable Gate Array

GPU: Graphics Processing Unit

HDF: Hardware Descriptor File

HDL: Hardware Description language

HLS: High-level Synthesis

IOU: Intersection Over Union

IP: Intellectual Property

ISE: Integrated Synthesis Environment

LUT: Look-Up Table

mAP: mean average precision

Trang 15

NNW: Neuron Net Work

PL: Processing Logic

PS: Processing System

RAM: Read Access Memory

RCNN: Region-Based Convolution Neural NetworksReLU: Rectified Linear Units

RTL: Register-transfer Level

SD: Secure Digital

SRAM: Static Random-access Memory

YOLO: You Only Look One

Trang 16

SUMMARY OF THESIS

In recent years, the Deep Learning models has been interested by many scientists toparticipate in research, notably the Neural Network (NNW) model as a good

candidate to solve problems such as recognize object processing

Many practical applications often have low latency in software or a lot of resources

in hardware In response to this problem, optimization strategies such as implement

to FPGA is adopted

We plan strategy, design, and implement a FPGA of Neural Network acceleratorarchitecture which use You Only Look Once version 2 (YOLOv2) detection

algorithm

Trang 17

Chapter 1 OVERVIEW

1.1 Overseas situation

In the overseas, there are many countries that lead about microchip such as UnitedStates, Indian, Germany, China, Japan, Therefore, we could be easy to find outthe documents about microchip through the Internet and Indian is the one of themost country have a huge of online documents about microchip

There are many projects, events or reports about object detection such as:

In Association for the Advancement of Artificial Intelligence (AAAI) Award 2020,

Geoffrey Hinton (Google, The Vector Institute, and University of Toronto) bringnew approach to object recognition for neural network As of late 2017, China has

deployed facial recognition and artificial intelligence technology in Xinjiang!

Project of WalkerLau about Acceleration Neural Network with FPGA on github Inthis project, they can compare the speed of CPU only and CPU + FPGA by execute

face recognition with 7 neural layers The result is CPU + FPGA acceleration

system works 45x-75x faster.”

Project of Mohamed Atri, Fethi Smach, Johel Miteran and Mohamed Abid aboutDesign of a Neural Networks Classifier for Face Detection In this topic, they have

a new concept that Multi-layer Perceptron The result of that project is that they get

99,67% accuracy of 500 sample images by using Xilinx FPGA.”

1.2 Situation in the country

In Viet Nam, we have lots of limit about the resources, expense, technology leading

to human resources must study abroad to get high quality Most of all companieswork about microchip also stop at pack the circuit or intermediaries for large

1 The 16® AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment.

? Acceleration CNN computation with FPGA from WaklerLau.

3 Neural Networks Classifier for Face Detection.

Trang 18

company outside Viet Nam So, most research topics are published from foreigncompanies.

The research and implementation of Neural Network YOLO algorithm on FPGA isdifficult because we could not find out any documents or projects made by people

in Viet Nam

1.3 Project’s goal

There are three main goals in our project:

First is the research YOLO base on Neural Network algorithm This is one ofthe advanced Deep Learning models It helps us to build intelligent systems

with high accuracy today So, our group need to learn and understand thisalgorithm to convert and implement this algorithm on Zedboard

Second is implement YOLO algorithm on board, calculate speed and

accuracy after completed that circuit with a small dataset Most of theprevious evaluation is on the software, this is one of the first time thismethod is implemented down to hardware So, make the circuit execute with

performance is not easy for implement, we will try to make it run smoothlythat we can

The thirst goal is training a bigger dataset for detection Our group needs to

collect image about: person and animals like dog, cat to label After that

we will generate data, which are collected to weights file and put it to mycircuit

Additionally, if we have enough time, we will make the Camera IP to make

it run with my circuit to detect object in real time Because detecting in real

time is useful in the modern life

Trang 19

Chapter 2 THEORETICAL BASIS

2.1 Neural Network

In deep learning, Neural Network is one of the most popular and most influential

deep learning models in Computer Vision community Neural Network is used inmany problems such as image recognition, video analysis, or for articles in the field

of natural language processing.*

The original architecture of the Neural Network model was introduced by a

Japanese computer scientist in 1980.° Then, in 1998, Yan LeCun first trained the

Neural Network model with backpropagation algorithm for the handwriting

recognition problem.5 However, it was not until 2012, when Ukrainian computer

scientist Alex Krizhevsky (filed by Geoffrey Hinton) built a Neural Network mode!

and used a GPU to speed up training deep nets to reach the top.’

In Figure 1, Neural Network includes a set of basic classes: neural layer withnonlinear layer, pooling layers, and fully connected layer Normally, an image wil

be propagated through first neural layer with nonlinear layer, then calculated valuewith propagate through pooling layer The neural layer, nonlinear layer and pooling

layer can be repeated many times in the network And then, data is propagated

through a fully connected layer and softmax to calculate the probability of whichthe image contains

* Wiki: Convolution Neural Network

Trang 20

l INPUT CONVOLUTION + RELU POOUNG €ONVOLUTION + REUU POOLING FLATTEN GAMUT 5, SOFTMAX

FEATURE LEARNING CLASSIFICATION

Figure 1: The common Neural Network stream to process the input image and

classify objects based on value

2.1.1 Neural Layer

Neural layer is the first layer to extract features from the input image Neuralmaintains relationships between pixels be exploring image features using smallsquares of the input data It is a math operation with 2 inputs such as image matrix

and 1 filter or kernel.* See the Figure 2, we have recipe:

e Animage matrix (volume) of dimension (h x w x d)

e A filter (f, X fy X đ)

¢ Outputs a volume dimension (h — f, + 1) x (w-f, +1) X1

Figure 2: Neural calculation process

The neural layer has the main function of detecting the specifics of an image These

features include an underlying feature, an angle, an edge, or a more complex featurelike an image’s texture Since the filter scans the entire image, these features can be

8 The Definition of Convolution in Deep Learning by using Matrix.

Trang 21

somewhere in the image, even if the image is rotate left or right, these features still

be detected.?

oo o;,0;\0 oo CC oo CC CC

Figure 3: A 5x5 filter is used to detect the angle / edge

We have an example about filter In the Figure 3, the value | will be connected withanother to denote for a feature, which we want to detect

The filter size is one of the most important parameters of the neural layer This size

is proportional to the number of parameters to learn at each neural layer which is

deceptive parameter to receptive field of this layer The most common filter size is3x3

Table 1: Compare small filter and large filter

Small filter Large filter

Each time we see a small area of pixels Receptive the large field

Extraction is characterized by high

localization

Features are more general

Smaller characteristics are detected Capture the basic parts of the photo

Features extracted will be more diverse

and useful in the back layer

Less diversity of pine trees is extracted

° Object Detection with Deep Learning: A Review

Trang 22

Reduce image size more slowly, thus

allowing for deeper network

Reduce image size quickly, thus

allowing only shallow networkLess weight, better weight sharing The weight sharing is less meaningful

2.1.2 Nonlinear Layer

Rectified Linear Units, f = max (0, x) was the most common trigger function for

NN which introduced by Geoffrey E Hinton in 2010.!9 The ReLU function is

favored for its simple calculation, which helps to limit vanishing gradients and

gives better results ReLU located right after neural layer ReLU will assignnegative values equal to 0 and keep input’s value when it is greater than 0 In the

Figure 4, all negative values will be set by 0 after ReLU process

The pooling layer reduces the number of parameters when the image is too large

The pooling space also known as sub-sampling or down-sampling reduces the size

of each map while retaining important information Pooling layer can have manydifferent types:

e Max pooling

e Average Pooling

19 Rectified Linear Units Improve Restricted Boltzman Machines.

Trang 23

se Sum Pooling!!

Max Pooling get the largest element from the tensor of objects or sum averages of

it Sum of all the element in the tensor is called sum pooling

2.1.4 Fully Connected Layer

The last layer of the Neural Network model in the image classification problem is

the fully connected layer This layer has the function of converting the feature

matrix in the previous layer intro a vector containing the probabilities of the objectthat need to be predicted

For example, in the handwritten MNIST number classification problem with 10classed corresponding to 10 number from 0-1, the fully connected layer convers thecharacteristic tensor of the previous layer into a 10-dimensional vector representing

the probabilities of the 10 similar layers

And finally, the process of training the Neural Network model for the image

classification problem is similar to training other models We need an error function

to calculate the error that keeps the model and label prediction accurate, as well as

we use backpropagation algorithm for the weight update process

2.1.5 Weight

Each neuron in a neural network computes an output value by applying a specificfunction to the input values coming for receptive field in the previous layer Thefunction that is applied to the input values is determined by a vector of weight and a

bias (typically real number)!? Learning in Neural Network, progresses by making

iterative adjustment to these biases and weight

The vector of weight and the bias are called filter and represent specific feature ofthe input (e.g., a particular shape) A distinguishing feature of Neural Network isthat many neurons can share the same filter This reduces memory footprint because

1! RunPool: A Dynamic Pooling Layer for Convolution Neural Network

12 Effect of bias in Neural Network

Trang 24

a single bias and a single vector of weight are used across all receptive fieldssharing that filter, as appose to each receptive field having its own bias and vector

weighting

2.2 YOLOv1

YOLO (You only look once) is one of the best and fastest methods It is combinedfrom neural layers and connected layers With neural layer, it will export features ofimage, connected layers will predict probability and object coordinates

YOLO have lots of application in real-life such as: Self-driving vehicle trackingsystem, people tracking system,

The model input is an image, for the object detection, we not only have to classifythe object on the image, but also locate the object Object detection has a number of

applications, such as people’s tracking system, which can help government identify

criminals hiding there or not, or self-driving car system, must also identify wherethe pedestrian is from and then decide to move on

Class probability map

Figure 5: How YOLOv1 predicts!

13 You Only Look Once: Unified, Real-Time Object Detection.

Trang 25

In the Figure 5, YOLOvI split the image into SxS cell, after that it will detect all

blocks, which have objects and classify each cell into color groups Blue is dog,

yellow is bicycle and pink is car

The Figure 6 show the idea of YOLOv1 is to divide the image into a grid cell with

the size SxS (7x7 default) For each grid cell, the model provides a prediction for

the B bounding box For each box in this B bounding box, there will be 5 parametes

x, y, w, h, confidence Respectively, the coordinates of the center (x, y), width,

height and confidence of the prediction With the grid cell in the other SxS grid, themodel also predicts the probability of falling on each class

So, the total output of the model will be S x S x (5 x B+ C) YOLOv1’s backbone

network is inspired by GoogleNet architecture

e High FPS for real-time processing

se YOLO detect object on each cell in image, it makes the algorithm can be

calculated faster than other algorithm

e There are lots of project related to the algorithm for research written by

professor, engineer,

10

Trang 26

Faster R-CNN (ow) Faster R-CNN (nigh) R-FCN (low) 850 00w) 880 (gh) YOLO flow) YOLO (righ)

Frames per second

Figure 7: Frames per second of each algorithm2.2.1 Grid System

The image is divided into a 7x7 square tensor, each containing a set of informationthat the model must predict

The only object that the spare contains The center of the object needs to bedetermined in which square then that square contains that object For example, inthe Figure 8, the girl’s center is in the green square, so the model must predict thatthe plot label is the girl Note, even if the part of the girl’s photo is in another square

but the center does not belong to that square, it does not count as containing the

girl’s label for that square However, we can increase the grid size from 7x7 to alarger size so we can detect more objects Also, the size of the input image must be

a multiple of the grid size

11

Trang 27

Figure 8: Example the only object that the square contains.

Each square is responsible for predicting 2 boundary boxes of the object Each databoundary box contains an object The location information of the boundary box

includes the object’s center boundary box and the width and length of the boundary

box In the Figure 9, a green square need to predict 2 boundary boxes containing thegirl as shown below One thing to note, at the installation time, we do not predictthe pixel value, we need to normalize the image size of the segment [0-1] andpredict the deviation of the center of the object to the box containing that object Forexample, instead of predicting the pixel position of the red point, we need to predictthe deviation a, b in the square containing the center of the object

Figure 9: Each square is responsible for each predicting 2 boundary boxes

12

Trang 28

Conclusion, for each square, the prediction is following the below information:

¢ Does the square contain any object or not?

e Predict the deviation of 2 boxes containing object from the current square

e The class of that object

For each square, we need to predict a vector with (Nbox + 4 x Nbox + Nclass)dimension For example, see the Figure 10, we need to predict 2 boxes, and 3 layersfor each square, it will be 7x7x13x3 tensor containing all the necessary information

Figure 10: A vector has 2 boxes and 3 layers for each square

2.2.2 Neural Network for YOLOv1

The next important thing is to build a Neural Network model that produces theoutput with appropriate shape For example, with a gride size of 7x7 is that eachsquare predicts 2 boxes, and there are 3 types of object in all, we need to output a7x7x13 shape from the Neural Network model

mạ

H WE 75 ” Tan xa

Figure 11: Neural Network model for YOLOv1'4

YOLOv1 uses linear regression to predict the information in each square Therefore,

in the last layer will not use any activation function at all In the Figure 11, with an

input image of 448x448, a Neural Network model with 6 max pooing floors with

size 2x2 will reduce the image size by 64 times, to 7x7 at the output Instead of

Low costs object detection using yolo.

13

Trang 29

using a fully connected layer in the last layer, Neural Network can replace it with a1x1 neural layer with 13 feature maps to easily output the shape 7x7x13.

2.2.3 Loss Function

YOLO use square loss function between prediction and label to calculate loss formodel It equals total of 3 loss:

¢ Loss of predict label of object (Classification loss)

¢ Loss of predict width and height of boundary box (Localization loss)

¢ Loss of cell that whether contain object or not (Confidence loss)

Leotat = Letassification + Liocatization + Leonfidence

In training image, model will choose each cell have object, increase classificationscore, after that, find the outside boundary box between 2 predicted boxes The next

is increase localization score of boundary box and change the information of this

box to the same with label With cell does not have object, confidence score will

decrease, and we do not use classification score and localization score

T]; obj : equal 1 if the cell has object, the opposite is 0.

P,(c): is the probable probability of class c in the corresponding square predicted bythe model

2 Localization Loss:

It uses for calculating the error value of predicted boundary box include of x, y

offset and width, height Only count the box, which have object

Trang 30

Localization loss is calculated by total of square loss of offset x, offset y and width,height of cell, which have object With each cell, we choose 1 boundary box have

best IOU (Intersection over union) and error calculation of boundary box

3 Confidence Loss:

It shows the error value between predicted boundary box with actual label

Confidence losses calculate on both cells have object and not have object

s? B obj ˆ s? B noobj

Leonyiaence= > | | Ge Gd? + Anooojecee > || G- eo?

¡=o yj ij i=0 —; ij

2.2.4 Intersection Over Union

IOU (Intersection Over Union) is a function that evaluates the accuracy of theobject detector on a specific dataset

We only keep each boundary box have object To calculate IOU between 2 boxes,

we need to calculate average of 2 intersect box and divide for sum of average of 2boxes

loU: 0.4034 loU: 0.7330 loU: 0.9264

Poor Good Excellent

Figure 12: IOU example

15

Trang 31

The Figure 12 show that the accuracy of the object If IoU > 0.5, prediction isassessed as good and opposite is not good The calculation for loU:

IOU = S-yellow / (S-red + S-yellow + S-violet)

2.2.5 Disadvantages of YOLOv1

YOLOv1 imposes spatial constrains on bounding boxes, each grid cell can predictvery little B bounding box and only one class These constrains limit our ability to

recognize objects that are close together, as well as to small object

During training, the loss function does not have a separate evaluation between the

error of the small bounding box versus the error of the large bounding box

Thinking them of the same type and aggregating them affects the overall accuracy

of the network Small errors on large boxes generally do a little harm, but small

errors with very small boxes will especially affect the IOU value

2.3 YOLOv2

YOLOv2 named YOLO9000 was announced by Joseph Redmon and Ali Farhadi in

late 2016 The main improvement of this version is better, faster, more advanced to

catch up faster R-CNN (method using Region Proposal Network), handle the

problems encountered by YOLOv1.!Š

2.3.1 Batch Normalization

The Batch Normalization technique was introduced after all YOLOv2’s neuralclasses This technique not only reduces training time, but also increases theuniversality (generalization) of the network In YOOLOv2, Batch Normalization

increased mAP by about 2% The network also does not need to use additional

Dropout to increases universality

Table 2: Batch Normalization Transform, applied to activation x over a mini-batch!®

15 YOLO9000: Better, Faster, Stronger

15 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

16

Trang 32

Input: Values of x over a mini-batch: = (x m);

Vi = VR; + B = BN, p(x¡) Scale and shift

Batch Normalization is an effective method when training a neural network model.The goal of this approach is standardizing the features (the output of each layer afterpassing activation) to a zero-mean state with standard deviation

Batch normalization can help prevent the value x from falling into saturation afterpassing the non-selection triggers So, it ensures that no trigger is either too high ortoo low This helps the weights when not using BN, may never be able to learn,now they can learn normally This helps YOLOv2 reduce the dependency on the

parameter’s initialization values

2.3.2 High Resolution Classifier

YOLO is trained in 2 phases The first phase will train a classifier network with asmall input 224x224 and the next phase will remove the fully connected class anduse this classifier to train network detection Note that, the small input image is alsooften used to train network classifiers, which will be used as the pretrained model

17

Trang 33

for the backbone portion of other network detection In the next phase, YOLO firstfinetune the network under the larger input image 448x448, let the network get used

to gradually with the large input image size, then use this result to train the

detection process This increases the mAP of YOLOv2 by 4%

2.3.3 Use Anchor Box Architecture to Make Predictions

In YOLOv2, the author removes the fully connected layer in the middle of thenetwork and uses the anchor box architecture to predict the bounding boxes.Predicting the offsets relative to the anchor box is much easier than predicting thebounding box coordinates This change reduces the mAP a bit but increases therecall

The position of an anchor box is determined by mapping the location of the networkoutput back to the input image We can see the Figure 13; the process is replicated

for every network output The result produces a set of tiled anchor boxes across the

entire image Each anchor boxes to make two predictions per location in the imagebelow

Figure 13: NN output (i, j) maps to image (i, j)

The distance, or stride, between the tiled anchor boxes is a function of the amount

of down sampling present in the NN Down sampling factors between 4 and 16 arecommon These down sampling factors produce coarsely tiled anchor boxes, which

can lead to localization errors

18

Trang 34

To fix localization errors, deep learning object detectors learn offsets to apply toeach tiled anchor box refining the anchor box position and size.

19

Trang 35

box location

Refined location of anchor box

Figure 15: Anchor box refining the anchor box position and size

Multiscale processing enables the network to detect objects of varying size Toarchive multiscale detection, you must specify anchor boxes of varying size, such as64-by-64, 128-by-128, and 256-by-256 Specify sizes that closely represent the

scale and aspect ratio of objects in your training data.!7

2.3.4 K-mean Clustering for Anchor Selection

Instead of having to manually select the anchor box, YOLOv2 uses the k-mean

algorithm to come up with the best anchor box selection for the network Thisproduces a better mean IOU

The goal of this clustering algorithm: from the input data and the number of clusters

we want to find, indicate the center of each group Assigning the data points into

corresponding groups

Input: Data X and the number of clusters to find K

Output: The centers M and label vector for each Y data point

e Step 1: Choose any K point as the initial centers

Anchor Boxes for Object Detection

20

Trang 36

e Step 2: Assign each data point to the cluster with the nearest center.

e Step 3: If the assignment of data to each cluster in step 2 does not change

compared to the previous loop, we stop the algorithm

e Step 4: Update the center for each cluster by taking the average of all the

data point assigned to that cluster after step 2

e Step 5: Go back to step 2

The algorithm will stop after a finite number of loops Since the loss function is apositive number and after every step 2 or 3, the value of the loss function isdecrease If a sequence of numbers falls and is blocked below it converges

Furthermore, the number of clusters for all data is finite so at some point, the loss

function will be changed, and we can stop the algorithm here

2.3.5 Direct Location Prediction

YLOv2 use the sigmoid (6ø) function to limit the value to between 0 and 1, thereby

limiting bounding box prediction around the grid cell for making the model morestable during training

See the Figure 16, given the anchor box of size (Py,Pp) on the grid cell with top left

position (c,,c,), the model predicts the offset and scale t,,ty,ty,t, and thebounding box (by, by, b„, by) The confidence of the prediction is o(t9)

Trang 37

seme -<

` Á.(

Figure 16: Predict the bounding box in YOLOv2

YOLOv?2 added 5% mAP when applying this method

2.3.6 Add Fine-grained Features

YOLOvz?2 uses feature map 13x13 to make prediction larger than 7x7 of YOLOvI

YOLOv2 also combined features at different layers to make predictions,specifically, the original architecture of YOLOv2 combined feature map 26x26

taken from near the end with feature map 13x13 at the end to make predictions

Specifically, these feature maps will be concatenated to form a block used forprediction

See the Figure 17, we have the YOLOv?2 architecture After 4 neural layers, it willhave Reorg layer to decrease the feature maps of image, it can skip 2 neural layers

to reduce execute time The Figure 18 show that the matrix will be split into 4

groups.

22

Trang 38

MaxPooling MaxPooling

MaxPooling

Figure 17: YOLOv2 Architecture'®

Reorg is a memory reorganization technique to turn feature map 26x26 into 13x13with greater depth to be able to perform concatenate with feature map 13x13 at theend

In the general case of the Reorg, Neural Network will turn the feature map of size

[N, C, H, W] to size [N,C x 375 Sy The number of parameters in the feature map

remains the same When we wat to reduce the width and height by 2 time, the

number of channels must be increased by 4 times This transformation is completely

different from the resize in image processing For easy visualization, you can seethe drawing below:

10 | 12 13 | 15 [14 |16 |

Figure 18: The Reorg technique in YOLOv2

'8 Determination of Vehicle Trajectory through Optimization of Vehicle Bounding Boxes Using a

Convolutional Neural Network

23

Trang 39

This is a channel of the 4x4 feature map To bring the size 2x2, that is to reduce thewidth by 2 time and the height by 2 times, we split the channel of the feature map

4x4 into 4 tensors as shown above, corresponding to the 4 depth channels of the

new 2x2 feature map The position of the values in each channel of the new 2x2feature map will be sparse on the original 4x4 feature map with stride = 2 along theweight and height

2.3.7 Multi-Scale Training

After adding the anchor box technique for YOLOv2, the author changed thenetwork input to 416x416 instead of 448x448 However, YOLOv2? is designed withonly neural and pooling layers, so it can adapt to many different input image sizes.YOLOv2?2 can training on many different image sizes to increase the adaptability of

a variety of image sizes

2.3.8 Light weight Backbone

This network consists of 19 neural layers and 5 layers of max pooling which make

the speed faster than YOLOv1.!”

Type Filters

“Convolutional kJ

Maxpool Convolutional | 64

Convolutional | 128 Convolutional | 64

Convolutional | 128 Maspool

Convolutional | 256 Convolutional | 128 Convolutional | 256 Maxpoo!

Convolutional | 512

Comoluional | 256

Convolutional | S12 Convolutional | 256 Convolutional | 512

Maxpool Convolutional | 1024

Convolutional | 512 Convolutional | 1024 Convolutional | 512

Convolutional | 1024

Convolutional | 1000

Avgpool Softmax

Figure 19: Darknet-19

'° Darknet: Open sources neural network in c.

24

Trang 40

Look at the Figure 19, the input of image is 224x224, after multiplying with kernelmatrix, the output image will be scale to 112x112 because the value of pooling is 2.

After 5 pooling layers, the size of image is 7x7

Tiêu đề	The Research & Implementation of CNN Algorithm (YOLO) on Zedboard Zynq 7000
Tác giả	Phan Tuan Thanh, Vu Hoang Hy
Người hướng dẫn	PhD. Nguyen Minh Son
Trường học	University of Information Technology
Chuyên ngành	Computer Engineering
Thể loại	Graduation Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	91
Dung lượng	28,78 MB