Khóa luận tốt nghiệp Kỹ thuật máy tính: Nghiên cứu tích hợp thuật toán Arcface – CNN trên vi mạch Zynq 7020 cho việc nhận diện khuôn mặt

Vietnam National University, Ho Chi Minh CityVNUHCM.-University of Information Technology Faculty of Computer Engineering Nguyễn Công Danh — 16520178 Nguyễn Đức Hoàng - 16520437 GRADUATE

Trang 1

Vietnam National University, Ho Chi Minh City

VNUHCM.-University of Information Technology

Faculty of Computer Engineering

Nguyễn Công Danh — 16520178 Nguyễn Đức Hoàng - 16520437

GRADUATE THESIS

RESEARCH AND IMPLEMENTATION CNN ARCFACE

ALGORITHM ON KIT ZYNQ7020 FOR FACE

RECOGNITION

ENGINEER OF COMPUTER ENGINEERING

HO CHI MINH, 2021

Trang 2

Vietnam National University, Ho Chi Minh City

VNUHCM.-University of Information Technology

Faculty of Computer Engineering

Nguyễn Công Danh — 16520178

Nguyễn Đức Hoàng - 16520437

GRADUATE THESIS

RESEARCH AND IMPLEMENTATION CNN ARCFACE

ALGORITHM ON KIT ZYNQ7020 FOR FACE

Trang 3

PROTECTION COUNCIL OF THE GRADUATE THESIS

Protection council of the graduate thesis, established under decision day by Principle of University of Information Technology

A¬ e cece e ee eeee eee eee eeeeeneeeseeeeenees — President

PB — Secretary

Bo eee eee — Commissioner

Trang 4

My group would like to express our sincere thanks to the teachers in the Faculty of

Computer Engineering of the University of Information Technology, National

University, Ho Chi Minh City, for creating opportunities for us to do the thesis

In particular, we would like to give our sincere thanks and gratitude to our mentor,

PhD Nguyen Minh Son Thank you for supporting us with wholeheartedly during thetime of our thesis

Sincere thanks to our friends and former students who helped us to learn more ideasand knowledge from previous topics

In the process of making the thesis, we have learned a lot more valuable experiences

from reality and treasures techniques However, in the process of making it, mistakes

can hardly be avoided, we hope that the council will forgive them

Thank you sincerely!

Representative Student

NGUYEN CONG DANH

Faculty of Computer Engineering - 2021

Trang 5

1.3 The problem that thesis focus on SOLVING esecseeesecesecseessessses 4CHAPTER 2 ELEMENTARY THEORY -sccsssecsseresrrersrrrssrrrssrrrssrr 52.1 Library’s Overview ssesssssssssssesssesssessesssesseeseeseesseesseessesseesaeenesaessaeesseeaeesneesseeneesas 5

2.1.2 DID csesssssssssssssessssnseessnssecssssseesssnsessssnseesssnsecsesssecetenseessnnseesssnsecsesnnecsessscessanseeseas 7

2.1.3 ncnriitno, TRE T9 T10 EEE PETecenrovconusocurosssouscoesseecanesuesssvonsecousancosusovesenseest 82.2 Machine Learning in Face Recognition

2.2.1 Machine Learning Overview -cceseecssrersrrrrsrrrrsrrrassrrasrrrasrr 9

2.2.2 Machine Learning Models -s«ccsecsseesseesersirsrrrsirsrrke 10

2.2.3 Convolution Image PrOC€SSing e«eceeceesesseseeserserirseersrssrssree 132.3 Arcface AlgorÏ(lhim «.e«e«co«csseseeserseressrtsersrsrsrrserssrertsesaiirrssrriirrssre 18PXŠ500 a2 76 11111 182.3.2 Transform Softmax to ATCÍACe -eeeseeceeeeeseerssrrirrrsrrrasrrrasrrrae 212.3.3 Comparison

2.3.4 Articles Evaluation ResuÌ(s o« seceseesseesreesresseesressersrrose 28

Trang 6

2.4 Hardware and f00ÌS -«-es<eeseeeseessreseresersrrrsrrnsrseirsarasirrtsirtsissrke 322.4.1 Xillinx

2.4.2 Vivado Design Suife «ee-eeceseeessrerssrersrrrrrrrrrrrarrrasrrrasrrrarnrrser 342.4.3 Vivado IDE cssessssessssesssssssseessseessseessseesssessssessaneerasesrseeesneresnersasesranesraneessnessass 362.4.4 Vivado HLS -.-eeseeeeseeesrreesrrresrrrarrrarrrarrrrsrirnsrrrnsrrrssrrrastrrasnrraer 38

2.4.5 KIT ZYNQ 7020 eeceesensesrrsereninirnirniiinrnransrninsnasrsrnsrsrsrss 42CHAPTER 3 SYSTEM DESIGN AND IMPLEMENTATION -‹ 473.1 Linux Implementation sssssssesssessseesessssssnsseessesnsessesnsesnseeseessessseeseesneeseesneess 473.1.1 OV€TVÏW eecccenserestrssrritrktratrtstrasrrasrrarrasrrasrnarrassrasrnsrasrtasrasrrasrke 473.1.2 System DiagTraim ‹-e esecseesteesrksertstrssrrarrnsrrasrarisrasrtarrasrraske 47

3.2 Vivado HLS ImplemenfafÏOn -«-s«ccseesessessessrsseseeserssrsrssrsrrsrrssre 503.2.1 Implementation 'THeOFV -.«‹ceeseeseeseeeessessrseesersersitsrtsersrssrkererssr 503.2.2 Implementation Practice -e«ceseeceseesesseseesersersitsrtsersrssrkrsrsar 503.3 Project Implementaion PTOC€SS -e-cs«cseeeerseesrserrsrnsrsrsrree 51CHAPTER 4 EXPERIMENTAL EVALUA TION eeeesecccesseccceserceess 534.1 Linux Result Evaluation -.-.««eceseeeeeeseseerssersrersrriiriiiraarasrmree 534.1.1 Advantages

4.1.2 Disadvantages sscssssssssssessssssssncssscsssessessscsnsssanssessnienessnesseesaeesseenseeaeeeas 534.2 Vivado Result Evaluation

4.2.1 AAVANtage cssssessssssssssssesssessesssssnessacsssesaeesaesnnssanssaessnieseesaeenneeneesaesnesaesay 544.2.2 Disadvantages

CHAPTER 5 CONCLUSION AND FUTURE WORK -c s+ 55

5.1 Solved Problems.

5.2 Unsolved Problems -.«eescosessesseseessesresresrnsrssrsrrssrssrsrtsrrsrnsrtsrnsrnsrssre 55

Trang 7

5.3 Future works and proposed

Trang 8

TABLE OF FIGURES

Figure 1: website paperwithcode.COm e‹sc-ccerteeereeerereiiiiriirirrrrseerrievẨÍ

Figure 2: OpenCV ÏOgO -ccccceceeeerrrrrtrtriiiirrirrriiiiiiiiirririiiiiiiirrrrirrriirarrrrrrrrrrriee O

Figure 3: DIib 100 w scsssssssssssssssesssssssesesssesssesssssssssesesssssssesesssssesssessssessssessssessssessssssssssssssssssssssssssssssess 7

Figure 4: Example of machine learning ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssses LO

Figure 5: Example of a neural network sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssees L2Figure 6: Example of a training models sssssssscsssssssseesesssssssssssuesssssssesssssussssssssssensee LOFigure 7: The example of small image and Kernel sssssssssssssssssssssssssssssssssssssssssssssssssssssssses LAFigure 8: The formula of COnVOLUtION sssssssssssssssssssssssssssssesssssssssesssssssssssssssssssssssssssssssssssssess LOFigure 9: Example of padding cssxxscieeeeereiiiiiiiiiiiiiirrriiirrrereeouee LOFigure 10: Example of the pooling layer sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssseees LOFigure 11: Example of Arcface algorithm sesssssssssssssssssssesssssssssssssssssssssssssssssessssssssssssssssssss 20Figure 12: Softmax aÏgOrithm -ccvvcsvvevecrertkttttirrrrirtriiiiirrrrirrriiiiersreerrix 21

2122Figure 13: Weights Normalization

Figure 14: Formula after adjustment

Figure 15: Sphere Face LOSS ssssssssssssssssssssssssssssssssssssssssssssssssssssssnsssssssssesssssssssssssssssssnsssssnsssssssss 2/2Figure 16: Sphere Face Loss Piece WiS€ ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssnsssssssssssnsssssss 2/2

Figure 17: Piece Wise FUnCtion sssssssssssssssssssssssssssssssssssssssssessssssssssssssssssssssssssssssssssssssssnsssssssss 2D

Figure 18: SphereFace-FNOrm ssssssssssssssssssssssssssssssssssssssssssssssssssesssssssssesssssssssssssssnsssnsssssnsssnsssss 2D

Figure 19: Additive Cosine Margin sssssssssssssssssssssssssssssssssesssssssssesssssssssssssssssssssssssssssssssssssess 2EFigure 20: Additive Angular Margin LoS w.sssssssssssssssssssssssssssssssssssssssssesssssssssessssssssssssssssssss 2EFigure 21: Geometrical interpretation Of ATCÝaCe -cc-cc-ccrrrrrreesereeeerrrrrre 2DFigure 22: Angle between the Feature and Target Center ssssssssssssssssssssssssssssssssssssssses 20,Figure 23: Angle between the Feature and Target CenIeT -‹. e-eececccec 27

Figure 24: Decision margins of different loss function -.-.-. 2Ÿ

Figure 25: Angle distributions of both positive and negative pairs on different

ataset ssssssssecsessscssssssssesesessssssssssseessesssssssvesesesesssssssssesssesssessssssesesessssssssuueeseessssessssaueeessssssssssesessss OO)

Figure 26: CMC and ROC curves of different models on MegaFace 3.2

Trang 9

Example of Vivado IDE oeecssssssssesessssesesesesesesesesesssesesesesssssessseseseseseseessssesssssesesesees O7

Vivado HLS Design FloWsesssssssssssssssesesesssesssssesssssssssesssssssssssssssesesssssssssssssesseeeees 40)

ZYNQ - 7000 SOC OVerVieW eeeeeraeesruur 4AZ-turn Board (With ZYNÑQ-7020) cceeerrrssnsosooo 4O,Linux implementation system diagram -ieeesrsrrrrrerereeeee 4B

4951Example of the biometrics numbers

Vivado HLS implemetation

Implementation Process csssssssssssssssssssssssssesssssesssesssesesesessseeeseseseseseeeeeeeseeeeessenseeeeeee OL

Trang 11

LIST OF ACRONYMS

ReLU Layers: Rectified Linear Units

LFW: Labeled Faces in the Wild

YTF: YouTube Faces

CALFW: Cross-Age Labeled Faces in the Wild

CPLFW: Cross-Pose Labeled Faces in the Wild

HLS: High-Level Sunthesis

PS: Processing System

PL: Programmable Logic

FFT: Fast Fourier Transform

SRL: Shift Register LUT

LUT: Look-up Table

Trang 12

Face recognition with the CNN algorithm is one of the most challenging problems

nowadays Many programmers and scientists have created many algorithms and

methods to solve this matter These techniques compare to each other of how accurateand fast are they, and among all of those formulas, there is one stand out which calledArcface

Arcface is a new algorithm that was first to introduce in February 2019 in the papertitled Arface: Additive Angular Margin Loss for Deep Face Recognition and the

researchers for this paper are four scientists: Jiankang Deng, Jia Guo, Niannan Xue,

and Stefanos Zafeiriou This method is currently being research in many countriesbut the country that develops in China They have many papers and articles about this

new algorithm In our country, specifically Vietnam, at the moment this method

doesn’t have much research and file or papers about this, most of the articles inVietnam just about the introduction

The advantage of the proposed Arcface can be summarized as follows: Engaging,

Effective, Easy and Efficient But one of the difficulties in this method is that the

paper and most of the example of this code is run on software, not many implement

this algorithm down in the hardware system And because of that, our group isassigned to research this algorithm and trying to implement this Arcface method

down to the hardware, specifically the kit Zynq7020

Trang 13

CHAPTER 1 OVERVIEW OE THE PROJECT

1.1 The past and the future of face recognition

1.1.1 The creation of face recognition

In the early 1990s when the United States Department of Defense was looking for

a solution or a technology that could spot criminals who furtively crossed borders

And because of that, face recognition gained so much popularity Then The

Defense Department roped in eminent university scientists and experts in the field

of face recognition for this purpose by providing them with research financing

In the early 2001, facial recognition made bold headlines after this technology wasused in a public event which was at Super Bowl XXXV in Tampa by the lawenforcement authorities to search for criminals or terrorists among the crowd Soonafter that, facial recognition systems were installed basically everywhere in the US

in order to keep track of felonious activities

1.1.2 Presents and future predictions

Nowadays and maybe in the future, implement face recognition down into small

devices or a kit is a trend in many parts of the world By doing this, individuals cansolve many problems in their daily life, for example they can create a simple facerecognition devices for their houses, a smart home for short Corporate can gain

many benefits from this technology, because they can clear many simple problems

such as check the attendance of an employee or strengthens their security measures

1.2 Research technology and related applications

1.2.1 Domestic Research

Our country — Vietnam, IC design or hardware design is not a major branch andbecause of that we have many limit in making or research in the field of face

recognition Our country doesn’t have enough expenditure, the technology is not

reach to the requirement and the human resource is low, so that most of the facerecognition research is come from development country

Trang 14

And Arcface is no exception, research papers about this algorithm in Vietnam arereally little The main reasons is that this method just got announced in 2019, and

combine with the limited access to new technology, not much researchers in

Vietnam learn and find out about this algorithm But in some programmingcommunity, people still talk about Arcface and have some demo about it Myprediction is that in the future, this technique will be widely use because of its speed

and the high accuracy, and with all the things above, ourselves really proud of to

be one of the first people to work with this algorithm

1.2.2 Foreign Countries Research

Opposite with our country, other development country such as America, England,Japan, China, have many research about face recognition There are so manypapers about it but some got public, some remain as a confidential because of itspotential and copy righted

According to the website paperswithcode.com, which is reliable website forprogrammers to study, and they even feature code and articles The famous articleabout Arcface is the paper titled Arface: Additive Angular Margin Loss for DeepFace Recognition and the researchers for this paper are four scientists: JiankangDeng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou The paper include theintroduction about Arcface, the math process about this algorithm and the

comparision between this method and others This can be one of the most detailed

article and might be the only article involved about this technique Our groupmainly use the article we have introduced above in order to create and execute our

Trang 15

Figure 1: website paperwithcode.com

1.3 The problem that thesis focus on solving

There are many problems that this dissertation can solved; however the specific thingsour group want to focus on is the implementation or the demo this algorithm on ARMside of the KIT In the detailed syllabus many achievements have been introduced butdoing this thesis, we cannot avoid drawbacks The biggest disadvantage our grouphas is that we do not acquire many articles; therefore we have changed our focus on

the demo of ARM core We believe that this way is one of many approachable

directions about this technique Also we have already successfully run the code onthe Linux environment, and with the tool vivado, we can input our code C/C++ ofArcface into the tool to run the simulation or the synthesis Therefore with many toolsand many advantages, we decided to focus our thesis to this problem in order to provethat our vision and guess is right The Vivado tool provides us with man system inorder to ease our simulation, such as synthesis function, analysis function and many

more With all of the above, we conclude that our main things we want to prove in

this thesis is that the simulation or the focus on the ARM core of this algorithm

Trang 16

CHAPTER 2.ELEMENTARY THEORY

2.1 Library’s Overview

2.1.1 OpenCV

OpenCV (Open Source Computer Vision Library) is an open source computer

vision and machine learning software library OpenCV was built to provide a

common infrastructure for computer vision applications and to accelerate the use

of machine perception in the commercial products Being a BSD-licensed

product, OpenCV makes it easy for businesses to utilize and modify the code

The library has more than 2500 optimized algorithms, which includes acomprehensive set of both classic and state-of-the-art computer vision and

machine learning algorithms These algorithms can be used to detect andrecognize faces, identify objects, classify human actions in videos, track camera

movements, track moving objects, extract 3D models of objects, produce 3D point

clouds from stereo cameras, stitch images together to produce a high resolutionimage of an entire scene, find similar images from an image database, remove redeyes from images taken using flash, follow eye movements, recognize scenery

and establish markers to overlay it with augmented reality, etc OpenCV has more

than 47 thousand people of user community and estimated number of downloadsexceeding 18 million The library is used extensively in companies, research

groups and by governmental bodies

Along with well-established companies like Google, Yahoo, Microsoft, Intel,

IBM, Sony, Honda, Toyota that employ the library, there are many startups such

as Applied Minds, VideoSurf, and Zeitera, that make extensive use of OpenCV.OpenCV’s deployed uses span the range from stitching street view imagestogether, detecting intrusions in surveillance video in Israel, monitoring mine

equipment in China, helping robots navigate and pick up objects at Willow

Garage, detection of swimming pool drowning accidents in Europe, runninginteractive art in Spain and New York, checking runways for debris in Turkey,

Trang 17

inspecting labels on products in factories around the world on to rapid facedetection in Japan.

It has C++, Python, Java and MATLAB interfaces and supports Windows, Linux,Android and Mac OS OpenCV leans mostly towards real-time vision applications

and takes advantage of MMX and SSE instructions when available A

full-featured CUDA and OpenCL interfaces are being actively developed right now.There are over 500 algorithms and about 10 times as many functions that compose

or support those algorithms OpenCV is written natively in C++ and has atemplate interface that works seamlessly with Standard Template Librarycontainers

Trang 18

2.1.2 Dlib

Dlib is a modern C++ toolkit containing machine learning algorithms and toolsfor creating complex software in C++ to solve real world problems It is used inboth industry and academia in a wide range of domains including robotics,embedded devices, mobile phones, and large high performance computingenvironments Dlib's open source licensing allows you to use it in any application,free of charge

Since development began in 2002, Dlib has grown to include a wide variety oftools As of 2016, it contains software components for dealing with networking,threads, graphical user interfaces, data structures, linear algebra, machine

learning, image processing, data mining, XML and text parsing, numerical

optimization, Bayesian networks, and many other tasks In recent years, much ofthe development has been focused on creating a broad set of statistical machinelearning tools and in 2009 Dlib was published in the Journal of Machine LearningResearch

Dlib contains a wide range of machine learning algorithms All designed to be

highly modular, quick to execute, and simple to use via a clean and modern C++API It is used in a wide range of applications including robotics, embeddeddevices, mobile phones, and large high performance computing environments

Dahle

Figure 3: Dlib logo

Trang 19

nenn is a high-performance neural network inference computing framework

optimized for mobile platforms ncnn is deeply considerate about deploymentand uses on mobile phones from the beginning of design ncnn does not havethird party dependencies it is cross-platform, and runs faster than all known

open source frameworks on mobile phone CPU Developers can easily deploy

deep learning algorithm models to the mobile platform by using efficient ncnnimplementation, create intelligent APPs, and bring the artificial intelligence to

your fingertips ncnn is currently being used in many Tencent applications,such as QQ, Qzone, WeChat, Pitu and so on

nenn supports for convolutional networks, supports multiple input and

multi-branch structure, can calculate part of the multi-branch Supports multi-core parallelcomputing acceleration, ARM, GPU acceleration, and so on Therefore, withsome adjustment in the source code, installation of libraries, the code can be

run, camera can detect and recognize the individuals

ncnn features:

¢ Supports convolutional neural networks, supports multiple input and

multi-branch structure, can calculate part of the branch

¢ No third-party library dependencies, does not rely on BLAS /

NNPACK or any other computing framework

» Pure C ++ implementation, cross-platform, supports android, iOS and

soon

« ARM NEON assembly level of careful optimization, calculation speed

is extremely high

¢ Sophisticated memory management and data structure design, very

low memory footprint

Trang 20

s Supports multi-core parallel computing acceleration, ARM big.

LITTLE CPU scheduling optimization

« Supports GPU acceleration via the next-generation low-overhead

Vulkan API

5 The overall library size is less than 700K, and can be easily reduced

to less than 300K

e Extensible model design, supports 8bit quantization and

half-precision floating point storage, can importcaffe /pytorch/mxnet/onnx models

e Support direct memory zero copy reference load network model

« Can be registered with custom layer implementation and extended

2.2 Machine Learning in Face Recognition

2.2.1 Machine Learning Overview

Machine learning involves computers discovering how they can perform taskswithout being explicitly programmed to do so It involves computers learning

from data provided so that they carry out certain tasks For simple tasksassigned to computers, it is possible to program algorithms telling the machinehow to execute all steps required to solve the problem at hand; on the

computer's part, no learning is needed For more advanced tasks, it can be

challenging for a human to manually create the needed algorithms In practice,

it can turn out to be more effective to help the machine develop its own

algorithm, rather than having human programmers specify every needed step

The discipline of machine learning employs various approaches to teach

computers to accomplish tasks where no fully satisfactory algorithm is

available In cases where vast numbers of potential answers exist, oneapproach is to label some of the correct answers as valid This can then be used

as training data for the computer to improve the algorithm(s) it uses to

Trang 21

determine correct answers For example, to recognize handwritten numbers orclassify spam email We can create generic algorithms for machine learningthen give the data in to it and it will help us resolve the problems.

Images of Hand-written ———>|

Numbers

% (

Sb

Figure 4: Example of machine learning

2.2.2 Machine Learning Models

2.2.2.1 Neural Networks

A biological neural network is composed of a groups of chemically

connected or functionally associated neurons A single neuron may be

10

Trang 22

connected to many other neurons and the total number of neurons andconnections in a network may be extensive Connections,

called synapses, are usually formed from axons to dendrites,though dendrodendritic synapses and other connections are possible

Apart from the electrical signaling, there are other forms of signalingthat arise from neurotransmitter diffusion

Artificial intelligence, cognitive modeling, and neural networks areinformation processing paradigms inspired by the way biologicalneural systems process data Artificial intelligence and cognitivemodeling try to simulate some properties of biological neural networks

In the artificial intelligence field, artificial neural networks have been

applied successfully to speech recognition, image

analysis and adaptive control, in order to construct software

agents (in computer and video games) or autonomous robots

Historically, digital computers evolved from the von Neumann model,

and operate via the execution of explicit instructions via access tomemory by a number of processors On the other hand, the origins of

neural networks are based on efforts to model information processing

in biological systems Unlike the von Neumann model, neural networkcomputing does not separate memory and processing

Neural network theory has served both to better identify how theneurons in the brain function and to provide the basis for efforts tocreate artificial intelligence

11

Trang 23

Figure 5: Example of a neural network

2.2.2.2 Training Models

Usually, machine learning models require a lot of data in orderfor them to perform well Usually, when training a machine

learning model, one needs to collect a large, representative

sample of data from a training set Data from the training set can

be as varied as a corpus of text, a collection of images, and datacollected from individual users of a service Overfitting is

something to watch out for when training a machine learning

12

Trang 24

model Trained models derived from biased data can result inskewed or undesired predictions Algorithmic bias is a potential

result from data not fully prepared for training

Convolution is a simple mathematical operation which is fundamental

to many common image processing operators Convolution provides away of multiplying together two arrays of numbers, generally ofdifferent sizes, but of the same dimensionality, to produce a third array

of numbers of the same dimensionality This can be used in image

processing to implement operators whose output pixel values are

simple linear combinations of certain input pixel values

In an image processing context, one of the input arrays is normally just

a graylevel image The second array is usually much smaller, and isalso two-dimensional (although it may be just a single pixel thick), and

is known as the kernel

13

Trang 25

The convolution is performed by sliding the kernel over the image,generally starting at the top left corner, so as to move the kernel through

all the positions where the kernel fits entirely within the boundaries of

the image Each kernel position corresponds to a single output pixel,the value of which is calculated by multiplying together the kernelvalue and the underlying image pixel value for each of the cells in the

kernel, and then adding all these numbers together

2.2.3.2 Convolution Algorithm

In image processing context, one of the input arrays is normally just a

gray level image The second array is usually much smaller, and is alsotwo-dimensional, and is known as the kernel Just like the example

image shown below

Tit} Diz} Lis} Tas} Tis} Tis] Liz} Tis} Lis

T21| Ezz| T23| Ez¿| T25| T26| Izz | Izz| I29

Is1| Isz| [33] [3a] I35| I36| I37| I3s| Is9 KukoKis

IK21\K22|K23 Tai| Taz} T43| Ea¿| Iaz | 4s | I¿z | Iaz | Tao

Is1| Isz| Is3| Isa] Is5| Íss | I57| Iss| Is9

To1| Esz| T63| Toa} 165 | Los} Isz | Tos| Io9

Figure 7: The example of small image and kernel

The convolution is performed by sliding the kernel over the image,

generally starting at the top left corner, so as to move the kernel throughall the positions where the kernel fits entirely within the boundaries of

the image Each kernel position corresponds to a single output pixel,

the value of which is calculated by multiplying together the kernel

14

Trang 26

value and the underlying image pixel value for each of the cells in thekernel, and then adding all these numbers together If the image has M

rows and N columns, and the kernel has m rows and n columns, then

the size of the output image will have M — m + | rows, and N-n+ 1columns Mathematically we can write the convolution as:

O3) = 3) 3)14+k—1,3+1—1)K(,Ð

k=1i=1

Figure 8: The formula of convolution

Note that many implementations of convolution produce a larger output

image than this because they relax the constraint that the kernel canonly be moved to positions where it fits entirely within the image

Instead, these implementations typically slide the kernel to all positionswhere just the top left corner of the kernel is within the image

Therefore the kernel `overlaps' the image on the bottom and right edges

One advantage of this approach is that the output image is the same size

as the input image Unfortunately, in order to calculate the output pixel

values for the bottom and right edges of the image, it is necessary

to invent input pixel values for places where the kernel extends off theend of the image Typically pixel values of zero are chosen for regions

outside the true image, but this can often distort the output image at

these places Therefore in general if you are using a convolutionimplementation that does this, it is better to clip the image to remove

these spurious regions Removing n - | pixels from the right hand side

and m - | pixels from the bottom will fix things

2.2.3.3 Padding

Padding is a term relevant to convolutional neural networks as it refers

to the amount of pixels added to an image when it is being processed

15

Trang 27

by the kernel of a CNN For example, if the padding in a CNN is set tozero, then every pixel value that is added will be of value zero If,

however, the zero padding is set to one, there will be a one pixel border

added to the image with a pixel value of zero

Padding works by extending the area of which a convolutional neural

network processes an image The kernel is the neural networks filterwhich moves across the image, scanning each pixel and converting thedata into a smaller, or sometimes larger, format In order to assist the

kernel with processing the image, padding is added to the frame of the

image to allow for more space for the kernel to cover the image Addingpadding to an image processed by a CNN allows for more accurate

Padding = Same

XA 9PI\S

Figure 9: Example of padding

2.2.3.4 ReLU (Rectified Linear Units) Layers

After each conv layer, it is convention to apply a nonlinear layer(or activation layer) immediately afterward.The purpose of this layer is

to introduce nonlinearity to a system that basically has just beencomputing linear operations during the conv layers (just element wise

16

Trang 28

multiplications and summations).In the past, nonlinear functions liketanh and sigmoid were used, but researchers found out that ReLU

layers work far better because the network is able to train a lot faster

(because of the computational efficiency) without making a significantdifference to the accuracy It also helps to alleviate the vanishinggradient problem, which is the issue where the lower layers of thenetwork train very slowly because the gradient decreases exponentiallythrough the layers The ReLU layer applies the function f(x) = max(0,x) to all of the values in the input volume In basic terms, this layer justchanges all the negative activations to 0.This layer increases thenonlinear properties of the model and the overall network withoutaffecting the receptive fields of the convolution layer

2.2.3.5 Pooling Layers

After some ReLU layers, programmers may choose to apply a poolinglayer It is also referred to as a downsampling layer In this category,

there are also several layer options, with maxpooling being the most

popular This basically takes a filter (normally of size 2x2) and a stride

of the same length It then applies it to the input volume and outputs the

maximum number in every subregion that the filter convolves around

Other options for pooling layers are average pooling and L2-norm

pooling The intuitive reasoning behind this layer is that once we knowthat a specific feature is in the original input volume (there will be a

high activation value), its exact location is not as important as itsrelative location to the other features As you can imagine, this layer

drastically reduces the spatial dimension (the length and the widthchange but not the depth) of the input volume This serves two main

purposes The first is that the amount of parameters or weights is

reduced by 75%, thus lessening the computation cost The second is

17

Trang 29

that it will control overfitting This term refers to when a model is sotuned to the training examples that it is not able to generalize well for

the validation and test sets A symptom of overfitting is having a model

that gets 100% or 99% on the training set, but only 50% on the test data

Single depth slice

but not discriminative enough for the open-set face recognition problem For

the triplet loss: there is a combinatorial explosion in the number of face tripletsespecially for large-scale datasets, leading to a significant increase in the

18

Trang 30

number of iteration steps; semi-hard sample mining is a quite difficultproblem for effective model training.

we propose an Additive Angular Margin Loss (ArcFace) to further improvethe discriminative power of the face recognition model and to stabilise the

training process As illustrated, the dot product between the DCNN feature

and the last fully connected layer is equal to the cosine distance after featureand weight normalisation We utilise the arc-cosine function to calculate theangle between the current feature and the target weight Afterwards, we add

an additive angular margin to the target angle, and we get the target logit backagain by the cosine function Then, we re-scale all logits by a fixed featurenorm, and the subsequent steps are exactly the same as in face recognition

DCNNs map the face image, typically after a pose normalisation step, into afeature that has small intra-class and large inter-class distance

There are two main lines of research to train DCNNs for face recognition

Those that train a multi-class classifier which can separate different identities

in the training set, such by using a softmax classifier, and those that learndirectly an embedding, such as the triplet loss Based on the large-scale

training data and the elaborate DCNN architectures, both the

softmax-loss-based methods and the triplet-loss-softmax-loss-based methods can obtain excellentperformance on face recognition However, the softmax loss The advantages

of the proposed ArcFace can be summarised as follows:

e Engaging ArcFace directly optimises the geodesic distance margin by

virtue of the exact correspondence between the angle and arc in the

normalised hypersphere We intuitively illustrate what happens in the512-D space via analysing the angle statistics between features and

weights

19

Trang 31

Effective ArcFace achieves state-of-the-art performance on ten facerecognition benchmarks including large-scale image and videodatasets.

Easy ArcFace only needs several lines of code as given in Algorithm

1 and is extremely easy to implement in the computational-graph-baseddeep learning frameworks, e.g MxNet, Pytorch and Tensorflow.Furthermore, contrary to the works in, ArcFace does not need to becombined with other loss functions in order to have stable performance,and can easily converge on any training datasets

Efficient ArcFace only adds negligible computational complexityduring training Current GPUs can easily support millions of identities

for training and the model parallel strategy can easily support many

largin-Loss: GDis (Fi = +m < GDis ( Fi °

Trang 32

2.3.2 Transform Softmax to Arcface

2.3.2.1 Softmax

Figure 12: Softmax algorithm

e The batch size and the class number is m and n

e The Softmax loss function does not explicitly optimise the

features to have higher similarity score for positive pairs andlower similarity score for negative pairs, which leads to aperformance gap

Predictions only depend on the angle between the feature

vector and the weight

21

Trang 33

cll#il[ cos(4y,)

ao log

||: || cos(Ø„; ) i 5||#¿ || cos Ø;

m 4 ell#: vi) +} »y ellz: Ì 5

Figure 14: Formula after adjustment

2.3.2.3 Multiplicative Angular Margin

In SphereFace, angular margin m is introduced by

multiplication on the angle:

ellze|| cos(mby, )

m = he ellzil| cos(m4y,) % Ji:[| cos 8;

2=1,7#U ©

Figure 15: Sphere Face Loss

where Ôyi belongs to [0,x/m] In order to remove this

restriction, cos(Oyi ) is substituted by a piece-wise

monotonic function w(yi )

Figure 16: Sphere Face Loss Piece Wise

Softmax supervision is incorporated to guarantee theconvergence of training, and the weight is controlled by

a dynamic hyper-parameter A

22

Trang 34

e Features for good quality frontal faces have a high

L2-norm while blurry faces with extreme pose

have low L2-norm

e Gradient norm may be extremely large when the

feature norm from low-quality face image is very

small, which potentially increase the risk ofgradient explosion

e The intuitive insight behind feature and weight

normalization is to remove the radial variation andpush every feature to distribute on a hypersphere

Tiêu đề	Research and Implementation CNN Arcface Algorithm on Kit Zynq7020 for Face Recognition
Tác giả	Nguyễn Cụng Danh, Nguyễn Đức Hoàng
Người hướng dẫn	Ph.D. Nguyộn Minh Son
Trường học	University of Information Technology
Chuyên ngành	Computer Engineering
Thể loại	Graduate Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh

Định dạng
Số trang	68
Dung lượng	20,33 MB