Vietnam National University, Ho Chi Minh CityVNUHCM.-University of Information Technology Faculty of Computer Engineering Nguyễn Công Danh — 16520178 Nguyễn Đức Hoàng - 16520437 GRADUATE
Trang 1Vietnam National University, Ho Chi Minh City
VNUHCM.-University of Information Technology
Faculty of Computer Engineering
Nguyễn Công Danh — 16520178 Nguyễn Đức Hoàng - 16520437
GRADUATE THESIS
RESEARCH AND IMPLEMENTATION CNN ARCFACE
ALGORITHM ON KIT ZYNQ7020 FOR FACE
RECOGNITION
ENGINEER OF COMPUTER ENGINEERING
HO CHI MINH, 2021
Trang 2Vietnam National University, Ho Chi Minh City
VNUHCM.-University of Information Technology
Faculty of Computer Engineering
Nguyễn Công Danh — 16520178
Nguyễn Đức Hoàng - 16520437
GRADUATE THESIS
RESEARCH AND IMPLEMENTATION CNN ARCFACE
ALGORITHM ON KIT ZYNQ7020 FOR FACE
Trang 3PROTECTION COUNCIL OF THE GRADUATE THESIS
Protection council of the graduate thesis, established under decision day by Principle of University of Information Technology
A¬ e cece e ee eeee eee eee eeeeeneeeseeeeenees — President
PB — Secretary
Bo eee eee — Commissioner
Trang 4My group would like to express our sincere thanks to the teachers in the Faculty of
Computer Engineering of the University of Information Technology, National
University, Ho Chi Minh City, for creating opportunities for us to do the thesis
In particular, we would like to give our sincere thanks and gratitude to our mentor,
PhD Nguyen Minh Son Thank you for supporting us with wholeheartedly during thetime of our thesis
Sincere thanks to our friends and former students who helped us to learn more ideasand knowledge from previous topics
In the process of making the thesis, we have learned a lot more valuable experiences
from reality and treasures techniques However, in the process of making it, mistakes
can hardly be avoided, we hope that the council will forgive them
Thank you sincerely!
Representative Student
NGUYEN CONG DANH
Faculty of Computer Engineering - 2021
Trang 51.3 The problem that thesis focus on SOLVING esecseeesecesecseessessses 4CHAPTER 2 ELEMENTARY THEORY -sccsssecsseresrrersrrrssrrrssrrrssrr 52.1 Library’s Overview ssesssssssssssesssesssessesssesseeseeseesseesseessesseesaeenesaessaeesseeaeesneesseeneesas 5
2.1.2 DID csesssssssssssssessssnseessnssecssssseesssnsessssnseesssnsecsesssecetenseessnnseesssnsecsesnnecsessscessanseeseas 7
2.1.3 ncnriitno, TRE T9 T10 EEE PETecenrovconusocurosssouscoesseecanesuesssvonsecousancosusovesenseest 82.2 Machine Learning in Face Recognition
2.2.1 Machine Learning Overview -cceseecssrersrrrrsrrrrsrrrassrrasrrrasrr 9
2.2.2 Machine Learning Models -s«ccsecsseesseesersirsrrrsirsrrke 10
2.2.3 Convolution Image PrOC€SSing e«eceeceesesseseeserserirseersrssrssree 132.3 Arcface AlgorÏ(lhim «.e«e«co«csseseeserseressrtsersrsrsrrserssrertsesaiirrssrriirrssre 18PXŠ500 a2 76 11111 182.3.2 Transform Softmax to ATCÍACe -eeeseeceeeeeseerssrrirrrsrrrasrrrasrrrae 212.3.3 Comparison
2.3.4 Articles Evaluation ResuÌ(s o« seceseesseesreesresseesressersrrose 28
Trang 62.4 Hardware and f00ÌS -«-es<eeseeeseessreseresersrrrsrrnsrseirsarasirrtsirtsissrke 322.4.1 Xillinx
2.4.2 Vivado Design Suife «ee-eeceseeessrerssrersrrrrrrrrrrrarrrasrrrasrrrarnrrser 342.4.3 Vivado IDE cssessssessssesssssssseessseessseessseesssessssessaneerasesrseeesneresnersasesranesraneessnessass 362.4.4 Vivado HLS -.-eeseeeeseeesrreesrrresrrrarrrarrrarrrrsrirnsrrrnsrrrssrrrastrrasnrraer 38
2.4.5 KIT ZYNQ 7020 eeceesensesrrsereninirnirniiinrnransrninsnasrsrnsrsrsrss 42CHAPTER 3 SYSTEM DESIGN AND IMPLEMENTATION -‹ 473.1 Linux Implementation sssssssesssessseesessssssnsseessesnsessesnsesnseeseessessseeseesneeseesneess 473.1.1 OV€TVÏW eecccenserestrssrritrktratrtstrasrrasrrarrasrrasrnarrassrasrnsrasrtasrasrrasrke 473.1.2 System DiagTraim ‹-e esecseesteesrksertstrssrrarrnsrrasrarisrasrtarrasrraske 47
3.2 Vivado HLS ImplemenfafÏOn -«-s«ccseesessessessrsseseeserssrsrssrsrrsrrssre 503.2.1 Implementation 'THeOFV -.«‹ceeseeseeseeeessessrseesersersitsrtsersrssrkererssr 503.2.2 Implementation Practice -e«ceseeceseesesseseesersersitsrtsersrssrkrsrsar 503.3 Project Implementaion PTOC€SS -e-cs«cseeeerseesrserrsrnsrsrsrree 51CHAPTER 4 EXPERIMENTAL EVALUA TION eeeesecccesseccceserceess 534.1 Linux Result Evaluation -.-.««eceseeeeeeseseerssersrersrriiriiiraarasrmree 534.1.1 Advantages
4.1.2 Disadvantages sscssssssssssessssssssncssscsssessessscsnsssanssessnienessnesseesaeesseenseeaeeeas 534.2 Vivado Result Evaluation
4.2.1 AAVANtage cssssessssssssssssesssessesssssnessacsssesaeesaesnnssanssaessnieseesaeenneeneesaesnesaesay 544.2.2 Disadvantages
CHAPTER 5 CONCLUSION AND FUTURE WORK -c s+ 55
5.1 Solved Problems.
5.2 Unsolved Problems -.«eescosessesseseessesresresrnsrssrsrrssrssrsrtsrrsrnsrtsrnsrnsrssre 55
Trang 75.3 Future works and proposed
Trang 8TABLE OF FIGURES
Figure 1: website paperwithcode.COm e‹sc-ccerteeereeerereiiiiriirirrrrseerrievẨÍ
Figure 2: OpenCV ÏOgO -ccccceceeeerrrrrtrtriiiirrirrriiiiiiiiirririiiiiiiirrrrirrriirarrrrrrrrrrriee O
Figure 3: DIib 100 w scsssssssssssssssesssssssesesssesssesssssssssesesssssssesesssssesssessssessssessssessssessssssssssssssssssssssssssssssess 7
Figure 4: Example of machine learning ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssses LO
Figure 5: Example of a neural network sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssees L2Figure 6: Example of a training models sssssssscsssssssseesesssssssssssuesssssssesssssussssssssssensee LOFigure 7: The example of small image and Kernel sssssssssssssssssssssssssssssssssssssssssssssssssssssssses LAFigure 8: The formula of COnVOLUtION sssssssssssssssssssssssssssssesssssssssesssssssssssssssssssssssssssssssssssssess LOFigure 9: Example of padding cssxxscieeeeereiiiiiiiiiiiiiirrriiirrrereeouee LOFigure 10: Example of the pooling layer sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssseees LOFigure 11: Example of Arcface algorithm sesssssssssssssssssssesssssssssssssssssssssssssssssessssssssssssssssssss 20Figure 12: Softmax aÏgOrithm -ccvvcsvvevecrertkttttirrrrirtriiiiirrrrirrriiiiersreerrix 21
2122Figure 13: Weights Normalization
Figure 14: Formula after adjustment
Figure 15: Sphere Face LOSS ssssssssssssssssssssssssssssssssssssssssssssssssssssssnsssssssssesssssssssssssssssssnsssssnsssssssss 2/2Figure 16: Sphere Face Loss Piece WiS€ ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssnsssssssssssnsssssss 2/2
Figure 17: Piece Wise FUnCtion sssssssssssssssssssssssssssssssssssssssssessssssssssssssssssssssssssssssssssssssssnsssssssss 2D
Figure 18: SphereFace-FNOrm ssssssssssssssssssssssssssssssssssssssssssssssssssesssssssssesssssssssssssssnsssnsssssnsssnsssss 2D
Figure 19: Additive Cosine Margin sssssssssssssssssssssssssssssssssesssssssssesssssssssssssssssssssssssssssssssssssess 2EFigure 20: Additive Angular Margin LoS w.sssssssssssssssssssssssssssssssssssssssssesssssssssessssssssssssssssssss 2EFigure 21: Geometrical interpretation Of ATCÝaCe -cc-cc-ccrrrrrreesereeeerrrrrre 2DFigure 22: Angle between the Feature and Target Center ssssssssssssssssssssssssssssssssssssssses 20,Figure 23: Angle between the Feature and Target CenIeT -‹. e-eececccec 27
Figure 24: Decision margins of different loss function -.-.-. 2Ÿ
Figure 25: Angle distributions of both positive and negative pairs on different
ataset ssssssssecsessscssssssssesesessssssssssseessesssssssvesesesesssssssssesssesssessssssesesessssssssuueeseessssessssaueeessssssssssesessss OO)
Figure 26: CMC and ROC curves of different models on MegaFace 3.2
Trang 9Example of Vivado IDE oeecssssssssesessssesesesesesesesesesssesesesesssssessseseseseseseessssesssssesesesees O7
Vivado HLS Design FloWsesssssssssssssssesesesssesssssesssssssssesssssssssssssssesesssssssssssssesseeeees 40)
ZYNQ - 7000 SOC OVerVieW eeeeeraeesruur 4AZ-turn Board (With ZYNÑQ-7020) cceeerrrssnsosooo 4O,Linux implementation system diagram -ieeesrsrrrrrerereeeee 4B
4951Example of the biometrics numbers
Vivado HLS implemetation
Implementation Process csssssssssssssssssssssssssesssssesssesssesesesessseeeseseseseseeeeeeeseeeeessenseeeeeee OL
Trang 11LIST OF ACRONYMS
ReLU Layers: Rectified Linear Units
LFW: Labeled Faces in the Wild
YTF: YouTube Faces
CALFW: Cross-Age Labeled Faces in the Wild
CPLFW: Cross-Pose Labeled Faces in the Wild
HLS: High-Level Sunthesis
PS: Processing System
PL: Programmable Logic
FFT: Fast Fourier Transform
SRL: Shift Register LUT
LUT: Look-up Table
Trang 12Face recognition with the CNN algorithm is one of the most challenging problems
nowadays Many programmers and scientists have created many algorithms and
methods to solve this matter These techniques compare to each other of how accurateand fast are they, and among all of those formulas, there is one stand out which calledArcface
Arcface is a new algorithm that was first to introduce in February 2019 in the papertitled Arface: Additive Angular Margin Loss for Deep Face Recognition and the
researchers for this paper are four scientists: Jiankang Deng, Jia Guo, Niannan Xue,
and Stefanos Zafeiriou This method is currently being research in many countriesbut the country that develops in China They have many papers and articles about this
new algorithm In our country, specifically Vietnam, at the moment this method
doesn’t have much research and file or papers about this, most of the articles inVietnam just about the introduction
The advantage of the proposed Arcface can be summarized as follows: Engaging,
Effective, Easy and Efficient But one of the difficulties in this method is that the
paper and most of the example of this code is run on software, not many implement
this algorithm down in the hardware system And because of that, our group isassigned to research this algorithm and trying to implement this Arcface method
down to the hardware, specifically the kit Zynq7020
Trang 13CHAPTER 1 OVERVIEW OE THE PROJECT
1.1 The past and the future of face recognition
1.1.1 The creation of face recognition
In the early 1990s when the United States Department of Defense was looking for
a solution or a technology that could spot criminals who furtively crossed borders
And because of that, face recognition gained so much popularity Then The
Defense Department roped in eminent university scientists and experts in the field
of face recognition for this purpose by providing them with research financing
In the early 2001, facial recognition made bold headlines after this technology wasused in a public event which was at Super Bowl XXXV in Tampa by the lawenforcement authorities to search for criminals or terrorists among the crowd Soonafter that, facial recognition systems were installed basically everywhere in the US
in order to keep track of felonious activities
1.1.2 Presents and future predictions
Nowadays and maybe in the future, implement face recognition down into small
devices or a kit is a trend in many parts of the world By doing this, individuals cansolve many problems in their daily life, for example they can create a simple facerecognition devices for their houses, a smart home for short Corporate can gain
many benefits from this technology, because they can clear many simple problems
such as check the attendance of an employee or strengthens their security measures
1.2 Research technology and related applications
1.2.1 Domestic Research
Our country — Vietnam, IC design or hardware design is not a major branch andbecause of that we have many limit in making or research in the field of face
recognition Our country doesn’t have enough expenditure, the technology is not
reach to the requirement and the human resource is low, so that most of the facerecognition research is come from development country
Trang 14And Arcface is no exception, research papers about this algorithm in Vietnam arereally little The main reasons is that this method just got announced in 2019, and
combine with the limited access to new technology, not much researchers in
Vietnam learn and find out about this algorithm But in some programmingcommunity, people still talk about Arcface and have some demo about it Myprediction is that in the future, this technique will be widely use because of its speed
and the high accuracy, and with all the things above, ourselves really proud of to
be one of the first people to work with this algorithm
1.2.2 Foreign Countries Research
Opposite with our country, other development country such as America, England,Japan, China, have many research about face recognition There are so manypapers about it but some got public, some remain as a confidential because of itspotential and copy righted
According to the website paperswithcode.com, which is reliable website forprogrammers to study, and they even feature code and articles The famous articleabout Arcface is the paper titled Arface: Additive Angular Margin Loss for DeepFace Recognition and the researchers for this paper are four scientists: JiankangDeng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou The paper include theintroduction about Arcface, the math process about this algorithm and the
comparision between this method and others This can be one of the most detailed
article and might be the only article involved about this technique Our groupmainly use the article we have introduced above in order to create and execute our
Trang 15Figure 1: website paperwithcode.com
1.3 The problem that thesis focus on solving
There are many problems that this dissertation can solved; however the specific thingsour group want to focus on is the implementation or the demo this algorithm on ARMside of the KIT In the detailed syllabus many achievements have been introduced butdoing this thesis, we cannot avoid drawbacks The biggest disadvantage our grouphas is that we do not acquire many articles; therefore we have changed our focus on
the demo of ARM core We believe that this way is one of many approachable
directions about this technique Also we have already successfully run the code onthe Linux environment, and with the tool vivado, we can input our code C/C++ ofArcface into the tool to run the simulation or the synthesis Therefore with many toolsand many advantages, we decided to focus our thesis to this problem in order to provethat our vision and guess is right The Vivado tool provides us with man system inorder to ease our simulation, such as synthesis function, analysis function and many
more With all of the above, we conclude that our main things we want to prove in
this thesis is that the simulation or the focus on the ARM core of this algorithm
Trang 16CHAPTER 2.ELEMENTARY THEORY
2.1 Library’s Overview
2.1.1 OpenCV
OpenCV (Open Source Computer Vision Library) is an open source computer
vision and machine learning software library OpenCV was built to provide a
common infrastructure for computer vision applications and to accelerate the use
of machine perception in the commercial products Being a BSD-licensed
product, OpenCV makes it easy for businesses to utilize and modify the code
The library has more than 2500 optimized algorithms, which includes acomprehensive set of both classic and state-of-the-art computer vision and
machine learning algorithms These algorithms can be used to detect andrecognize faces, identify objects, classify human actions in videos, track camera
movements, track moving objects, extract 3D models of objects, produce 3D point
clouds from stereo cameras, stitch images together to produce a high resolutionimage of an entire scene, find similar images from an image database, remove redeyes from images taken using flash, follow eye movements, recognize scenery
and establish markers to overlay it with augmented reality, etc OpenCV has more
than 47 thousand people of user community and estimated number of downloadsexceeding 18 million The library is used extensively in companies, research
groups and by governmental bodies
Along with well-established companies like Google, Yahoo, Microsoft, Intel,
IBM, Sony, Honda, Toyota that employ the library, there are many startups such
as Applied Minds, VideoSurf, and Zeitera, that make extensive use of OpenCV.OpenCV’s deployed uses span the range from stitching street view imagestogether, detecting intrusions in surveillance video in Israel, monitoring mine
equipment in China, helping robots navigate and pick up objects at Willow
Garage, detection of swimming pool drowning accidents in Europe, runninginteractive art in Spain and New York, checking runways for debris in Turkey,
Trang 17inspecting labels on products in factories around the world on to rapid facedetection in Japan.
It has C++, Python, Java and MATLAB interfaces and supports Windows, Linux,Android and Mac OS OpenCV leans mostly towards real-time vision applications
and takes advantage of MMX and SSE instructions when available A
full-featured CUDA and OpenCL interfaces are being actively developed right now.There are over 500 algorithms and about 10 times as many functions that compose
or support those algorithms OpenCV is written natively in C++ and has atemplate interface that works seamlessly with Standard Template Librarycontainers
Trang 182.1.2 Dlib
Dlib is a modern C++ toolkit containing machine learning algorithms and toolsfor creating complex software in C++ to solve real world problems It is used inboth industry and academia in a wide range of domains including robotics,embedded devices, mobile phones, and large high performance computingenvironments Dlib's open source licensing allows you to use it in any application,free of charge
Since development began in 2002, Dlib has grown to include a wide variety oftools As of 2016, it contains software components for dealing with networking,threads, graphical user interfaces, data structures, linear algebra, machine
learning, image processing, data mining, XML and text parsing, numerical
optimization, Bayesian networks, and many other tasks In recent years, much ofthe development has been focused on creating a broad set of statistical machinelearning tools and in 2009 Dlib was published in the Journal of Machine LearningResearch
Dlib contains a wide range of machine learning algorithms All designed to be
highly modular, quick to execute, and simple to use via a clean and modern C++API It is used in a wide range of applications including robotics, embeddeddevices, mobile phones, and large high performance computing environments
Dahle
Figure 3: Dlib logo
Trang 19nenn is a high-performance neural network inference computing framework
optimized for mobile platforms ncnn is deeply considerate about deploymentand uses on mobile phones from the beginning of design ncnn does not havethird party dependencies it is cross-platform, and runs faster than all known
open source frameworks on mobile phone CPU Developers can easily deploy
deep learning algorithm models to the mobile platform by using efficient ncnnimplementation, create intelligent APPs, and bring the artificial intelligence to
your fingertips ncnn is currently being used in many Tencent applications,such as QQ, Qzone, WeChat, Pitu and so on
nenn supports for convolutional networks, supports multiple input and
multi-branch structure, can calculate part of the multi-branch Supports multi-core parallelcomputing acceleration, ARM, GPU acceleration, and so on Therefore, withsome adjustment in the source code, installation of libraries, the code can be
run, camera can detect and recognize the individuals
ncnn features:
¢ Supports convolutional neural networks, supports multiple input and
multi-branch structure, can calculate part of the branch
¢ No third-party library dependencies, does not rely on BLAS /
NNPACK or any other computing framework
» Pure C ++ implementation, cross-platform, supports android, iOS and
soon
« ARM NEON assembly level of careful optimization, calculation speed
is extremely high
¢ Sophisticated memory management and data structure design, very
low memory footprint
Trang 20s Supports multi-core parallel computing acceleration, ARM big.
LITTLE CPU scheduling optimization
« Supports GPU acceleration via the next-generation low-overhead
Vulkan API
5 The overall library size is less than 700K, and can be easily reduced
to less than 300K
e Extensible model design, supports 8bit quantization and
half-precision floating point storage, can importcaffe /pytorch/mxnet/onnx models
e Support direct memory zero copy reference load network model
« Can be registered with custom layer implementation and extended
2.2 Machine Learning in Face Recognition
2.2.1 Machine Learning Overview
Machine learning involves computers discovering how they can perform taskswithout being explicitly programmed to do so It involves computers learning
from data provided so that they carry out certain tasks For simple tasksassigned to computers, it is possible to program algorithms telling the machinehow to execute all steps required to solve the problem at hand; on the
computer's part, no learning is needed For more advanced tasks, it can be
challenging for a human to manually create the needed algorithms In practice,
it can turn out to be more effective to help the machine develop its own
algorithm, rather than having human programmers specify every needed step
The discipline of machine learning employs various approaches to teach
computers to accomplish tasks where no fully satisfactory algorithm is
available In cases where vast numbers of potential answers exist, oneapproach is to label some of the correct answers as valid This can then be used
as training data for the computer to improve the algorithm(s) it uses to
Trang 21determine correct answers For example, to recognize handwritten numbers orclassify spam email We can create generic algorithms for machine learningthen give the data in to it and it will help us resolve the problems.
Images of Hand-written ———>|
Numbers
% (
Sb
Figure 4: Example of machine learning
2.2.2 Machine Learning Models
2.2.2.1 Neural Networks
A biological neural network is composed of a groups of chemically
connected or functionally associated neurons A single neuron may be
10
Trang 22connected to many other neurons and the total number of neurons andconnections in a network may be extensive Connections,
called synapses, are usually formed from axons to dendrites,though dendrodendritic synapses and other connections are possible
Apart from the electrical signaling, there are other forms of signalingthat arise from neurotransmitter diffusion
Artificial intelligence, cognitive modeling, and neural networks areinformation processing paradigms inspired by the way biologicalneural systems process data Artificial intelligence and cognitivemodeling try to simulate some properties of biological neural networks
In the artificial intelligence field, artificial neural networks have been
applied successfully to speech recognition, image
analysis and adaptive control, in order to construct software
agents (in computer and video games) or autonomous robots
Historically, digital computers evolved from the von Neumann model,
and operate via the execution of explicit instructions via access tomemory by a number of processors On the other hand, the origins of
neural networks are based on efforts to model information processing
in biological systems Unlike the von Neumann model, neural networkcomputing does not separate memory and processing
Neural network theory has served both to better identify how theneurons in the brain function and to provide the basis for efforts tocreate artificial intelligence
11
Trang 23Figure 5: Example of a neural network
2.2.2.2 Training Models
Usually, machine learning models require a lot of data in orderfor them to perform well Usually, when training a machine
learning model, one needs to collect a large, representative
sample of data from a training set Data from the training set can
be as varied as a corpus of text, a collection of images, and datacollected from individual users of a service Overfitting is
something to watch out for when training a machine learning
12
Trang 24model Trained models derived from biased data can result inskewed or undesired predictions Algorithmic bias is a potential
result from data not fully prepared for training
Convolution is a simple mathematical operation which is fundamental
to many common image processing operators Convolution provides away of multiplying together two arrays of numbers, generally ofdifferent sizes, but of the same dimensionality, to produce a third array
of numbers of the same dimensionality This can be used in image
processing to implement operators whose output pixel values are
simple linear combinations of certain input pixel values
In an image processing context, one of the input arrays is normally just
a graylevel image The second array is usually much smaller, and isalso two-dimensional (although it may be just a single pixel thick), and
is known as the kernel
13
Trang 25The convolution is performed by sliding the kernel over the image,generally starting at the top left corner, so as to move the kernel through
all the positions where the kernel fits entirely within the boundaries of
the image Each kernel position corresponds to a single output pixel,the value of which is calculated by multiplying together the kernelvalue and the underlying image pixel value for each of the cells in the
kernel, and then adding all these numbers together
2.2.3.2 Convolution Algorithm
In image processing context, one of the input arrays is normally just a
gray level image The second array is usually much smaller, and is alsotwo-dimensional, and is known as the kernel Just like the example
image shown below
Tit} Diz} Lis} Tas} Tis} Tis] Liz} Tis} Lis
T21| Ezz| T23| Ez¿| T25| T26| Izz | Izz| I29
Is1| Isz| [33] [3a] I35| I36| I37| I3s| Is9 KukoKis
IK21\K22|K23 Tai| Taz} T43| Ea¿| Iaz | 4s | I¿z | Iaz | Tao
Is1| Isz| Is3| Isa] Is5| Íss | I57| Iss| Is9
To1| Esz| T63| Toa} 165 | Los} Isz | Tos| Io9
Figure 7: The example of small image and kernel
The convolution is performed by sliding the kernel over the image,
generally starting at the top left corner, so as to move the kernel throughall the positions where the kernel fits entirely within the boundaries of
the image Each kernel position corresponds to a single output pixel,
the value of which is calculated by multiplying together the kernel
14
Trang 26value and the underlying image pixel value for each of the cells in thekernel, and then adding all these numbers together If the image has M
rows and N columns, and the kernel has m rows and n columns, then
the size of the output image will have M — m + | rows, and N-n+ 1columns Mathematically we can write the convolution as:
O3) = 3) 3)14+k—1,3+1—1)K(,Ð
k=1i=1
Figure 8: The formula of convolution
Note that many implementations of convolution produce a larger output
image than this because they relax the constraint that the kernel canonly be moved to positions where it fits entirely within the image
Instead, these implementations typically slide the kernel to all positionswhere just the top left corner of the kernel is within the image
Therefore the kernel `overlaps' the image on the bottom and right edges
One advantage of this approach is that the output image is the same size
as the input image Unfortunately, in order to calculate the output pixel
values for the bottom and right edges of the image, it is necessary
to invent input pixel values for places where the kernel extends off theend of the image Typically pixel values of zero are chosen for regions
outside the true image, but this can often distort the output image at
these places Therefore in general if you are using a convolutionimplementation that does this, it is better to clip the image to remove
these spurious regions Removing n - | pixels from the right hand side
and m - | pixels from the bottom will fix things
2.2.3.3 Padding
Padding is a term relevant to convolutional neural networks as it refers
to the amount of pixels added to an image when it is being processed
15
Trang 27by the kernel of a CNN For example, if the padding in a CNN is set tozero, then every pixel value that is added will be of value zero If,
however, the zero padding is set to one, there will be a one pixel border
added to the image with a pixel value of zero
Padding works by extending the area of which a convolutional neural
network processes an image The kernel is the neural networks filterwhich moves across the image, scanning each pixel and converting thedata into a smaller, or sometimes larger, format In order to assist the
kernel with processing the image, padding is added to the frame of the
image to allow for more space for the kernel to cover the image Addingpadding to an image processed by a CNN allows for more accurate
Padding = Same
XA 9PI\S
Figure 9: Example of padding
2.2.3.4 ReLU (Rectified Linear Units) Layers
After each conv layer, it is convention to apply a nonlinear layer(or activation layer) immediately afterward.The purpose of this layer is
to introduce nonlinearity to a system that basically has just beencomputing linear operations during the conv layers (just element wise
16
Trang 28multiplications and summations).In the past, nonlinear functions liketanh and sigmoid were used, but researchers found out that ReLU
layers work far better because the network is able to train a lot faster
(because of the computational efficiency) without making a significantdifference to the accuracy It also helps to alleviate the vanishinggradient problem, which is the issue where the lower layers of thenetwork train very slowly because the gradient decreases exponentiallythrough the layers The ReLU layer applies the function f(x) = max(0,x) to all of the values in the input volume In basic terms, this layer justchanges all the negative activations to 0.This layer increases thenonlinear properties of the model and the overall network withoutaffecting the receptive fields of the convolution layer
2.2.3.5 Pooling Layers
After some ReLU layers, programmers may choose to apply a poolinglayer It is also referred to as a downsampling layer In this category,
there are also several layer options, with maxpooling being the most
popular This basically takes a filter (normally of size 2x2) and a stride
of the same length It then applies it to the input volume and outputs the
maximum number in every subregion that the filter convolves around
Other options for pooling layers are average pooling and L2-norm
pooling The intuitive reasoning behind this layer is that once we knowthat a specific feature is in the original input volume (there will be a
high activation value), its exact location is not as important as itsrelative location to the other features As you can imagine, this layer
drastically reduces the spatial dimension (the length and the widthchange but not the depth) of the input volume This serves two main
purposes The first is that the amount of parameters or weights is
reduced by 75%, thus lessening the computation cost The second is
17
Trang 29that it will control overfitting This term refers to when a model is sotuned to the training examples that it is not able to generalize well for
the validation and test sets A symptom of overfitting is having a model
that gets 100% or 99% on the training set, but only 50% on the test data
Single depth slice
but not discriminative enough for the open-set face recognition problem For
the triplet loss: there is a combinatorial explosion in the number of face tripletsespecially for large-scale datasets, leading to a significant increase in the
18
Trang 30number of iteration steps; semi-hard sample mining is a quite difficultproblem for effective model training.
we propose an Additive Angular Margin Loss (ArcFace) to further improvethe discriminative power of the face recognition model and to stabilise the
training process As illustrated, the dot product between the DCNN feature
and the last fully connected layer is equal to the cosine distance after featureand weight normalisation We utilise the arc-cosine function to calculate theangle between the current feature and the target weight Afterwards, we add
an additive angular margin to the target angle, and we get the target logit backagain by the cosine function Then, we re-scale all logits by a fixed featurenorm, and the subsequent steps are exactly the same as in face recognition
DCNNs map the face image, typically after a pose normalisation step, into afeature that has small intra-class and large inter-class distance
There are two main lines of research to train DCNNs for face recognition
Those that train a multi-class classifier which can separate different identities
in the training set, such by using a softmax classifier, and those that learndirectly an embedding, such as the triplet loss Based on the large-scale
training data and the elaborate DCNN architectures, both the
softmax-loss-based methods and the triplet-loss-softmax-loss-based methods can obtain excellentperformance on face recognition However, the softmax loss The advantages
of the proposed ArcFace can be summarised as follows:
e Engaging ArcFace directly optimises the geodesic distance margin by
virtue of the exact correspondence between the angle and arc in the
normalised hypersphere We intuitively illustrate what happens in the512-D space via analysing the angle statistics between features and
weights
19
Trang 31Effective ArcFace achieves state-of-the-art performance on ten facerecognition benchmarks including large-scale image and videodatasets.
Easy ArcFace only needs several lines of code as given in Algorithm
1 and is extremely easy to implement in the computational-graph-baseddeep learning frameworks, e.g MxNet, Pytorch and Tensorflow.Furthermore, contrary to the works in, ArcFace does not need to becombined with other loss functions in order to have stable performance,and can easily converge on any training datasets
Efficient ArcFace only adds negligible computational complexityduring training Current GPUs can easily support millions of identities
for training and the model parallel strategy can easily support many
largin-Loss: GDis (Fi = +m < GDis ( Fi °
Trang 322.3.2 Transform Softmax to Arcface
2.3.2.1 Softmax
Figure 12: Softmax algorithm
e The batch size and the class number is m and n
e The Softmax loss function does not explicitly optimise the
features to have higher similarity score for positive pairs andlower similarity score for negative pairs, which leads to aperformance gap
Predictions only depend on the angle between the feature
vector and the weight
21
Trang 33cll#il[ cos(4y,)
ao log
||: || cos(Ø„; ) i 5||#¿ || cos Ø;
m 4 ell#: vi) +} »y ellz: Ì 5
Figure 14: Formula after adjustment
2.3.2.3 Multiplicative Angular Margin
In SphereFace, angular margin m is introduced by
multiplication on the angle:
ellze|| cos(mby, )
m = he ellzil| cos(m4y,) % Ji:[| cos 8;
2=1,7#U ©
Figure 15: Sphere Face Loss
where Ôyi belongs to [0,x/m] In order to remove this
restriction, cos(Oyi ) is substituted by a piece-wise
monotonic function w(yi )
Figure 16: Sphere Face Loss Piece Wise
Softmax supervision is incorporated to guarantee theconvergence of training, and the weight is controlled by
a dynamic hyper-parameter A
22
Trang 34e Features for good quality frontal faces have a high
L2-norm while blurry faces with extreme pose
have low L2-norm
e Gradient norm may be extremely large when the
feature norm from low-quality face image is very
small, which potentially increase the risk ofgradient explosion
e The intuitive insight behind feature and weight
normalization is to remove the radial variation andpush every feature to distribute on a hypersphere