Figure 2.11 Convolutional Layer Figure 2.12 Convolution Operation Problem Figure 2.13 The Input Matrix with Padding p=1 Figure 2.14 Feature Map with Padding Applied.. ...40 Figure 2.43 S
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
INFORMATION SYSTEM FACULTY
PHI LONG NGUYEN
INFORMATION SYSTEM ENGINEERING
Ho Chi Minh City, 2022
Trang 2VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OE INFORMATION TECHNOLOGY
INFORMATION SYSTEM FACULTY
PHI LONG NGUYEN - 18521043
Dr PHAN XUAN THIEN
Ho Chi Minh City, 2022
Trang 3INFORMATION OF THE GRADUATE THESIS COUNCIL
œ£lw›
Graduation thesis grading committee, established under Decision No
dated of the Rector of the University of Information Technology.
— — Chairman P — Secretary
la“ — Commissioner
¬ — Commissioner
Trang 4calle
For the purposes of completing this graduation thesis, besides my own efforts and stant efforts, it is impossible not to mention the support and help of the teachers work-
con-ing at the University of Information Technology, VNU-HCMC I would like to express
my deep and sincere thanks to Dr Phan Xuan Thien, my instructor He has heartedly helped me since the days I started to study deep learning and has trusted and encouraged me during difficult times while working on this thesis In addition, I am incredibly grateful to him for contributing ideas from the days of applying for the topic His guidance helped me in all the time of research and writing of this thesis I could not
whole-have imagined having a better advisor and mentor for my thesis.
In the process of implementation, despite efforts to learn, research, experiment and initially achieve some encouraging results, but due to limited knowledge and expe-
rience, it is inevitable that shortcoming; I look forward to receiving your comments to
edit and improve the thesis.
Ho Chi Minh City, / /
Advisor Signature
Trang 5Chapter 1: Overview
1.1 Problem Statement
1.2 Problem Solution
1.4 Goals and Study SCOpe ¿- 1 ThS 1H TH.” 011010 H010 tiêu 4
1.5 Thesis Structure (th 1E Hư 5
Chapter 2: Methods - 55:22 22t 32 212121211121212111011112121.11121010 011110 re 6
2.1 Artificial Neural NetWOTK ¿c1 hStnHnHHTHnHHHHHHHngiưy 6
2.2 Convolutional Neural Ne€tWOrK - «+6 +5 xxx SH gi, 9
2.2.1 Convolutional LAY€T + S523 12EE1 1E kg He, 11
2.2.1.1 The Input ÏDafa - Sàn E11 TH HH HH 12
na 30 2.2.4.2 DrOPOUt "(113 31 2.3 Transfer Learning
2.3.1 Fine-tuning
2.3.2 Pre-trained ResNet Model.
Trang 62.4 Experiments
2.4.1 Dataset Preparing 238 2.4.2 Image PTreDTOCSSÏNE - - 55252 t2 22212 E12 221212101 111.11 re 42
PS 0N co 0á CCaAÀẠỤDIIAỤỘỤDŨDỤDẮẠŨẲŨỘỤŨỖỒ 43 2.4.3.1 Model Initialization cece cece es ceeeseseseseeseseseecsnesessseesensneeeseeeees 44 2.4.3.2 Model CompiÏingy - + + + tt *vvvvskrekrrrrrekrrrkekrkrkrkree 48 2.4.3.3 Model Training - ‹- tt ng He 50 2.4.4 Training R€SuÏL - ¿cv HH HH ng HH Hit 51 Chapter 3: Android Application cece 5< S+S‡k‡ E2 11212 12111 1 11110101 11g 54 3.1 TensorFlow LLÍ€ - - ¿+ + 2S SE k E111 1101110101110 110121 10 111tr 54 3.2 Wireframe Design
Trang 7List of Figures
Figure 1.1 Seafood Production wild fish catch vs aquaculture from 1960 to 2010
Figure 2.1 Feed-forward Neural Network
Figure 2.2 Artificial Neural Network Architecture
Figure 2.3 ANN Perceptron
Figure 2.4 Convolutional Neural Network Architecture
Figure 2.5 Convolutional Layer
Figure 2.6 Array of Pixel in Digital image .
Figure 2.7 Grayscale Image and RGB Color Image
Figure 2.8 Kernel Types.
Figure 2.9 Convolution Layer Operation .
Figure 2.10 How Convolution Layer Work
Figure 2.11 Convolutional Layer
Figure 2.12 Convolution Operation Problem
Figure 2.13 The Input Matrix with Padding p=1
Figure 2.14 Feature Map with Padding Applied.
Figure 2.15 Input Matrix with Stride of One
Figure 2.16 Feature map with Stride applied
Figure 2.17 Reject region when kernel goes out of the matrix with s = 2
Figure 2.18 CNN for Car Recognizing
Figure 2.19 Linear Function and Non-linear Function
Figure 2.20 Linear Function Problem
Figure 2.21 Non-linear function
Figure 2.22 Activation Functions
Figure 2.23 ReLU Function
Figure 2.24 ReLU Activation Mapped on Feature Map
Figure 2.25 Difference between ReLU and Leaky ReLU.
Figure 2.26 Pooling Example
Figure 2.27 Max Pooling Example
Figure 2.28 Average Pooling Example
Figure 2.29 CNN Overview
30 Figure 2.30 Flattening Example
Figure 2.31 Overfitting Example 31 Figure 2.32 Dropout Example 31
32 33 34 Figure 2.33 Knowledge Transfer Example
Figure 2.34 Transfer Learning Architecture
Figure 2.35 VGG16 Pre-Trained Model after Fine-tuning
Trang 8Figure 2.36 Training error and Test error on CIFAR-10 with 20-layer and 56-layer 34 Figure 2.37 A residual block - the fundamental building block of residual networks 35 Figure 2.38 Example network architectures for ImageÌNet -. - ¿55-552 36
Figure 2.39 Training curves on ImageNet-1K 37
Figure 2.40 A Large-Scale Fish Dataset from Kaggle 239 Figure 2.41 Fish Species Dataset from MendelayData 39 Figure 2.42 A Number of Images Are Used In Thesis 40 Figure 2.43 Some Species Samples of Dataset Al
Figure 2.44 Deep Learning Process 42
Figure 2.45 Input Images After Preprocessing 43 Figure 2.46 The Input Image Under Machine’s Perspective Ö44
Figure 2.47 Model Architecture 45
Figure 2.48 Model Summary 45
Figure 2.49 ResNet Architecture In Our Model 46
Figure 2.50 Feature Map After Blocks Applied 47 Figure 2.51 Deep Learning Frameworks Adoption 2021 48
Figure 2.52 One-Hot Encoder 50
Figure 2.53 Training Process S51 Figure 2.54 Model Accuracy Per Epoch 52 Figure 2.55 Classification Results 53 Figure 2.56 Evaluation on Test Dataset Result 53
Figure 3.1 TF Model To TFLite 54
Figure 3.2 Wireframe Design 55 Figure 3.3 System Architecture 56
57
61 Figure 3.4 Restful API Source: E Forbes, 2017
Figure 4.1 Our Project Restful API.
Figure 4.2 API data 61
Figure 4.3 Main screen 62 Figure 4.4 History screen 63 Figure 4.5 Two options for importing image 64 Figure 4.6 Image Scan Result 65
Figure 4.7 Camera Scan Result 65
Figure 5.1 Gantt Chart 67
Trang 9List of Tables
Table 2.1 Comparison between ResNet-50 and ResNet-101 on ImageNet-IK
Table 2.2 Model Compile Configuration
Table 2.3 Model Training Configuration
Table 4.1 Pre-trained Models Details
Table 4.2 Pre-trained Models Training and Testing Results
Trang 10List of Acronyms
AI: Artificial Intelligence
API : Application Programming Interface
ANN : Artificial Neural Network
CNN : Convolutional Neural Network
MLP : Multilayer Perceptrons
NN : Neural Network
FC Layer: Full-connected Layer
RNN : Recurrent Neural Network
ReLU : Rectified Linear Unit
ResNet : Residual Network
Trang 11The high demand for fishing nowadays causes the risk of overfishing which can lead to
the depletion of aquatic resources, especially river and sea fishes In addition, the lack
of information about the fish species and further details of fishes may also affect the productivity and effectiveness of fishing activities.
Due to that fact, we believe that building a tool that can recognize fishes automatically and supply detailed information about fishes to improve the productivity of fishing and help the government and organizations control fishing activities more efficiently is ur-
gently necessary and important, especially in developing countries like Vietnam and
some other countries Toward that goal, we research and develop method that can tomatically detect and recognize fish species based on photos or scanning images of them In our proposal, a Convolutional Neural Network for classification, and transfer learning is also applied to improve the accuracy of the classification Our experiments show that the accuracy after validating is 89.13%, which is an acceptable and promis- ing result Moreover, to enhance the applicability of our proposal, we also build up a
au-software tool for Android mobile devices based on our method to help users approach
and use our method easily and efficiently.
Keywords: Convolutional Neural Network, CNN, Deep Learning, Machine Learning,
Neural Network, Fish Classifications, ResNet, Keras.
Trang 12Chapter 1: Overview
The content of this chapter presents the problem statement, an overview of theproblem, the challenges encountered, the goal - scope of the thesis and finally the lay-out of the thesis
1.1 Problem Statement
According to 2022 State of World Fisheries and Aquaculture (FAO, 2022), production
of aquatic animals in 2020 was more than 60 percent higher than the average in the1990s, significantly outpacing world population growth, largely due to increasing aq-uaculture production and fishing Global consumption of aquatic foods increased at anaverage annual rate of 3.0 percent from 1961 to 2019, a rate almost twice that of annualworld population growth (1.6 percent) for the same period Most countries are easy tosee a rise in their aquatic food consumption per capita at that time In 2020, 89 percent(157 million tons) of world production was used for direct human consumption, com-pared with 67 percent in the 1960s and this share is expected to continue to grow from
89 percent in 2020 to 90 percent by 2030 [1] Therefore, in order to satisfy the demand
of seafood supply, the rate of overfishing will increase Some of the reasons for thisproblem are listed below:
1 Lack of knowledge regarding fish populations
2 Difficulties in regulating fishing areas due to lack of resources and tracking
activity
3 Fishing areas are largely unprotected
To be more detailed, owing to the lack of knowledge, it will lead to the tion, overfishing, that people will catch even small fish that are not within the allowedsize In addition, the insufficiency of information can cause people to catch and kill ra-
situa-re and psitua-recious fish species that need to be protected to maintain the ecosystem
Trang 13Seafood production: wild fish catch vs aquaculture, World Crier
Aquaculture is the farming of aquatic organisms including fish, molluscs, crustaceans and aquatic plants Capture
fishery production is the volume of wild fish catches landed for all commercial, industrial, recreational and
Source: Food and Agriculture Organization of the United Nations (via World Bank) OurWorldInData.org/fish-and-overfishing * CC BY
Figure 1.1 Seafood Production wild fish catch vs aquaculture from 1960 to 2010
In Figure 1.1, the fish catch, and aquaculture data published by Hannah Ritchie
and Max Roser Globally, the share of fish stocks which are overexploited — we catch
them faster than they can reproduce to sustain population levels — has more than
dou-bled since the 1980s.!
1.2 Problem Solution
Due to that fact, we believe that building tool that can recognize fishes automatically
and supply detailed information about fishes to improve productivity of fishing andhelp government and organizations controlling the fishing activities more efficiently isurgently necessary and important, especially in developing countries like Vietnam andsome other countries Toward that goal, we research and develop method that can au-
LH Ritchie and M Roser, Our World in Data, 2021 https://ourworldindata.org/fish-and-overfishing
Trang 14tomatically detect and recognize fish species based on photos or scanning images of
them.
In our proposal, a CNNs model used for classification, and transfer learning is
al-so applied to improve the accuracy of the classification More specifically, we used thepre-trained model ResNet101- a CNN with 101 layers deep ResNet-101 is an im-proved model from the ResNet-50 and ResNet-50V2 versions Our experiments showthat the accuracy after validating is 89.13%, which is an acceptable and promising re-sult Moreover, to enhance the applicability our proposal, we also build up a softwaretool for Android mobile devices based on our method to help users approach and useour method easily and efficiently
1.3 Challenges
Since we will have to recognize fishes through existing photos or new ones taken withthe camera, the image quality will not be as good as with natural environmental condi-tions In natural environments, any classification task is challenged by diversity inbackground complexity, turbidity, and light propagation will all reduce the accuracy of
deep learning
Furthermore, we want to be easy, convenient, and accessible for everyone, so
we will have to bring deep learning to electronic devices, specifically in this thesis,Android phones This is also a challenge in identifying fishes using CNN algorithms indeep learning On the current phone models, no equipment can compare with a com-puter so running the deep learning process, which can be up to several hundred, eventhousands of tasks and algorithms are something that is not easy for data science
1.4 Goals and Study Scope
Before starting the thesis, we need to clarify the goal of the research direction that weaim to To help people know more information and fish individuals (e.g., scientificname, size, as well as habitat) We have studied image classification, thereby helping
Trang 15readers to understand more clearly about classification algorithms, how computerslearn to perform tasks With the scope of the graduation thesis, the main objectives ofthis thesis will include:
1 Research on an overview of the problem of object recognition in images
2 Learn about the architectures in the deep learning recognition model
3 Build models and apply on Android phones
Chapter 2: Methods: In this chapter, the thesis will present about the types ofneural networks, their architecture as well as how they work In addition, description ofthe data set and experimental process, model building
Chapter 3: Android Application: This chapter goes into detail about the Androidapplication that the trained model applied
Chapter 4: Evaluation: This chapter will evaluate the results after the model
building process and present some recommendations based on the obtained results.
State the problem that occurred during the experiment
Chapter 5: Conclusion: Finally, chapter 5 will summarize the results, theknowledge of deep learning as well as the process of image classification Provide ide-
as and development directions for the thesis topic
Trang 16Chapter 2: Methods
For image classification tasks, there are two distinct types of neural network indeep learning, such as CNN or ANN There is also have RNN but it used for time se-ries, text, and audio data so we do not focus on it in this thesis These types of neuralnetworks are at the core of the deep learning revolution, powering applications (e.g.,face detecting, speech recognize, self-driving car) Each type has different advantages
as well as disadvantages
2.1 Artificial Neural Network
ANN is a group where we have multiple perceptron or neurons at each layer and is alsoknown as Feed-Forward Neural Network A neural network is essentially made up ofneurons and the connections that connect them A neuron is a function that has numer-ous inputs and only one output Its job is to take all the numbers from the input, apply afunction to them, and output the result The connection between neurons is like chan-nels, which connect output of one neuron with the input of another so they can senddigits to each other Each connection has only one output parameter is weight All in-puts are processed only in the forward direction As information travels through nu-merous input nodes in one direction until it reaches the output nodes, this particulartype of neural network is among the simplest neural network variations
Figure 2.1 Feed-forward Neural Network
Trang 17To prevent the network from falling into anarchy, the neurons are linked by ers, not randomly Within a layer, neurons are not connected, but they are connected toneurons of the next and previous layers.
lay-Input layer ; Hidden layers ; Output layer
i h h h : 0
Output n
In Figure 2.2, there are total three layers in ANN architecture — Input layer,
Hidden layer, and Output layer The input layers accept the input data, then move to
hidden layer, the hidden layer processes the input data, and the output layer produces
the result Each layer tries to learn weights Data in the network goes strictly in one
di-rection Nevertheless, no one writes neurons and connections when materialize NN inthe actual world For greater performance, everything is instead represented as matricesand calculation is based on matrix multiplication
Trang 18Figure 2.3 ANN Perceptron
Figure 2.3 show perceptron architecture, a NN is given a set of weights when it
is initialized after being trained on the training set During the training phase, theseweights are then optimized, yielding the ideal weights The weighted sum of the inputs
is the initial calculation a neuron does
Y= > (weight * input) + bias = x,w, + x.W2 + - + X,W, + bias
With bias is simply a constant value which is added to the product of inputs andweights The activation function's outcome is shifted either to the positive or negativeside using the bias Furthermore, an activation function is a powerhouse of ANN
There are a lot of activation functions (e.g., linear regression, logistic regression,
bipo-lar), but we will not focus much on it in this section
While solving an image classification problem using ANN, the first step is toconvert a 2-dimensional image into a |-dimensional vector prior to train the model.This process has a drawback, which causes the number of trainable parameters increase
drastically with an increase in the size of the image For instance, if the size of the
Trang 19im-age 1s 224x224, then the number of trainable parameters at the first hidden layer withjust 4 neurons is estimated around 600 thousand so ANN needs huge hardware depend-ency Moreover, with its equation in ANN perceptron, there is no specific rule for de-termining the structure of ANN and appropriate network structure is achieved throughexperience trial and error.
2.2 Convolutional Neural Network
In 2012, the AI system known as AlexNet (named after its main creator, Alex sky), won the ImageNet computer vision contest with an amazing 85 percent accuracy.The runner-up scored a modest 74 percent on the test At the heart of AlexNet wasCNN, a special type of neural network that roughly imitates human vision Over theyears CNNs have become a particularly important part of many Computer Vision ap-
Krizhev-plications.
In numerous applications, CNN models are widely employed, but they are cially prevalent in image and video processing projects This type of neural networkcomputational model uses a variation of MPL and contains one or more convolutionallayers that can be either connected or pooled
espe-Furthermore, in image classification, CNN tends to be a more powerful and curate way of solving classification problems that ANN can hardly or cannot solve it.For example, in ANN, the concrete data points must be provided (e.g., The size of thesnout and the length of the ears need to be explicitly included as data points in a modelthat attempts to differentiate between dogs and cats) When using CNN, these spatialfeatures are extracted from image input This makes CNN ideal when thousands of fea-
ac-tures need to be extracted Instead of having to measure each individual feature, CNN
gathers these features on its own In additional, as mentioned in the ANN parts, imageclassification tasks will become more challenging when 2-dimensional images need to
be converted to 1-dimensional vectors cause increases trainable parameters, takes
Trang 20stor-age and processing capability Meanwhile, CNN can automatically detect the nent characteristics without any human interference.
promi-Apart from that, compare with ANN, no neurons and weights used in this neuralnetwork for feature extraction Instead, CNN uses filters to examine picture inputs andapplies several layers to images Those four layers are the mathematics layer (convolu-tional layer), the activation layer (i.e., ReLU, SoftMax), the pooling layer, and the fully
connected layer These layers’ functions include processing data output and producing
an n-dimensional vector in order to comprehend patterns that the network can "see".With the help of that n-dimensional output, distinct characteristics can be identified andconnected to the provided image input in the fully connected layer The user can thenreceive the classification output from it
Convolution
Input
Figure 2.4 Convolutional Neural Network Architecture
In Figure 2.4, CNN architecture can split up into two separate categories, thefirst category is feature learning and that what is happening at the head to the middle
and to almost a tail end of the network At the end of the tail is classification, so there
is two parts - feature learning part and classification part Then, in feature extraction
10
Trang 21part, we have three operations, it can be repeatedly, we can call them ConvolutionBlocks We first apply convolution, then we can apply ReLU or any activation on out-put of the feature map which is the result matrix from convolution operation, then pool-ing or we can call sub-sampling layer also applied The basic functionality of the CNNcan be broken down into four key areas.
1 As found in other forms of ANN, the input layer will hold the pixel values of
the image as a matrix and transfer it to the first convolutional layer
2 The convolutional layer will determine the output of layer through the
calcu-lation of the matrix product between their weights (values in kernel matrix)and the region connected to the input volume In some case, the ReLU aims
to apply an activation function to the output of the activation produced by
the previous layer.
3 The pooling layer will then simply perform down sampling along the spatial
dimensionality of the given input, further reducing the number of parameters
within that activation
4 After that, the fully connected layers will operate the same tasks as in
con-ventional ANNs and try to derive a class from the activations that may beapplied to classification Additionally, it is suggested that ReLU might beapplied in between these layers to enhance performance
Designing a CNN is quite difficult, it involves choosing many design features(i.e., the input and output size of each layer, where and when to apply dropout layers,what activation functions to use) For the best performance, we have configured andtried to train many times to get the best percentage accuracy
2.2.1 Convolutional Layer
The main building block of CNN is the convolutional layer and is always exists in age classification tasks Its purpose is to detect the presence of a set of features in the
im-11
Trang 22image as input matrix Convolution is mathematical operation to merge two sets of formation (i.e., the input matrix and convolution kernel) The convolution layer is ap-plied on the input data using a convolution kernel to produce a feature map In convo-lutional layers the weights are represented as the multiplicative factor of the filters(kernels).
in-a4 40-3400 + x*x + x * x x*.x* *x ^^ CO—¬CC—C—¬C—- H HH wd HH HH HH ^©O—=—C—=CCC.~
+
+
Convoluted feature
Figure 2.5 Convolutional Layer
Unlike traditional methods (ANN), features are not predefined It will be learned
by the network during training process For more details, we move to the input data andconvolution kernel
2.2.1.1 The Input Data
CNN works with both gray scale and RGB images Beneath the computer's point ofview, images are represented as arrays of pixel values The pixel values can range from
0 to 255 Each number represents a color code When using the image as input andpassing though NN, the computation of high numeric values may become more com-plex To reduce this, we can normalize the values of pixels to range 0 to 1 In this way,the numbers will be small, and the computation becomes easier and faster To get thenormalize value, we divide pixel values by 255 will convert it to range 0 to 1
12
Trang 23Digital image structure This example
image is the planet Venus, as viewed in
reflected microwaves Digital images
are represented by a two-dimensional
array of numbers, each called a pixel In
this image, the array is 200 rows by 200
columns, with each pixel a number
between 0 to 255 When this image was
acquired, the value of each pixel
corresponded to the level of reflected
microwave energy A grayscale image
is formed by assigning each of the 0 to
255 values to varying shades of gray.
Trang 24There is only one-color channel in gray scale image So, a gray scale image isrepresented as (height, width, 1) or simply (height, width) On the other hand, comparewith RGB image, there are total three assorted color channels (Red, Green, and Blue).Thus , an RGB image is represented as (height, width, 3) — where 3 denotes the number
of the channel in the image
- Three color Single color channels
channel pa
J” (height, width) J@ (height width, 3)
Grayscale image RGB color image
Figure 2.7 Grayscale Image and RGB Color Image
2.2.1.2 Convolution Kernels
A kernel or filter exists in the convolution layer as a set of filters or kernels, parameters
of which are to be learned throughout the training It is used to extract the features from
the image The size of the kernel is usually smaller than those of the input volume.
Each kernel is convolved with the input volume to compute an activation map Kernelsizes are divided into smaller and larger The frequently used kernel size is 1 X 1,2x2, 3x3 and 4x4, whereas larger one consists of 5 X 5 and on and on In the
2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012) submitted
by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton’, after winning the
com-petition, AlexNet CNN architecture was introduced, it used 11 X 11 and 5x5 like
20 Russakovsky, J Deng, H Su, J Krause, S Satheesh, S Ma, Z Huang, A Karpathy, A Khosla, M
Bern-stein, A C Berg, L Fei, ‘ImageNet Large Scale Visual Recognition Challenge’, 2015, p.19
14
Trang 25larger kernel sizes that consumed two to three weeks in training We no longer use
such huge kernel sizes as a result of the considerably prolonged training time required
and the cost
Original Gaussian Blur Sharpen Edge Detection
0 0 0 1 1 3 1 =1 -1 -1
0 1 0 18 2 4 2 1 8 1
0 0 0 1 2 1 =1 -1 -1l
Figure 2.8 Kernel Types
The basic functionality of operation of convolution layer can summary as the below:
1 Take two matrices (which both have the same dimensions)
2 Multiply them, element-by-element
3 Sum the elements together
Trang 26We perform the convolution operation by sliding this filter over the input
(Fig-ure 2.10) At every position, we do the matrix multiplication by each element and sum
the result This sum will go into the feature map The matrix multiplication in CNN cansplit into two types of operation (i.e., Convolution Operation and Cross-CorrelationOperation)
eee
Figure 2.10 How Convolution Layer Work
Cross-Correlation Operation and Convolution Operation
correlation and convolution are both operations applied to images
Cross-correlation operation means sliding a kernel (filter) across an image
G[m,n] = (fF *l)[m,n] = YY h[j,k] ƒ[m + j,n + kÌ
jk
Where the input image is denoted by f and our kernel by h The indexes of rowsand columns of the result matrix are marked with m and n In additional, j and k repre-sented the position of the row and column in the matrix and its reflection on the input
data, respectively In convolution operation, the kernel is flipped and works like
16
Trang 27Padding and Stride
With the purpose of calculating the result matrix size, taking an example of a fied image As shown in Figure 2.11, the input matrix has both a height and width ofseven and the kernel matrix size has both a height and width of 3, we got the output
simpli-shape with dimension 5 x 5 Assuming that the input simpli-shape is ?„ X ny and the volution kernel shape 1s m, X My We will have an equation to calculate the output
con-shape
(nụ — Mp + 1) X (ny — my + 1)
Input Image Feature Map
Figure 2.11 Convolutional Layer Source:
When the input image come thought convolution operation as we mention on the
pre-vious part, the output matrix will get shrunk and if we take an output result and keeprepeating, it will give us a small image after ‘filtered’ at the end In the example on
17
Trang 28Figure 2.11, we have the input matrix with 7 x 7 dimension and the kernel 3 x 3
di-mension, we will get the result matrix in 5 x 5 Thus, there are two problems when we
convolve a kernel to an image:
1 The shrinking output: Every time we apply a convolution operation, our
im-age shrinks It comes from seven-by-seven down to five-by-five, for stance We can only apply convolution operation a few times before the im-age starts getting tiny Thus, it can cause the classification task less accurate
in-2 Throwing away information from the edges of the image: Reuse the Figure
2.12, we have four orange rectangle sketches represent for four corners ofthe input matrix, and the dot line rectangle represents for per sliding of ker-nel when applied convolution operation (Figure 2.12) If we take the pixel atone of the fourth corner of matrix, the kernel just applied only once per op-eration Thus, we will miss lots of information in each sliding kernel Mean-while, if we take the pixel at the middle, there are a lot of three-by-three ker-nel regions that slide and overlap the pixel Compute the time sliding of themiddle pixel with the corner ones Hence, our convolution operation worksinefficiently
Feature Map
Figure 2.12 Convolution Operation Problem
18
Trang 29With the purpose of solving both of these problems above, what we can do is
before applying convolution operation, we can pad the input matrix Padding is simply
an action that we add extra pixels outside the image, or we can imagine it is a borderout of the matrix
Figure 2.13 The Input Matrix with Padding p=1
After we applied pad, which has parameter p = 1, on the input matrix, in Figure2.13, our four corner pixels of origin image are no longer the edge of the matrix In-stead, they become the pixel which is in the matrix In this way, we will not throwaway any information from the edges of the image Furthermore, we can also solve theproblem that shrinks the image to tiny after each convolution operation Because of thechanged size of the input matrix, the equation that use to calculate the output matrix
size (nụ — mp, + 1) X (ny — My + 1) must be changed, then the output becomes.
(nụ — Mp + 2p + 1) x(n, — my, + 2p + 1)
With m is the original size of input matrix We have an example using the Figure
2.13, our input matrix after padded to become nine-by-nine image and if we convolve
19
Trang 30it with kernel size of three Then, we will get seven-by-seven image output — same as
the original image Thus, we managed to preserve the original input size
Feature Detector
Figure 2.14 Feature Map with Padding Applied
For deep learning project, we can face with an extremely deep NN which hasone hundred layers deep or even more Therefore, padding is necessary to apply toconvolutional layer to solve training problems
20
Trang 31ps so
Figure 2.15 Input Matrix with Stride of One Source: Seb — programmathically
Because of stride action, we have the changed in the step range per slide Thus,
the equation for the outcome matrix will become, with s;, is the stride for the heightand sự is the stride for the width:
Input Image Feature Map
Figure 2.16 Feature map with Stride applied
21
Trang 32In some case, when the stride parameter is big, the matrix dimension does nothave enough pixel for the kernel Thus, we completely discard that region.
Figure 2.17 Reject region when kernel goes out of the matrix with s = 2
Padding and stride are two techniques used to improve convolution operationsand make it more efficient Same padding is especially important in the training of very
deep neural network If we have a lot of layers, it becomes increasingly difficult tokeep track of the dimensionality of the outputs if the dimensions change in every layer
Furthermore, the size of the feature maps will be reduced at every layer resulting in formation loss at the borders This is likely to depress the performance of your neural
in-network Stride, on the other hand, has lost its importance in practical applications due
to the increase in computational power available to deep learning practitioners
2.2.2 Activation Functions
An activation function is function used in ANN to compute the weighted sum of inputsand bias (recall Figure 2.3 for more details), which defines the output of a neuron that
contains the parameters in the data These activation functions either be linear or
non-linear depending on the function it represents In deep learning, extremely complicated
tasks (e.g., image classification, object detection, speech-to-text transform), which are
needed to address with the help of NN and activation function Output of convolutional
22
Trang 33layer will go thought activation functions before becomes an input matrix for the nextconvolutional layer.
Linear function Non-linear function
Figure 2.19 Linear Function and Non-linear Function
23
Trang 34On the other hand, with a complicated dataset, linear function faces a problem
when tried to cover all the information (Figure 2.20) Meanwhile, non-linear function
can handle the task better (Figure 2.21)
Figure 2.20 Linear Function Problem
24
Trang 35ReLU is most widely used activation function ReLU is equal to x if x > 0 and equal to
0 if x < 0, where x is the output parameter in feature map
f(@) = max(0, x)
The above function means that the negative values in feature map are mapped to
zero and the positive value are returned (Figure 2.24)
ReLU Activation Function
Trang 36Figure 2.24 ReLU Activation Mapped on Feature Map Source: Researchgate
On the other hand, nothing is perfect, ReLU still has its problems when a ReLUactivation function always outputs the same value (i.e., zero as it happens) for any in-
put This problem called ‘Dying ReLU’ which achieved by learning a large negativebias as ReLU activation inputs and cause zero mapped on feature map Once a ReLUends up in this state, it is unlikely to recover, because with 0 values in feature map, the
filter cannot alter the weight for next convolution blocks (Convolution layer operation
in the previous part).
2.2.2.2 Leaky ReLU
Another popular function is Leaky ReLU, in some perspective this function works
same as ReLU However, this function can be used to solve the Dying ReLU problem.Leaky ReLU has a non-zero gradient The slope is non-zero for the negative side in
Leaky ReLU graph The equation of Leaky ReLU for solving Dying ReLU:
f (x) = max (0.01x, x)
26
Trang 37f(x) * f(x)
F(x) =x f(x) = %
x x
f(x) =0 f(x) = 0.01x
ReLU activation function LeakyReLU activation function
Figure 2.25 Difference between ReLU and Leaky ReLU Source: Researchgate
2.2.3 Sub-sampling or Pooling Layer
Pooling layers are used to reduce the dimension of the feature maps As a result, the
amount of computation done in the network is reduced along with the number of
pa-rameters that must be learned This type of layer is often placed between two tional layers Pooling layer improves the efficiency of the network and avoids over-learning The basic principle of pooling is remarkably similar to the convolution opera-tion It still has a kernel which will slide on the input matrix The most commonly usedkernel size is 2 x 2 and it slides with the stride of 2
convolu-There are several approaches to pooling The most used approaches are pooling and average-pooling Based on the type of pooling operation applied, the pool-ing kernel calculates an output for output matrix Nevertheless, the pooling kernel isnot same as the one in convolution operation, it does not contain any weights
max-27
Trang 383 J Tompson, R Goroshin, A Jain, Y LeCun, Ch Bregler, New York University
28
Trang 39Figure 2.28 Average Pooling Example
2.2.4 Fully connected Layer
Fully connected layer is the last layer in CNN The basic approach of fully connected
layer is to learn associated characteristics with label and classification them After the
image is passed through multiple convolution and pooling layers, we have learned the
relative characteristics of the image Before entering fully connected layer, because
these layers only accept 1-dimensional data, while our data is not, so the output of the
29
Trang 40last layer in the convolutional block will go through a function which is called flatten(Figure 2.29).
This is a plain step After Pooling, we got out Pooled Feature Map, which has a matrix
size of three, for instance (Figure 2.30) Our next mission is to convert a three-by-threematrix into a single column We will supply these numbers as input values in the input
of FC Layer, which is why flattening is necessary.
Flattening
Pooled Feature Map
Figure 2.30 Flattening Example
30