1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp Hệ thống thông tin: Building a mobile app for fish detection

85 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Building An Android Application For Fish Recognition
Tác giả Phi Long Nguyen
Người hướng dẫn Dr. Phan Xuan Thien
Trường học University of Information Technology
Chuyên ngành Information System Engineering
Thể loại Graduation Thesis
Năm xuất bản 2022
Thành phố Ho Chi Minh City
Định dạng
Số trang 85
Dung lượng 49,13 MB

Nội dung

Figure 2.11 Convolutional Layer Figure 2.12 Convolution Operation Problem Figure 2.13 The Input Matrix with Padding p=1 Figure 2.14 Feature Map with Padding Applied.. ...40 Figure 2.43 S

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

INFORMATION SYSTEM FACULTY

PHI LONG NGUYEN

INFORMATION SYSTEM ENGINEERING

Ho Chi Minh City, 2022

Trang 2

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OE INFORMATION TECHNOLOGY

INFORMATION SYSTEM FACULTY

PHI LONG NGUYEN - 18521043

Dr PHAN XUAN THIEN

Ho Chi Minh City, 2022

Trang 3

INFORMATION OF THE GRADUATE THESIS COUNCIL

œ£lw›

Graduation thesis grading committee, established under Decision No

dated of the Rector of the University of Information Technology.

— — Chairman P — Secretary

la“ — Commissioner

¬ — Commissioner

Trang 4

calle

For the purposes of completing this graduation thesis, besides my own efforts and stant efforts, it is impossible not to mention the support and help of the teachers work-

con-ing at the University of Information Technology, VNU-HCMC I would like to express

my deep and sincere thanks to Dr Phan Xuan Thien, my instructor He has heartedly helped me since the days I started to study deep learning and has trusted and encouraged me during difficult times while working on this thesis In addition, I am incredibly grateful to him for contributing ideas from the days of applying for the topic His guidance helped me in all the time of research and writing of this thesis I could not

whole-have imagined having a better advisor and mentor for my thesis.

In the process of implementation, despite efforts to learn, research, experiment and initially achieve some encouraging results, but due to limited knowledge and expe-

rience, it is inevitable that shortcoming; I look forward to receiving your comments to

edit and improve the thesis.

Ho Chi Minh City, / /

Advisor Signature

Trang 5

Chapter 1: Overview

1.1 Problem Statement

1.2 Problem Solution

1.4 Goals and Study SCOpe ¿- 1 ThS 1H TH.” 011010 H010 tiêu 4

1.5 Thesis Structure (th 1E Hư 5

Chapter 2: Methods - 55:22 22t 32 212121211121212111011112121.11121010 011110 re 6

2.1 Artificial Neural NetWOTK ¿c1 hStnHnHHTHnHHHHHHHngiưy 6

2.2 Convolutional Neural Ne€tWOrK - «+6 +5 xxx SH gi, 9

2.2.1 Convolutional LAY€T + S523 12EE1 1E kg He, 11

2.2.1.1 The Input ÏDafa - Sàn E11 TH HH HH 12

na 30 2.2.4.2 DrOPOUt "(113 31 2.3 Transfer Learning

2.3.1 Fine-tuning

2.3.2 Pre-trained ResNet Model.

Trang 6

2.4 Experiments

2.4.1 Dataset Preparing 238 2.4.2 Image PTreDTOCSSÏNE - - 55252 t2 22212 E12 221212101 111.11 re 42

PS 0N co 0á CCaAÀẠỤDIIAỤỘỤDŨDỤDẮẠŨẲŨỘỤŨỖỒ 43 2.4.3.1 Model Initialization cece cece es ceeeseseseseeseseseecsnesessseesensneeeseeeees 44 2.4.3.2 Model CompiÏingy - + + + tt *vvvvskrekrrrrrekrrrkekrkrkrkree 48 2.4.3.3 Model Training - ‹- tt ng He 50 2.4.4 Training R€SuÏL - ¿cv HH HH ng HH Hit 51 Chapter 3: Android Application cece 5< S+S‡k‡ E2 11212 12111 1 11110101 11g 54 3.1 TensorFlow LLÍ€ - - ¿+ + 2S SE k E111 1101110101110 110121 10 111tr 54 3.2 Wireframe Design

Trang 7

List of Figures

Figure 1.1 Seafood Production wild fish catch vs aquaculture from 1960 to 2010

Figure 2.1 Feed-forward Neural Network

Figure 2.2 Artificial Neural Network Architecture

Figure 2.3 ANN Perceptron

Figure 2.4 Convolutional Neural Network Architecture

Figure 2.5 Convolutional Layer

Figure 2.6 Array of Pixel in Digital image .

Figure 2.7 Grayscale Image and RGB Color Image

Figure 2.8 Kernel Types.

Figure 2.9 Convolution Layer Operation .

Figure 2.10 How Convolution Layer Work

Figure 2.11 Convolutional Layer

Figure 2.12 Convolution Operation Problem

Figure 2.13 The Input Matrix with Padding p=1

Figure 2.14 Feature Map with Padding Applied.

Figure 2.15 Input Matrix with Stride of One

Figure 2.16 Feature map with Stride applied

Figure 2.17 Reject region when kernel goes out of the matrix with s = 2

Figure 2.18 CNN for Car Recognizing

Figure 2.19 Linear Function and Non-linear Function

Figure 2.20 Linear Function Problem

Figure 2.21 Non-linear function

Figure 2.22 Activation Functions

Figure 2.23 ReLU Function

Figure 2.24 ReLU Activation Mapped on Feature Map

Figure 2.25 Difference between ReLU and Leaky ReLU.

Figure 2.26 Pooling Example

Figure 2.27 Max Pooling Example

Figure 2.28 Average Pooling Example

Figure 2.29 CNN Overview

30 Figure 2.30 Flattening Example

Figure 2.31 Overfitting Example 31 Figure 2.32 Dropout Example 31

32 33 34 Figure 2.33 Knowledge Transfer Example

Figure 2.34 Transfer Learning Architecture

Figure 2.35 VGG16 Pre-Trained Model after Fine-tuning

Trang 8

Figure 2.36 Training error and Test error on CIFAR-10 with 20-layer and 56-layer 34 Figure 2.37 A residual block - the fundamental building block of residual networks 35 Figure 2.38 Example network architectures for ImageÌNet -. - ¿55-552 36

Figure 2.39 Training curves on ImageNet-1K 37

Figure 2.40 A Large-Scale Fish Dataset from Kaggle 239 Figure 2.41 Fish Species Dataset from MendelayData 39 Figure 2.42 A Number of Images Are Used In Thesis 40 Figure 2.43 Some Species Samples of Dataset Al

Figure 2.44 Deep Learning Process 42

Figure 2.45 Input Images After Preprocessing 43 Figure 2.46 The Input Image Under Machine’s Perspective Ö44

Figure 2.47 Model Architecture 45

Figure 2.48 Model Summary 45

Figure 2.49 ResNet Architecture In Our Model 46

Figure 2.50 Feature Map After Blocks Applied 47 Figure 2.51 Deep Learning Frameworks Adoption 2021 48

Figure 2.52 One-Hot Encoder 50

Figure 2.53 Training Process S51 Figure 2.54 Model Accuracy Per Epoch 52 Figure 2.55 Classification Results 53 Figure 2.56 Evaluation on Test Dataset Result 53

Figure 3.1 TF Model To TFLite 54

Figure 3.2 Wireframe Design 55 Figure 3.3 System Architecture 56

57

61 Figure 3.4 Restful API Source: E Forbes, 2017

Figure 4.1 Our Project Restful API.

Figure 4.2 API data 61

Figure 4.3 Main screen 62 Figure 4.4 History screen 63 Figure 4.5 Two options for importing image 64 Figure 4.6 Image Scan Result 65

Figure 4.7 Camera Scan Result 65

Figure 5.1 Gantt Chart 67

Trang 9

List of Tables

Table 2.1 Comparison between ResNet-50 and ResNet-101 on ImageNet-IK

Table 2.2 Model Compile Configuration

Table 2.3 Model Training Configuration

Table 4.1 Pre-trained Models Details

Table 4.2 Pre-trained Models Training and Testing Results

Trang 10

List of Acronyms

AI: Artificial Intelligence

API : Application Programming Interface

ANN : Artificial Neural Network

CNN : Convolutional Neural Network

MLP : Multilayer Perceptrons

NN : Neural Network

FC Layer: Full-connected Layer

RNN : Recurrent Neural Network

ReLU : Rectified Linear Unit

ResNet : Residual Network

Trang 11

The high demand for fishing nowadays causes the risk of overfishing which can lead to

the depletion of aquatic resources, especially river and sea fishes In addition, the lack

of information about the fish species and further details of fishes may also affect the productivity and effectiveness of fishing activities.

Due to that fact, we believe that building a tool that can recognize fishes automatically and supply detailed information about fishes to improve the productivity of fishing and help the government and organizations control fishing activities more efficiently is ur-

gently necessary and important, especially in developing countries like Vietnam and

some other countries Toward that goal, we research and develop method that can tomatically detect and recognize fish species based on photos or scanning images of them In our proposal, a Convolutional Neural Network for classification, and transfer learning is also applied to improve the accuracy of the classification Our experiments show that the accuracy after validating is 89.13%, which is an acceptable and promis- ing result Moreover, to enhance the applicability of our proposal, we also build up a

au-software tool for Android mobile devices based on our method to help users approach

and use our method easily and efficiently.

Keywords: Convolutional Neural Network, CNN, Deep Learning, Machine Learning,

Neural Network, Fish Classifications, ResNet, Keras.

Trang 12

Chapter 1: Overview

The content of this chapter presents the problem statement, an overview of theproblem, the challenges encountered, the goal - scope of the thesis and finally the lay-out of the thesis

1.1 Problem Statement

According to 2022 State of World Fisheries and Aquaculture (FAO, 2022), production

of aquatic animals in 2020 was more than 60 percent higher than the average in the1990s, significantly outpacing world population growth, largely due to increasing aq-uaculture production and fishing Global consumption of aquatic foods increased at anaverage annual rate of 3.0 percent from 1961 to 2019, a rate almost twice that of annualworld population growth (1.6 percent) for the same period Most countries are easy tosee a rise in their aquatic food consumption per capita at that time In 2020, 89 percent(157 million tons) of world production was used for direct human consumption, com-pared with 67 percent in the 1960s and this share is expected to continue to grow from

89 percent in 2020 to 90 percent by 2030 [1] Therefore, in order to satisfy the demand

of seafood supply, the rate of overfishing will increase Some of the reasons for thisproblem are listed below:

1 Lack of knowledge regarding fish populations

2 Difficulties in regulating fishing areas due to lack of resources and tracking

activity

3 Fishing areas are largely unprotected

To be more detailed, owing to the lack of knowledge, it will lead to the tion, overfishing, that people will catch even small fish that are not within the allowedsize In addition, the insufficiency of information can cause people to catch and kill ra-

situa-re and psitua-recious fish species that need to be protected to maintain the ecosystem

Trang 13

Seafood production: wild fish catch vs aquaculture, World Crier

Aquaculture is the farming of aquatic organisms including fish, molluscs, crustaceans and aquatic plants Capture

fishery production is the volume of wild fish catches landed for all commercial, industrial, recreational and

Source: Food and Agriculture Organization of the United Nations (via World Bank) OurWorldInData.org/fish-and-overfishing * CC BY

Figure 1.1 Seafood Production wild fish catch vs aquaculture from 1960 to 2010

In Figure 1.1, the fish catch, and aquaculture data published by Hannah Ritchie

and Max Roser Globally, the share of fish stocks which are overexploited — we catch

them faster than they can reproduce to sustain population levels — has more than

dou-bled since the 1980s.!

1.2 Problem Solution

Due to that fact, we believe that building tool that can recognize fishes automatically

and supply detailed information about fishes to improve productivity of fishing andhelp government and organizations controlling the fishing activities more efficiently isurgently necessary and important, especially in developing countries like Vietnam andsome other countries Toward that goal, we research and develop method that can au-

LH Ritchie and M Roser, Our World in Data, 2021 https://ourworldindata.org/fish-and-overfishing

Trang 14

tomatically detect and recognize fish species based on photos or scanning images of

them.

In our proposal, a CNNs model used for classification, and transfer learning is

al-so applied to improve the accuracy of the classification More specifically, we used thepre-trained model ResNet101- a CNN with 101 layers deep ResNet-101 is an im-proved model from the ResNet-50 and ResNet-50V2 versions Our experiments showthat the accuracy after validating is 89.13%, which is an acceptable and promising re-sult Moreover, to enhance the applicability our proposal, we also build up a softwaretool for Android mobile devices based on our method to help users approach and useour method easily and efficiently

1.3 Challenges

Since we will have to recognize fishes through existing photos or new ones taken withthe camera, the image quality will not be as good as with natural environmental condi-tions In natural environments, any classification task is challenged by diversity inbackground complexity, turbidity, and light propagation will all reduce the accuracy of

deep learning

Furthermore, we want to be easy, convenient, and accessible for everyone, so

we will have to bring deep learning to electronic devices, specifically in this thesis,Android phones This is also a challenge in identifying fishes using CNN algorithms indeep learning On the current phone models, no equipment can compare with a com-puter so running the deep learning process, which can be up to several hundred, eventhousands of tasks and algorithms are something that is not easy for data science

1.4 Goals and Study Scope

Before starting the thesis, we need to clarify the goal of the research direction that weaim to To help people know more information and fish individuals (e.g., scientificname, size, as well as habitat) We have studied image classification, thereby helping

Trang 15

readers to understand more clearly about classification algorithms, how computerslearn to perform tasks With the scope of the graduation thesis, the main objectives ofthis thesis will include:

1 Research on an overview of the problem of object recognition in images

2 Learn about the architectures in the deep learning recognition model

3 Build models and apply on Android phones

Chapter 2: Methods: In this chapter, the thesis will present about the types ofneural networks, their architecture as well as how they work In addition, description ofthe data set and experimental process, model building

Chapter 3: Android Application: This chapter goes into detail about the Androidapplication that the trained model applied

Chapter 4: Evaluation: This chapter will evaluate the results after the model

building process and present some recommendations based on the obtained results.

State the problem that occurred during the experiment

Chapter 5: Conclusion: Finally, chapter 5 will summarize the results, theknowledge of deep learning as well as the process of image classification Provide ide-

as and development directions for the thesis topic

Trang 16

Chapter 2: Methods

For image classification tasks, there are two distinct types of neural network indeep learning, such as CNN or ANN There is also have RNN but it used for time se-ries, text, and audio data so we do not focus on it in this thesis These types of neuralnetworks are at the core of the deep learning revolution, powering applications (e.g.,face detecting, speech recognize, self-driving car) Each type has different advantages

as well as disadvantages

2.1 Artificial Neural Network

ANN is a group where we have multiple perceptron or neurons at each layer and is alsoknown as Feed-Forward Neural Network A neural network is essentially made up ofneurons and the connections that connect them A neuron is a function that has numer-ous inputs and only one output Its job is to take all the numbers from the input, apply afunction to them, and output the result The connection between neurons is like chan-nels, which connect output of one neuron with the input of another so they can senddigits to each other Each connection has only one output parameter is weight All in-puts are processed only in the forward direction As information travels through nu-merous input nodes in one direction until it reaches the output nodes, this particulartype of neural network is among the simplest neural network variations

Figure 2.1 Feed-forward Neural Network

Trang 17

To prevent the network from falling into anarchy, the neurons are linked by ers, not randomly Within a layer, neurons are not connected, but they are connected toneurons of the next and previous layers.

lay-Input layer ; Hidden layers ; Output layer

i h h h : 0

Output n

In Figure 2.2, there are total three layers in ANN architecture — Input layer,

Hidden layer, and Output layer The input layers accept the input data, then move to

hidden layer, the hidden layer processes the input data, and the output layer produces

the result Each layer tries to learn weights Data in the network goes strictly in one

di-rection Nevertheless, no one writes neurons and connections when materialize NN inthe actual world For greater performance, everything is instead represented as matricesand calculation is based on matrix multiplication

Trang 18

Figure 2.3 ANN Perceptron

Figure 2.3 show perceptron architecture, a NN is given a set of weights when it

is initialized after being trained on the training set During the training phase, theseweights are then optimized, yielding the ideal weights The weighted sum of the inputs

is the initial calculation a neuron does

Y= > (weight * input) + bias = x,w, + x.W2 + - + X,W, + bias

With bias is simply a constant value which is added to the product of inputs andweights The activation function's outcome is shifted either to the positive or negativeside using the bias Furthermore, an activation function is a powerhouse of ANN

There are a lot of activation functions (e.g., linear regression, logistic regression,

bipo-lar), but we will not focus much on it in this section

While solving an image classification problem using ANN, the first step is toconvert a 2-dimensional image into a |-dimensional vector prior to train the model.This process has a drawback, which causes the number of trainable parameters increase

drastically with an increase in the size of the image For instance, if the size of the

Trang 19

im-age 1s 224x224, then the number of trainable parameters at the first hidden layer withjust 4 neurons is estimated around 600 thousand so ANN needs huge hardware depend-ency Moreover, with its equation in ANN perceptron, there is no specific rule for de-termining the structure of ANN and appropriate network structure is achieved throughexperience trial and error.

2.2 Convolutional Neural Network

In 2012, the AI system known as AlexNet (named after its main creator, Alex sky), won the ImageNet computer vision contest with an amazing 85 percent accuracy.The runner-up scored a modest 74 percent on the test At the heart of AlexNet wasCNN, a special type of neural network that roughly imitates human vision Over theyears CNNs have become a particularly important part of many Computer Vision ap-

Krizhev-plications.

In numerous applications, CNN models are widely employed, but they are cially prevalent in image and video processing projects This type of neural networkcomputational model uses a variation of MPL and contains one or more convolutionallayers that can be either connected or pooled

espe-Furthermore, in image classification, CNN tends to be a more powerful and curate way of solving classification problems that ANN can hardly or cannot solve it.For example, in ANN, the concrete data points must be provided (e.g., The size of thesnout and the length of the ears need to be explicitly included as data points in a modelthat attempts to differentiate between dogs and cats) When using CNN, these spatialfeatures are extracted from image input This makes CNN ideal when thousands of fea-

ac-tures need to be extracted Instead of having to measure each individual feature, CNN

gathers these features on its own In additional, as mentioned in the ANN parts, imageclassification tasks will become more challenging when 2-dimensional images need to

be converted to 1-dimensional vectors cause increases trainable parameters, takes

Trang 20

stor-age and processing capability Meanwhile, CNN can automatically detect the nent characteristics without any human interference.

promi-Apart from that, compare with ANN, no neurons and weights used in this neuralnetwork for feature extraction Instead, CNN uses filters to examine picture inputs andapplies several layers to images Those four layers are the mathematics layer (convolu-tional layer), the activation layer (i.e., ReLU, SoftMax), the pooling layer, and the fully

connected layer These layers’ functions include processing data output and producing

an n-dimensional vector in order to comprehend patterns that the network can "see".With the help of that n-dimensional output, distinct characteristics can be identified andconnected to the provided image input in the fully connected layer The user can thenreceive the classification output from it

Convolution

Input

Figure 2.4 Convolutional Neural Network Architecture

In Figure 2.4, CNN architecture can split up into two separate categories, thefirst category is feature learning and that what is happening at the head to the middle

and to almost a tail end of the network At the end of the tail is classification, so there

is two parts - feature learning part and classification part Then, in feature extraction

10

Trang 21

part, we have three operations, it can be repeatedly, we can call them ConvolutionBlocks We first apply convolution, then we can apply ReLU or any activation on out-put of the feature map which is the result matrix from convolution operation, then pool-ing or we can call sub-sampling layer also applied The basic functionality of the CNNcan be broken down into four key areas.

1 As found in other forms of ANN, the input layer will hold the pixel values of

the image as a matrix and transfer it to the first convolutional layer

2 The convolutional layer will determine the output of layer through the

calcu-lation of the matrix product between their weights (values in kernel matrix)and the region connected to the input volume In some case, the ReLU aims

to apply an activation function to the output of the activation produced by

the previous layer.

3 The pooling layer will then simply perform down sampling along the spatial

dimensionality of the given input, further reducing the number of parameters

within that activation

4 After that, the fully connected layers will operate the same tasks as in

con-ventional ANNs and try to derive a class from the activations that may beapplied to classification Additionally, it is suggested that ReLU might beapplied in between these layers to enhance performance

Designing a CNN is quite difficult, it involves choosing many design features(i.e., the input and output size of each layer, where and when to apply dropout layers,what activation functions to use) For the best performance, we have configured andtried to train many times to get the best percentage accuracy

2.2.1 Convolutional Layer

The main building block of CNN is the convolutional layer and is always exists in age classification tasks Its purpose is to detect the presence of a set of features in the

im-11

Trang 22

image as input matrix Convolution is mathematical operation to merge two sets of formation (i.e., the input matrix and convolution kernel) The convolution layer is ap-plied on the input data using a convolution kernel to produce a feature map In convo-lutional layers the weights are represented as the multiplicative factor of the filters(kernels).

in-a4 40-3400 + x*x + x * x x*.x* *x ^^ CO—¬CC—C—¬C—- H HH wd HH HH HH ^©O—=—C—=CCC.~

+

+

Convoluted feature

Figure 2.5 Convolutional Layer

Unlike traditional methods (ANN), features are not predefined It will be learned

by the network during training process For more details, we move to the input data andconvolution kernel

2.2.1.1 The Input Data

CNN works with both gray scale and RGB images Beneath the computer's point ofview, images are represented as arrays of pixel values The pixel values can range from

0 to 255 Each number represents a color code When using the image as input andpassing though NN, the computation of high numeric values may become more com-plex To reduce this, we can normalize the values of pixels to range 0 to 1 In this way,the numbers will be small, and the computation becomes easier and faster To get thenormalize value, we divide pixel values by 255 will convert it to range 0 to 1

12

Trang 23

Digital image structure This example

image is the planet Venus, as viewed in

reflected microwaves Digital images

are represented by a two-dimensional

array of numbers, each called a pixel In

this image, the array is 200 rows by 200

columns, with each pixel a number

between 0 to 255 When this image was

acquired, the value of each pixel

corresponded to the level of reflected

microwave energy A grayscale image

is formed by assigning each of the 0 to

255 values to varying shades of gray.

Trang 24

There is only one-color channel in gray scale image So, a gray scale image isrepresented as (height, width, 1) or simply (height, width) On the other hand, comparewith RGB image, there are total three assorted color channels (Red, Green, and Blue).Thus , an RGB image is represented as (height, width, 3) — where 3 denotes the number

of the channel in the image

- Three color Single color channels

channel pa

J” (height, width) J@ (height width, 3)

Grayscale image RGB color image

Figure 2.7 Grayscale Image and RGB Color Image

2.2.1.2 Convolution Kernels

A kernel or filter exists in the convolution layer as a set of filters or kernels, parameters

of which are to be learned throughout the training It is used to extract the features from

the image The size of the kernel is usually smaller than those of the input volume.

Each kernel is convolved with the input volume to compute an activation map Kernelsizes are divided into smaller and larger The frequently used kernel size is 1 X 1,2x2, 3x3 and 4x4, whereas larger one consists of 5 X 5 and on and on In the

2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012) submitted

by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton’, after winning the

com-petition, AlexNet CNN architecture was introduced, it used 11 X 11 and 5x5 like

20 Russakovsky, J Deng, H Su, J Krause, S Satheesh, S Ma, Z Huang, A Karpathy, A Khosla, M

Bern-stein, A C Berg, L Fei, ‘ImageNet Large Scale Visual Recognition Challenge’, 2015, p.19

14

Trang 25

larger kernel sizes that consumed two to three weeks in training We no longer use

such huge kernel sizes as a result of the considerably prolonged training time required

and the cost

Original Gaussian Blur Sharpen Edge Detection

0 0 0 1 1 3 1 =1 -1 -1

0 1 0 18 2 4 2 1 8 1

0 0 0 1 2 1 =1 -1 -1l

Figure 2.8 Kernel Types

The basic functionality of operation of convolution layer can summary as the below:

1 Take two matrices (which both have the same dimensions)

2 Multiply them, element-by-element

3 Sum the elements together

Trang 26

We perform the convolution operation by sliding this filter over the input

(Fig-ure 2.10) At every position, we do the matrix multiplication by each element and sum

the result This sum will go into the feature map The matrix multiplication in CNN cansplit into two types of operation (i.e., Convolution Operation and Cross-CorrelationOperation)

eee

Figure 2.10 How Convolution Layer Work

Cross-Correlation Operation and Convolution Operation

correlation and convolution are both operations applied to images

Cross-correlation operation means sliding a kernel (filter) across an image

G[m,n] = (fF *l)[m,n] = YY h[j,k] ƒ[m + j,n + kÌ

jk

Where the input image is denoted by f and our kernel by h The indexes of rowsand columns of the result matrix are marked with m and n In additional, j and k repre-sented the position of the row and column in the matrix and its reflection on the input

data, respectively In convolution operation, the kernel is flipped and works like

16

Trang 27

Padding and Stride

With the purpose of calculating the result matrix size, taking an example of a fied image As shown in Figure 2.11, the input matrix has both a height and width ofseven and the kernel matrix size has both a height and width of 3, we got the output

simpli-shape with dimension 5 x 5 Assuming that the input simpli-shape is ?„ X ny and the volution kernel shape 1s m, X My We will have an equation to calculate the output

con-shape

(nụ — Mp + 1) X (ny — my + 1)

Input Image Feature Map

Figure 2.11 Convolutional Layer Source:

When the input image come thought convolution operation as we mention on the

pre-vious part, the output matrix will get shrunk and if we take an output result and keeprepeating, it will give us a small image after ‘filtered’ at the end In the example on

17

Trang 28

Figure 2.11, we have the input matrix with 7 x 7 dimension and the kernel 3 x 3

di-mension, we will get the result matrix in 5 x 5 Thus, there are two problems when we

convolve a kernel to an image:

1 The shrinking output: Every time we apply a convolution operation, our

im-age shrinks It comes from seven-by-seven down to five-by-five, for stance We can only apply convolution operation a few times before the im-age starts getting tiny Thus, it can cause the classification task less accurate

in-2 Throwing away information from the edges of the image: Reuse the Figure

2.12, we have four orange rectangle sketches represent for four corners ofthe input matrix, and the dot line rectangle represents for per sliding of ker-nel when applied convolution operation (Figure 2.12) If we take the pixel atone of the fourth corner of matrix, the kernel just applied only once per op-eration Thus, we will miss lots of information in each sliding kernel Mean-while, if we take the pixel at the middle, there are a lot of three-by-three ker-nel regions that slide and overlap the pixel Compute the time sliding of themiddle pixel with the corner ones Hence, our convolution operation worksinefficiently

Feature Map

Figure 2.12 Convolution Operation Problem

18

Trang 29

With the purpose of solving both of these problems above, what we can do is

before applying convolution operation, we can pad the input matrix Padding is simply

an action that we add extra pixels outside the image, or we can imagine it is a borderout of the matrix

Figure 2.13 The Input Matrix with Padding p=1

After we applied pad, which has parameter p = 1, on the input matrix, in Figure2.13, our four corner pixels of origin image are no longer the edge of the matrix In-stead, they become the pixel which is in the matrix In this way, we will not throwaway any information from the edges of the image Furthermore, we can also solve theproblem that shrinks the image to tiny after each convolution operation Because of thechanged size of the input matrix, the equation that use to calculate the output matrix

size (nụ — mp, + 1) X (ny — My + 1) must be changed, then the output becomes.

(nụ — Mp + 2p + 1) x(n, — my, + 2p + 1)

With m is the original size of input matrix We have an example using the Figure

2.13, our input matrix after padded to become nine-by-nine image and if we convolve

19

Trang 30

it with kernel size of three Then, we will get seven-by-seven image output — same as

the original image Thus, we managed to preserve the original input size

Feature Detector

Figure 2.14 Feature Map with Padding Applied

For deep learning project, we can face with an extremely deep NN which hasone hundred layers deep or even more Therefore, padding is necessary to apply toconvolutional layer to solve training problems

20

Trang 31

ps so

Figure 2.15 Input Matrix with Stride of One Source: Seb — programmathically

Because of stride action, we have the changed in the step range per slide Thus,

the equation for the outcome matrix will become, with s;, is the stride for the heightand sự is the stride for the width:

Input Image Feature Map

Figure 2.16 Feature map with Stride applied

21

Trang 32

In some case, when the stride parameter is big, the matrix dimension does nothave enough pixel for the kernel Thus, we completely discard that region.

Figure 2.17 Reject region when kernel goes out of the matrix with s = 2

Padding and stride are two techniques used to improve convolution operationsand make it more efficient Same padding is especially important in the training of very

deep neural network If we have a lot of layers, it becomes increasingly difficult tokeep track of the dimensionality of the outputs if the dimensions change in every layer

Furthermore, the size of the feature maps will be reduced at every layer resulting in formation loss at the borders This is likely to depress the performance of your neural

in-network Stride, on the other hand, has lost its importance in practical applications due

to the increase in computational power available to deep learning practitioners

2.2.2 Activation Functions

An activation function is function used in ANN to compute the weighted sum of inputsand bias (recall Figure 2.3 for more details), which defines the output of a neuron that

contains the parameters in the data These activation functions either be linear or

non-linear depending on the function it represents In deep learning, extremely complicated

tasks (e.g., image classification, object detection, speech-to-text transform), which are

needed to address with the help of NN and activation function Output of convolutional

22

Trang 33

layer will go thought activation functions before becomes an input matrix for the nextconvolutional layer.

Linear function Non-linear function

Figure 2.19 Linear Function and Non-linear Function

23

Trang 34

On the other hand, with a complicated dataset, linear function faces a problem

when tried to cover all the information (Figure 2.20) Meanwhile, non-linear function

can handle the task better (Figure 2.21)

Figure 2.20 Linear Function Problem

24

Trang 35

ReLU is most widely used activation function ReLU is equal to x if x > 0 and equal to

0 if x < 0, where x is the output parameter in feature map

f(@) = max(0, x)

The above function means that the negative values in feature map are mapped to

zero and the positive value are returned (Figure 2.24)

ReLU Activation Function

Trang 36

Figure 2.24 ReLU Activation Mapped on Feature Map Source: Researchgate

On the other hand, nothing is perfect, ReLU still has its problems when a ReLUactivation function always outputs the same value (i.e., zero as it happens) for any in-

put This problem called ‘Dying ReLU’ which achieved by learning a large negativebias as ReLU activation inputs and cause zero mapped on feature map Once a ReLUends up in this state, it is unlikely to recover, because with 0 values in feature map, the

filter cannot alter the weight for next convolution blocks (Convolution layer operation

in the previous part).

2.2.2.2 Leaky ReLU

Another popular function is Leaky ReLU, in some perspective this function works

same as ReLU However, this function can be used to solve the Dying ReLU problem.Leaky ReLU has a non-zero gradient The slope is non-zero for the negative side in

Leaky ReLU graph The equation of Leaky ReLU for solving Dying ReLU:

f (x) = max (0.01x, x)

26

Trang 37

f(x) * f(x)

F(x) =x f(x) = %

x x

f(x) =0 f(x) = 0.01x

ReLU activation function LeakyReLU activation function

Figure 2.25 Difference between ReLU and Leaky ReLU Source: Researchgate

2.2.3 Sub-sampling or Pooling Layer

Pooling layers are used to reduce the dimension of the feature maps As a result, the

amount of computation done in the network is reduced along with the number of

pa-rameters that must be learned This type of layer is often placed between two tional layers Pooling layer improves the efficiency of the network and avoids over-learning The basic principle of pooling is remarkably similar to the convolution opera-tion It still has a kernel which will slide on the input matrix The most commonly usedkernel size is 2 x 2 and it slides with the stride of 2

convolu-There are several approaches to pooling The most used approaches are pooling and average-pooling Based on the type of pooling operation applied, the pool-ing kernel calculates an output for output matrix Nevertheless, the pooling kernel isnot same as the one in convolution operation, it does not contain any weights

max-27

Trang 38

3 J Tompson, R Goroshin, A Jain, Y LeCun, Ch Bregler, New York University

28

Trang 39

Figure 2.28 Average Pooling Example

2.2.4 Fully connected Layer

Fully connected layer is the last layer in CNN The basic approach of fully connected

layer is to learn associated characteristics with label and classification them After the

image is passed through multiple convolution and pooling layers, we have learned the

relative characteristics of the image Before entering fully connected layer, because

these layers only accept 1-dimensional data, while our data is not, so the output of the

29

Trang 40

last layer in the convolutional block will go through a function which is called flatten(Figure 2.29).

This is a plain step After Pooling, we got out Pooled Feature Map, which has a matrix

size of three, for instance (Figure 2.30) Our next mission is to convert a three-by-threematrix into a single column We will supply these numbers as input values in the input

of FC Layer, which is why flattening is necessary.

Flattening

Pooled Feature Map

Figure 2.30 Flattening Example

30

Ngày đăng: 02/10/2024, 05:37

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN