In the field of deep learning in recent years, the emergence of the transformerarchitecture has revolutionized the field of artificial intelligence, particularly in naturallanguage proce
Trang 1NGUYEN MINH LY - 20521592
KHOA LUAN TOT NGHIEP
ĐÁNH GIA HIEU QUA CUA KIÊN TRÚC
CNN-TRANSFORMERS TRONG VIEC PHAN LOAI TINH TRANG
HU HONG DUONG BO
(EVALUATING THE EFFECTIVENESS OF CNN-TRANSFORMERS
ARCHITECTURE IN CLASSIFYING ROAD DAMAGE CONDITIONS)
CU NHAN NGANH KHOA HOC MAY TINH
GIANG VIEN HUONG DAN
TS LE KIM HUNG
TP HO CHI MINH, 2024
Trang 2Dissertation Defense Committee List
The dissertation defense committee, established according to Decision No.154/QD-DHCNTT dated March 1, 2023, by the Rector of the University of InformationTechnology
— - Chairperson
q5 - Secretary
- Member
Trang 3First of all, I would like to express my sincere gratitude to all the professors and
teachers working and teaching at the University of Information Technology - HCM for the knowledge, lessons, and valuable experiences that I have acquired during
VNU-my recent journey I wish the Department of Computer Science in particular, and the
University of Information Technology - VNU-HCM in general, continued brilliant
success in the field of education, always training talents for the country, and remaining
a firm and attractive destination for future generations of students
I would also like to extend my heartfelt thanks to Dr Le Kim Hung Thanks tohis experiences, lessons, care, assistance, and guidance, I have overcome the difficultiesand challenges in the process of completing this graduation thesis
Next, I would like to express my sincere thanks to my family for always believing
in and encouraging me throughout my studies at the University of InformationTechnology - VNU-HCM, giving me additional motivation to strive for developmentand achieve the success I have today
Finally, I would like to thank my fellow students at the University of InformationTechnology - VNU-HCM for their companionship, assistance, enthusiasm in sharingopinions, and suggestions to help improve and perfect my graduation thesis
Ho Chi Minh, 2024
Trang 4Table of Contents
Acknowledgements 4
ABSTRACT 11CHAPTER 1 INTRODUCTION 12
CHAPTER 2 THEORETICAL BACKGROUND 16
2.1 Overview of Artificial Intelligence, Machine Learning, Deep Learning 162.2 Typical Architectures in Deep Learning 21
2.2.1 Neural Network 212.2.2 Convolution Neural Network 252.2.3 Transformers 292.3 Image classification problem 342.4 Some studies on the problem of detecting damaged paths 35CHAPTER 3 PROPOSED MODEL 36
3.1 Overview of the Model 36
Trang 53.2 Resnet model
3.3 EfficientNet
3.4 Transformers
3.5 Zero shot learning
CHAPTER 4 EXPERIMENTS AND EVALUATION
4.1 Dataset
4.2 Evaluation metrics
4.2.1 Confusion matrix4.2.2 Recall
4.2.3 Precision
4.2.4 Fl score
4.3 Experiment
4.3.1 Experimental context4.3.2 Experimental results
CHAPTER 5 SUMMARY AND DEVELOPMENT DIRECTIONS
5.1 The achieved results
Trang 6List of images Figure 1 Investigate the overall management theory, applications, and classifications in the fields of AI, ML, and DL << << << << << <£ceeezsssssz 16
Figure 2.The performance differences among various AI and ML model groups
— šŠÖ 20
Figure 3 Transformers arChIf€CfUT€ - - 2G <2 1E 1E S1 E91 Hư 31 Figure 4.Proposed model architecture - -. <5 S113 E+kESskksersreeeree 36 Figure 5 The skip connection architecture was invented in resnet 39
Figure 6 Architecture of the ResNet50 model . -5- -sc+c+csereersee 40 Figure 7 Ways to scale up neural network architectures .- «- 42
Figure 8 The baseline architecture of the EfficientNet model - 43
Figure 9 Compare the accuracy of the models on the ImageNet dataset 44
Figure 10 The architecture of the EfficientNetB3 model .- 44
Figure 11 Image classification with Zero-shot learning .- - ‹ -<- 46 Figure 12 CLIP training process on 400 million image-caption data pairs 48
Figure 13 The process of predicting the label of an image using the clip model Figure 14 Distribute data in train, valid and test S€fs - - « -sc+s++++ 51 Figure 15 The ratio of images between the classes in the dataset; 0: non-damaged road class; 1: damaged road CÏaSS - - s91 ng ng c 52 Figure 16 Distribution of images for each country in the train da(a 52
Figure 17 Distribution of images for each country in the valid data 53
Figure 18 Distribution of images for each country in the test dafa 53
Figure 19 Some images from the dataset a, b: Images in the case of undamaged roads; c, d: Images in the case of damaged roads - - - - «+ s* + ssksseksssresree 54 Figure 20 The loss and accuracy scores of the model during training 63
Trang 7Figure 21 The GradCAM results display the important regions on an image thatthe model uses to make Dr€dICfIOTIS - 5 E111 19112 1231 11 HH HH ni, 65
Trang 8List of Tables
Table 1 Details the number of images in the đafa 5 - 555 5s +2 £++se 51
Table 2 Values of TP, FN, FP, TN in the confusion matr1X - << 55Table 3 Simple prompt eee eee 59Table 4 Detail prompt 0n 59
Table 5 Result for Zero-shot learning for each cÌaSS€S -s«<++ 60
Table 6 The overview result for zero-shot learnIng - ‹ - «<< «++<ex++es 60Table 7 The number of parameters for each model «<< +s«++x£++e+ 61Table 8 The detailed accuracy metrics of the proposed model using variousTT€8SUT€ITRIES G- 6 5162112001855 111112 T91 nh HT TH HH TH nh nhờ 64
Table 9 The results of comparing the accuracy of our model ("Ours") with theCNN models 1177777 67
Table 10 Evaluation T€SUÏfS - óc + 2111911 91 931 9 1 9311 v1 ng nhiệt 68
Trang 9List of Abbreviations
Artificial Intelligence
10
Trang 10Detecting road damage is an essential and crucial task for ensuring roadinfrastructure, traffic safety, and maintaining the supply chain for the economy With therapid technological development in recent years, especially in automated imageprocessing techniques in the field of artificial intelligence, this thesis researches solutionsfor automatically detecting road damages The aim is to improve accuracy and considerreasonable computational resources for practical deployment, meeting the significantglobal demand
In this thesis, I propose a new model architecture that utilizes a combination ofconvolutional neural network (CNN) and transformers for image classificationproblems This architecture leverages the strength of CNNs in extracting features fromintermediate to advanced levels of input images, and then processes these throughtransformer encoder blocks This process uncovers hidden features in the feature vector
that CNNs may not detect, thereby enhancing accuracy for image classification
Moreover, we compile various related datasets to create a comprehensive dataset for
evaluating the effectiveness of the proposed solution compared to the best availablesolutions
11
Trang 11to improving transportation efficiency, especially in the context where road transportaccounts for a predominant share of freight transportation in many countries A reliablestatistic shows that the proportion of goods transported by road in European countriesaccounted for 77.3% in the year 2021 [1].
However, current solutions such as human monitoring through video or image
analysis from unmanned aerial vehicles, while partly addressing the issue, face many limitations regarding labor and equipment costs, raising questions about their widespread
applicability Additionally, management agencies often lack technological expertise indeploying and maintaining these advanced and modern systems
Recently, new solutions, although applying advanced technologies like machinelearning and automatic image processing, still cannot fully meet these needs Forexample, the crack detection model on roads developed by the University of Tokyo,while achieving high accuracy, requires up to 1500ms to process a single image whenexperimented on a smartphone This clearly does not meet the real-time requirements inpractical applications [2] Another solution is the work of Yachao Yuan and colleagues,
12
Trang 12who attempted to address this issue by combining various simple image processing
techniques to increase processing speed [3] Although the model achieved impressive
accuracy and prediction speed, the dataset used was not large and diverse enough toconclusively demonstrate the solution's effectiveness across different road types invarious countries This highlights a significant limitation in developing models that cangeneralize effectively across various road types worldwide
Machine Learning (ML) and Artificial Intelligence (AI) have opened a newapproach to addressing this issue The combination of AI with image processingtechniques holds great potential in automatically and accurately detecting and classifyingroad damages However, developing AI models that can generalize well across variousenvironmental and road conditions remains a significant challenge Additionally,
processing and analyzing the large volume of collected data is a considerable challenge,
requiring a smooth integration of advanced image processing techniques and machine
learning.
In the field of deep learning in recent years, the emergence of the transformerarchitecture has revolutionized the field of artificial intelligence, particularly in naturallanguage processing There have been many studies utilizing the power of transformers
to improve accuracy in image processing tasks, achieving significant advancements Thesuperiority of this approach is due to the attention mechanism, which allows the model
to view the entire context rather than just focusing on local areas, as is the case withtraditional CNNs Therefore, a current trend is to research models that combine bothCNNs and transformers to create significant technological advancements
The research and development of new solutions are necessary both technicallyand socially Advanced research is needed to develop AI models capable of quickly,accurately, and efficiently classifying road damages at a reasonable cost This will play
a crucial role in improving the safety and efficiency of the global road traffic system
13
Trang 13In this context, the topic of this thesis aims not only to solve a technical problem
but also to contribute to the sustainable development of global road infrastructure The
research and development of AI-based automatic road damage classification solutionsare not only a significant step forward in the technical field but also a practicalcontribution to improving the quality of life and safety of the community
1.3 Objectives of the Thesis
e Explore the fields of artificial intelligence, machine learning, deep learning,
and their applications
e Research several existing studies and methods in the problem of classifying
road damage conditions
e Focus on designing and testing architectures that combine Convolutional
Neural Networks (CNN) and Transformers The goal is to create an optimalsolution for the road damage classification problem, leveraging the advantages
of both architectures: the powerful image feature recognition capability ofCNNs and the ability to process sequential, complex data of Transformers
e Compare and evaluate the performance of the proposed model against models
that use only the CNN architecture
e Evaluate the model's ability to deploy the proposed model on an edge device
1.4 Subjects and Scope of Research
1.4.1 Research Subjects
This thesis focuses on:
14
Trang 14Modern network architectures in machine learning and artificial intelligence,
including ANN (Artificial Neural Networks), CNN (Convolutional NeuralNetworks), and Transformers
The problem of image classification for the detection of damaged road
surfaces in photographs
The feasibility of deploying these models in practical applications,particularly on devices with limited hardware capabilities
1.4.2 Scope of Research
The scope of research includes:
Research on existing solutions to address the problem of road damageclassification
Research on effective models in the field of image processing, specifically for
image classification problems.
15
Trang 15CHAPTER 2 THEORETICAL BACKGROUND
2.1 Overview of Artificial Intelligence, Machine Learning, Deep
Learning
@ — — — — — ARTIFICIAL INTELLIGENCE
= < A technique which enables machines
Artificial Intelligence _ -“ to mimic human behaviour
kh
Machine Learning
MACHINE LEARNING
max" ”——————^T—T Subset of AI technique which use
\ | statistical methods to enable machines
x to improve with experience
~~ DEEP LEARNING
~— — — — — — Subset of ML which make the
computation of multi-layer neural network feasible
Figure 1 Investigate the overall management theory, applications, and
classifications in the fields of AI, ML, and DL
Artificial Intelligence: In the field of computer science, artificial intelligence, or
AI, sometimes referred to as synthetic intelligence, is intelligence exhibited by machines,contrasting with the natural intelligence of humans Typically, the term "artificialintelligence" is used to describe server (or computer) systems capable of mimicking
"cognitive" functions often associated with the human mind, such as "learning" and
"problem-solving." [4]
Artificial intelligence systems are classified based on their ability to replicate
human characteristics and are broadly divided into three main types as follows:
16
Trang 16e Narrow Artificial Intelligence (ANI): At this level, artificial intelligence can
only solve problems in a specialized domain, such as image classification orspam email filtering
e Artificial General Intelligence (AGI): This type of intelligence is similar to
human capabilities, meaning it can perform tasks that humans can do and can
be considered a miniature representation of human intelligence
e Artificial Super Intelligence (ASI): At this level, artificial intelligence
surpasses human intelligence in its capabilities
Machine Learning: Machine learning is a field of artificial intelligence thatinvolves researching and developing algorithms to enable computer systems to "learn"automatically from data to solve specific problems Its goal is to build a model based ontraining data to make predictions or decisions without the need for explicit programming.For example, machines can "learn" how to classify emails as spam or not andautomatically organize them into the corresponding folders Machine learning is closelyrelated to statistical inference, although there are differences in terminology [5]
To better understand how to create a machine learning model, we need thefollowing elements:
o Datasets: A dataset consists of a set of entities or samples with similarities
in the same structure Creating a good dataset requires significant time,effort, and sometimes financial investment A good dataset will train goodmachine learning models, making it arguably the most crucial part of
model training
© Features: Features of the data are essential for the model, considered as the
key factors used to represent the problem being solved Features are oftencharacteristics of an observed phenomenon Selecting informative,
17
Trang 17differentiating, and independent features for training is crucial to creatingeffective models for tasks such as recognition, classification, or regression.
© Algorithm: Algorithms are the definitions and constraints in the learning
and development process of a model automatically, without the need forspecific task programming There are various algorithms, from classical tomodern ones, to solve different problems such as linear regression, k-
means, deep learning networks, and recommendation systems The
accuracy and speed of predictions depend on the type of algorithm used
Machine learning is categorized into:
o Supervised Learning: The model is trained on pre-labeled data Since the
model is fine-tuned to fit the accurately prepared data (labels), aftertraining, it represents the characteristics of the data accurately, meeting theuser's needs
© Unsupervised Learning: It is trained on unlabeled data The goal is to find
features and hidden knowledge in the dataset As there are no labels forevaluation, this method may sometimes produce models that do not meetexpectations
© Semi-supervised Learning: Uses a dataset with some labeled and some
unlabeled data It guides the model according to the purpose of the creator(with labeled data) Due to the relatively low cost of collecting unlabeleddata, the model is trained on a large dataset, gaining better understanding
of the data and often yielding better prediction results
© Reinforcement Learning: Considered the most ambitious learning
approach, predicted to create super artificial intelligence models in thefuture Reinforcement learning involves researching how an agent should
18
Trang 18choose actions to maximize a reward It is focused on maximizingcumulative rewards over time.
Deep Learning: Deep learning is a subset of machine learning built on theprinciples of artificial neural networks It mimics how the biological brain processesinformation through interconnected nodes and distributed networks While it originatesfrom the biological brain, artificial neural networks in deep learning still have distinctivefeatures, reflecting clear differences between natural and artificial structures [6]
Artificial neural networks in deep learning are constructed with multiple layers,
each layer consisting of a large number of artificial neurons Each neuron receives input,processes information, and transmits signals to other neurons The training process indeep learning involves adjusting the weights of the connections between neurons tooptimize the accuracy of the model when making predictions for input data
An important aspect of deep learning is the ability to learn feature representations.Instead of manually defining necessary features for processing (similar to traditionalmachine learning algorithms), deep learning models can automatically discover and
refine complex features from input data This enables the model to adapt to complex and
diverse types of data
While inspired by the biological brain, artificial neural networks in deep learning
have clear differences from their natural mechanisms While the biological brain is adynamic, flexible, and continuously evolving system, artificial neural networks are oftendesigned with static and highly symbolic structures This means that although capable
of learning and adapting, artificial neural networks cannot autonomously develop orchange their fundamental structure like the human brain
Deep learning has become a powerful tool in various fields, from imagerecognition to analyzing complex data Its ability to self-learn and adapt to input data
19
Trang 19makes deep learning an ideal choice for solving highly complex problems, such as road
damage classification in this context
In summary, deep learning is not only a technology that simulates the workings
of the biological brain but also an advanced method with the ability to self-learn andadapt, opening up new possibilities in solving complex real-world problems
Deep learning algorithms
Traditional machine learning algorithms
Performance
Figure 2.The performance differences among various AI and ML model groups
The figure above illustrates the difference between traditional machine learning
models (other approaches) and neural networks As the amount of data andcomputational resources increases, scaling up neural networks allows the creation ofmodels with superior accuracy In the current era, with data growing exponentially,neural network models increasingly demonstrate their dominance in automated tasks
20
Trang 202.2 Typical Architectures in Deep Learning
2.2.1 Neural Network
a Overview
Neural Network, also known as ANN (Artificial Neural Network), is an algorithminspired by the functionality of the human brain to create models capable of solvingcommon problems without the need for detailed programming To illustrate, when wesee images around us through our eyes (which can be considered as data), signals fromthe eyes will be transmitted to neural nerve cells in the brain (akin to the smallestelements in a neural network) to process and make decisions about those images (output)
In reality, for a neural network to process such information, it must undergo a trainingprocess to acquire the necessary knowledge before being used to predict outcomes ANN
can be used to solve various problems, such as pattern recognition, image classification,speech recognition, etc., as long as suitable data is available for training
Mathematically, ANN can be viewed as a function y = f(x), where x is the input
data, and the role of f is to map x to produce a predicted result from that information
b Architecture of ANN
ANN consists of 3 layers:
e Input layer: It serves as a unit representing input information, used to feed
input data into the network
e Hidden layer: In a CNN, to achieve high accuracy, networks are often
designed with multiple hidden layers Each neuron in these layers is connected
to neurons in the preceding and following layers, facilitating the connection
of feature information from the data and learning hidden knowledge from it
21
Trang 21e Output layer: It serves as a unit representing output information Signals from
the previous layers are transmitted to this layer to provide the predicted result
of the model
c Neural architecture in ANN
A fundamental unit in each layer of an ANN is called a neuron (or node, unit) Itreceives input from the preceding layer and computes the output Each input received bythe neuron has a corresponding weight, representing the contribution of that input to
solving the given problem The goal of training an ANN model is to find suitable values
for these parameters so that when predicting a new data sample, the expected result isobtained accurately The output of a neuron is calculated using the dot product of theweights and corresponding inputs
Formula to calculate the output of a neural network:
capability of linear functions, it becomes challenging to map inputs to correct predictions
in complex problems Therefore, we need to make this representation function nonlinear.The solution is to pass the output through a nonlinear function called an activationfunction Some commonly used activation functions include Relu, Sigmoid, Tanh, etc
At this point, the accurate formula for calculating the output of the neural network is:
y = ƒÖ ;¡w;x¡ + b), with, f is activation function.
22
Trang 22d Neural network training process
As mentioned earlier, the training process involves updating the weights in each
neuron to fit the data Therefore, during the training process, the model goes through twostages:
Feed Forward Propagation: This is the process of calculating the output ofthe network by allowing input data to pass through the network in the forwarddirection (from input to output) Additionally, in the initial stage of the trainingprocess, the output values often deviate significantly from the actual labels ofthe data because the model has not learned any knowledge from the data yet.After obtaining the prediction results, a comparison is made with the actuallabels of the data, and the prediction error is calculated The goal is to updatethe weights in such a way that, after training, this error is minimized
Mathematical representation of the loss function:
Loss = 9iapei, Ypreaict )
In which:
Viabe ground truth
Ypreaict: the prediction of the modelg: is a function that measures the error between predicted results and
ground truth.
Back Propagation: This is the process of updating the model's weights The
current challenge is to find the appropriate weights for the model so that theloss function has the smallest value An effective way to achieve this is to findthe partial derivative of the loss function for each neuron and update itsweights in the direction opposite to the gradient To perform this process, the
23
Trang 23chain rule is used to facilitate the calculation of derivatives This weight
updating process will be repeated multiple times until the loss value reaches
the smallest possible value
e Summary of advantages and disadvantages of ANN
The architecture of Artificial Neural Networks (ANN) has the following
advantages:
Self-learning and adaptability: ANN can learn from data, extract relevantfeatures, and adjust its network weights without the need for specificprogramming
Non-linear data processing: Thanks to activation functions, ANN is powerful
in handling non-linear models, making it suitable for various complexproblems
Automatic feature extraction: In deep learning, the hidden layers of ANN canautomatically learn complex features from data, reducing the need for manualfeature selection techniques
However, it also has some notable drawbacks:
Data issues: ANN tends to be less effective when trained on small, less diverse
datasets
Training complexity: Training ANN, especially deep learning models with
multiple layers, can be complex and requires significant computational
resources.
Sensitivity to small changes in input data, requiring careful datapreprocessing
24
Trang 242.2.2 Convolution Neural Network
Convolutional Neural Network [7] is a well-known form of a Neural Network,widely used in various tasks related to image processing such as image classification,object detection, face recognition, etc., due to its ability to learn powerful features
a Convolution layer
Convolution is an important operation that has played a significant role and
emerged early in the field of digital signal processing Finding suitable filters for eachtype of signal and each problem has been extensively researched and taught in technicalcurricula In the late 1980s, Yann Lecun proposed a two-dimensional convolutionalmodel for image data, achieving great success in handwritten digit classification tasks.[8] Since then, many successive successes have followed, with numerous studies in thisfield leading to the development of neural network architectures utilizing convolutionaloperations that demonstrate outstanding efficiency in image processing In the context
of this article, we will delve deeper into two-dimensional convolution to illustrate howneural networks operate on images
The 2D convolution operation is one of the most crucial components of CNN,widely employed in processing and learning from image data This convolutionaloperation not only assists the model in learning essential features from the image butalso preserves the spatial structure of the data In CNN, the 2D convolution is performed
by sliding a filter (kernel) over the input image from left to right and top to bottom Thisfilter is typically a small matrix (e.g., 3x3 or 5x5) containing weights that need to belearned
Let's assume I is the input matrix (e.g., an image) and K is the filter (kernel) with
dimensions m*n The 2D convolution operation is defined as follows:
25
Trang 25(I * K)(i,j) = gựn Lan | l(¡ + u,j + 0) x K(u,v)
C~ằu=fÍl u=fl
In which:
(i, 7) represents the position in the output matrix
I(i+u,j + 0) and K(u,v) are the values at the corresponding positions in the
input matrix and the filter, respectively.
Some other factors to consider in CNN:
e Stride: During the convolution operation, the value of the stride is also
important Stride can be understood as the number of pixels the kernel movesafter each convolution operation in the image With a stride of 1, the kernelmoves one unit at a time in the input image
e Padding: Sometimes the filter size does not perfectly match the input image,
and in such cases, additional columns or rows are often added to the inputimage matrix, with zero values This is called padding
e Activation function: From the convolution formula, it's easy to see that the
convolution operation is also a linear function, as mentioned in the ANNsection Linear functions are often not powerful enough to represent complex
data Therefore, in CNN, an activation function is added to enhance thenetwork's representational capacity.
b Pooling layer
Because the convolution operation is performed on small parts of the image,larger images require more resources and computational time Therefore, the poolinglayer is used to reduce the size of the input image or feature maps There are three types
of pooling:
26
Trang 26Max pooling: Divides the feature map into small matrices of size NxN,moving from top to bottom and left to right, then takes the maximum value ineach NxN matrix as the representative value.
Average pooling: Similar to max pooling in the division of regions, but the
representative value for each region is calculated as the average of the values
Due to the use of shared weights (kernel), when an object in the image is atany position, after the convolution operation, we still obtain the features ofthat object at the exact position [9]
In mathematical terms, a function f(x) is considered equivariant if f(g(x)) =g(f(x)), where g is a translation function Therefore, when applying the function g to animage and then performing convolution, the result will be similar to applyingconvolution to the translated image
Thanks to this property, with the input image being a cat, regardless of its position
or the cat breed, a CNN can still extract specific features for the cat subject, allowing it
to detect the object in the image as a cat
27
Trang 27Translational Invariance:
The concept of translational invariance differs slightly from translationalequivariance The goal of this property is that regardless of the subject'sposition in the image, its features will ultimately be pulled towards the center
of the image, where the model is most likely to focus [10]
This property is achieved through the operation of the pooling layer
Specifically, for example, with max pooling, it reduces the size of the output
(feature map) by half when using pooling with a size of 2 The representativevalue for each region is the maximum, so the stronger features of the imageare preserved in the end
Local Receptive Field:
In each movement of the kernel over the image, it only "looks" at a portion ofthe image as input for the convolution operation Therefore, it can be said that
in each calculation, it only observes a local region of the image This localregion is commonly referred to as the local receptive field [11]
Thanks to this property, we can design networks with fewer parameters,helping to save resources and computational costs Moreover, with fewerparameters, the model is less prone to overfitting when trained on a notexcessively large dataset, and the model will converge faster
d Disadvantages of CNN architecture
Alongside its strengths, CNN still harbors the following drawbacks:
To achieve good performance, the model still requires a relatively largedataset
28
Trang 28e It is unable to provide a convincing explanation for the mechanism through
which the CNN model learns and makes decisions, particularly crucial insensitive applications requiring high accuracy, such as healthcare
e Insome cases, the identification or classification of objects in data with more
complex structures may not be effective
e Training a CNN model, especially deep and complex ones, demands
substantial computational resources and lengthy training times
e Due to the local receptive field nature, it does not capture the entire context of
an image at once, potentially not fully utilizing the information present in thedata
2.2.3 Transformers
a Introduction
In recent years, the Transformer architecture has emerged as a groundbreakingadvancement in the field of machine learning, particularly in applications related to
natural language processing (NLP) Initially introduced by Vaswani et al in the paper
"Attention is All You Need" in 2017, it has gained significant prominence [12], Thisarchitecture has quickly demonstrated its superior strength, particularly in handling longand complex sequences, a challenge faced by earlier models like Recurrent NeuralNetworks (RNN) and Long Short-Term Memory (LSTM) [13]
The core difference of Transformers compared to traditional models lies in theuse of the "Self-Attention" mechanism, allowing the model to focus on the entire input
to better understand the context and meaning This not only helps improve accuracy intasks such as machine translation, text classification, and text generation but also opens
up the capability to handle much more complex tasks
29
Trang 29Additionally, the Transformer architecture brings computational efficiency
benefits compared to RNNs, LSTMs, or GRUs Thanks to its non-sequential structure,
it can parallelize the processing, significantly reducing training time compared to modelsbased on RNN or LSTM This has changed the approach to model design in manyresearch and applications, shifting from a focus on improving sequential processing tooptimizing parallel processing capabilities
Furthermore, Transformers have extended their influence beyond the field ofNLP Recent studies have demonstrated their capability to address issues in computer
vision, including tasks such as image recognition [14] [15], object detection [16],
segmentation [17], , Marking a significant milestone in constructing versatile andpowerful deep learning models, the flexibility and efficiency of the Transformer
architecture have not only opened new avenues for language processing methodologies
but also served as a crucial step for integration and development in various other machine
learning domains.
Therefore, the Transformer architecture is not only a turning point in the history
of machine learning development but also a key to unlocking new possibilities inaddressing the complex challenges of the real world
30
Trang 30Positional A) @
Encoding
Encoding OY
Figure 3 Transformers architecture
Transformers consist of two parts: the encoder and the decoder While the encoder
is responsible for extracting and synthesizing features from the input data, the decoderfocuses on decoding these features into signals that humans can comprehend, commonlytext In the scope of this thesis, as only the encoder part of transformers is utilized forimage classification, a more in-depth analysis will be concentrated on this aspect
b Input embedding and Positional encoding
Input embedding is a step that transforms input data into vectors so that the model
can perform computations Specifically:
e For text data, each token (word or phrase) is typically represented as a vector
e For image data, images are usually divided into regions, and then the image is
embedded into a vector (flattening the image, passing it through a neuralnetwork, etc.)
31
Trang 31Positional encoding: In the case of transformers, data is inputted simultaneously(unlike networks such as ANN, GRU, where data is inputted sequentially) Therefore,
we need to add information about the position of each input vector This is because, indata related to text, a change in the order of tokens can lead to a misinterpretation of asentence's meaning In the paper "Attention is all you need," they use the following
method to represent positional information:
e iis the iteration value, ranging from 0 to
¢ dmnodei 1S the number of dimensions in the embedding vector for one token
c Transformers Encoder
The Transformer encoder is an architecture used to map input data into feature
vectors During this process, it leverages information from the entire input data toexamine relationships within the data As a result, the feature vectors become richer inknowledge The Transformer encoder consists of two main components: multi-headattentions and fully connected layers It also incorporates residual connections betweenblocks to mitigate the issue of vanishing gradients
Multi-head attention:
e In the encoder architecture, multi-head attention utilizes a special type of
attention called self-attention Self-attention allows the model to learnrelationships between each word in the input and the rest of the words Thishelps identify which words are strongly connected to each other, capturing the
data's distinctive features
32
Trang 32Specifically, in self-attention, for each embedding vector in the input, it
undergoes three fully connected layers to generate three vectors: Query (Q),Key (K), and Value (V) The roles of these vectors can be envisioned asfollows: “The query key and value concept come from retrieval systems For
example, when you type a query to search for some video on Youtube, thesearch engine will map your query against a set of keys (video title, description
etc.) associated with candidate videos in the database, then present you the bestmatched videos (values).” [18]
As envisioned regarding the roles of the three vectors K, V, and Q, to discoverthe association between the key and query, we need to perform a dot productoperation to find the representation matrix This matrix indicates the degree
of correlation between each word pair
The next step is to scale down this representation matrix to turn it intoprobability distributions for easier computation Specifically, we performscaling down by dividing by the square root of the dimension of the key vector
Jax Afterward, we pass it through a softmax function to convert the values
33
Trang 33e Finally, we will combine the input with the output, because the architecture
has a residual connection connecting these two locations
After that, the process involves normalization and passing through a pointwisefeed-forward network for additional processing In the end, we get a new, moreinformative data representation vector
d Advantages and disadvantages.
Advantages:
e Effective Handling of Long Input Context: The Transformer excels in
processing long sequences of data efficiently, thanks to its attentionmechanism, allowing the model to focus on crucial parts of the input data
e Parallelized Learning Capability: Unlike LSTM or GRU models, the
Transformer is not dependent on the previous state of the sequence, enablingparallelized training and accelerating the training process
Disadvantages:
e High Resource Requirements: Transformers often demand a significant
amount of computational resources, especially in terms of RAM and GPU, for
both training and deployment
e Overfitting: Due to the large number of parameters, Transformers are prone
to overfitting, particularly when trained with a small dataset
2.3 Image classification problem
The image classification task is one of the most important and common problems
in the field of image processing When employing artificial intelligence techniques toaddress this task, the goal is to enable computers to automatically categorize images intodifferent labels The simplest example is classifying images as either dogs or cats,
determining whether the image contains a dog or a cat.
34
Trang 34Before the advent of neural networks, traditional machine learning algorithmswere primarily used to address this problem The performance of these algorithms wasrelatively low, almost inadequate for practical applications Since the emergence ofneural networks, especially CNNs, and a significant increase in training data andcomputational resources, this problem has gradually become much accurate To enhanceits performance, it is common to design new CNN architectures that can learn betterfeature representations, potentially reducing parameters without compromising
accuracy.
2.4 Some studies on the problem of detecting damaged paths
The problem of detecting road damage has long attracted significant attentionfrom various research communities and practical businesses In the initial stages, somestudies focused solely on image signal processing using conventional image processingtechniques Prominent among these is the use of filters such as Gabor filters [20] [21]however, the results achieved did not meet the expectations
In 2016, research on using CNN to address this problem began [22] Since then,the research community has started to pay more attention to the use of neural networks
to tackle the issue [23] [24] [25] Although initial progress has yielded improved results,the cost and complexity of deployment remain factors that hinder their widespreadadoption
35
Trang 35CHAPTER 3 PROPOSED MODEL
3.1 Overview of the Model
Average Pooling ‘average Pooling +
jf | a - Transformer Encoder
Flatten Flatten
Concat Features vector
Figure 4.Proposed model architecture
In the proposed model architecture, we employ a hybrid approach by using both
advanced convolutional neural network (CNN) architectures, namely ResNet50 and
EfficientNetB3, to extract features from images The primary goal of utilizing bothnetworks is to leverage the unique advantages that each network brings, aiming toenhance the understanding and analysis of image data ResNet50 is renowned for its
ability to address the vanishing gradient problem through the use of residual connections,allowing the network to learn deep features without losing important information across
36
Trang 36layers On the other hand, EfficientNetB3 is designed to optimize both performance and
accuracy by carefully balancing the depth, width, and resolution of the network
After extracting features from both networks, we concatenate them using a vectorconcatenation function This process generates a consolidated feature vector, combininginformation from both sources, harnessing the strengths and distinctive characteristics ofeach model Next, this consolidated feature vector is divided into 16 patches, and each
patch is then fed into a transformer encoder architecture Using a transformer helps the
model focus on important parts of the image and understand the spatial relationshipsbetween different parts
Finally, the representation vector obtained from the transformer is passed through
a fully connected (FC) layer This FC layer is responsible for classification, based on the
representation vector, to determine the appropriate label for the image data Thus, the
model can classify images into specific labels based on the features and spatial
relationships it has learned.
In summary, the combination of ResNet50, EfficientNetB3, and the transformerarchitecture in this model creates a powerful and flexible approach for analyzing andclassifying images, leveraging the strengths of both CNN and transformers in featurelearning and image understanding
Specifically, we observe that the proposed architecture brings about several newand somewhat positive impacts:
- Instead of directly dividing the image into batches and passing them through amultilayer perceptron (MLP) as in vision transformers [26], our architecture uses CNN
to extract features first These features are then passed through a small transformerencoder layer to extract additional information This approach saves a considerablenumber of parameters, as the responsibility for feature extraction is distributed betweenthe CNN and the transformer, with the transformer handling a smaller set of parameters
37
Trang 37- The use of two small CNN models for feature extraction, as opposed to one largeCNN model, has its rationale We believe that each CNN model extracts relativelydifferent features Using multiple types of models provides more diverse featureinformation, similar to the concept of multi-model voting, where having more opinionsleads to more reliable information that complements each other Additionally, using alarge model may face the risk of overfitting when trained on a smaller dataset in thefuture.
- Initially, we assumed that when using CNNs, the information extracted from
different networks would somewhat resemble each other and provide sufficient featureinformation for prediction Therefore, we considered passing the output of the CNNsdirectly through transformers (i.e., having two transformer blocks after the two CNNs)
However, after multiple experiments, it became evident that the information in the two
CNN branches is quite different Consequently, we decided to concatenate the featurevectors from the two CNN branches and use a single transformer block The resultsshowed improvement, and this was further supported when visualizing the model's focusareas using GradCam during experiments
3.2 Resnet model
Researchers in the field of deep learning used to believe that "the deeper, thebetter." It also makes sense because, typically, deeper networks tend to representknowledge better (partly due to having more parameters for representation) However,experiments have shown that after increasing the depth of a model to a certain extent,
the accuracy of the model decreases, and more resources are needed for computation
[27]
There are several reasons provided by researchers to explain this phenomenon,but the most convincing one is the vanishing gradient [28] This is a phenomenon inwhich the gradient of the loss function becomes very small, to the point of being
38