Khóa luận tốt nghiệp Khoa học máy tính: Đánh giá hiệu quả của kiến trúc CNN-Transformers trong việc phân loại tình trạng hư hỏng đường bộ

In the field of deep learning in recent years, the emergence of the transformerarchitecture has revolutionized the field of artificial intelligence, particularly in naturallanguage proce

Trang 1

NGUYEN MINH LY - 20521592

KHOA LUAN TOT NGHIEP

ĐÁNH GIA HIEU QUA CUA KIÊN TRÚC

CNN-TRANSFORMERS TRONG VIEC PHAN LOAI TINH TRANG

HU HONG DUONG BO

(EVALUATING THE EFFECTIVENESS OF CNN-TRANSFORMERS

ARCHITECTURE IN CLASSIFYING ROAD DAMAGE CONDITIONS)

CU NHAN NGANH KHOA HOC MAY TINH

GIANG VIEN HUONG DAN

TS LE KIM HUNG

TP HO CHI MINH, 2024

Trang 2

Dissertation Defense Committee List

The dissertation defense committee, established according to Decision No.154/QD-DHCNTT dated March 1, 2023, by the Rector of the University of InformationTechnology

— - Chairperson

q5 - Secretary

- Member

Trang 3

First of all, I would like to express my sincere gratitude to all the professors and

teachers working and teaching at the University of Information Technology - HCM for the knowledge, lessons, and valuable experiences that I have acquired during

VNU-my recent journey I wish the Department of Computer Science in particular, and the

University of Information Technology - VNU-HCM in general, continued brilliant

success in the field of education, always training talents for the country, and remaining

a firm and attractive destination for future generations of students

I would also like to extend my heartfelt thanks to Dr Le Kim Hung Thanks tohis experiences, lessons, care, assistance, and guidance, I have overcome the difficultiesand challenges in the process of completing this graduation thesis

Next, I would like to express my sincere thanks to my family for always believing

in and encouraging me throughout my studies at the University of InformationTechnology - VNU-HCM, giving me additional motivation to strive for developmentand achieve the success I have today

Finally, I would like to thank my fellow students at the University of InformationTechnology - VNU-HCM for their companionship, assistance, enthusiasm in sharingopinions, and suggestions to help improve and perfect my graduation thesis

Ho Chi Minh, 2024

Trang 4

Table of Contents

Acknowledgements 4

ABSTRACT 11CHAPTER 1 INTRODUCTION 12

CHAPTER 2 THEORETICAL BACKGROUND 16

2.1 Overview of Artificial Intelligence, Machine Learning, Deep Learning 162.2 Typical Architectures in Deep Learning 21

2.2.1 Neural Network 212.2.2 Convolution Neural Network 252.2.3 Transformers 292.3 Image classification problem 342.4 Some studies on the problem of detecting damaged paths 35CHAPTER 3 PROPOSED MODEL 36

3.1 Overview of the Model 36

Trang 5

3.2 Resnet model

3.3 EfficientNet

3.4 Transformers

3.5 Zero shot learning

CHAPTER 4 EXPERIMENTS AND EVALUATION

4.1 Dataset

4.2 Evaluation metrics

4.2.1 Confusion matrix4.2.2 Recall

4.2.3 Precision

4.2.4 Fl score

4.3 Experiment

4.3.1 Experimental context4.3.2 Experimental results

CHAPTER 5 SUMMARY AND DEVELOPMENT DIRECTIONS

5.1 The achieved results

Trang 6

List of images Figure 1 Investigate the overall management theory, applications, and classifications in the fields of AI, ML, and DL << << << << << <£ceeezsssssz 16

Figure 2.The performance differences among various AI and ML model groups

— šŠÖ 20

Figure 3 Transformers arChIf€CfUT€ - - 2G <2 1E 1E S1 E91 Hư 31 Figure 4.Proposed model architecture - -. <5 S113 E+kESskksersreeeree 36 Figure 5 The skip connection architecture was invented in resnet 39

Figure 6 Architecture of the ResNet50 model . -5- -sc+c+csereersee 40 Figure 7 Ways to scale up neural network architectures .- «- 42

Figure 8 The baseline architecture of the EfficientNet model - 43

Figure 9 Compare the accuracy of the models on the ImageNet dataset 44

Figure 10 The architecture of the EfficientNetB3 model .- 44

Figure 11 Image classification with Zero-shot learning .- - ‹ -<- 46 Figure 12 CLIP training process on 400 million image-caption data pairs 48

Figure 13 The process of predicting the label of an image using the clip model Figure 14 Distribute data in train, valid and test S€fs - - « -sc+s++++ 51 Figure 15 The ratio of images between the classes in the dataset; 0: non-damaged road class; 1: damaged road CÏaSS - - s91 ng ng c 52 Figure 16 Distribution of images for each country in the train da(a 52

Figure 17 Distribution of images for each country in the valid data 53

Figure 18 Distribution of images for each country in the test dafa 53

Figure 19 Some images from the dataset a, b: Images in the case of undamaged roads; c, d: Images in the case of damaged roads - - - - «+ s* + ssksseksssresree 54 Figure 20 The loss and accuracy scores of the model during training 63

Trang 7

Figure 21 The GradCAM results display the important regions on an image thatthe model uses to make Dr€dICfIOTIS - 5 E111 19112 1231 11 HH HH ni, 65

Trang 8

List of Tables

Table 1 Details the number of images in the đafa 5 - 555 5s +2 £++se 51

Table 2 Values of TP, FN, FP, TN in the confusion matr1X - << 55Table 3 Simple prompt eee eee 59Table 4 Detail prompt 0n 59

Table 5 Result for Zero-shot learning for each cÌaSS€S -s«<++ 60

Table 6 The overview result for zero-shot learnIng - ‹ - «<< «++<ex++es 60Table 7 The number of parameters for each model «<< +s«++x£++e+ 61Table 8 The detailed accuracy metrics of the proposed model using variousTT€8SUT€ITRIES G- 6 5162112001855 111112 T91 nh HT TH HH TH nh nhờ 64

Table 9 The results of comparing the accuracy of our model ("Ours") with theCNN models 1177777 67

Table 10 Evaluation T€SUÏfS - óc + 2111911 91 931 9 1 9311 v1 ng nhiệt 68

Trang 9

List of Abbreviations

Artificial Intelligence

10

Trang 10

Detecting road damage is an essential and crucial task for ensuring roadinfrastructure, traffic safety, and maintaining the supply chain for the economy With therapid technological development in recent years, especially in automated imageprocessing techniques in the field of artificial intelligence, this thesis researches solutionsfor automatically detecting road damages The aim is to improve accuracy and considerreasonable computational resources for practical deployment, meeting the significantglobal demand

In this thesis, I propose a new model architecture that utilizes a combination ofconvolutional neural network (CNN) and transformers for image classificationproblems This architecture leverages the strength of CNNs in extracting features fromintermediate to advanced levels of input images, and then processes these throughtransformer encoder blocks This process uncovers hidden features in the feature vector

that CNNs may not detect, thereby enhancing accuracy for image classification

Moreover, we compile various related datasets to create a comprehensive dataset for

evaluating the effectiveness of the proposed solution compared to the best availablesolutions

11

Trang 11

to improving transportation efficiency, especially in the context where road transportaccounts for a predominant share of freight transportation in many countries A reliablestatistic shows that the proportion of goods transported by road in European countriesaccounted for 77.3% in the year 2021 [1].

However, current solutions such as human monitoring through video or image

analysis from unmanned aerial vehicles, while partly addressing the issue, face many limitations regarding labor and equipment costs, raising questions about their widespread

applicability Additionally, management agencies often lack technological expertise indeploying and maintaining these advanced and modern systems

Recently, new solutions, although applying advanced technologies like machinelearning and automatic image processing, still cannot fully meet these needs Forexample, the crack detection model on roads developed by the University of Tokyo,while achieving high accuracy, requires up to 1500ms to process a single image whenexperimented on a smartphone This clearly does not meet the real-time requirements inpractical applications [2] Another solution is the work of Yachao Yuan and colleagues,

12

Trang 12

who attempted to address this issue by combining various simple image processing

techniques to increase processing speed [3] Although the model achieved impressive

accuracy and prediction speed, the dataset used was not large and diverse enough toconclusively demonstrate the solution's effectiveness across different road types invarious countries This highlights a significant limitation in developing models that cangeneralize effectively across various road types worldwide

Machine Learning (ML) and Artificial Intelligence (AI) have opened a newapproach to addressing this issue The combination of AI with image processingtechniques holds great potential in automatically and accurately detecting and classifyingroad damages However, developing AI models that can generalize well across variousenvironmental and road conditions remains a significant challenge Additionally,

processing and analyzing the large volume of collected data is a considerable challenge,

requiring a smooth integration of advanced image processing techniques and machine

learning.

In the field of deep learning in recent years, the emergence of the transformerarchitecture has revolutionized the field of artificial intelligence, particularly in naturallanguage processing There have been many studies utilizing the power of transformers

to improve accuracy in image processing tasks, achieving significant advancements Thesuperiority of this approach is due to the attention mechanism, which allows the model

to view the entire context rather than just focusing on local areas, as is the case withtraditional CNNs Therefore, a current trend is to research models that combine bothCNNs and transformers to create significant technological advancements

The research and development of new solutions are necessary both technicallyand socially Advanced research is needed to develop AI models capable of quickly,accurately, and efficiently classifying road damages at a reasonable cost This will play

a crucial role in improving the safety and efficiency of the global road traffic system

13

Trang 13

In this context, the topic of this thesis aims not only to solve a technical problem

but also to contribute to the sustainable development of global road infrastructure The

research and development of AI-based automatic road damage classification solutionsare not only a significant step forward in the technical field but also a practicalcontribution to improving the quality of life and safety of the community

1.3 Objectives of the Thesis

e Explore the fields of artificial intelligence, machine learning, deep learning,

and their applications

e Research several existing studies and methods in the problem of classifying

road damage conditions

e Focus on designing and testing architectures that combine Convolutional

Neural Networks (CNN) and Transformers The goal is to create an optimalsolution for the road damage classification problem, leveraging the advantages

of both architectures: the powerful image feature recognition capability ofCNNs and the ability to process sequential, complex data of Transformers

e Compare and evaluate the performance of the proposed model against models

that use only the CNN architecture

e Evaluate the model's ability to deploy the proposed model on an edge device

1.4 Subjects and Scope of Research

1.4.1 Research Subjects

This thesis focuses on:

14

Trang 14

Modern network architectures in machine learning and artificial intelligence,

including ANN (Artificial Neural Networks), CNN (Convolutional NeuralNetworks), and Transformers

The problem of image classification for the detection of damaged road

surfaces in photographs

The feasibility of deploying these models in practical applications,particularly on devices with limited hardware capabilities

1.4.2 Scope of Research

The scope of research includes:

Research on existing solutions to address the problem of road damageclassification

Research on effective models in the field of image processing, specifically for

image classification problems.

15

Trang 15

CHAPTER 2 THEORETICAL BACKGROUND

2.1 Overview of Artificial Intelligence, Machine Learning, Deep

Learning

@ — — — — — ARTIFICIAL INTELLIGENCE

= < A technique which enables machines

Artificial Intelligence _ -“ to mimic human behaviour

kh

Machine Learning

MACHINE LEARNING

max" ”——————^T—T Subset of AI technique which use

\ | statistical methods to enable machines

x to improve with experience

~~ DEEP LEARNING

~— — — — — — Subset of ML which make the

computation of multi-layer neural network feasible

Figure 1 Investigate the overall management theory, applications, and

classifications in the fields of AI, ML, and DL

Artificial Intelligence: In the field of computer science, artificial intelligence, or

AI, sometimes referred to as synthetic intelligence, is intelligence exhibited by machines,contrasting with the natural intelligence of humans Typically, the term "artificialintelligence" is used to describe server (or computer) systems capable of mimicking

"cognitive" functions often associated with the human mind, such as "learning" and

"problem-solving." [4]

Artificial intelligence systems are classified based on their ability to replicate

human characteristics and are broadly divided into three main types as follows:

16

Trang 16

e Narrow Artificial Intelligence (ANI): At this level, artificial intelligence can

only solve problems in a specialized domain, such as image classification orspam email filtering

e Artificial General Intelligence (AGI): This type of intelligence is similar to

human capabilities, meaning it can perform tasks that humans can do and can

be considered a miniature representation of human intelligence

e Artificial Super Intelligence (ASI): At this level, artificial intelligence

surpasses human intelligence in its capabilities

Machine Learning: Machine learning is a field of artificial intelligence thatinvolves researching and developing algorithms to enable computer systems to "learn"automatically from data to solve specific problems Its goal is to build a model based ontraining data to make predictions or decisions without the need for explicit programming.For example, machines can "learn" how to classify emails as spam or not andautomatically organize them into the corresponding folders Machine learning is closelyrelated to statistical inference, although there are differences in terminology [5]

To better understand how to create a machine learning model, we need thefollowing elements:

o Datasets: A dataset consists of a set of entities or samples with similarities

in the same structure Creating a good dataset requires significant time,effort, and sometimes financial investment A good dataset will train goodmachine learning models, making it arguably the most crucial part of

model training

key factors used to represent the problem being solved Features are oftencharacteristics of an observed phenomenon Selecting informative,

17

Trang 17

differentiating, and independent features for training is crucial to creatingeffective models for tasks such as recognition, classification, or regression.

and development process of a model automatically, without the need forspecific task programming There are various algorithms, from classical tomodern ones, to solve different problems such as linear regression, k-

means, deep learning networks, and recommendation systems The

accuracy and speed of predictions depend on the type of algorithm used

Machine learning is categorized into:

o Supervised Learning: The model is trained on pre-labeled data Since the

model is fine-tuned to fit the accurately prepared data (labels), aftertraining, it represents the characteristics of the data accurately, meeting theuser's needs

features and hidden knowledge in the dataset As there are no labels forevaluation, this method may sometimes produce models that do not meetexpectations

unlabeled data It guides the model according to the purpose of the creator(with labeled data) Due to the relatively low cost of collecting unlabeleddata, the model is trained on a large dataset, gaining better understanding

of the data and often yielding better prediction results

approach, predicted to create super artificial intelligence models in thefuture Reinforcement learning involves researching how an agent should

18

Trang 18

choose actions to maximize a reward It is focused on maximizingcumulative rewards over time.

Deep Learning: Deep learning is a subset of machine learning built on theprinciples of artificial neural networks It mimics how the biological brain processesinformation through interconnected nodes and distributed networks While it originatesfrom the biological brain, artificial neural networks in deep learning still have distinctivefeatures, reflecting clear differences between natural and artificial structures [6]

Artificial neural networks in deep learning are constructed with multiple layers,

each layer consisting of a large number of artificial neurons Each neuron receives input,processes information, and transmits signals to other neurons The training process indeep learning involves adjusting the weights of the connections between neurons tooptimize the accuracy of the model when making predictions for input data

An important aspect of deep learning is the ability to learn feature representations.Instead of manually defining necessary features for processing (similar to traditionalmachine learning algorithms), deep learning models can automatically discover and

refine complex features from input data This enables the model to adapt to complex and

diverse types of data

While inspired by the biological brain, artificial neural networks in deep learning

have clear differences from their natural mechanisms While the biological brain is adynamic, flexible, and continuously evolving system, artificial neural networks are oftendesigned with static and highly symbolic structures This means that although capable

of learning and adapting, artificial neural networks cannot autonomously develop orchange their fundamental structure like the human brain

Deep learning has become a powerful tool in various fields, from imagerecognition to analyzing complex data Its ability to self-learn and adapt to input data

19

Trang 19

makes deep learning an ideal choice for solving highly complex problems, such as road

damage classification in this context

In summary, deep learning is not only a technology that simulates the workings

of the biological brain but also an advanced method with the ability to self-learn andadapt, opening up new possibilities in solving complex real-world problems

Deep learning algorithms

Traditional machine learning algorithms

Performance

Figure 2.The performance differences among various AI and ML model groups

The figure above illustrates the difference between traditional machine learning

models (other approaches) and neural networks As the amount of data andcomputational resources increases, scaling up neural networks allows the creation ofmodels with superior accuracy In the current era, with data growing exponentially,neural network models increasingly demonstrate their dominance in automated tasks

20

Trang 20

2.2 Typical Architectures in Deep Learning

2.2.1 Neural Network

a Overview

Neural Network, also known as ANN (Artificial Neural Network), is an algorithminspired by the functionality of the human brain to create models capable of solvingcommon problems without the need for detailed programming To illustrate, when wesee images around us through our eyes (which can be considered as data), signals fromthe eyes will be transmitted to neural nerve cells in the brain (akin to the smallestelements in a neural network) to process and make decisions about those images (output)

In reality, for a neural network to process such information, it must undergo a trainingprocess to acquire the necessary knowledge before being used to predict outcomes ANN

can be used to solve various problems, such as pattern recognition, image classification,speech recognition, etc., as long as suitable data is available for training

Mathematically, ANN can be viewed as a function y = f(x), where x is the input

data, and the role of f is to map x to produce a predicted result from that information

b Architecture of ANN

ANN consists of 3 layers:

e Input layer: It serves as a unit representing input information, used to feed

input data into the network

e Hidden layer: In a CNN, to achieve high accuracy, networks are often

designed with multiple hidden layers Each neuron in these layers is connected

to neurons in the preceding and following layers, facilitating the connection

of feature information from the data and learning hidden knowledge from it

21

Trang 21

e Output layer: It serves as a unit representing output information Signals from

the previous layers are transmitted to this layer to provide the predicted result

of the model

c Neural architecture in ANN

A fundamental unit in each layer of an ANN is called a neuron (or node, unit) Itreceives input from the preceding layer and computes the output Each input received bythe neuron has a corresponding weight, representing the contribution of that input to

solving the given problem The goal of training an ANN model is to find suitable values

for these parameters so that when predicting a new data sample, the expected result isobtained accurately The output of a neuron is calculated using the dot product of theweights and corresponding inputs

Formula to calculate the output of a neural network:

capability of linear functions, it becomes challenging to map inputs to correct predictions

in complex problems Therefore, we need to make this representation function nonlinear.The solution is to pass the output through a nonlinear function called an activationfunction Some commonly used activation functions include Relu, Sigmoid, Tanh, etc

At this point, the accurate formula for calculating the output of the neural network is:

y = ƒÖ ;¡w;x¡ + b), with, f is activation function.

22

Trang 22

d Neural network training process

As mentioned earlier, the training process involves updating the weights in each

neuron to fit the data Therefore, during the training process, the model goes through twostages:

Feed Forward Propagation: This is the process of calculating the output ofthe network by allowing input data to pass through the network in the forwarddirection (from input to output) Additionally, in the initial stage of the trainingprocess, the output values often deviate significantly from the actual labels ofthe data because the model has not learned any knowledge from the data yet.After obtaining the prediction results, a comparison is made with the actuallabels of the data, and the prediction error is calculated The goal is to updatethe weights in such a way that, after training, this error is minimized

Mathematical representation of the loss function:

Loss = 9iapei, Ypreaict )

In which:

Viabe ground truth

Ypreaict: the prediction of the modelg: is a function that measures the error between predicted results and

ground truth.

Back Propagation: This is the process of updating the model's weights The

current challenge is to find the appropriate weights for the model so that theloss function has the smallest value An effective way to achieve this is to findthe partial derivative of the loss function for each neuron and update itsweights in the direction opposite to the gradient To perform this process, the

23

Trang 23

chain rule is used to facilitate the calculation of derivatives This weight

updating process will be repeated multiple times until the loss value reaches

the smallest possible value

e Summary of advantages and disadvantages of ANN

The architecture of Artificial Neural Networks (ANN) has the following

advantages:

Self-learning and adaptability: ANN can learn from data, extract relevantfeatures, and adjust its network weights without the need for specificprogramming

Non-linear data processing: Thanks to activation functions, ANN is powerful

in handling non-linear models, making it suitable for various complexproblems

Automatic feature extraction: In deep learning, the hidden layers of ANN canautomatically learn complex features from data, reducing the need for manualfeature selection techniques

However, it also has some notable drawbacks:

Data issues: ANN tends to be less effective when trained on small, less diverse

datasets

Training complexity: Training ANN, especially deep learning models with

multiple layers, can be complex and requires significant computational

resources.

Sensitivity to small changes in input data, requiring careful datapreprocessing

24

Trang 24

2.2.2 Convolution Neural Network

Convolutional Neural Network [7] is a well-known form of a Neural Network,widely used in various tasks related to image processing such as image classification,object detection, face recognition, etc., due to its ability to learn powerful features

a Convolution layer

Convolution is an important operation that has played a significant role and

emerged early in the field of digital signal processing Finding suitable filters for eachtype of signal and each problem has been extensively researched and taught in technicalcurricula In the late 1980s, Yann Lecun proposed a two-dimensional convolutionalmodel for image data, achieving great success in handwritten digit classification tasks.[8] Since then, many successive successes have followed, with numerous studies in thisfield leading to the development of neural network architectures utilizing convolutionaloperations that demonstrate outstanding efficiency in image processing In the context

of this article, we will delve deeper into two-dimensional convolution to illustrate howneural networks operate on images

The 2D convolution operation is one of the most crucial components of CNN,widely employed in processing and learning from image data This convolutionaloperation not only assists the model in learning essential features from the image butalso preserves the spatial structure of the data In CNN, the 2D convolution is performed

by sliding a filter (kernel) over the input image from left to right and top to bottom Thisfilter is typically a small matrix (e.g., 3x3 or 5x5) containing weights that need to belearned

Let's assume I is the input matrix (e.g., an image) and K is the filter (kernel) with

dimensions m*n The 2D convolution operation is defined as follows:

25

Trang 25

(I * K)(i,j) = gựn Lan | l(¡ + u,j + 0) x K(u,v)

C~ằu=fÍl u=fl

In which:

(i, 7) represents the position in the output matrix

I(i+u,j + 0) and K(u,v) are the values at the corresponding positions in the

input matrix and the filter, respectively.

Some other factors to consider in CNN:

e Stride: During the convolution operation, the value of the stride is also

important Stride can be understood as the number of pixels the kernel movesafter each convolution operation in the image With a stride of 1, the kernelmoves one unit at a time in the input image

e Padding: Sometimes the filter size does not perfectly match the input image,

and in such cases, additional columns or rows are often added to the inputimage matrix, with zero values This is called padding

e Activation function: From the convolution formula, it's easy to see that the

convolution operation is also a linear function, as mentioned in the ANNsection Linear functions are often not powerful enough to represent complex

data Therefore, in CNN, an activation function is added to enhance thenetwork's representational capacity.

b Pooling layer

Because the convolution operation is performed on small parts of the image,larger images require more resources and computational time Therefore, the poolinglayer is used to reduce the size of the input image or feature maps There are three types

of pooling:

26

Trang 26

Max pooling: Divides the feature map into small matrices of size NxN,moving from top to bottom and left to right, then takes the maximum value ineach NxN matrix as the representative value.

Average pooling: Similar to max pooling in the division of regions, but the

representative value for each region is calculated as the average of the values

Due to the use of shared weights (kernel), when an object in the image is atany position, after the convolution operation, we still obtain the features ofthat object at the exact position [9]

In mathematical terms, a function f(x) is considered equivariant if f(g(x)) =g(f(x)), where g is a translation function Therefore, when applying the function g to animage and then performing convolution, the result will be similar to applyingconvolution to the translated image

Thanks to this property, with the input image being a cat, regardless of its position

or the cat breed, a CNN can still extract specific features for the cat subject, allowing it

to detect the object in the image as a cat

27

Trang 27

Translational Invariance:

The concept of translational invariance differs slightly from translationalequivariance The goal of this property is that regardless of the subject'sposition in the image, its features will ultimately be pulled towards the center

of the image, where the model is most likely to focus [10]

This property is achieved through the operation of the pooling layer

Specifically, for example, with max pooling, it reduces the size of the output

(feature map) by half when using pooling with a size of 2 The representativevalue for each region is the maximum, so the stronger features of the imageare preserved in the end

Local Receptive Field:

In each movement of the kernel over the image, it only "looks" at a portion ofthe image as input for the convolution operation Therefore, it can be said that

in each calculation, it only observes a local region of the image This localregion is commonly referred to as the local receptive field [11]

Thanks to this property, we can design networks with fewer parameters,helping to save resources and computational costs Moreover, with fewerparameters, the model is less prone to overfitting when trained on a notexcessively large dataset, and the model will converge faster

d Disadvantages of CNN architecture

Alongside its strengths, CNN still harbors the following drawbacks:

To achieve good performance, the model still requires a relatively largedataset

28

Trang 28

e It is unable to provide a convincing explanation for the mechanism through

which the CNN model learns and makes decisions, particularly crucial insensitive applications requiring high accuracy, such as healthcare

e Insome cases, the identification or classification of objects in data with more

complex structures may not be effective

e Training a CNN model, especially deep and complex ones, demands

substantial computational resources and lengthy training times

e Due to the local receptive field nature, it does not capture the entire context of

an image at once, potentially not fully utilizing the information present in thedata

2.2.3 Transformers

a Introduction

In recent years, the Transformer architecture has emerged as a groundbreakingadvancement in the field of machine learning, particularly in applications related to

natural language processing (NLP) Initially introduced by Vaswani et al in the paper

"Attention is All You Need" in 2017, it has gained significant prominence [12], Thisarchitecture has quickly demonstrated its superior strength, particularly in handling longand complex sequences, a challenge faced by earlier models like Recurrent NeuralNetworks (RNN) and Long Short-Term Memory (LSTM) [13]

The core difference of Transformers compared to traditional models lies in theuse of the "Self-Attention" mechanism, allowing the model to focus on the entire input

to better understand the context and meaning This not only helps improve accuracy intasks such as machine translation, text classification, and text generation but also opens

up the capability to handle much more complex tasks

29

Trang 29

Additionally, the Transformer architecture brings computational efficiency

benefits compared to RNNs, LSTMs, or GRUs Thanks to its non-sequential structure,

it can parallelize the processing, significantly reducing training time compared to modelsbased on RNN or LSTM This has changed the approach to model design in manyresearch and applications, shifting from a focus on improving sequential processing tooptimizing parallel processing capabilities

Furthermore, Transformers have extended their influence beyond the field ofNLP Recent studies have demonstrated their capability to address issues in computer

vision, including tasks such as image recognition [14] [15], object detection [16],

segmentation [17], , Marking a significant milestone in constructing versatile andpowerful deep learning models, the flexibility and efficiency of the Transformer

architecture have not only opened new avenues for language processing methodologies

but also served as a crucial step for integration and development in various other machine

learning domains.

Therefore, the Transformer architecture is not only a turning point in the history

of machine learning development but also a key to unlocking new possibilities inaddressing the complex challenges of the real world

30

Trang 30

Positional A) @

Encoding

Encoding OY

Figure 3 Transformers architecture

Transformers consist of two parts: the encoder and the decoder While the encoder

is responsible for extracting and synthesizing features from the input data, the decoderfocuses on decoding these features into signals that humans can comprehend, commonlytext In the scope of this thesis, as only the encoder part of transformers is utilized forimage classification, a more in-depth analysis will be concentrated on this aspect

b Input embedding and Positional encoding

Input embedding is a step that transforms input data into vectors so that the model

can perform computations Specifically:

e For text data, each token (word or phrase) is typically represented as a vector

e For image data, images are usually divided into regions, and then the image is

embedded into a vector (flattening the image, passing it through a neuralnetwork, etc.)

31

Trang 31

Positional encoding: In the case of transformers, data is inputted simultaneously(unlike networks such as ANN, GRU, where data is inputted sequentially) Therefore,

we need to add information about the position of each input vector This is because, indata related to text, a change in the order of tokens can lead to a misinterpretation of asentence's meaning In the paper "Attention is all you need," they use the following

method to represent positional information:

e iis the iteration value, ranging from 0 to

¢ dmnodei 1S the number of dimensions in the embedding vector for one token

c Transformers Encoder

The Transformer encoder is an architecture used to map input data into feature

vectors During this process, it leverages information from the entire input data toexamine relationships within the data As a result, the feature vectors become richer inknowledge The Transformer encoder consists of two main components: multi-headattentions and fully connected layers It also incorporates residual connections betweenblocks to mitigate the issue of vanishing gradients

Multi-head attention:

e In the encoder architecture, multi-head attention utilizes a special type of

attention called self-attention Self-attention allows the model to learnrelationships between each word in the input and the rest of the words Thishelps identify which words are strongly connected to each other, capturing the

data's distinctive features

32

Trang 32

Specifically, in self-attention, for each embedding vector in the input, it

undergoes three fully connected layers to generate three vectors: Query (Q),Key (K), and Value (V) The roles of these vectors can be envisioned asfollows: “The query key and value concept come from retrieval systems For

example, when you type a query to search for some video on Youtube, thesearch engine will map your query against a set of keys (video title, description

etc.) associated with candidate videos in the database, then present you the bestmatched videos (values).” [18]

As envisioned regarding the roles of the three vectors K, V, and Q, to discoverthe association between the key and query, we need to perform a dot productoperation to find the representation matrix This matrix indicates the degree

of correlation between each word pair

The next step is to scale down this representation matrix to turn it intoprobability distributions for easier computation Specifically, we performscaling down by dividing by the square root of the dimension of the key vector

Jax Afterward, we pass it through a softmax function to convert the values

33

Trang 33

e Finally, we will combine the input with the output, because the architecture

has a residual connection connecting these two locations

After that, the process involves normalization and passing through a pointwisefeed-forward network for additional processing In the end, we get a new, moreinformative data representation vector

d Advantages and disadvantages.

Advantages:

e Effective Handling of Long Input Context: The Transformer excels in

processing long sequences of data efficiently, thanks to its attentionmechanism, allowing the model to focus on crucial parts of the input data

e Parallelized Learning Capability: Unlike LSTM or GRU models, the

Transformer is not dependent on the previous state of the sequence, enablingparallelized training and accelerating the training process

Disadvantages:

e High Resource Requirements: Transformers often demand a significant

amount of computational resources, especially in terms of RAM and GPU, for

both training and deployment

e Overfitting: Due to the large number of parameters, Transformers are prone

to overfitting, particularly when trained with a small dataset

2.3 Image classification problem

The image classification task is one of the most important and common problems

in the field of image processing When employing artificial intelligence techniques toaddress this task, the goal is to enable computers to automatically categorize images intodifferent labels The simplest example is classifying images as either dogs or cats,

determining whether the image contains a dog or a cat.

34

Trang 34

Before the advent of neural networks, traditional machine learning algorithmswere primarily used to address this problem The performance of these algorithms wasrelatively low, almost inadequate for practical applications Since the emergence ofneural networks, especially CNNs, and a significant increase in training data andcomputational resources, this problem has gradually become much accurate To enhanceits performance, it is common to design new CNN architectures that can learn betterfeature representations, potentially reducing parameters without compromising

accuracy.

2.4 Some studies on the problem of detecting damaged paths

The problem of detecting road damage has long attracted significant attentionfrom various research communities and practical businesses In the initial stages, somestudies focused solely on image signal processing using conventional image processingtechniques Prominent among these is the use of filters such as Gabor filters [20] [21]however, the results achieved did not meet the expectations

In 2016, research on using CNN to address this problem began [22] Since then,the research community has started to pay more attention to the use of neural networks

to tackle the issue [23] [24] [25] Although initial progress has yielded improved results,the cost and complexity of deployment remain factors that hinder their widespreadadoption

35

Trang 35

CHAPTER 3 PROPOSED MODEL

3.1 Overview of the Model

Average Pooling ‘average Pooling +

jf | a - Transformer Encoder

Flatten Flatten

Concat Features vector

Figure 4.Proposed model architecture

In the proposed model architecture, we employ a hybrid approach by using both

advanced convolutional neural network (CNN) architectures, namely ResNet50 and

EfficientNetB3, to extract features from images The primary goal of utilizing bothnetworks is to leverage the unique advantages that each network brings, aiming toenhance the understanding and analysis of image data ResNet50 is renowned for its

ability to address the vanishing gradient problem through the use of residual connections,allowing the network to learn deep features without losing important information across

36

Trang 36

layers On the other hand, EfficientNetB3 is designed to optimize both performance and

accuracy by carefully balancing the depth, width, and resolution of the network

After extracting features from both networks, we concatenate them using a vectorconcatenation function This process generates a consolidated feature vector, combininginformation from both sources, harnessing the strengths and distinctive characteristics ofeach model Next, this consolidated feature vector is divided into 16 patches, and each

patch is then fed into a transformer encoder architecture Using a transformer helps the

model focus on important parts of the image and understand the spatial relationshipsbetween different parts

Finally, the representation vector obtained from the transformer is passed through

a fully connected (FC) layer This FC layer is responsible for classification, based on the

representation vector, to determine the appropriate label for the image data Thus, the

model can classify images into specific labels based on the features and spatial

relationships it has learned.

In summary, the combination of ResNet50, EfficientNetB3, and the transformerarchitecture in this model creates a powerful and flexible approach for analyzing andclassifying images, leveraging the strengths of both CNN and transformers in featurelearning and image understanding

Specifically, we observe that the proposed architecture brings about several newand somewhat positive impacts:

- Instead of directly dividing the image into batches and passing them through amultilayer perceptron (MLP) as in vision transformers [26], our architecture uses CNN

to extract features first These features are then passed through a small transformerencoder layer to extract additional information This approach saves a considerablenumber of parameters, as the responsibility for feature extraction is distributed betweenthe CNN and the transformer, with the transformer handling a smaller set of parameters

37

Trang 37

- The use of two small CNN models for feature extraction, as opposed to one largeCNN model, has its rationale We believe that each CNN model extracts relativelydifferent features Using multiple types of models provides more diverse featureinformation, similar to the concept of multi-model voting, where having more opinionsleads to more reliable information that complements each other Additionally, using alarge model may face the risk of overfitting when trained on a smaller dataset in thefuture.

- Initially, we assumed that when using CNNs, the information extracted from

different networks would somewhat resemble each other and provide sufficient featureinformation for prediction Therefore, we considered passing the output of the CNNsdirectly through transformers (i.e., having two transformer blocks after the two CNNs)

However, after multiple experiments, it became evident that the information in the two

CNN branches is quite different Consequently, we decided to concatenate the featurevectors from the two CNN branches and use a single transformer block The resultsshowed improvement, and this was further supported when visualizing the model's focusareas using GradCam during experiments

3.2 Resnet model

Researchers in the field of deep learning used to believe that "the deeper, thebetter." It also makes sense because, typically, deeper networks tend to representknowledge better (partly due to having more parameters for representation) However,experiments have shown that after increasing the depth of a model to a certain extent,

the accuracy of the model decreases, and more resources are needed for computation

[27]

There are several reasons provided by researchers to explain this phenomenon,but the most convincing one is the vanishing gradient [28] This is a phenomenon inwhich the gradient of the loss function becomes very small, to the point of being

38

Tiêu đề	Evaluating the Effectiveness of CNN-Transformers Architecture in Classifying Road Damage Conditions
Tác giả	Nguyen Minh Ly
Người hướng dẫn	TS. Le Kim Hung
Trường học	University of Information Technology - VNU-HCM
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2024
Thành phố	Ho Chi Minh

Định dạng
Số trang	75
Dung lượng	38,94 MB