Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông

Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông .Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông .Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông .Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông .Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông .Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông .Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông .Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông .

Introduction

Artificial intelligence (AI) refers to the intelligence exhibited by artificial systems and has become ubiquitous in modern life It is integrated into various applications, including office software, automated customer service systems, intelligent traffic control, and smart home technologies As computer hardware has advanced, AI has significantly progressed and is now widely utilized across numerous sectors of society.

Artificial intelligence aims to create algorithms and applications that enhance human decision-making and facilitate autonomous decisions during data identification and acquisition Key research areas include object detection, object action recognition, and human action recognition, which are applied in various domains such as security surveillance, remote control systems, assistive technologies for the visually impaired, sports analytics, automated robotics, and self-driving vehicles Numerous studies have introduced diverse solutions for AI development, including heuristic algorithms, evolutionary algorithms, Support Vector Machines, Hidden Markov Models, expert systems, and neural networks However, traditional methods often necessitate significant human intervention and large datasets for analysis, resulting in low accuracy and limited identification capabilities.

3.2.2.158To overcome those shortcomings, machine learning with focusing on Deep

Learning Method (Deep Learning) is now being applied in artificial intelligence in terms of object detection and action recognition.

Deep Learning, a prominent subset of machine learning, has sparked extensive debate in the AI community This technology aims to enhance artificial neural networks, leading to significant advancements in areas like voice recognition, image recognition, and natural language processing In recent years, Deep Learning has driven remarkable progress in challenging fields such as object perception and machine translation, making previously complex tasks more accessible for artificial intelligence researchers.

3.2.2.160However, despite of the fact that issues related to AI were solved, Deep

Learning has still remained limitations that need to be settled.

- Firstly, to create a system capable of identifying a variety of objects, a huge amount of

Deep Learning requires 15 input data points to facilitate computer learning, a process that demands significant time and the support of powerful processors typically found in large server systems.

Deep Learning currently struggles to recognize complex social interactions and similar objects due to a lack of advanced technology that enables logical recognition Additionally, integrating abstract knowledge into machine learning systems presents significant challenges, particularly in understanding object identity, usage, and human interaction Consequently, machine learning has not yet achieved the level of common knowledge that humans possess.

The inquiry focuses on how a machine learning system can autonomously acquire, select, and update knowledge to construct a cohesive data set similar to human capabilities Research in Adaptive Learning offers potential solutions to address the limitations of Deep Learning, particularly in areas that remain unaddressed by current Deep Learning methodologies.

An advanced Adaptive Learning model enables an autonomous robotic system to achieve self-learning and self-intelligence, mimicking human brain functions As the device operates, its intelligence progressively enhances, allowing the system to automatically select relevant data while continuously retraining and updating its model to replace outdated versions.

The proposed Adaptive Learning model shows significant potential for application in various Auto Robot systems, particularly in the context of self-driving vehicles This doctoral research will involve studying and experimenting with the operational processes of these autonomous vehicles Key recognition objects for self-driving cars include traffic elements such as other vehicles (motorcycles, cars, trucks, and passenger cars), pedestrians, traffic signs, and roadside features.

Research goal

The objective of this thesis is to explore artificial intelligence, focusing on the methods and algorithms utilized in the field It aims to assess the limitations of existing techniques and propose enhanced solutions to improve the efficiency and accuracy of AI in object detection.

- Study, analyze and evaluate traditional methods: Support Vector Machine, Hidden Markov Model, Neural network, and so on.

- Study and evaluate the application of Deep Learning in classification and object detection in traffic (Pedestrians, traffic vehicles, traffic signs, etc.).

To improve the performance of Deep Learning models utilizing an Adaptive Learning approach, it is essential to conduct experiments focusing on adaptive learning techniques and hyperparameter optimization specifically for self-driving vehicles (ADAS) Implementing tailored adaptive learning strategies can significantly enhance model accuracy and responsiveness, while systematic hyperparameter tuning will refine the learning process By integrating these solutions, we can achieve more robust and efficient deep learning systems for autonomous driving applications.

- Develop data sets for training and recognizing objects in traffic.

Research method

The information collection method involved gathering foundational materials on algorithms and artificial intelligence, along with documents and articles focused on Deep Learning, Adaptive Learning, and object detection Experimental data were sourced from real-time traffic cameras and online videos.

3.2.2.168-Comparison method: Summary and comparison between the obtained documents to provide an overview of the methods, advantages and disadvantages of those methods as well.

- Analysis method: Analyze the algorithms, their operation and characteristics The effectiveness of the algorithms applied to specific cases is evaluated and analyzed to get the best results.

- Expert method: Consult from AI experts to complete the area need to be studied.

- Experimental method: Installing and testing algorithms applied to each method for a better understanding From this, the advantages and disadvantages of each method are then evaluated and verified.

- Conduct experiments on Google's machine learning open-source system (TensorFlow), MathWorks (Matlab) to have comparison with the results of research experiments.

To develop and evaluate the proposed algorithms, it is essential to gather and create real empirical data sets that include objects in traffic such as pedestrians, vehicles, and traffic signs These data sets are compiled from actual road photographs and videos sourced from the internet, ensuring a comprehensive training and testing resource.

- Install research results on the system to prove experiment.

Research subject and scope

3.2.2.172+ Deep Learning method and Adaptive Learning method

3.2.2.173+ Propose solutions to enhance on-road object detection quality of self- driving car system.

3.2.2.174+ Study and propose Adaptive Learning solution which is applied in on- road object detection.

3.2.2.175+ Create data and experiment, analyze results.

The structure of the thesis

3.2.2.181 Chapter 1: Overview of artificial intelligence

This section provides an overview of artificial intelligence and traditional algorithms, highlighting key methods such as decision trees, random forests, support vector machines, and artificial neural networks It also discusses both domestic and international research focused on on-road object detection and adaptive learning solutions for self-driving vehicle systems.

3.2.2.185 Chapter 2: Identifying objects by Deep Learning

3.2.2.186 Proposes solution to on-road object detection by Deep Learning: pedestrians, vehicles

- Deep Learning in pedestrian action prediction

- Deep Learning in vehicle classification

Learning techniques in object recognition

Based on the research findings presented in Chapter 2, the Adaptive Learning solution for self-driving vehicle systems is continuously evolving This innovative model demonstrates the ability to self-learn and exhibit intelligence autonomously, without requiring any human intervention.

3.2.2.199 Adaptive learning techniques in vehicle, traffic sign recognition and advanced driver assistance systems

3.2.2.200 Chapter 4: Optimization of hyperparameter set in Adaptive Learning

3.2.2.201 Basing on the proposed model mentioned in Chapter 3, the Adaptive Learning solution of algorithms and parameters is

3.2.2.207 Adaptive learning through optimization of the training hyperparameter set based on a new dataset related to traffic sign and vehicle recognition

3.2.2.208 efficiency and on-road object detection accuracy.

OVERVIEW OF ARTIFICIAL INTELLIGENCE

Overview of artificial intelligence

3.2.2.219There have been many different definitions of artificial intelligence, or AI in the world, specifically:

Artificial intelligence (AI) refers to the intelligence exhibited by artificial systems, particularly computers designed for various purposes This term encompasses both the theoretical foundations and practical applications of AI technology.

• According to Bellman, artificial intelligence is the automation of activities that we associate with human thinking, activities such as decision-making, problem solving, learning, etc.

• Rich and Knight: “Artificial intelligence is the study of how to make computers do things at which, at the moment, people are better''.

Artificial intelligence (AI) can be understood as a branch of computer science that is grounded in a robust theoretical framework It focuses on automating intelligent behaviors in computers, enabling them to emulate human-like capabilities such as thinking, decision-making, problem-solving, learning, and self-adaptation.

3.2.2.223The history of artificial intelligence [15, 16, 17] has gone over many different stages of development, as shown in Figure 1.1.

3.2.2.229 Figure 1.1 History of artificial intelligence (Source: https://connectjaya.com/)

Machine learning and identification techniques

3.2.2.231As an AI subfield, machine learning uses algorithms that enable computers to learn from data to perform tasks instead of being explicitly programmed [18].

3.2.2.233Image processing problem solve issues of analyzing information from images or performing some transformations Some examples are:

• Image tagging, like Facebook, an algorithm that automatically detects your face and your friends' photos Basically, this algorithm learns from photos you've tagged yourself before.

Optical Character Recognition (OCR) is a technology that converts images of typed, handwritten, or printed text into machine-readable text This process involves algorithms that learn to identify and interpret the visual representation of characters.

Self-driving cars utilize advanced image processing techniques, where a machine learning algorithm analyzes each video frame captured by the vehicle's camera This technology allows the cars to effectively detect road edges, traffic signs, and obstacles, ensuring safe navigation on the road.

• Text analysis is a work of transforming or classifying free texts The texts here can be Facebook posts, emails, chats, documents, etc Some common examples are:

• Spam filtering is one of the most popular spam text classification applications Text classification, here, is to identify the subject definition of a text The spam filter can also

“learn” what each user views as spam based on the user identifying email message and its subject.

• Sentiment Analysis learns how to classify an expression as positive, negative, or neutral

• Information Extraction is the process of extracting information from textual sources, learn how to useful information, address, a person's name, a keyword, etc for ex.

Data mining is the process of uncovering valuable insights and making predictions from data sets, where each record represents an object and each column signifies a feature By analyzing learned records, the values of new records can be predicted, or the records can be categorized into groups This technique has a wide range of applications across various fields.

• Anomaly detection is a technique for finding an unusual point, credit card fraud detection or, for example A suspicious transaction may be discovered based on a change in consumer normal behavior.

Association rules are valuable in understanding customer behavior, especially in settings like supermarkets and e-commerce platforms By analyzing purchasing patterns, businesses can identify which items are frequently bought together, revealing what customers are likely to purchase next This data-driven insight can inform strategic marketing decisions, enhancing promotional efforts and improving customer engagement.

• Grouping, for example, in a SaaS platform, users are grouped by their behavior or by profile information.

• Predictions, the value columns (of a new record in the database) For example, the price of an apartment can be predicted based on the previous price data.

Machine learning has significantly impacted the fields of video games and robotics, particularly through the use of reinforcement learning This approach enables characters in games to navigate and avoid obstacles by learning from their experiences In reinforcement learning, positive feedback is received when a character successfully reaches its destination, while negative feedback occurs when it collides with obstacles.

1.2.2 Basic recognition techniques in machine learning

The integration of AI methods with image processing for object recognition is a crucial aspect of computer vision Machine learning techniques are categorized into supervised and unsupervised learning Supervised machine learning encompasses various techniques such as decision trees, neural networks, support vector machines (SVM), boosting, and random forests In this approach, classification relies on a labeled sample dataset, allowing for structured learning and accurate predictions.

Experts utilize a training dataset to analyze and develop recognition models, a process known as model training This involves training the recognition machine to identify objects In contrast, the unsupervised method relies on unlabeled data, allowing the algorithm to classify the data independently The identification of objects in this approach is based on statistical analysis of the input dataset.

• Decision trees are a specific field of research in machine learning Decision tree techniques are widely used in the fields of knowledge exploitation and pattern recognition

A decision tree is a predictive model that utilizes a tree-like structure to organize data samples according to specific rules In this model, the leaves signify the classification outcomes, while the branches illustrate the combinations of features that guide these classifications.

A decision tree is developed by partitioning the training dataset into subsets based on the evaluation of single or multiple attribute values This process involves using mathematical deductive techniques to achieve straightforward classification combinations Ultimately, the training of the classification model culminates in the creation of a decision tree.

Random forests (RF) are a supervised learning algorithm that constructs numerous decision trees through a random selection of features, allowing for both classification and regression tasks Developed by Tin Kam Ho in 1998 and published in the IEEE Journal, RF effectively addresses issues related to missing values and reduces the risk of overfitting by utilizing multiple trees This technique has gained popularity in computer vision and object classification applications.

Boosting techniques are ensemble machine learning algorithms that create multiple weak classifiers simultaneously, which are then combined into a single strong classifier through weighted aggregation One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting), introduced by Freund and Schapire in 1999 AdaBoost operates on the principle of enhancing weak classifiers by assigning weights to challenging-to-classify data samples, thereby emphasizing difficult classifications while reducing the influence of easier ones As training progresses, each weak classifier adjusts its weights to focus more on misclassified samples, allowing subsequent classifiers to improve on the weaknesses of their predecessors Ultimately, weak classifiers are integrated based on their classification accuracy, resulting in a robust final strong classifier.

The support-vector machine (SVM) is a supervised learning algorithm introduced by Corinna and Vapnik in 1995, initially designed for classification tasks and later adapted for multilayer classifications SVM trains a model to classify data samples into two predefined classes, utilizing a hyperplane that maximizes the distance from the training data points in multi-dimensional space To effectively classify samples, they must be represented within the same space, with SVM categorizing each sample based on its position relative to the classification superplane.

• Figure 1.2 Classification simulation of SVM (Source: https://towardsai.net)

Support Vector Machine (SVM) is a leading classification technique in computer science and data analysis, known for its effectiveness on large datasets and high-dimensional data It excels in classifying various data types, including images, text, and audio SVM can utilize multiple kernel functions and supports both linear and non-linear classification methods Notably, it achieves a high level of accuracy, outperforming many traditional machine learning approaches.

An artificial neural network (ANN), commonly known as a neural network, is inspired by biological neural networks and consists of nodes (neurons) and arcs (edges) The architecture is organized into layers, including an input layer, hidden layers, and an output layer, where each arc connects pairs of neurons to facilitate information transmission and output processing The propagation function, along with its associated weights, defines the relationships between nodes, with the network architecture typically established beforehand and weights adjusted during training Notably, certain networks, such as multilayer neural networks (MLNN) and self-organizing maps (SOM), can adapt their architecture based on real data during the learning process.

• Capable of self-learning in neural networks is one of the important components of

A neural network is a complex adaptive system capable of modifying its internal structure in response to incoming information This adaptability is primarily achieved through adjustments in weighting, allowing the network to learn and improve over time.

Deep Learning and Adaptive Learning

1.3.1 Overview of Deep Learning and Adaptive Learning

Deep Learning, a burgeoning field within computer vision and machine learning, comprises a set of algorithms designed to tackle complex high-level discrete data models through multi-layered architectures and nonlinear transformations, surpassing traditional machine learning methods Although introduced in the early 1990s alongside various other machine learning techniques, Deep Learning faced challenges due to its intricate architecture and the limitations of computing resources at the time Notably, Lecun emerged as a pioneer in this domain, contributing significantly to the advancement of Deep Learning solutions.

Deep Learning, a concept first introduced by Rina Dechter in 1986, gained significant traction in 1989 when Lecun and his team developed a neural network that utilized backpropagation algorithms to achieve impressive handwriting recognition accuracy This neural network laid the groundwork for future research and applications in Deep Learning As a sophisticated technique, Deep Learning employs network models to tackle complex problems, enabling feature extraction, classification, and recognition in areas such as voice recognition, computer vision, natural language processing, and predictive analytics Its growing prominence in computer science is attributed to its ability to deliver superior accuracy compared to traditional methods, thereby advancing various domains within the field.

Adaptive learning originated from the desire to create intelligent systems that mimic the human brain Various artificial neural networks, such as AlexNet, GoogLeNet, Microsoft ResNet, R-CNN, Fast R-CNN, and Faster R-CNN, have been developed, demonstrating high accuracy and the ability to recognize multiple objects effectively.

While advancements in models like [32] and VGGNet [33] have primarily concentrated on modifying network structures, fine-tuning parameters, and refining training techniques, there remains a lack of progress in enhancing models' ability to autonomously increase their intelligence over time Currently, the intelligence of these models still relies heavily on external intervention and labeled data.

An effective Adaptive Learning model autonomously identifies objects, trains itself, assesses its performance, and updates its intelligence, minimizing human intervention after its initial setup The model's adaptability is demonstrated through the incorporation of diverse data, enhanced recognition of unfamiliar and complex objects, and ongoing adjustments to training parameters based on evolving datasets For instance, in an autonomous vehicle system, the initial model may only detect basic shapes of vehicles, lanes, pedestrians, trees, buildings, and traffic signs However, as the vehicle navigates, the system progressively learns to recognize more intricate forms of these objects, continually training and updating to improve its intelligence on the road.

A Deep Neural Network (DNN) is an advanced form of artificial neural network (ANN) characterized by multiple hidden layers and numerous nodes, allowing it to model complex non-linear relationships Unlike simpler networks, DNNs process data through a greater number of layers and nodes, enhancing their ability to recognize patterns and make predictions effectively.

•Figure 1.4 Simple Deep Learning network with one layer and Deep Learning network with multiple hidden layers (Source: https://www.kdnuggets.com)

Early deep neural networks consisted of an input layer, an output layer, and a single hidden layer Over time, these networks evolved to include multiple hidden layers, leading to the development of deeper architectures Therefore, the term "deep" in deep learning refers to the number of hidden layers present in neural networks.

In deep learning networks, each layer's nodes undergo extensive training with unique features derived from the outcomes of prior layers As data progresses through the inner layers of neural networks, it becomes increasingly complex Through this process, nodes can recognize, synthesize, and recombine features from prior layers, effectively displaying features at higher levels, a phenomenon known as feature learning or hierarchical representation.

Hierarchical featuring refers to the process of creating complex and abstract data structures through a hierarchy Deep neural networks are utilized to analyze extensive datasets across multiple dimensions, employing billions of parameters that are processed using non-linear functions.

Deep neural learning networks excel at identifying patterns in unlabeled and unstructured data, which is prevalent in real-world applications Research indicates their effectiveness in analyzing various forms of unstructured data, including raw multimedia, images, documents, sounds, and videos Consequently, deep neural learning techniques are capable of addressing challenges related to the analysis, recognition, and classification of unstructured, homologous, or anomalous data.

Convolutional Neural Networks (CNNs) are a type of deep learning architecture inspired by the biological processes of the visual cortex in animals Pioneered by researchers like LeCun, CNNs utilize regularized multilayer perceptrons to streamline the pre-analysis of data Each neuron in a CNN responds to stimuli within a specific area called the receptive field, and these fields overlap to comprehensively cover the entire visual input, mimicking the connectivity patterns found in the brain.

• Figure 1.5 Architecture of a simple convolution neural network (Source: https://medium.com)

The architecture of a Convolutional Neural Network (CNN) comprises an input layer, an output layer, and several hidden layers that include convolutional, pooling, rectified linear unit (ReLU), fully connected, and normalization layers Generally, CNNs feature multiple convolutional and pooling layers, along with normalization layers, and may also incorporate fully connected layers.

• Some CNNs which have been introduced and commonly used are AlexNet [27], GoogLeNet [28], Microsoft ResNet [29], R-CNN [30], Fast R-CNN [31],

Domestic and international research

From the 1990s to the early 2000s, Vietnam saw significant contributions to artificial intelligence research, particularly in image processing and recognition Prominent researchers such as Assoc Prof Ngo Quoc Tao, Assoc Dr Do Nang Toan, and Assoc Dr Luong Chi Mai made notable advancements in areas including handwriting recognition and Vietnamese handwriting processing, as well as speech recognition technologies.

Research in recognition and human face detection, alongside human body simulation, predominantly utilizes classic algorithms like SVM, Random Forest, hidden Markov models, and artificial neural networks These studies serve as essential references for students and graduates Additionally, numerous publications focusing on image processing and object recognition have emerged, contributing to the field's advancement.

• After the first decade of the 20th century, AI growth, along with computer hardware, enables the fields of machine learning and object recognition to make advance.

In Vietnam, research on Artificial Neural Networks (ANN) and Convolutional Neural Networks (CNN) began at a primitive stage, primarily driven by overseas Vietnamese PhD students, with no significant domestic contributions initially Since 2015, there has been a notable increase in publications in international journals such as ISI and Scopus, with contributions from prominent research groups including Hanoi University of Technology, Ton Duc Thang University, National University of Ho Chi Minh City, and Duy Tan University in Da Nang Additionally, numerous independent studies have emerged, focusing on applications in health, transportation, agriculture, and national defense, encompassing innovations like autonomous vehicles, robotics, and human action recognition.

• The AI history and machine learning has gone through many phrases The intelligence of the machine has been simulated and demonstrated by Alan Turing Since

In 1955, American computer scientist John McCarthy coined the term "Artificial Intelligence," referring to the field of intelligent computer engineering The following year, he organized the Dartmouth Conference, the inaugural gathering focused on AI, which brought together experts from prestigious institutions like Carnegie Mellon University, the Massachusetts Institute of Technology, and IBM Since then, "artificial intelligence" has become a widely recognized term in the scientific community.

Artificial Intelligence (AI) and machine learning are continuously evolving, focusing on key algorithms such as Support Vector Machines, Random Forests, Neural Networks, K-means, Decision Trees, and Boosting, which are essential for advancements in recognition, object classification, and data processing Since the late 1990s, the growth of computer hardware has propelled the development of Deep Learning and Convolutional Neural Networks (CNNs), leading to significant real-world applications Yann LeCun, a pioneer in this domain, developed LeNet, a renowned CNN architecture in the 1990s, featuring a structure of two convolutional and max-pooling layers, followed by two fully connected layers and a softmax output layer, achieving an impressive recognition accuracy of 99%.

In 2012, AlexNet, developed by Alex Krizhevsky and his team, revolutionized the field of deep learning by winning the ImageNet LSVRC-2012 contest with a significant error rate improvement of 15.3% compared to the previous best of 26.2% This convolutional neural network (CNN) features an impressive 60 million parameters, showcasing a substantial increase in complexity and capability over earlier models like LeNet.

• ReLU is used instead of sigmoid (or tanh) to deal with non-linearity, increasing computing speed by 6 times.

• DropOut is used as a new regularization method applied to CNN Dropouts not only enable the model to avoid over-fitting but reduce model training time.

• OverLap pooling is used reduce the size of the model (Traditionally pooling regions does not overlap).

• Local response normalization is used to normalize each layer.

• Data augmentation technique is used to create additional training data by translations and horizontal reflections.

• AlexNet is trained by 90 epochs within 5 to 6 days with 2 GTX 580 GPUs Using SGD at learning rate 0.01, momentum 0.9 and weight decay 0.0005.

• The architecture of AlexNet consists of 5 convolutional layers and 3 fully connection layers Activation ReLU is used after each convolution and fully connection layer.

• This was followed by new models proposed in turn, decreasing the error percentage, increasing the model's complexity with a deep architecture The proposed models include VggNet 2014, GoogleNet2014, MicrosoftResNet 2015, Densenet 2016, etc.

With advancements in network architecture, models have achieved high accuracy in experimental training and recognition of nearly all real-world objects A notable example is AlexNet, which can identify and classify 1,000 distinct objects effectively.

Numerous research institutes and universities globally have published studies offering solutions to specific AI challenges in areas such as robotics and autonomous vehicles Each sector is further categorized into various levels to address these issues effectively For example, the challenges associated with self-driving cars can be classified into distinct cases.

- The lane - recognition problem for self-driving cars

- The on – road object recognition problem for self-driving cars

- The traffic sign recognition problem for self-driving cars

- The distance measurement problem for self-driving cars

- The pedestrian movement prediction for self-driving cars

- The obstacle recognition problem for self-driving cars

• For pedestrian detection, nowadays, there are many contributions that using

Tracking technologies play a crucial role in detecting and recognizing objects, offering high accuracy However, the lengthy processing time associated with these technologies poses significant challenges for autonomous vehicles, particularly in emergency situations.

Recent advancements in pedestrian recognition technologies include methods such as Histograms of Oriented Gradients (HOG), a feature descriptor widely used in computer vision HOG enhances accuracy by analyzing directional and color/grayscale variations within local image areas and standardizing contrast between blocks Additionally, the Latent SVM algorithm classifies objects based on their parts and geometric location constraints Effective detection relies on a trained model utilizing an image dataset that includes both target and contrasting images, alongside techniques like the Kanade–Lucas–Tomasi (KLT) algorithm.

Convolutional Neural Networks (CNNs) are highly effective for feature extraction, with notable models including AlexNet, GoogleNet, Microsoft ResNet, and various Region Based CNNs like R-CNN, Fast R-CNN, and Faster R-CNN Each model offers unique advantages in processing speed and accuracy This thesis focuses on optimizing feature extraction by utilizing pre-trained CNN models The features extracted from these models are then employed in various classification algorithms, such as k-Nearest Neighbor (kNN), Support Vector Machine (SVM), Random Forest, and Fully Connected Networks, tailored to meet specific requirements.

Several algorithms for pedestrian action recognition have been proposed in previous studies, but they primarily focus on recognizing pedestrians without considering specific scenarios within traffic systems As a result, autonomous vehicles (AVs) struggle to assess varying levels of alertness associated with different danger levels.

Detecting and recognizing vehicles in single images or video footage captured by roadside cameras can be achieved through various methods Research in this field primarily focuses on two approaches: utilizing traditional techniques alone or integrating these methods with Deep Learning technologies for enhanced accuracy and efficiency.

Traditional vehicle recognition and tracking methods include the Gaussian Mixture Model (GMM) and the Kalman Filter, with GMM focusing on vehicle recognition and the Kalman Filter aiding in tracking under varying lighting conditions Another effective approach is Optical Flow estimation, which leverages edge features identified by the Canny algorithm to detect moving vehicles In the realm of feature extraction, techniques such as Scale Invariant Features Transform (SIFT) and Histogram of Oriented Gradients (HOG) are commonly employed, often in conjunction with Support Vector Machine (SVM) classifiers for transport identification Evaluations indicate that using HOG for feature extraction alongside SVM classification yields promising results, and recent studies advocate for these methods to enhance vehicle recognition, counting, and classification accuracy.

This approach utilizes Gaussian Mixture Models (GMM) for image segmentation, followed by the Canny edge detector to delineate boundaries and extract features Traditional techniques typically focus on extracting features related to the shape, color, and texture of images to represent objects of interest (IO) Subsequently, a classifier architecture is employed to interpret the context of the transportation scenario.

• Researches of Deep Learning often use high-performance Convolution Neural Network models, such as AlexNet[27], GoogleNet [28], Microsoft ResNet [29], R-CNN

Object recognition problems

Artificial intelligence has significantly transformed various aspects of life, with machine learning, particularly its subset Deep Learning, playing a crucial role in this evolution Utilizing convolutional neural networks, Deep Learning has made remarkable advancements in fields such as voice and object recognition, medical applications, smart transportation, and robotics.

Deep Learning has significantly advanced in its ability to recognize and process images, particularly through the training of models to accurately identify objects This study examines the effectiveness of Convolutional Neural Network (CNN) models in object recognition, specifically in the context of autonomous vehicles The focus is on evaluating how well these CNN models can recognize various objects essential for the functionality of self-driving technology.

- Vehicles: motorbikes, cars and vans.

- Other objects such as houses, trees and sky

Recognizing pedestrians is a crucial aspect of autonomous vehicle development, as it poses significant challenges due to the complexities of pedestrian movement and behavior Accurately predicting pedestrian actions and walking speeds is essential to ensure the safety of both pedestrians and vehicles There are three primary types of pedestrians - crossing, walking, and waiting pedestrians - which encompass all possible interactions between pedestrians and autonomous vehicles By analyzing images of pedestrians, features such as gestures, locations, and scenes can be extracted, enabling the training of predictive models to recognize and forecast pedestrian movement.

The proposed approach consists of two main phases: first, a classifier model is trained to predict pedestrian movement using features extracted from CNN models, specifically the AlexNet architecture In the second phase, real-time video frames from autonomous vehicles (AV) are processed by detecting pedestrians, extracting regions of interest (ROI), and predicting pedestrian movement within these ROIs The ACF algorithm is employed for pedestrian detection, while a Support Vector Machine (SVM) model is utilized for training and predicting pedestrian movement.

• Figure 2.1 The process of extracted features by CNN model from image dataset

• Figure 2.2 The process of pedestrian movement prediction

• The resolution of used camera is 2 Megapixels or more with the minimum resolution of collected images of 72 dpi.

The increasing number of vehicles on the road has heightened the need for effective traffic control and separation As technology advances, the demand for precise automatic control systems becomes essential to address the challenges of managing vehicle flow Implementing these systems can significantly enhance traffic safety and efficiency.

In Intelligent Transportation Systems (ITS), effective monitoring and decision-making rely on training vehicle region extraction solutions, which utilize sensors to collect data from devices attached to vehicles and leverage internet connectivity for vehicle networking However, many proposed solutions face challenges in real-world application due to limitations in device production, internet bandwidth, and high establishment costs Therefore, implementing an automatic recognition and classification system for vehicles is crucial for enhancing the functionality and efficiency of these systems.

The proposed solution begins with acquiring images from surveillance cameras in Intelligent Transportation Systems (ITS) to recognize objects of interest and identify vehicle types This article emphasizes recognition models over traditional vehicle detection methods, utilizing a semantic segmentation model based on SegNet's CNN architecture Detected vehicles are then extracted to define regions of interest (ROI), focusing on samples of the vehicles The method allows for the integration of CNN models and data augmentation techniques to enhance accuracy The recognition results play a crucial role in the ITS system, enabling alerts for vehicles that are not permitted to cross limit lines and addressing violations effectively.

• Figure 2.3 Proposed vehicle detection model

Suggested solution

• Object recognition has been introduced with three basic steps:

(1) Detecting and extracting areas of interest

(2) Extracting features and training recognition models

• However, the step 1 may be unnecessary once target object was identified.

• Each step can have different techniques:

- Detecting and extracting areas of interest: use image meaning to extract areas of interest (pedestrians, vehicles, traffic signs, etc ).

- Extracting features and training recognition models: build and introduce Deep Learning models to extract features of objects It is suggested to use SVM model to train recognition models.

- Recognizing objects: use trained recognition models to recognize and classify objects according to individual problems.

2.2.1.1 Extracting features and training classifier model

In machine learning, convolutional neural networks (CNNs) are a specialized class of deep learning models primarily used for visual imagery analysis Various CNN models, each with unique architectural characteristics, sizes, and layer counts, have been developed, including notable examples like AlexNet, GoogleNet, Microsoft ResNet, and Region-Based CNNs (R-CNN, Fast R-CNN, and Faster R-CNN), all of which demonstrate low error rates This thesis proposes the AlexNet CNN model, which is designed to enhance processing efficiency.

The AlexNet model effectively extracts and retains essential features from input images, as illustrated in Figure 2.4 Utilizing a dataset of 3000 images, which includes 1000 images each of crossing, walking, and waiting pedestrians sourced from real street videos online, the images were carefully processed and cropped to suitable frames The Convolutional Neural Network (CNN) captures rich features such as pedestrian postures, roadways, roadsides, and pedestrian positions, as shown in Figure 2.5 These extracted features are then employed to train the Support Vector Machine (SVM) classifier model.

• Figure 2.4 Input images and simulate rich features of image

In the CNN model, various feature layers can be extracted, including convolutional and fully connected layers Among these, layer 19, known as fc7, which is a 4096 fully connected layer, stands out as the most beneficial, as it is positioned just before the classification layer.

Object recognition, particularly for animals, objects, and vehicles, boasts a high accuracy rate of 90% to 100% However, when predicting pedestrian actions, the analysis extends beyond just the specific object to include surrounding elements like vehicles, buildings, trees, and roadside objects, as illustrated in Figure 2.5.

The ACF algorithm plays a crucial role in enhancing the accuracy of pedestrian movement prediction by detecting pedestrians prior to extracting the region of interest (ROI) and classifying their actions.

• Pedestrian detection by ACF ACF classification model, specified as 'inria-

The 'inria-100x41' and 'caltech-50x21' models are designed for person detection, with the former trained on the INRIA Person dataset and the latter on the Caltech Pedestrian dataset The 'inria-100x41' model is the default option in the ACF algorithm, which generates an M-by-1 vector of classification scores ranging from 0 to 1, indicating detection confidence When a pedestrian is detected, a bounding box appears, displaying confidence values as percentages; higher scores signify greater accuracy in detection However, the ACF algorithm may encounter challenges in complex images.

To enhance accuracy in real-time experimental processes, a score value of 0.25 is recommended to effectively minimize error recognition For instance, a score of 0.1 may lead to inaccuracies in certain scenarios, as illustrated in Figure 2.7 (a), whereas a score of 0.25 significantly improves result precision, as shown in Figure 2.7 (b).

• Figure 2.6 Example input image for recognition

• Figure 2.7 Pedestrian detection with scores = 0.1 (a) and scores = 0.25 (b)

When autonomous vehicles (AVs) navigate roads, they often encounter numerous pedestrians within a single video frame To enhance accuracy, it is essential to extract multiple separate frames from the original, focusing on specific regions known as Regions of Interest (ROIs) Given that the high-resolution images received from AVs contain a significant amount of irrelevant data, it is crucial to isolate the ROI at a specific scale to eliminate surrounding distractions for each detected pedestrian This extraction process aids the Convolutional Neural Network (CNN) in accurately identifying features and minimizes error rates during action recognition and classification by the Support Vector Machine (SVM) The proposed size for the ROI is outlined as follows:

In the context of a rectangle encompassing a pedestrian object, let H represent height and W denote width The coordinates of the rectangle's top left corner are indicated by x and y Additionally, Width and Height refer to the dimensions of the input image The parameters x1, y1, W1, and H1 specify the size of the Region of Interest (ROI) defined as follows.

In specific instances, if x1, y1, W1, or H1 are less than the frame's edge value or exceed the dimensions of the input image, they will be adjusted to match the image's edge values.

• On the other hand, when ROI is out of image input size, the offset value of ROI on the opposite side is proposed in Figure 2.8.

• Figure 2.8 ROI extraction from pedestrian image

Pedestrian_crossing Pedestrian_waiting Pedestrian_walking

Pedestrian movement prediction involves extracting a region of interest (ROI) from a single image, followed by feature extraction using a Convolutional Neural Network (CNN) model These features are then classified with a Support Vector Machine (SVM) classifier, resulting in outputs that are labeled based on the predicted pedestrian behavior, including categories such as Pedestrian Crossing, Pedestrian Waiting, and Pedestrian Walking.

Pedestrian safety is crucial in urban environments, particularly at pedestrian crossings where individuals navigate through traffic It's important for pedestrians waiting at the roadside to remain vigilant before crossing to ensure their safety Additionally, those walking along the edges of the road should stay alert and adhere to safety guidelines to avoid accidents.

• Figure 2.9 The order of classifications of pedestrians when there are many pedestrians on the road in an input image

In our study, we developed a custom 24-layer Convolutional Neural Network (CNN) architecture for vehicle recognition, as existing pre-trained models like AlexNet and GoogleNet are not suitable due to size discrepancies with actual images and inadequate training parameters for accuracy enhancement Our model comprises an input layer, convolution layers, ReLU layers, cross-normalization, max-pooling, and a fully connected layer, transforming 128×128×3 input images into a hierarchical descriptor The initial filters focus on the RGB color channels, operating independently and collectively across hidden layers, while the final layer extracts the feature vector for classification This approach addresses the specific challenges of vehicle recognition effectively.

• Table 2.1 CNN architecture with 22 hidden layers, 1 input layer, and the final classification layer

• 5 • Max Pooling • 3x3 max pooling with stride [1 1]

• 8 • Max Pooling • 2x2 max pooling with stride [1 1]

•12 • Max Pooling • 2x2 max pooling with stride [1 1]

•20 • Fully Connected • 1024 fully connected layer

•22 • Fully Connected • 4 fully connected layer

Output • crossentropyex with 4 other classes

•The training data set classified during the collection is shown in Figure 2.10.

To enhance the accuracy of vehicle recognition, we recommend augmenting the data by a factor of ten This involves rotating images within a range of [-5°, 0°, 5°], flipping them, or adding noise, all while maintaining image quality during training The augmented training dataset is detailed in Table 2.5.

Experimental evaluation

2.3.1.1 Extracting features and training classifier model

• The experiment is carried out with about 3,000 images being extracted by CNN model There features are used for training of SVM classifier model Table

2.2 shows the image and label datasets of extracted and trained features.

• Table 2.2 Image and label datasets of extracted and trained features

• 90% of images from each set is used for the training data and the rest 10% is used for the data validation.

2.3.1.2 Pedestrian detection and action prediction

Using the pedestrian detection ACF algorithm on the input images, we generate outputs that facilitate action prediction through an SVM classifier When multiple pedestrians are present in a single frame, we extract Regions of Interest (ROIs) into individual images for analysis Each extracted image features specific characteristics that the SVM classification model utilizes to predict pedestrian actions and issue relevant alerts for autonomous vehicles.

• Figure 2.10 Pedestrians detected and ROI extracted

• The maximum results of rate-recognition after training and comparing with dataset in Table 2.2 are as follow:

•Table 2.3 Maximum confusion matrix for pedestrian action prediction

• •Pedestrian crossing • Pedestrian waiting • Pedestrian walking

The experiment on real-time video processing for pedestrian detection achieved an accuracy rate ranging from 82% to 97%, with a processing speed of just 0.6 seconds per detected pedestrian, indicating significant potential for advancements in self-driving technology.

We conducted experiments using a comprehensive database of vehicles, including motorcycles, cars, coaches, and trucks, collected from real traffic scenarios in Nha Trang city, Khanh Hoa province, Vietnam Our camera systems capture signals from vehicles in various traffic situations, and the dataset comprises 8,558 images categorized into four vehicle classes The dataset is strategically divided, with 60% allocated for training and 40% reserved for evaluation, ensuring a robust framework for analysis.

• Figure 2.11 Some examples of vehicle categories

•Number of samples • • Sample size

• Table 2.5 Training data after augmentation and balance data

• Catego ries • Number of samples

• Result obtained after CNN model training is shown as follows:

(i) Filter parameters: The first convolution layer uses 64 filters, whose filter's weight is shown in Figure 2.12:

The first convolution layer, depicted in Figure 2.12, utilizes 64 filters of size 7x7, each linked to three RGB image input channels When sample images are processed through these convolution filters, the resulting data highlights distinct components from the original RGB images, revealing various vehicle features The convolution output may include negative values, necessitating normalization through linear adjustment Below, the output from certain layers is presented, showcasing the input pattern of the motor sample.

(a) The output of 64 convolutions at the first convolution layer

(b) The linear correction value after the first convolution layer

(c) The output of 64 samples at the second Convolution layer

• Figure 2.13 Some results of linear convolution and linear correction for the input images being motors

In an experiment evaluating three distinct methods on a consistent sample dataset, the results are summarized in Table 2.4 The methods assessed include traditional approaches utilizing Histogram of Oriented Gradients (HOG) and Support Vector Machines (SVM), a Convolutional Neural Network (CNN), and a CNN enhanced with data augmentation techniques.

• The accuracy of the HOG and SVM method on the sample data set was 89.31%. Details of the sample size for each type and recognition result are shown in Table 2.6.

• Table 2.6 Confusion matrix of vehicle recognition using HOG and SVM

• Mot or • • Car •Coach •Truck

• The evaluated accuracy of the CNN method based on original data was achieved 90.10% on average, as shown in Table 2.7.

• Table 2.7 Confusion matrix of vehicle recognition using CNN

• Mot or • • Car •Coach •Truck

• The evaluated accuracy of the CNN method based on data augmentation was achieved 95.59% on average, as shown in Table 2.8.

• Table 2.8 Confusion matrix of vehicle recognition using CNN and data augmentation

In this study, we compared the proposed CNN model with a traditional approach that utilizes the HOG feature descriptor and SVM classifier, as illustrated in Figure 2.14.

• Figure 2.14 Comparison of HOG+SVM, CNN model and CNN with augmenting data

Recent advancements in artificial intelligence, particularly through machine learning and Deep Learning networks, have significantly enhanced computer systems Chapter 2 illustrates the object recognition capabilities of Convolutional Neural Networks (CNNs) and their intelligence in specific scenarios, showcasing foundational Deep Learning techniques and their application potential However, a key limitation of current artificial intelligence is its inability to self-learn, self-update, and think independently To address this gap, Chapter 3 focuses on developing Adaptive Learning systems that empower autonomous systems to learn and evolve without human intervention, thereby bridging the divide between artificial and human intelligence.

• Chapter 2, the author mentions the two research works which are papers PP 1.1, PP 1.2,

LEARNING TECHNIQUE IN OBJECT RECOGNITION

This chapter builds on the research findings presented in Chapter 2, introducing an Adaptive Learning solution for self-driving vehicle system data The proposed model demonstrates the ability to self-learn and exhibit intelligence autonomously, without the need for human intervention.

Adaptive learning problem in object recognition

Recent advancements in object recognition techniques, particularly through deep convolutional neural networks (CNNs), have significantly improved accuracy These models, supported by enhanced computer hardware, feature complex structures with numerous layers and extensive training data, enabling them to identify a wide range of object classes effectively However, their performance diminishes when objects deviate from the training data, as real-world conditions—such as varying brightness, rain, fog, and movement—can greatly affect image capture Consequently, even large training datasets cannot encompass all possible object variations Furthermore, training on excessively large datasets poses challenges due to limited computational resources and time constraints To address these issues, a proposed adaptive approach aims to automatically upgrade recognition models, thereby enhancing accuracy.

Tiêu đề	Adaptive Learning Solution Based On Deep Learning For Traffic Object Recognition
Trường học	Duy Tan University
Chuyên ngành	Computer Science
Thể loại	Doctor Of Philosophy
Năm xuất bản	2022
Thành phố	Da Nang

Định dạng
Số trang	119
Dung lượng	5,18 MB