(Luận văn) design of advanced driver assistance system based on deep learning

INTRODUCTION

OVERVIEW

In the automotive industry, Advanced Driver Assistance Systems are paramount technology built-in vehicles that assist the human driver in various ways The main goal of the ADAS system is to ensure optimal user experience and enhance safety Much Original Equipment Manufacturers (OEMs) have focused on expanding and developing broader and more complex fields of ADAS application as the market grows These systems aim to prevent accidents caused by human error, such as driver drowsiness and distraction, provide warnings on possible dangerous scenarios, evaluate driving performance, and offer suggestions Advanced Driver Assistance Systems have emerged as essential technologies researched in intelligent automotive.

Automotive manufacturers are forging the path to high and fully autonomous vehicles based on the technological innovation of ADAS They discovered that human errors are the leading cause or a contributing factor in 94% [1] of automotive accidents. Vehicles, environmental factors, and other unidentified causes account for 2% of all accidents Speeding, inattentiveness, and improper lookout were the research’s most often cited human reasons The vast majority are effective avoidance with ADAS [2] vital application as forwarding collision warning and brake assistance, adaptive cruise control, traffic sign recognition, lane keeping, and lane departure warning While these are admirable goals, drivers will only utilize ADAS if they find it beneficial [3] This will be the case if the ADAS is easy to use and comprehend and if they do not bother the driver with useless or untimely information Therefore, another significant component of personalization in ADAS that has yet to be studied is the impact of the interface design between the vehicle and the driver.

These ADAS applications often rely on a mono front camera or stereo-vision camera.The camera data is sometimes complemented with data from other sensors, such as light detection and ranging (LIDAR) or radio detection and ranging (RADAR) ADAS cameras are often mounted against the front windshield, behind the central rear-view mirror TheADAS camera field of view is positioned in the wiper area to keep the glass in front of the camera as clean as possible Sometimes RADAR sensing, vision sensing, and data fusion are coupled in a single module.

This work proposes ADAS applications using road information obtained from a mono color camera and the vehicle computer Nvidia Jetson AGX Orin for high- computing workloads, as shown in Figure 1.1 The intelligent computation of the proposed ADAS is implemented using deep learning models such as YOLOv6 and Ultra Fast Lane Detection, which make decisions based on critical observations of surrounding objects The vehicle computer processes the data from the road, including traffic signs, road lanes, vehicles, and pedestrians, to perform essential ADAS applications, including FCW, TSR, and LDW then displays the assistance information and warning on the GUI.

Figure 1.1: Diagram of the proposed system.

GOALS

This project aims to propose an ADAS software based on centralized E/E hardware architecture that can operate in real-time to complement the current ADAS applications in the automotive industry [3], which will appear in every upcoming vehicle The system focuses on useful applications such as forward collision warning, lane departure warning,traffic sign recognition, and graphic user interface to enhance user experience and safety.

LIMITATIONS

Because of the vast field of ADAS applications and the variety of road scenarios, this work mainly develops models based on videos selected from the internet Furthermore,this project has certain limitations The ADAS cannot operate in real-life environments,crowded, unexpected changes, low light, and complex traffic scenarios, implying that it can only work in well-monitored environments Traffic sign recognition, for example, camera angle in each video is so important in detecting performance, this ADAS software may not work with random dashcam video but only with those that have been pre- selected Similarly, the safe distance for forward collision warning works on reliable research, but since the video is downloaded from the internet, the safe distance when programming is only an assumption.

OUTLINES

To summarize the main aspects of this thesis as follows:

This chapter introduces the topic, the objectives, the limitations, the related works of the research, and the layout of this thesis.

This chapter gives the fundamental theory, the framework, and the algorithms for the implementation of the thesis, using the relevant studies as a reference source.

This chapter presents a detailed design of the proposed work software and hardware, including data collection, algorithms, procedure, evaluation, and operation.

This chapter presents the result of the proposed work.

• Chapter 5: Conclusion and Future Work

This chapter gives the conclusion and some future works which will be conducted.

LITERATURE REVIEW

DEEP LEARNING

Deep Learning has emerged as an efficient method for big-data analysis, utilizing complicated algorithms and artificial neural networks to train machines/computers to learn from experience, categorize, and identify data/images the same way as the human brain does At its heart is the Convolutional Neural Network (CNN), an artificial neural network frequently used in Deep Learning for image/object detection and classification. CNN is essential in various applications such as image processing, computer vision tasks such as object detection and segmentation, video analysis, recognizing obstacles in self- driving cars, and speech recognition in natural language processing.

A CNN is a sequence of layers There are three main layers to build CNN architectures, as shown in Figure 2.1: convolutional layer, pooling layer, and fully connected layer In contrast with fully connected networks, where every neuron is linked to all neurons in the next layer, CNN has a relatively concise organization Because each neuron only responds to a specific receptive field The reduction in connectivity means decreased computational costs A structure of CNN captures the spatial features from an image The features will become more and more complex through each layer hence distinguishing the object from other objects in an image that helps us identify the object accurately.

Figure 2.1: A basic CNN architecture for image classification.

The convolutional layer (Conv) is the foundation of CNN It is the initial layer that extracts features from an input image The Conv’s parameters are a collection of learnable filters Every filter is small spatially (along width and height) but has the same depth as the input picture size Convolution preserves the link between pixels by learning image features As a result, a mathematical operation requires two inputs: the input image matrix and the filter or kernel.

Figure 2.2: Demonstration of convolution operation with 5×5 input and 3×3 kernel.

Stride and padding are two new notations in this layer Stride is the number of pixels shifted across the input matrix When the stride is 1, we shift the filters one pixel at a time When the stride is 2, we shift the filters to two pixels Figure 2.3 demonstrates how convolution might function with a stride of 1.

Figure 2.3: Demonstration of convolution operation 3×3 kernel over 5×5 image with padding = 0, stride = 1 [4].

The filter does not always correctly match the input image in padding Hence, to help the kernel process the image, padding is added to the image frame to provide additional space for the kernel to cover the image The benefits of padding allow us to use a Conv layer without reducing output image size Therefore, padding is essential to build deeper networks.

Figure 2.4: Demonstration of convolution operation 3×3 kernel over 5×5 image with padding = 1, stride = 1 [4].

Finally, the convolutional output is layers added with a bias, which go through the activation function before becoming the input of the network’s convolution layer The ReLU (Rectified Linear Unit) function, a piecewise linear function that outputs the positive input directly or returns zero, is a frequent activation function in CNN The goal of ReLU computation saving is that the ReLU function can increase deep neural network training speed compared to typical activation functions (sigmoid, tanh) As a result, the earliest layer (the first hidden layer) can acquire mistakes from the previous layers and change all weights between layers A common activation function, such as the sigmoid, is limited to values between 0 and 1, resulting in slight errors in the first hidden layer This circumstance results in an incorrectly trained neural network As a result, the layer is known as the Conv - ReLU layer.

When the photos are too large, adding pooling layers would lower the number of parameters This layer is commonly used to insert in-between convolutional layers in a CNN architecture Spatial pooling, also known as sub-sampling or down-sampling, decreases the dimensionality of each map while retaining critical information Spatial pooling can be of several forms, including max pooling, which takes the most prominent element from the corrected feature map Average pooling takes the average value of all elements in the feature map, and sum pooling takes the sum of all elements in the feature map.

Figure 2.6: Demonstration of max pooling with 3×3 filters on 5×5 image and stride = 1 [4].

The typical CNN design stacks a few Conv - ReLU layers, then pooling layers and repeats this pattern until the image is spatially merged to a tiny size We added a fully connected layer as the last layer that holds the output after that sequence As a result, we consider the last layer - the fully connected layer.

The network has learned the number of unique features of the picture after numerous Conv - ReLU and pooling layers The output tensor of the final layer is flattened to a one-dimensional vector × × (height, width, and channel) Finally, a fully linked layer is given the flattened vector.

Figure 2.7: Demonstration of the flattening process.

The fully connected layer is an artificial neural network with an input layer that receives a feature map from the output of the Conv - ReLU - pooling layers and then passes through multiple hidden layers with nodes before classifying the output with the soft-max activation function Therefore, we may expand the design to include more than one fully connected layer.

OBJECT DETECTION

Deep learning has attracted extensive attention in the past decades Various research has been carried out in recent years to develop a practical approach to accelerate the development of deep learning methodologies Numerous developments achieved excellent results and continuous reforms in how deep learning related to practical in people’s everyday lives The deep learning-based object detector technique is prominent among them, which helps solve many problems in medical image analysis, self-driving vehicles, business analytics, and face identification Object detectors are divided into one- stage object detectors as the YOLO series [5-10], and two-stage object detectors, such asRCCN and Faster RCNN [11].

Two-stage detectors divide the detection work into (1) proposal creation and (2) prediction of these proposals During the proposal generation phase, the detector will attempt to identify regions in the picture that may represent objects The goal is to propose regions with a high recall so that all of the items in the image correspond to at least one of these recommended regions In the second stage, a deep learning-based model is utilized to classify these proposals with the appropriate labels The region can be either a background or an item from one of the predefined class labels Furthermore, the model may refine the initial localization given by the proposal generator.

Figure 2.8 Overview of different two-stage detection frameworks for generic object detection:

R-CNN is a ground-breaking two-stage object detector deep learning-based approach proposed in 2014 The R-CNN pipeline is separated into three parts: (1) proposal creation,

(2) feature extraction, and (3) region classification R-CNN provides a sparse collection of roughly 2000 proposals for each picture using Selective Search, which excludes regions easily recognized as background regions These proposals are then fed to a CNN, which does classification and bounding box regression Deep convolutional networks extracted the features of each proposal individually, resulting in highly repeated calculations As a result, R-CNN requires a significant amount of time for training and testing [11].

Later, Faster-RCNN suggested an architecture in which features are shared across both stages, resulting in a considerable gain in efficiency R-CNN is faster because it employs a convolutional backbone network, such as VGG or ResNet, to generate global feature maps These convolutional maps are shared between the RPN and the detection network, lowering the cost of producing proposals externally Many follow-up efforts have attempted to enhance detection accuracy using various techniques, such as creating better backbones that can yield richer representations For example, feature pyramid networks (FPN) have been proposed to harvest RoI features from various layers depending on the scale Later research, such as the ResNeXt with grouped convolutions or the Res2Net, attempted to improve residual networks’ internal connections to better utilize the multi-scale features of convolutional maps [11].

Unlike two-stage detection methods, which divide the detection process into two stages, a one-stage object detector typically comprises the following components: a backbone, a neck, and a head The backbone primarily determines feature representation ability; nonetheless, its design significantly impacts inference efficiency since it takes a considerable percentage of the computing cost The neck combines low-level physical information with high-level semantic features before building pyramid feature maps at all levels The head comprises numerous convolutional layers, and it predicts final detection outcomes based on multi-level characteristics gathered by the neck The Single ShotMultiBox Detector (SSD) and YOLO (You Only Look Once) were among the first to offer a single unified design that did not require a pre-proposal computation [11].

Figure 2.9: Overview of different one-stage detection frameworks for generic object detection: a) YOLO and b) SSD [11].

YOLO approached object detection as a regression problem and split the entire image spatially into a predefined number of grid cells (e.g., using a 7 × 7 grid) Each cell was considered as a proposal for detecting the presence of one or more objects Each cell in the initial implementation was thought to hold the center of up to two objects A prediction was created for each cell, which included the following information: whether the location had an item, the bounding box coordinates and size (width and height), and the object’s class However, YOLO had specific difficulties: (1) It could only detect up to two things at a time, making it difficult to detect small or cluttered objects (2) Only the most recent feature map was utilized for prediction, which was insufficient for predicting objects with varying sizes and aspect ratios [11].

Single-Shot Multi-box Detector (SSD) addresses YOLO’s limitations SSD comprises two parts: a backbone model and an SSD head As a feature extractor, the backbone model is often a pre-trained image classification network The backbone is often a network such asResNet or MobileNet that has had the last fully connected layer removed The SSD head is just one or more convolutional layers added to this backbone The outputs are evaluated as bounding boxes and classifications of objects in the spatial location of the final layer activations [11].

Furthermore, SSD predicted objects on several feature maps, with each feature map responsible for detecting a specific size of objects based on its receptive fields Several new convolutional feature maps were added to the original backbone design to detect large objects and increase receptive fields Using an end-to-end training approach, the whole network was optimized using a weighted sum of localization loss and classification loss across all prediction maps The final prediction was created by combining all detection scores from several feature maps Hard negative mining was utilized for training the detector to avoid many negative proposals dominating training gradients To boost detection accuracy, intensive data augmentation was also used SSD obtained detection accuracy equivalent to Faster R-CNN but gained the ability to perform real- time inference [11].

However, the following updated versions of YOLO considerably improved performance while maintaining real-time inference speed YOLOv2 created improved anchor priors from training data using k-means clustering This improvement assisted in decreasing localization optimization challenges YOLOv2 was able to get a solid competitive outcome by integrating with Batch Normalization layers and multi-scale training approaches at the time YOLOv3 utilizes a different feature extractor than its predecessor As the backbone feature extractor, it employs a 3×3 and 1×1 convolutional network 3 called Darknet-53 Darknet-53 is a more advanced version of Darknet-19, utilized in a previous version of YOLOv2 This backbone architecture comprises 53 convolutional layers, as the name implies Adapting the ResNet style residual layers enhanced its accuracy while keeping its speed advantage This feature extractor outperforms ResNet101 and ResNet152 while being 1.5x and 2x quicker, respectively

[5] YOLOv4 [6] redesigned the detection framework into three distinct components (backbone, neck, and head) and confirmed bag-of-freebies and bag-of-specials at the time to create a framework suited for training on a single GPU [12].

A single CNN operation is required to generate the output of a one-stage object detector.

In the case of two-stage object detectors, the high-score region proposals obtained from the

The inference time of one-stage and two-stage object detectors is given by

, where m is the number of area suggestions with confidence scores larger than a certain threshold In other words, the inference time for one-stage object detectors is fixed, but the inference time for two-stage object detectors is not As a result, real-time object detectors must always be one-stage detectors.

Recently, the development of real-time object detectors has centered on creating efficient architecture The backbone of real-time object detectors employed on the CPU is mostly MobileNet, ShuffleNet, or GhostNet Another popular real-time object detector for GPU is ResNet, DarkNet At present, state-of-the-art real-time object detectors are mainly based on YOLO, which are YOLOv5 [7], YOLOv6 [8], and YOLOv7 [9] TheADAS applications are developed using object detection based on YOLOv6 after evaluating the performance of each architecture that will be expressed in Chapter 3.

YOLOv6 OBJECT DETECTION ARCHITECTURE

After great effort developed in pioneering works, the YOLO series has become the most preferred detection architecture in industrial applications due to its good balance of speed and accuracy The authors created YOLOv6 after observing numerous essential characteristics that prompted them to improve the YOLO framework: (1) RepVGG reparameterization is a better strategy that is currently underrated in object detection In addition, they discover that simple model scaling for RepVGG blocks is no longer practical They believe excellent network design consistency is unnecessary across small and large networks For small networks, the plain single-path architecture is a better choice However, for larger models, the exponential growth of the parameters and the computation cost of the single-path architecture make it impractical; (2) Quantization of reparameterization-based detectors also necessitates thorough treatment; otherwise, dealing with performance degradation due to heterogeneous configuration during training and inference would be intractable (3) Previous works tend to pay less attention to deployment, with latencies typically compared to high-cost computers such as the Nvidia V100 There is a hardware gap when it comes to practical, real-world applications. Common GPUs and edge devices, such as the Tesla T4, RTX 3090, and Jetson embedded systems, are less costly and provide enough inference performance (4) Considering and

= 1 architectural variance, advanced domain-specific strategies, such as label assignment and loss function design, necessitate additional verifications; (5) For deployment, they can tolerate training strategy adjustments that improve accuracy performance while not increasing inference costs, such as knowledge distillation They offer two scaled reparameterizable backbones and necks to support models of various sizes, as well as an efficient decoupled head using the hybrid-channel technique based on the concept of hardware-friendly network design Figure 2.10 presents the overall architecture of YOLOv6 [8].

Figure 2.10: The overall YOLOv6 architecture RepBlock comprises a stack of RepVGG blocks with ReLU activations [8].

The critical factors in much of the research on developing effective backbone designs are the number of parameters, the amount of computation, and the computational density.Although many advanced CNNs outperform basic ones in terms of accuracy, the downsides are substantial (1) Complicated multi-branch architectures, such as ResNet’s residual addition and Inception’s branch-concatenation, make the model harder to implement, adjust, slow inference, and minimize memory utilization (2) Certain components, such as depthwise Conv in Xception and MobileNets and channel shuffle inShuffleNets, increase memory access costs and lack device support With many variables influencing inference speed, the number of floating-point operations (FLOPs) does not accurately reflect the actual speed Although some unique models, such as VGG andResNet-18/34/50, have fewer FLOPs than traditional ones, they may not run more quickly As a result, VGG and the original versions of ResNets are still often used in

Figure 2.11: Presentation of RepVGG architecture versus ResNet and InceptionV2.

The RepVGG has a plain architecture like classic VGG, consisting of a stack of Conv, ReLU, and pooling without any branches, as shown in Figure 2.11 It is challenging for a plain model to reach a comparable level of performance as the multi-branch architectures. The vanishing gradient problem makes deep networks difficult to train As the gradient is backpropagated to previous layers, the value of the derivative product drops until, at some point, the partial derivative of the loss function approaches 0 and vanishes.

As a result, the deeper the network extends, the more saturated or significantly degraded its performance becomes ResNet makes the model an implicit ensemble of numerous shallower models by concatenating the 1×1 Conv layer for linear projection of a stack of feature maps so that training a multi-branch model avoids the gradient vanishing problem The Inception block consists of four parallel branches The first three branches use convolutional layers with window sizes of 1×1, 3×3, and 5×5 to extract information from different spatial sizes A larger kernel is preferred for information distributed more globally, and a smaller kernel is preferred for information distributed more locally The middle two branches also add a 1×1 convolution of the input to reduce the number of channels, reducing the model’s complexity [13].

As discussed above, it is difficult for a plain model to compete with the performance of multi-branch architectures The benefits of multi-branch architecture are all for training, and the drawbacks are undesirable for inference RepVGG is proposed to decouple the training-time multi-branch and inference-time plain architectures via structural reparameterization, which means transforming the architecture’s parameters from one to the other.

Specifically, during training, RepVGG utilizes identity and 1×1 branches, which are inspired by ResNet but implemented so that the branches may be deleted by structural reparameterization, as seen in Figure 2.11 (d) RepVGG performs the transformation with simple algebra after training, as an identity branch can be regarded as a degraded 1×1 Conv and the latter as a degraded 3×3 Conv, allowing it to construct a single 3×3 kernel with the trained parameters of the original 3×3 kernel, identity and 1×1 branches, and batch normalization (BN) layers As a result, the modified model contains three Conv layers preserved for testing and deployment [13].

Notably, the body of an inference-time RepVGG includes just one type of operator: 3×3 Conv followed by ReLU, which allows RepVGG to run quickly on general computing devices such as GPUs Even better, RepVGG enables specialized hardware to attain even faster speeds since, given chip size and power consumption, the fewer types of operators required, the more computing units we can fit into the device As a result, an inference device optimized for RepVGG can include a massive number of 3×3-ReLU units and fewer memory units [13].

Figure 2.12: Peak memory occupation in the residual and plain model [13].

Because the results of each branch must be retained until the addition or concatenation,the multi-branch topology is memory-inefficient, greatly increasing the peak value of memory occupied The input to a residual block must be held until the addition, as shown inFigure 2.12 Assuming the block keeps the feature map size constant, the maximum value of permits the memory consumed by the inputs to a specific layer to be quickly released when the action is completed When designing specialized hardware, a simple CNN allows for deep memory optimizations and lowers memory unit costs, allowing us to integrate more computing units onto the chip [13].

Various pyramid network algorithms are used to implement the neck of object detection Pyramid networks effectively fuse features from different levels of the backbone model There are two common types of pyramid networks: feature pyramid networks (FPN) [14], which are implemented in YOLOv3, and path-augmented networks (PAN) [15], which are implemented in YOLOv4 The neck of YOLOv6 is developed using the PANet notation as RepPAN since the author used RepVGG for small models (nano, tiny, small) or CSPStackRep blocks for bigger models (medium, large) The PANet topology employs path augmentation to improve object localization by improving low-level patterns The PANet structure uses concatenation and fusion to predict object class and mask The PANet structure is depicted in the figure below.

Figure 2.13: YOLOv6 Rep-PAN neck architecture [8].

The complexity of the features increases as the image passes through the various layers of the backbone, from depicting low-level features such as edges and textures to encoding whole object sections such as eyes and nose However, due to different stride convolutions and pooling layers in the backbone, the spatial resolution of the feature maps drops This problem results in a loss of spatial information, making these high-level features unsuitable for predicting pixel-level masks.

The PAN network is divided into stages, each having layers that generate feature maps with the exact spatial sizes For example, P3 and C3 in Figure 2.13 are from the same stage (C is used to denote feature maps generated from the backbone and P for final feature maps) A feature map belonging to each stage in the bottom-up path is generated using a RepBlock of the previous feature map followed by a concatenation of the same stage feature map from the backbone Another 3×3 convolution is performed on the result to produce the final features.

The object detection head performs the final prediction of the bounding boxes and class scores End-to-end object detection systems use numerous heads to identify objects of varying resolutions correctly There are three heads in the YOLO family for classification, localization, and regression However, classification and localization in the coupled head are two different jobs with almost identical characteristics Some works are aware of the conflict between the two object functions in the coupled head; they observe that the spatial misalignment between the two object functions in the coupled head can significantly harm the training process and the conflict between classification and regression tasks [8].

Figure 2.14: YOLOv6 decoupled head architecture [8].

As shown in Figure 2.14, the efficient decoupled head differs from the detection head of YOLOv5, with parameters shared between the classification and localization branches.

In YOLOv6, the author uses a hybrid-channel method from YOLOX [10] to create a more efficient decoupled head They specifically restrict the number of the middle 3×3 convolutional layers to one The width multiplier for the backbone and the neck jointly scale the head’s width These changes significantly cut processing costs, resulting in shorter inference times [8].

SYSTEM DESIGN

OVERALL SYSTEM

Most car manufacturers use some camera systems, at least on their high-end models, whether a simple rear-view camera to prevent back-overs, a front camera for lane departure warning, forward collision warning, or stereo cameras for estimating the depth of the environment ahead of the vehicle Naturally, more complex applications need higher-end processing units and sensors, which are more expensive But as safety laws change and manufacturers compete, versatile and inexpensive sensors such as cameras will become more common in low-end cars and eventually become a standard feature for many vehicles.

Figure 3.1: Warning zone of the sensors [16].

As a result, this work proposes an ADAS system, as shown in Figure 3.2, solely based on the front camera and vehicle computer Jetson Orin for 3 applications: forward collision warning, lane departure warning, and traffic sign recognition These applications are

Figure 3.2: The framework of the proposed ADAS.

From the beginning, the frame from the camera transfers to the Jetson Orin and is responsible for the primary process As shown in Figure 3.3, four threads are executed concurrently because three heavy object detection and lane detection tasks consume a GUI application’s hardware resources and cause the application to freeze The input frame is then fed into three threads (1) traffic sign recognition, (2) forward collision warning, and

(3) lane departure warning After the threads are executed successfully, the results are asynchronously sent to the final block within the same time frame to visualize, investigate, and prepare the assistance information Finally, all the assistance information is displayed on the GUI in a vehicle These applications will be detailed in the following sections.

Figure 3.3: Apply concurrent programming for the system.

COMPARISON OF OBJECT DETECTION MODELS

This study aims to create an ADAS application that can operate in real-time on low- power edge devices in a vehicle Hence, numerous evaluations have been carefully conducted to discover the best object detection model for accuracy and inference speed. This study proposes the benchmark among lasted models in the YOLO family, including YOLOv5 [7], YOLOv6 [8], and YOLOv7 [9], on the custom benchmark dataset based on the TT100K dataset [18], as shown in Figure 3.4 All the models are converted to FP16- precision with Nvidia TensorRT [19] for speed tests on vehicle computer Jetson Orin because TensorRT offers a significant performance advantage by fusion ReLU into convolution In addition, the comparison between the FP32-precision and FP16-precision is also conducted to evaluate the trade-off between accuracy and speed.

The dataset provided in this part is just used to train the benchmark models, not the final models used in the proposed ADAS software since the author discovers specific approaches to provide superior results, which will be detailed in sections 3.3 and 3.4 The benchmark dataset contains 16 classes in 5469 train images and 916 validation images,accounting for 85.6% and 14.4% of the benchmark dataset In addition, as indicated inTable 3.1, this dataset comprises 9248 traffic sign labels from the TT100K and 13891 labels from this research for three classes: four-wheeler (cars, trucks, buses), two-wheeler(motorcycles, bicycles), and pedestrians.

Table 3.1: Classes and labels in Benchmark Dataset.

Number Classes Train labels Validation labels

Each YOLO model was trained by RTX3090 with 150 epochs, an input size of

1280×1280, batch size of 12, a learning rate of 0.01, a momentum of 0.937, and slight data augmentation as shown in Table 3.2:

Table 3.2: Data augmentation for training benchmark model.

Mix-up 0.15 Mix-up (probability)

The results of this benchmark are shown in Table 3.3 and Figure 3.5 The YOLOv6s outperform YOLOv5s in mAP 50 FP16 by 18.05% while slowing down by 11.41%.

YOLOv6s, on the other hand, impressively surpasses YOLOv5m in accuracy and inference

23 speed by 0.13% and 75.49%, respectively Although the YOLOv7 produces the most precise results, it’s also the slowest Furthermore, the precision results of FP16 and FP32 among these models show no difference, even though the inference time of FP32 is dropped by roughly half Finally, the YOLOv6s model in FP16 precision was chosen for this ADAS software because of its excellent combination of accuracy and speed.

Table 3.3: Comparison of YOLO-series object detectors on the benchmark dataset.

Models Params FLOPs Input mAP 50 mAP 50 FPS FPS size val, FP16 val, FP32 FP16 FP32

Figure 3.5: The detection result of YOLOv6s in FP16 precision on benchmark dataset.

TRAFFIC SIGN RECOGNITION

Traditional TSR is divided into color-based, shape-based, and machine learning- based Color-based and shape-based detection techniques primarily use specific image colors and shapes to manually extract features, such as SIFT [20] (scale-invariant feature transform) features, Histograms of Oriented Gradient features (HOG) [21], and match marker signs by templates On the other hand, color-based and shape-based techniques are vulnerable to weather, illumination, and other environmental influences Machine learning approaches are used to extract invariant or comparable visual elements from traffic signs, recognize traffic signs in images, and then categories them using classification algorithms to comprehend the semantic information contained in the traffic signs The present traffic sign recognition system has various weaknesses: First, the size of traffic signs occupies a small part of the actual road scene, making accurate traffic sign information challenging to detect by traffic sign detecting systems.

Figure 3.6: TT100K Traffic Sign Dataset.

For instance, in Tsinghua-Tencent 100K Dataset (TT100K) [18], as shown in figure 3.7, the ratio of the pixel range of traffic signs is about 0.2% of the picture pixel Second, there are large-scale and small-scale traffic signs in the same image; the variation in scale can quickly generate false or missing detector detections, thereby reducing detection accuracy To meet real-time detection in complicated traffic situations, deploying traffic sign-detecting technology demands high accuracy and fast inference speed.

Figure 3.7: Small traffic signs in the TT100K dataset’s image.

After surveying the YOLO-series architecture, the above problems can be solved by YOLOv6 architecture [8] and “bag-of-freebies” during training inspired by YOLOv4 [6].

To eliminate the problem that most small-scale traffic sign’s spatial information can only exist in the shallow network and cannot be transmitted to the deep network through the feature extraction process YOLOv6 has a RepPAN neck to preserve spatial information.With the backbone RepVGG, the traffic sign detector can improve the feature extraction capability by going deeper without a gradient vanishing problem, better preserving small- scale traffic sign spatial information while reducing computation costs during inference.

This work pre-produces the TT100K dataset to be suited for Viet Nam traffic signs, emphasizing prohibit signs Limit speed signs less than 50 km/h in the TT100K dataset are removed because these limit speed signs have been removed in Vietnam since 2015

[22] The custom TSR-TT100K dataset, as indicated in Table 3.4, contains 12 classes in

5160 train images and 620 validation images accounting for 89.3% and 10.7% of the TSR-TT100K dataset.

Table 3.4: Classes and labels in the TSR-TT100K dataset.

Number Classes Train labels Validation labels

During training, this work further utilizes the “bag of goodies” strategy to improve the model’s performance as the following The YOLOv6s - Traffic Sign Recognition is trained from scratch on an RTX3090 with 200 epochs, an input size of 1280×1280, a batch size of 20, a learning rate of 0.01, a momentum of 0.937, and minor data augmentation The trained weights from the scratch training operate as the teacher for transfer learning within 300 epochs, a learning rate of 0.0032, and heavy data

Table 3.5: Data augmentation for training from scratch and fine-tuning.

Parameters Scratch Values Finetune Values Descriptions

Mix-up 0.15 0.243 Mix-up (probability)

During training, the author found that the default input size of 640×640 for the

COCO2017 dataset would be inefficient when the loss value did not converge after 50 epochs because of the tiny traffic sign size Hence, this work chooses 1280×1280 as the input size because little things, such as traffic signs in the dataset, since increasing the resolution and increase the richness of characteristics the YOLOv6 architecture can achieve from that tiny labeled bounding box.

Training a YOLOv6s - TSR model at a higher resolution 1280×1280 for 300 epochs hours takes 9 hours, nearly three times as long as training at 640×640 The training performance metric, as shown in Figures 3.8 and 3.9, indicates the model’s success There are two types of loss value: classification loss and IoU loss IoU measures the probability that an object exists in a proposed region of interest Classification loss indicates how successfully the algorithm predicts the proper class of an object After 20 epochs, the loss value declines dramatically, indicating that the model performs effectively Because of the limited dataset, the scratch model increased quickly in terms of precision before plateauing after roughly 50 epochs.

Figure 3.8: YOLOv6s-TSR performance metrics during training from scratch.

The creators of YOLOv6 give an early stopping strategy that is used to choose the best weights to be the teacher weight for fine-tuning Transfer learning is an effective way of training an improved model on new data without retraining the entire network,that why the first mAP50 of the fine-tuning process starts at 0.954 At first glance, the accuracy of fine-tuning weights improves by 1% after training However, it reveals more promising results when evaluated since the metrics on the training set are only the mean across all 20 photos during training rather than the whole data set Furthermore, the confusion matrix created at 640×640, which is fourth of val image resolution, shows that when the cell is deep blue, the detection result from fine-tuned weights distinguishes effectively across classes, as shown in Figure 3.10.

Figure 3.9: YOLOv6s-TSR performance metrics during fine-tuning training.

It isn’t easy to record a picture for every real-world scenario the model may be asked to see in inference Thus, the combination of transfer learning and strong data augmentation lets the model learn from a wider range of situations Following that, the inference results of TSR scratch and fine-tuned weights from a random road in Vietnam, indicated in Figure 3.11, show that the fine-tuned model is superior when it can detect traffic signs correctly in the image with high accuracy In contrast, the scratch weight has lower accuracy and incorrectly distinguishes the limit speed of 80 from the limit speed of 100.

Figure 3.11: Comparison inference result of scratch and fine-tuned TSR weights.

Table 3.6 below shows the mAP 50 evaluation at 640×640 resolution after 200 epochs of training from scratch and 300 epochs of fine-tuning The average accuracy of all fine- tune weights classes improves by 4.76% compared to scratch weights The current results are measured at an inference size of 640×640 with validation image resizes at 640×640, while the fine-tune weight inference at 1280×1280 with validation images size of 1280×1280 has higher accuracy, about 9.48%, the average inference time is also significantly slower, almost the fourth (8.51ms versus 2.41ms) Furthermore, compared to the benchmark dataset, which contains practically all the classes included in the Custom TT100k dataset except for the Limit Speed 70, the fine-tune weights boost accuracy to 16.4% The reason for this significant increase is that the benchmark dataset contains numerous instances of vehicles and pedestrians that the author manually labeled to experiment, causing the dataset to become imbalanced when some class is overrepresented, lowering performance Based on the findings of the above analyses, TSR fine-tuned weights were chosen for the YOLOv6s model due to their excellent accuracy and fast inference at 640x640.

Table 3.6 Evaluate results from scratch, fine-tune and benchmark dataset.

Classes mAP 50, Benchmark mAP 50, Scratch mAP 50, Finetune mAP 50, Finetune at 1280

After carefully evaluating the effectiveness of fine-tuning weights on a TT100K dataset, the author converts weights to the TensorRT floating-point 16 engine for faster inference speed while retaining the same accuracy, as shown in Table 3.4 Road signs are often located above and to the right of the human field of view, as seen in Figure 3.12. The TSR block is also intended to detect traffic signs in these regions Suppose the TSR block detects a sign in the intended region with confidence and iou thresholds more than 0.8 and 0.25 In that case, the sign’s coordinates and name are transferred to the final block for processing and visualization The demonstration result when traffic signs appear in one of the two proposed regions is shown in Figure 3.12.

Figure 3.12: Common location of traffic sign on the road

Figure 3.13: TSR result of traffic sign in above and right regions (Images are zoomed due to

FORWARD COLLISION WARNING

FCW is an effective method of lowering the chance of a collision with the vehicle in front Many sensors, including radar, LiDAR, and a camera, have been used in this sector LiDAR and radar are two regularly used sensors for distance measuring and forward collision warning Because of several benefits, including the ability to identify long-distance impediments even in rainy/cloudy weather or at night The conventional FCW LiDAR-based [23] and radar-based techniques [24] must employ expensive hardware and lack significant object identification capacity to evaluate traffic scenes for applications such as traffic sign recognition, lane departure warning, etc On the other hand, a single vision system provides a cost-benefit since it does not require a separate sensor-matching process and is easy to deploy Vehicles can be detected using a single vision system or edge information [25], resulting in many false detections.

Figure 3.14: The visualization of FCW.

Due to its outstanding performance, a deep learning-based object classification algorithm may be used to detect vehicles However, it cannot be used in a low-power embedded system since it requires considerable computing This study offers an FCW solution based on the YOLOv6s architecture and low-cost operation that is quick and accurate on a low-power edge device that can run inside a vehicle, as shown in Figure 3.14.

Instead of utilizing the YOLOv6s pre-trained weights on the COCO2017 dataset,which covers 80 classes in total [26], this study presents a new road dataset based onTT100K The motivation for developing the custom dataset is that the TT100K dataset has various real-life traffic scenarios that are appropriate for this project but have not been exploited yet Besides, the COCO2017 dataset contains many images that are not appropriate for traffic scenarios, as shown in Figure 3.15, which makes custom training inefficient.

The customized FCW-TT100K dataset manual labels all the objects often seen in road scenarios, with 3754 train images and 759 validation images divided into three classes accounting for 83.2% and 16.8% There are 18832 labels in both train and validation sets, including four-wheeler (cars, trucks, buses), two-wheeler (motorcycles, bicycles), and pedestrians, as shown in Figure 3.16 and Table 3.7.

Figure 3.16: Sample labels in the FCW-TT100K dataset.

Table 3.7: Classes and labels in the custom FCW-TT100K dataset.

Classes Train labels Validation labels

Since the greater resolution only benefits small objects and this custom FCW- TT100K dataset contains practically large objects, this study utilizes the same training strategy of YOLOv6s - TSR for the FCW model, with the resolution adjusted to 640×640 rather than 1280×1280 Training for 400 epochs takes close to five hours, as demonstrated in Figures 3.17 and 3.18, which display the training performance metric.

The loss value drops rapidly after 20 epochs The scratch training process mAP 50 increased steadily until halting after the 100th epoch The early stopping also applied to choose the best way for fine-tuning The fine-tuning metric loss and mAP 50 value are unstable due to significant data augmentation However, the fine-tuned weights produce good results when evaluating and inferring, as demonstrated in Table 3.8 and Figure 3.21.

Figure 3.17: YOLOv6s-FCW performance metrics during training from scratch.

Figure 3.18: YOLOv6s-FCW performance metrics during fine-tuning training.

The confusion matrix in Figure 3.19, after evaluation at 640×640, indicates a very high false positive background, which is a bad indicator for the whole YOLOv6 model. However, the model distinguishes between classes admirably The image in the TT100K dataset was captured at a high resolution of 2048×2048 by the ultra wide-angle camera, which captures many objects and details This is an advantage for this project in the future when its full potential is exploited and a disadvantage when it requires tremendous work to label the objects, as shown in Figure 3.20 Therefore, the false-positive background could almost be the detected result of some vehicles, pedestrians at a far distance, edges, hidden hard to label, but the model still can detect.

Figure 3.19: Confusion matrix from evaluating fine-tuning FCW weights.

Figure 3.21 shows the inference results of scratch and fine-tuned FCW weights from a random road in Vietnam The fine-tuned weights are superior in accuracy and the number of objects that can be recognized In comparison, the scratch weights have lower accuracy and missing many objects The yellow box demonstrates that the fine-tuned weights draw the improper object’s bounding box, and the red box illustrates the number of undetected objects.

Figure 3.21: Comparison inference result of scratch and fine-tuned FCW weights.

Table 3.8 shows the mAP 50 evaluation at 640×640 resolution The average accuracy of all fine-tune weights classes improved by 4.76% and 12.47% compared to scratch weights and benchmark weights.

Table 3.8: Evaluate results from scratch, fine-tune and benchmark dataset.

Classes mAP 50, Benchmark mAP 50, Scratch mAP 50, Finetune

Currently, there is no tool to directly compare the detection result of the same class across YOLO models, but observing the detection result with input images randomly selected dashcam footage and inference The comparison result from the pre-trained COCO weights and this work FCW-TT100K fine-tuned weights, as shown in Figures 3.22 and 3.23, show that the fine-tuned FCW weights have better higher accuracy The COCO dataset has almost 100.000 labels for only cars Still, the pre-trained weight must be generalized for 80 classes, and the dataset’s car and person images are not meant for only traffic scenarios [26], so the performance is not superior Therefore, this shows a promising result if the proposed model has more well-prepared data for only traffic scenarios.

Figure 3.22: Example 1 st : Detection result 1 from pre-trained COCO weights and FCW-TT100K fine-tuned weight.

Figure 3.23: Example 2 nd : detection result from pre-trained COCO weights and FCW-TT100K fine-tuned weight.

It is critical to establish acceptable driving distances to avoid a forward collision. This research [27] shows the Stopping Distance (SD) parameters for several road types (dry, wet, and snow) and speeds When a driver senses a risk, as depicted in Figure 3.24, they spend some time before deciding to brake; this is known as the thinking time, 1 When the driver decides to brake, it takes a long time for them to move their leg and hit the brake pedal; this interval, 2 , is known as reaction time The braking system usually does not activate instantly, rather, it takes a few milliseconds to fill up the braking system and provide the requisite brake compression; this interval is known as the brake effectiveness time, 3 Throughout all of these times, the car maintains a steady speed. When the brake becomes effective, the vehicle decelerates at a constant pace, j, until it comes to a full stop; this is known as the braking period, 4

Figure 3.24: Stopping distance dynamics when the driver perceives a danger [27].

The braking time 4 is affected by different factors such as the braking force, tire state, tire type, and friction force.

Where V is the vehicle speed, 1 is the thinking time, g = 9.81 is the gravity constant, f is the adhesion coefficient that varies with the road type, and is the slope of the road This

1, 2, 3 typical value, which are 0.5s, 0.2s, and 0.3s, respectively Since the proposed ADAS vehicle computer does not connect directly to the vehicle, it also assumes that the typical speed for all scenarios is 50km/h - the common speed in the city [22] Taking all the given values, the proposed stopping distance for this work is 25m.

As shown in Figure 3.25, there is a caution zone with a yellow mask and a danger zone with an orange mask After detecting vehicles and pedestrians, the coordinates of these objects in the image are sent to the final block for comparison Consider the safe distance to be 25m and the upper boundary of the danger zone to be the safe distance between it and the vehicle If the object coordinates are within the danger distance region, the danger alarm with GUI and proper caution will be triggered Besides, the upper boundary of the caution zone plus 5m from the upper boundary of the danger zone only displays a warning through GUI if the object coordinates within it.

Figure 3.25: Caution and danger zone.

Normally, the camera installed on a car is fixed in one area, but this project is based on numerous dashcam footage Hence, the caution and danger zones vary and must be manually adjusted before operating In addition, the upper boundary of caution and danger zone can be calibrated properly to determine whether pixel height and width correlate with a stopping distance of 25m and lane width of 3.5m This customization for each vehicle is required due to camera angle and position differences The demonstration results of FCW block when detected vehicle’s bounding box appear in caution, and the danger zone is present in Figure 3.26.

LANE DEPARTURE WARNING

There are two popular approaches for lane detection: standard image processing methods [28] and deep segmentation methods [29] Deep segmentation algorithms have recently achieved considerable success in this sector because of their excellent representation and learning abilities There are still some significant and difficult issues to be addressed [30] The lane-detecting method is actively used as a crucial component of autonomous driving This necessitates a very low computing cost for lane detection. Furthermore, current autonomous driving solutions are frequently built with several applications based on vision and deep learning, which usually require a reduced computing cost for each application For this purpose, the proposed ADAS pre-produces the state-of-the-art in-lane detection known as Ultra Fast Lane Detection (UFLD) [30], which is an exceptionally fast speed based on the row-anchor approach with the ResNet18 as a backbone, as shown in Figure 3.27, and solves a no-visual-clue problem, as shown in Figure 3.28.

Figure 3.27: Illustration of selecting the left and right lanes In the right part, selecting a row is shown in detail Row anchors are the predefined row locations, and our formulation is defined as horizontally selecting each row anchor A background gridding cell is introduced on the right of the image to indicate no lane in this row [30].

Figure 3.29 presents the comparison between UFLD and conventional segmentation methods Assume the picture size is × and the row anchor is ℎ × In general, the number of predefined row anchors and gridding size is significantly smaller than the image size, ℎ ≪ and ≪ Hence, the original segmentation formulation must perform ( + 1) × × classifications, but UFLD has to solve ( +

1) × × ℎ classification issues As a result, the computation cost of the UFLD method is lowered significantly Besides, the UFLD method uses global features as input, which has a larger receptive field than segmentation Context information and messages from other locations of the image can be utilized to address the no-visual-clue problem [30].

Figure 3.28: Illustration of no-visual-clue problem in lane detection Different lanes are marked with different colors Most challenging scenarios are severely occluded or distorted with various lighting conditions, resulting in little or no visual clues of lanes that can be used for lane detection [30].

Figure 3.29: UFLD method selects locations (grids) on rows, while segmentation classifies every pixel The dimensions used for classifying are also marked in red [30].

The LDW block assumes that the camera is mounted in the center of the windshield and then processes the detected lanes from the UFLD Considering the bottom position of the left and right lanes and comparing it to the center of the image frame, use the prior knowledge of the width between the left and right lanes (e.g., 3.5 m in Vietnam) to estimate the distance that corresponds with the pixel in actual scale.

Figure 3.30: Illustration of lane detected result from UFLD.

As shown in Figure 3.21, the central axis x presents the point of view Its value changes depending on the simulation footage or the actual position of the mounted camera on the vehicle The left and right lane bottom x are chosen to calculate the distance off-center value from the central axis since they are the most stable result return from UFLD The warning flag will rise if the off-center value exceeds 0.6m The off- center value, as displayed in Figure 3.31, is calculated following this Algorithm 1:

Algorithm 1: Calculate the off-center value

Input: x-axis coordinate of the left and right main lane

Output: The off-center value

← Left main lane bottom x-axis coordinate ℎ ← Right main lane bottom x-axis coordinate ←

_ ← Distance away from the center x-axis

Figure 3.31: Example of lane detection and off-center value.

GRAPHIC USER INTERFACE

The fast expansion of the automotive industry system has prompted a new revolution in vehicle design The vehicle’s graphic user interface has swiftly become one of the most useful features automakers have utilized to personalize and enhance user experience As a result, this study proposes a GUI that interacts effectively with the ADAS applications using Qt Creator.

Figure 3.32: The GUI in Tesla model Y.

This study interface is inspired by Tesla’s elegant and beautiful GUI, as demonstrated inFigure 3.32 The proposed GUI is divided into three sections: the dashboard, inference, and notification The notification page will provide driver assistance information from ADAS applications and a button to switch between the inference and dashboard pages As illustrated in Figure 3.33, the dashboard page will include various buttons to handle the car’s features Still, there are only three functional buttons to activate these applications: FCW, TSR, and LDW There are also buttons for playing demo videos and a screen to load the real-time inference result from the backend Python on the inference page The remaining graphic elements in the GUI are largely aesthetics and credit support.

Figure 3.33: The dashboard page of GUI.

Figure 3.34: The inference page of GUI.

ELECTRICAL/ELECTRONIC ARCHITECTURE

The introduction of the ECU into the automobile industry has advanced vehicle electrification and mechatronics The ECU’s functions have evolved from managing engine operation to controlling the chassis, electronic components, and in-car entertainment and networking devices Currently, one or more ECUs regulate each vehicle’s features Electronic controllers have increased dramatically in recent years with the growth of fuel-saving, safety, comfort, and entertainment demands A level 2 premium car nowadays has more than 100 ECUs.

The ECU is built around a microcontroller unit (MCU) and an embedded system.The embedded system is a microcomputer, while the MCU is mostly used for control but not computation As a result, a single ECU can only perform data-intensive computation and control functions, including engine control, battery management, and motor control.The future’s toughest challenge for vehicle development will be the increasing demand for data processing and computing speed, whether from intelligent connection or autonomous driving technology Developing driver assistant technologies, in particular,will generate difficult logical processes and unstructured data processing scenarios The computational power of ADAS software has already reached 10 TOPS (Tera OperationsPer Second), and the computing power of autonomous driving software is predicted to approach 100 TOPS, which microcomputers’ existing computing capacity cannot handle.This work requires a high-performance vehicle computer - NVIDIA Jetson AGXOrin, as shown in Figure 3.35, to handle 3 essential ADAS applications and a graphic- user-interface It contains 2048-core NVIDIA Ampere architecture GPU, 12-core ARMCortex-A78AE 64-bit CPU, 32GB LPDDR5 RAM, and a maximum power of 50W delivering up to 275 TOP while keeping power efficiency in a small footprint with high- speed interfaces to support deep learning research [31] In addition, a low-cost camera is used to capture images with a resolution of 1280×720 and transfer it to the vehicle computer through the USB port The output of the ADAS software will be displayed on a1920x1080 screen through the DisplayPort interface.

Figure 3.35: Connection diagram of the proposed ADAS.

This project has to cope with several multithreading challenges for the ADAS system to achieve real-time performance on low-power edge devices in a vehicle while preserving GUI responsiveness Thus, numerous evaluations based on sequential and concurrent execution have been done According to the data presented in Table 3.9, concurrent execution outperforms sequential execution by 156.83% in begin-to-end (time elapsed between the application threads receiving the input frame and the last block rendering the output to the GUI) Currently, there is a lack of supported software for precise statistics of hardware resources on embedded devices However, by observing the system’s execution process and selecting peak parameters, the concurrent execution uses 61% GPU resources As a result, this satisfies the system’s real-time requirement with FPS above 60, while the most power- consuming component in this vehicle computer - the GPU is only 12.8W.

Table 3.9: Comparison of performance and hardware utilization between sequential and concurrent programming.

Model Sequential (FPS) Concurrent (FPS)

RESULTS

SIMULATION RESULTS

The proposed ADAS chose state-of-the-art object detection and lane detection models by analyzing scientific research and developing a benchmark among the models using a customized dataset Furthermore, this study investigates how to deploy selected models most efficiently by utilizing special strategies while training models and concurrent programming to enable three heavy computer vision and deep learning tasks to run swiftly with responsive GUI The simulation results when the ADAS software runs with the input 1280×720 dashcam footage, as shown in Figure 4.1, will be given in the following sections.

Figure 4.1: Example frames from dashcam footage to simulate the ADAS software. 4.1.1 Traffic Sign Recognition Results

The YOLOv6s FCW and TSR models accurately detect and recognize vehicles and signs in most frames The limit speed of 60 sign was detected with great precision (92%) in Figure 4.2 but was not present in the proposed region (light blue boxes) As a result,the indication is not displayed on the GUI In Figure 4.3, the detected limit speed of 60 is present in the proposed region Thus, the traffic sign is displayed to the driver through theGUI.

Figure 4.2: 1 st Sample result of the traffic sign recognition during inference.

Figure 4.3: 2 nd Sample result of the traffic sign recognition during inference.

The YOLOv6s-FCW detects the vehicle in Figure 4,4 with high precision (95%), and the vehicle’s coordinate in the frame was inside the caution zone Therefore the warning was shown in the inference result frame and notification “collision warning” on the GUI. was within the danger zone Therefore, the danger alert was raised and shown in the inference result frame and notification “danger ahead” on the GUI.

Figure 4.4: 1 st Sample result of the forward collision warning during inference.

Figure 4.5: 2 nd Sample result of the forward collision warning during inference.

The UFLD accurately detects the lane when the lane markers are visible As shown in Figure 4.6, there is no lane departure warning because the vehicle is heading straight. While in Figure 4.7, considering an upper boundary (yellow line) of the caution zone as an indicator for easily observing the center of the lane (LDW still based on the off-center value), the lane departure notification “driving off lane” appears on the GUI because the car is leaning to the right.

Figure 4.6: 1 st Sample result of the lane departure warning.

EXPERIMENTAL RESULTS

This experiment adjusted the input to capture from a camera instead of taking direct input from dashcam footage to examine environments close to real traffic scenarios Figure 4.8 depicts the proposed ADAS setup, which includes (1) the TV used to display traffic scenarios in Vietnam, (2) a camera to capture images and transfer them to the Jetson Orin vehicle computer powered by an AC-DC adapter, (3) the monitor to display the GUI, and

(4) the keyboard and mouse for user interaction (3) and (4) are subjected to change to a touchscreen in the future for greater convenience.

Figure 4.8: Two angle shows the experimental setup.

The safe result of the system displayed in Figure 4.9 demonstrates that practically all vehicles are detected with extremely high accuracy, and the lane detection system also detects and displays lanes correctly At the same time, the vehicle is visible in the center of the lane In contrast, we can observe in Figure 4.10 that the vehicle tends to go to the left, and caution is displayed on the vehicle’s GUI.

Figure 4.9: The safe result with multiple detected objects of the experiment.

Figures 4.11 and 4.12 show that FCW and TSR applications with the camera input operate similarly using dashcam footage as direct input Thus, we may deduce that deploying the ADAS system to a real vehicle in a controlled environment will result in the proposed ADAS performing perfectly fine.

Figure 4.11: The traffic sign recognition result of the experiment.

Figure 4.12: The forward collision warning result of the experiment.

CONCLUSION AND FUTURE WORK

CONCLUSION

In conclusion, this work proposes three fundamental ADAS applications using cutting-edge object detection YOLOv6 and lane detection UFLD based on computer vision and deep learning with a user-friendly graphic user interface Furthermore, this study employs the best technique for improving YOLOv6 performance by fine-tuning training and converting models to FP16 during inference for fast speed while maintaining accuracy In addition, this work provides a detail benchmarked among five YOLO models and a custom dataset targeted to traffic scenarios with over 18832 labeled objects. The object detection models trained and fine-tuned from the custom TSR and FCW datasets result in mAP 50 of 88.6% and 82.1%, respectively The ADAS system and GUI can operate in real-time at 71 frames per second while utilizing just 61% of the GPU’s performance The simulation and experimental results show that the system has successfully satisfied the goals, with important driver assistance features such as warnings and instructions working intuitively, precisely, and elegantly through the GUI.The initial accomplishment of this ADAS software will open the path for future software based on vehicle E/E architecture to progressively advance to autonomous cars.

FUTURE WORK

The following improvements will be done in the future to make this work more practical and to meet the OEM’s expectations Firstly, The FCW application can calculate and change the distance of collision warning and perform auto brake when combined with dedicated sensors and ECUs on the vehicle Secondly, TSR and FCW applications can enhance object detection performance with more variety of signs, vehicles, and pedestrians by preparing data for various traffic scenarios Thirdly, UFLD can be enhanced further by replacing the ResNet18 backbone with a better model, such as theRepVGG-A0 [13] Finally, converting models to INT8 precision by TensorRT for even faster inference speed [19].

[1] S Singh, “Critical reasons for crashes investigated in the National Motor Vehicle Crash Causation Survey”, US National Highway Traffic Safety Administration, DOT HS

812 506, Washington DC, USA, pp 2-2, March 2018

[2] L Yue, M Abdel-Aty, Y Wu, and L Wang, “Assessment of the safety benefits of vehicles’ advanced driver assistance, connectivity and low-level automation systems”, Accident Anal Prevention, vol 117, pp 55-64, Aug 2018.

[3] M Hasenjọger, M Heckmann and H Wersing, “A Survey of Personalization for Advanced Driver Assistance Systems,” in IEEE Transactions on Intelligent Vehicles, vol 5, no 2, pp 335-344, June 2020, doi: 10.1109/TIV.2019.2955910.

[4] Dumoulin, Vincent, and Visin, Francesco “A guide to convolution arithmetic for deep learning.” arXiv, 2016, https://doi.org/10.48550/arXiv.1603.07285.

[5] Redmon, Joseph, and Farhadi, Ali “YOLOv3: An Incremental Improvement.” arXiv,

[6] Bochkovskiy, Alexey, et al "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv, 2020, https://doi.org/10.48550/arXiv.2004.10934.

[7] Jocher Glenn “YOLOv5 release v6.1” (2022) [Online] Available: https://github.com/ultralytics/yolov5/releases/tag/v6 1

[8] Li, Chuyi, et al “YOLOv6: A Single-Stage Object Detection Framework for

Industrial Applications.” arXiv, 2022, https://doi.org/10.48550/arXiv.2209.02976.

[9] Wang, Chien, et al “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors.” arXiv, 2022, https://doi.org/10.48550/arXiv.2207.02696.

[10] Ge, Zheng, et al “YOLOX: Exceeding YOLO Series in 2021.” arXiv, 2021, https:// doi.org/10.48550/arXiv.2107.08430.

[11] Wu, Xiongwei, et al “Recent Advances in Deep Learning for Object Detection.” arXiv, 2019, https://doi.org/10.48550/arXiv.1908.03673.

[12] Carranza-García, M.; Torres-Mateo, J.; Lara-Benítez, P.; García-Gutiérrez, J On the Performance of One-Stage and Two-Stage Object Detectors in Autonomous Vehicles Using Camera Data Remote Sens 2021, 13, 89 https://doi.org/10.3390/rs13010089.

[13] Ding, Xiaohan, et al “RepVGG: Making VGG-style ConvNets Great Again.” arXiv,

[14] Lin, Tsung, et al “Feature Pyramid Networks for Object Detection.” arXiv, 2016, https://doi.org/10.48550/arXiv.1612.03144.

[15] Zhang, Can, et al “PAN: Towards Fast Action Recognition via Learning Persistence of Appearance.” arXiv, 2020, https://doi.org/10.48550/arXiv.2008.03462.

[16] Deloitte, “Autonomous Driving” (2019) [Online] Available:

[17] National Highway Traffic Safety Administration, “Driver Assistance Technologies”

(2022) [Online] Available: https://www.nhtsa.gov/equipment/driver-assistance-technologies

[18] Z Zhu, D Liang, S Zhang, X Huang, B Li, and S Hu, “Traffic-Sign Detection and Classification in the Wild,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp 2110-2118, doi: 10.1109/CVPR.2016.232.

[19] NVIDIA TensorRT (2022) [Online] Available: https://developer.nvidia.com/tensorrt,

[20] Takaki, M.; Fujiyoshi, H Traffic Sign Recognition Using SIFT Features IEEJ Trans Electron Inf Syst 2009, 129, 824–831.

[21] Dalal, N.; Triggs, B Histograms of oriented gradients for human detection In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 10, pp. 886–893.

[22] Department of Transportation, “Circular Number: 91/2015/TT-BGTVT” (2015). [Online] Available: Thông tư 91/2015/TT-BGTVT tốc độ khoảng cách xe cơ giới xe máy chuyên dùng giao thông đường bộ

[23] Wei, Pan, et al “LiDAR and Camera Detection Fusion in a Real-Time Industrial Multi-Sensor Collision Avoidance System”, arXiv, 2018 https://doi.org/10.48550/arXiv.1807.10573.

[24] Ziebinski, A.; Cupek, R.; Erdogan, H.; Waechter, S A Survey of ADAS Technologies for the Future Perspective of Sensor Fusion In Computational Collective Intelligence;

Nguyen, N.T., Iliadis, L., Manolopoulos, Y., Trawiński, B., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp 135–146.

[25] Nur, SA; Ibrahim, M.; Ali, N.; Nur, FIY Vehicle detection based on underneath vehicle shadow using edge features In Proceedings of the 2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, 25–27 November 2016; pp 407–412.

[26] Lin, Tsung, et al “Microsoft COCO: Common Objects in Context.” arXiv, 2014, https://doi.org/10.48550/arXiv.1405.0312.

[27] Elsagheer Mohamed, S.A.; Alshalfan, K.A.; Al-Hagery, M.A.; Ben Othman, M.T. Safe Driving Distance, and Speed for Collision Avoidance in Connected Vehicles Sensors

[28] Aly, Mohamed “Real-time Detection of Lane Markers in Urban Streets.” arXiv,

[29] Neven, D., De Brabandere, B., Georgoulis, S., Proesmans, M., Van Gool, L.: Towards end-to-end lane detection: an instance segmentation approach In: Proceedings of the IEEE Intelligent Vehicles Symposium pp 286–291 (2018)

[30] Qin, Zequn, et al “Ultra Fast Structure-aware Deep Lane Detection.” arXiv, 2020, https://doi.org/10.48550/arXiv.2004.11757.

[31] NVIDIA “Jetson AGX Orin developer kit specification” (2022) [Online]

Available: Jetson AGX Orin for Advanced Robotics | NVIDIA

Figure 1: Plagiarism check result by Turnitin.

Tiêu đề	Design Of Advanced Driver Assistance System Based On Deep Learning
Tác giả	Thai Hoang Minh Tam
Người hướng dẫn	Le Minh Thanh, M.Eng
Trường học	Ho Chi Minh City University Of Technology And Education
Chuyên ngành	Computer Engineering Technology
Thể loại	Graduation Project
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	90
Dung lượng	5,18 MB