-- ¿+ + 1Figure 1-2: Using internet of thing in public transportation Figure 1-3: Idea design Figure 2-1: Intelligent transportation system: Figure 3-1: Image processing in R-CNN Figure
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
COMPUTER ENGINEERING DEPARTMENT
PHAN TRAN QUOC DAT
VÕ QUOC HUY
GRADUATION THESIS
RESEARCH AND IMPLEMENTATION OF DETECTING
AND TRACKING SYSTEM OF VEHICLE ON
NATIONAL WAYS
ENGINEER OF COMPUTER ENGINEERING
HO CHi MINH CITY, 2021
Trang 2VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY COMPUTER
NGHIEN CUU VA THUC HIEN HE THONG PHAT HIEN,
THEO DOI TOC DO XE TREN DUONG QUOC LO
ENGINEER OF COMPUTER ENGINEERING
INSTRUCTOR
PhD LAM DUC KHAI
HO CHi MINH CITY, 2021
Trang 3PROTECTION COUNCIL OF THE GRADUATE THESIS
Protection council of the graduate thesis, established under decision no
70/QD-DQCNTT dated January 27", 2021 of the Rector of University of
Information Technology
Trang 4We would like to give our gratitude to Ph.D Lam Duc Khai and Ph.D Nguyen MinhSon for the passionate answers whenever we face complicated questions, the usefuladvice whenever we run into troublesome tasks, the directions whenever we lost track
of situations
We also wish to give our appreciation to all the instructors of UIT computer engineeringdepartment for every single lesson which provide us with the knowledge, nurture us tobecome better people for the society
Finally, we would like to give out big thanks for all the people that constantly assist us
in the process of carrying out the research, give us the encouragements in the hard time,and support us both in finance and mentality This research can not be done withoutyour enthusiastic help
Trang 52.2 Problem and direction
Chapter 3 THEORY FOUNDATION
3.1 Review of object detection model:
3
3.2 Detail of Yolo Model for object detection
3.2.1 Introduction
3.2.2 The reason Choosing _YoÏO -:-¿- 5:5: S52 +e+xexsxervxererrrererree 13
3.2.3 How this WOTK 1 2222212 22 222 021221212121 re 13
3.2.4 The reSUÍ(L 5 cà 221 1212121211112 HH HH 19
3.3 Dataset and Training - St tt 1010121 H000 hàn 203.3.1 Dataset and why choOSingE ¿tt St 20Kao ẽắắ%mnỖ 213.3.3 Processing raw iIAgCS ánh HT HH ngư 2ky 233.3.5 Evaluate the training r€SuÏ - -¿ - «+ + St St re 25
3.4 Review of object tracking algorithms
3.4.1 Meanshift
3.4.2 Particle filter ố ẻ 28
3.4.3 Kaliman 293.5 SORT algorithm +22 22x22 x2 txerrrrrrrrerườt 313.5.1 InfFOđUCtÏOH c1 1 E* ST 1212 H1 HH re 31
Trang 63.5.2 Processing flow of the SORT ¿2-5222 2t22rrerrrerrerrre 31
3.6 Speed measuring
Chapter 4 PROJECT IMPLE TION AND RESULT EVALUATION
.34
4.1 Project implementation 344.1.1 Hardware 34
“9 36
.39 4.2 Result evaluation
4.2.1 Evaluating model base on đetecting distance 394.2.2 Evaluating model base on detecting aCCUTaCÿ c csc sex 444.2.3 Evaluating model base on tracking vehicle + s+++c+c+x+xss+ 46Chapter 5 CONCLUSION AND FUTURE WORKK -c-csc< << 51
5.1 Conclusion ⁄⁄22 <651 Te Ắ, 515.2 Future work 51REFERENCES 52
Trang 7FIGURE MENU
Figure 1-1: Traffic jam in VietfIam ¿+ + 1Figure 1-2: Using internet of thing in public transportation
Figure 1-3: Idea design
Figure 2-1: Intelligent transportation system:
Figure 3-1: Image processing in R-CNN
Figure 3-2: Image processing in Fast R-CNN
Figure 3-3: Left: Region proposal network (RPN) Right: Samples detection of
lý 00.G000Ẻ1 10
%œ œ6
Figure 3-5: SSD mechanism in training and detecting
Figure 3-6: YOLO workflow
Figure 3-7: YoLo performance
Figure 3-8: Darknet framework loads 106 layers for every commands
Figure 3-9: Darknet-53
Figure 3-10: Image by Ayoosh Kathuria
Figure 3-11: Image by Valentyn Sichkar(a)
Figure 3-12: Image by Valentyn Sichkar(b)
Figure 3-13: Total bounding boxes of three difference scales
Figure 3-14: Calculate bounding box by using the anchor
Figure 3-15: Equation for objecness score - Image by Valentyn Sichkai
Figure 3-16: Image of objects detected on HoChiMinh cit
Figure 3-17: The result displayed on terminal
Figure 3-18: Types of vehicle used to train model
Figure 3-19: Instances label
Figure 3-20: Weather types
Figure 3-21: Annotation format
Figure 3-22: Interface of labeling
Figure 3-23: Services from Collab
Figure 3-24: Training procedure
Figure 3-25: Chart evaluate after training
Figure 3-26: Detail result after training
Figure 3-27: Meanshift illustration
Figure 3-28: Steps in the operation
Figure 3-29: Content of kalman
Figure 3-30: Processing flow of the SORT
Figure 3-31: Estimate the speed of vehicles
Figure 4-1: Jetson Nano kit
Figure 4-2: OpenCv
Figure 4-3: Distance at sunny
Figure 4-4: Distance at cloudy SOSCDRHHKBVAIADRURWY
Trang 8Figure 4-5: Distance at near night
Figure 4-6: Distance at rain
Figure 4-7: Distance at rainy nigh
Figure 4-8: Video result
Figure 4-9: Detect object partially obscured
Trang 9TABLE MENU
Table 1: Testing results of many models in similar conditions - -‹Table 2: Number of imported automobiles on November 2020
Table 3: Specification of hardware on Collab
Table 4: Compare the Tracking methods
Table 5: Performance of models on different hardwar:
Table 6: Jetson Nano kit detai
Table 7: Camera detail
Table 8: Distance at sunny
Table 9: Distance at cloudy
Table 10: Distance at near night
Table 11: Distance at rain
Table 12: Distance at rainy night
Table 13: Video Specifications
Table 14: Number of vehicles in practice
Table 15: Number of vehicles compare
Table 16: Compare the results
Trang 10We made this research with the hope to provide a device that can help solve in
improve the quality of people when traveling on roads in Vietnam
Thanks to the development of deep learning and computer version, main functions
of the device are detection and tracking popular vehicles on national roads, in
addition, estimate the distance and speed as the same time
The system will have the following direction:
- The system will detect various kinds of vehicles using Yolo version 3 Inputcan be available videos or transmit through camera attached to the hardware
- The detected objects will be tracking and estimating speed by applying
algorithms
According to the expected result, performance of the detecting process can reachabout 80% and the ability to calculate speed is added
Trang 11Chapter 1 CURRENT PROBLEM AND POTENTIAL SOLUTION
1.1 Problem statement
Our country is at the stage of developing in many fields, some of noticeable ones areabout human, science, technology, Through the process, there is no other solutionbut to be headstrong and deal with challenges born along the way
Figure 1-1: Traffic jam in Vietnam
A large-scale problem we could mention of is the national transportation So far, with
128 national roads having the total length of 17.530 kilometers and a tremendousnumber of vehicles that are increasing day by day, one might easily think about thehardship for the Ministry of Transport to manage such complex traffic
Trang 12Fortunately, the improvement of information technology has taken effect With theapplication of deep learning, particularly detection and tracking methods, we can giveassists to controlling traffic more adequately One great detecting and tracking systemcan not only support the management of traveling but also raise the citizens’consciousness of safety when participating in traffic flow.
1.2 The idea
From primitive forms like walking or riding bicycle on trails, vehicles with four wheelsappear regularly on the asphalt roads which are prolonged from time to time Besidepositive aspects such as the improving of everyday life quality, people have toconfront trouble incurred from that growth Many types of vehicles meaning moretule need to establish to keep all of them under control, the frustration of people whenthey get stuck in a traffic jam which can last for a few hours after a day of tiring work,
or the lack of sense of responsibility and safety for themselves and the others, all of
Figure 1-3: Idea designWith the application of deep learning to detect and track vehicles, the amount of workwill be able to reduce Having a system to analyze the number of vehicles in specific
Trang 13frame of time so we can give citizens announcements about the situation on the road
so they could avoid driving on jamming roads For the traffic police, having anaccurate information will put them in the right destination to give out trafficcommands, or in the office, they can detect vehicles that break the rules Not onlysolving external issues, but the project may also join in fixing internal problems.People knowing about the system will have to behave themselves if they do not want
to have punishments, gradually, they will form a civilized habit, care more for theirown safety and the others’ as the same time
1.3 Methodology
About this project, we propose three main steps that need to acquire:
— The first one is to understand the basic concept of Deep learning Foundation is
always an important element when approach new knowledge Our group hadgathered necessary information from various reliable sources: learningwebsites on the internet, articles of many previous researches, experience fromqualified people,
— The second one is the implementation of the project This is the main stage of
the project as it carries multiple tasks to be accomplished For the hardwarewill have a crucial role in deciding the performance of the system, a choiceneed to be make considered about the functionality and price After making anoverall view at the marketplaces, NVidia Jetson Nano kit seems to be areasonable pick because of the efficiency in both price and performance Aftersetting up the hardware with every requirement, YOLO algorithm, a state ofart Object Detector, had been installed and run demo to see the result Nextstep is collecting the dataset for our purpose Dataset is considered one of themost basic factors to evaluate if the model can be applied to reality or not Themore resourceful dataset with a precise annotation will give the model anability to learn deeply about the features of objects that we desired to detect, inorder word, the result will gain better accuracy for this project The intention ofthis dataset is from Vietnam, for Vietnam as most of training data will be
collect in our nation so model can get used to environment in this country.
Trang 14— Finally, the last target is training and testing the model Theory is usually
different from reality, so testing will never be an ignorable subject After all,this project is mean to be able to be applied in real world After testingfunctionality and packing vital components, we bring the system to suitable
environment for the validation.
Trang 15Chapter 2 RELATED WORK
2.1 Previous projects
2.1.1 Domestic
The human eye is structured in the following parts:
— ITS (intelligent transportation system) for Vietnam Expressway has been
publicly available in March 2017 This project is the fruit of cooperation of aconsortium of Japanese companies led by Toshiba The man in charged claimedthe system including cutting-edge information processing technology foranalyze vehicles on the road, which results in reducing the disruption along withthe network and inconvenience for its users
~ ——Ÿ_ `
Figure 2-1: Intelligent transportation system
— Da Nang smart camera system has been carrying out as a foundation to create
a smart city in the future The system can run 24/24 hours, detect two kinds of
vehicles (car and motorbike), and can report traffic violations The scale of the
project is at national level: there are nearly 50 locations that are installed smartcamera, the total number of traffic cameras in the city is 143 with 125surveillance cameras, 9 speed testing cameras and 9 observation cameras
2.1.2 Foreign
A Cascade of Boosted Generative and Discriminative Classifiers for VehicleDetection conducted by Pablo Negri, Xavier Clady, Shehzad Muhammad
Trang 16Hanif, and Lionel Prevost showed a cascade of boosted classifiers for vehicledetection in scene image on the road The project studied two main features:Haar-like features and HoG (histogram of oriented gradients) features, whereHaar-like features are used to construct discriminative weak classifiers whilethe other ones are used to construct generative weak classifiers The fusiondetector combines the advantages of both Haar and HoG detectors and achieves
a high correct detection rate of 94% and a small number of false alarms rateper image of 0.0003 The result had been evaluated on 2.2 GHz processor and
had not been tested in practice.
Siemens Mobility has developed technology to assist traffic management tasks.Some of examples are: Sitraffic Sivicam- an easy to attach to availableinfrastructure elements and be able to detect vehicles at the intersection Theirrecent project is the cooperation Australia authority to improve the quality ofhighway transportation The main use of the newly introduced system is the
ability to exchange information on traffic disruptions real time so they can send
signals back to be processed This can help navigate vehicles such asemergency vehicles or public transports to move more efficiently
2.2 Problem and direction
Most of the projects above have been tested and proved to be in use as they areproviding by brands that have history and experience The utilization of deeplearning in image processing has been providing usefulness in surveillance andcapturing situation on streets, which lend a hand to make a good decision atsolving congestion and traffic violation
Although the system mentioned can operate smoothly, those are still a nationallevel project With a scale that big, the expending in equipping and maintainingthem are extremely pricey leading to the decrease in coverage of systems, not tomention the resources of computing and network
This project is proposed to build a system where the resources usage will bereduced With the main point is:
6
Trang 17— Receiving the data from the camera.
— Detecting various typical kinds of vehicles in Vietnam national roads
— Counting the number of vehicles
— Estimating the speed of vehicles.
With the information extracted by the system, we hope this project can help resolvemany issues of current traffic and help improving the satisfaction of citizens whentraveling
Trang 18Chapter 3 THEORY FOUNDATION
3.1 Review of object detection models
Object detection is an activity when a model provides location of an object in an
image and draw a bounding box around that object Some common model
architectures for object detection are R-CNN, SSD, YOLO, let have some reviews
3.1.1 R-CNN
R-CNN stands for regions with CNN features The model has the name by the
activity of extracting proposal regions from input image, wrapping it to
compatible size for convolutional neural network, then compute the features for
each proposal After that, regions are classified by linear Support Vector
Figure 3-1: Image processing in R-CNN
This model achieves a mean average precision of 53.7% on PASCAL VOC 2010
As an early proposed model, disadvantages appear:
— Training is multiple-stage pipeline
— Training is expensive in space and time as VGG16 is used as backbone
— Object detection is slow as ConvNet forward pass for each object proposals
3.1.2 Fast R-CNN
Fast R-CNN is an improvement that fixes disadvantages of CNN The network
takes entire image and a set of object proposals as input Image goes through
Trang 19several convolutional and max pooling layers to product convolutional feature
map Each Rol (region of interest) is pooled into a fixed-size feature map and
mapped to feature vector by fully connected layers The network then outputs two
vectors per Rol: softmax probabilities and per class bounding box regression
offset
Some advantages of Fast R-CNN compared to R-CNN:
— Training is a single-stage, using a multi-task loss
— Training can update all network layers
— No disk storage is needed for feature caching
Figure 3-2: Image processing in Fast R-CNN
The reason why Faster R-CNN give a quicker result owing to the replacement of
selective search, which is used for both R-CNN and fast R-CNN, with RPN to
identify the region proposals
3.1.3 Faster R-CNN
Faster R-CNN comprises of two modules: a deep convolutional network for
proposing regions and a Fast R-CNN as detector A region proposal network
(RPN) takes image as input and outputs rectangular object proposals Each
rectangular has objectness score
Trang 202k scores 4k coordinates ° ` )
ls layer \ t reg layer
256-4 + intermediate layer
sliding window
ony feature map
Figure 3-3: Left: Region proposal network (RPN) Right: Samples
Faster R-CNN| 0.2
fe) 15 30 45
Figure 3-4: Image processing time of Faster R-CNN comparing to
previous models3.1.4 SSD
SSD, standing for single shot multibox detector, is a single shot detector designed
to use one-stage deep neural network for object detection in real-time SSDenhances running speed comparing to two stage detector like faster R-CNN by
eliminating the proposal network
SSD comprises of two main features: extractions of feature maps andconvolutional filters to detect objects The model takes image and ground trueboxes as inputs, then model evaluate default boxes at each location in several
10
Trang 21feature maps with different scales(8x8 ,4x4) The bounding boxes are chosenbased on what rate they match with the ground true boxes The chosen defaultboxes are then predicted with both coordination and confidence for all objectclasses.
(a) Image with GT boxes (b) 8 x 8 feature map (c) 4 x 4 feature map
Figure 3-5: SSD mechanism in training and detecting3.1.5 YOLO
YOLO stands for you only look one Different from region-based model abovewhich only use a part of an image to detect object, YOLO will look through animage at a whole to find out where an object is
The input image will be divided into SxS grid, each cell is responsible to predictone class Convolutional layers are used to extract features then feed to fullyconnected layer to predict the output including coordinate and output
probabilities
Due to spatial constraints of algorithm, YOLO struggles against small objectwithin the image
le boxes + confidence
Class probability map
Figure 3-6: YOLO workflow
II
Trang 22Decision making in choosing an object detection model:
Table 1: Testing results of many models in similar conditions
13.5 38.1 52.0 16.2 39.8 52.1
unreliable YOLO based on VGG16, lowers FPS to 21, but 66.4 mAP makes up for
it SSD300 outperforms both YOLO and Faster R-CNN with 74.3 mAP, 59 FPS onsame hardware But this comparison was made in 2016, as technology improves, we
will have different result.
In 2018, Joseph Redmon and Ali Farhadi introduced YOLOv3 with manyimprovements The architecture increased into 106 layers, had 3 different scale fordetecting operation, These changes were huge and leveraged YOLOv3 into one ofthe best detectors for real-time task
Metho.
[B] SSD22T
[C] DSSD221 [DỊ R-FCN [E] SSDS13 [E]l DSSDS13
[G] FPN FRCN
YOLOv3-320 YOLOv3-416
Trang 233.2 Detail of Yolo Model for object detection
3.2.1 Introduction
Yolo, standing for You Only Look One, is a state of art algorithm which is popularfor the ability to detect object and can be utilized for real time application
3.2.2 The reason choosing Yolo
Yolo had been publicized for community a few year ago with many versionsreleased through time This project make use of Yolov3 as for its incredible speedand reliable accuracy compared to other detection algorithms It can also detectmultiple objects including objects’ class and location in a single image It can alsodetect multiple objects including objects’ class and location in a single image.Above all the model can simply have a tradeoff between speed and accuracy just
by changing the model network size without requiring more resources
3.2.3 How this work
3.2.3.1 Architecture
Yolo uses:
— 53 CNNs layers (darknet-53)
— For detection, 53 more layers are added
— Total 106 layers for Yolo version 3
0.044 BF
Figure 3-8: Darknet framework loads 106 layers for every commands
13
Trang 24-Ivpe Filters Size Output
— n: the number of images
— w: width
— h: height
3.2.3.3 Channel
The number of w and his the resolution of the network and they can be changed
with any number that can be divided by 32 without any remainder Increasing
14
Trang 25the solution of the network leading to improvement of accuracy in training and
detecting
Darknet framework is integrated resize function, so the user may feed images
of any size to network with trouble, as the images will be adjusted according
to network size (width x height)
3.2.3.4 Detection
As mention above, layer 82, 94 and 106 is where detection conducting,
respectively, input image goes through downsampling with the factors of 32,
16, and 8 These three numbers are called stride, which indicates how many
times the solution of the images in that specific layer is smaller than the input
network size For example, the input size is 416 x 416, then move to layer 82,
it is downsampled to 13 x13, similarly, 26 x 26 for layer 84 and 52 x 52 for
layer 106
, |
Residual Block
A Detection Layer *
106
Upsamplng Layer "
@ Further Layers
YOLO v3 network Architecture là
Figure 3-10: Image by Ayoosh Kathuria
15
Trang 26The reason for these conversions is to make detecting more effective when 13
x 13 solution is responsible for detecting large objects, 26 x 26 is for the medium and 52 x 52 is for small objects
— B: number of bounding boxes which each cell is responsible for
detection In the paper, author of Yolov3 stated each cell would predict
3 boxes.
— C: number of classes you want to detect.
— 5: including 5 attributes: coordinate of the center of bounding box (tx,
ty), width and height of bounding box (tw, th), objectness score (p0) and confidence of each classes (p1; p2, , Pc).
For example, we want to Yolo to detect 5 classes, the formula will become 3*(5+5) = 30 attributes for each cell in the feature map of detecting layers.
Figure 3-11: Image by Valentyn Sichkar(a)
16
Trang 273.2.3.6 Anchors (prior)
Anchors (or prior) is a redefined bounding box which participate in choosing
which bounding boxes in each cell predict the right object They are calculated
by K-means clustering Through whole process, total 9 anchor boxes are used
to calculate the bounding boxes, 3 for each scale.
0.05 class: orange
Figure 3-12: Image by Valentyn Sichkar(b)
First Yolo extracts the information in the kernel of each cell Then it calculates and chooses bounding box which has highest probability for specific class.
Repeating the steps for all the grip cells of all the scale, Yolo version 3 calculate total 10647 bounding boxes till the end of the process (supposed the
network input solution is 416 x 416).
Trang 283.2.3.7 Calculate bounding box
To calculate bounding box, yolo evaluate the offset to the anchor through these formulas
bw: width of bounding box
bp: height of the bounding box
Figure 3-14: Calculate bounding box by using the anchor 3.2.3.8 Objectness score
Objectness score is one of the attributes in bounding box It is used to calculate what class bounding box relating to and later that result will be used to choose anchor box The value of object score stands for the probability of bounding box has object inside.
18
Trang 29We need to recognize the different between objectness score and the
confidences of classes: objectness score indicates the probability whether the cell contains object inside, confidences indicate what kind of classes the cell belongs to.
- predicted probability
- between predicted BB2 and ground truth BB1
Figure 3-15: Equation for objecness score - Image by Valentyn
Sichkar
3.2.4 The result
Yolo version with Darknet framework gives the bounding boxes and confidence
scores it detects The reliability of detection depends on how training process
executes.
Figure 3-16: Image of objects detected on HoChiMinh city
19
Trang 30CM-day_6.j]pg: Predicted in 31024.904000 milli-seconds.
(left_x: 167 top y: 390 width: 117
(left_x: 223 top y: 287 width: 83 (left_x: 241 top y: 163 width: 39
)
(Left x: 380 top y: 67 width: 14
(left_x: 392 top y: 60 width: 14 (left_x: 394 top_y: 79 width: 22 (left_x: 411 top_y: 140 width: 38 (left_x: 423 top_y: 186 width: 52 (left_x: 426 top_y: 85 width: 23 (left_x: 426 top_y: 93 width: 25 (left_x: 477 top Vy: width: 49
(left_x: 595 top y: width:
Figure 3-17: The result displayed on terminal
3.3 Dataset and Training
3.3.1 Dataset and why choosing
Globally, one of the standard to evaluate development level of a country is the automobiles owning rate Vehicles of various types appear on road also determine the growth of car industry The prediction of Ministry of Industry and Trade is the flourish of car market in 2025 when our nation can have 600.000 cars per year.
The report of VEMA (Vietnam automobile manufacturers' association) stated that there was total 36.359 automobiles sold on October 2020, increased 9% compared
to previous month, 22% compared to previous year at the same month.
Although we have suffered from disadvantages such as pandemic obstructing the economy, the number of automobiles imported, installed and sold still showed such significant number.
Table 2: Number of imported automobiles on November 2020
Types Instances
Automobiles less than 9 seats 8.441
Automobiles more than 9 seats 12
Automobiles specializing in 2.585
transporting
20
Trang 31Others 1.199
Total 12.237
With such evidences, we would like to choose these types of vehicles: car, van, truck, lorry truck, bus as they are vehicles whose number will only grow larger in the future, especially in main cities of our nation like Ho Chi Minh city or Ha Noi capital.
3.3.2 Raw image
A good model replies on a good dataset and Yolo is not excluded This project is conducted with the hope to give a mean to control the traffic for nowadays growth rate, so data for training are popular vehicles running on Vietnam national roads.
Our dataset currently has total 19200 annotated instances in 2430 images that are
used in this study 5 classes are identified to feed into the model: car, van, truck,
lorry truck and bus We decided to choose those types of transportation as theyare large in quality and have a flow to increase in the future.
21