BUI MINH HIEU
RESEARCH AND DEVELOP SOLUTIONS TO ESTIMATE TRAFFIC DENSITY FROM TRAFFIC CAMERAS AT MAIN
INTERSECTIONS
Major: COMPUTER SCIENCE Major code: 8480101
MASTER’S THESIS
Trang 2THIS THESIS IS COMPLETED AT
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor: Assoc Prof Ph D Tran Minh Quang
Examiner 1: Assoc Prof Ph D Nguyen Van Vu Examiner 2: Assoc Prof Ph D Nguyen Tuan Dang
This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on July 11th 2023
Master’s Thesis Committee:
(Please write down full name and academic rank of each member of the Master’s Thesis Committee)
1 Chairman: Assoc Prof Ph D Le Hong Trang 2 Secretary: Ph D Phan Trong Nhan
3 Review 1: Assoc Prof Ph D Nguyen Van Vu 4 Review 2: Assoc Prof Ph D Nguyen Tuan Dang 5 Member: Assoc Prof Ph D Tran Minh Quang
Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)
Trang 3VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness
THE TASK SHEET OF MASTER’S THESIS
Full name : Bui Minh Hieu Student ID : 2170461
Date of birth : 17/02/1997 Place of birth : HCM
Major : Computer science Major ID : 8480101
I THESIS TITLE : RESEARCH AND DEVELOP SOLUTIONS TO
ESTIMATE TRAFFIC DENSITY FROM TRAFFIC CAMERAS AT MAIN
INTERSECTIONS (NGHIÊN CỨU, XÂY DỰNG CÁC PHÉP ƯỚC LƯỢNG
MẬT ĐỘ GIAO THÔNG DỰA VÀO DỮ LIỆU CAMERA Ở NHỮNG NÚT GIAO THÔNG QUAN TRỌNG).
II TASKS AND CONTENTS :
The objective of this thesis is to research and develop a method for estimating traffic density at intersection areas using images or videos Accordingly, the tasks involved in this work include determining the calculation method and evaluating traffic density for a given area, comparing existing papers and systems with the proposed method to identify any differences, proposing solutions to address challenges, and advancing the calculation method of the thesis Additionally, designing a prototype system to demonstrate the functionality will be undertaken
III THESIS START DAY : (According to the decision on assignment of
Master’s thesis) 06/09/2022
IV THESIS COMPLETION DAY : (According to the decision on assignment of
Master’s thesis) 08/06/2023
V SUPERVISOR : (Please fill in the supervisor’s full name and academic rank) Assoc Prof Ph.D Tran Minh Quang
HCM City, June 8th 2023
SUPERVISOR
(Full name and signature)
CHAIR OF PROGRAM COMMITTEE
(Full name and signature)
HEAD OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING
Trang 4ACKNOWLEDGEMENT
First and foremost, I would want to express my gratitude and extend my heartfelt gratitude to Assoc Prof Ph D Tran Minh Quang, the instructor who has completely guided and helped me to accomplish my thesis
I'd want to thank the Computer Science and Engineering instructors, as well as the teachers who generously shared their knowledge with me throughout my time at the HCM city University of Technology Finally, I'd want to thank my family and friends, who have always supported and encouraged me throughout the process of doing this research
HCM City, June 8th 2023
Trang 5ABSTRACT
Traffic congestion has become a pressing issue, not only for the government but also for the general public, as it directly impacts the quality of life It is necessary to research and develop solutions to manage traffic conditions and minimize the impact of congestion Understanding this, this study provides readers with a method to estimate traffic conditions, specifically traffic density at intersections To achieve this, the research team divided the task into two main components: vehicle counting and area measurement Vehicle counting is relatively simple with the assistance of modern technologies and techniques In contrast, and notably, this study offers readers methods to calculate the area of a region from images, without relying on camera specifications, based on the concept of object reference Additionally, through this research, we also present the design of the testing system, the challenges faced, and the accompanying solutions during the design process, as well as the results achieved when applied in a real-world environment Finally, alongside the expected outcomes, some limitations need to be discussed and addressed in the future
Trang 6TÓM TẮT LUẬN VĂN THẠC SĨ
Tắc đường đã trở thành vấn đề cấp bách, không chỉ đối với chính phủ mà cịn cả với người dân, vì nó ảnh hưởng trực tiếp đến chất lượng cuộc sống Việc nghiên cứu và phát triển các giải pháp quản lý tình trạng giao thơng để giảm thiểu tác động của ùn tắc đã trở nên cần thiết hơn bao giờ Nhằm để hiểu rõ hơn vấn đề đã nêu, nghiên cứu này cung cấp cho độc giả một phương pháp để ước tính, đánh giá điều kiện giao thơng, cụ thể là tính mật độ giao thông tại các ngã tư Để đạt được mục tiêu này, nhóm nghiên cứu đã chia cơng việc thành hai thành mục chính: đếm số phương tiện và đo diện tích khu vực Việc đếm số phương tiện khá đơn giản với sự hỗ trợ của các công nghệ và kỹ thuật hiện đại Ngược lại, đo diện tích khu vực lại khó khăn hơn và thử thách hơn, do đó nghiên cứu này đề xuất các phương pháp tính tốn diện tích của một khu vực từ hình ảnh, mà khơng dựa vào thơng số của máy ảnh, dựa trên khái niệm về tham chiếu đối tượng Bên cạnh đó, thơng qua nghiên cứu này, chúng tơi cũng giới thiệu kiến trúc hệ thống mà chúng tôi thiết kế, các thách thức gặp phải trong quá trình chạy hệ thống và các giải pháp đi kèm, cũng như kết quả đạt được khi áp dụng trong môi trường thực tế Cuối cùng, bên cạnh các kết quả dự kiến, một số hạn chế cần được thảo luận và giải quyết trong tương lai
Trang 7THE COMMITMENT
I confirm that this is my research The data utilized in the thesis's complete analytic process has a clear and transparent provenance, and it was released in compliance with scientific research standards and ethics In this thesis, I have presented the results of my study openly and fairly The thesis results are presented in this report for the first time and have not been published in any earlier thesis
HCM City, June 8th 2023
Trang 8TABLE OF CONTENTS
I INTRODUCTION 1
1.1 Research problem 1
1.2 Objectives of the topic 3
1.3 Scope of study 3
1.4 Scientific and practical significances 3
1.4.1 Practical significance 3
1.4.2 Academy significance 4
II THEORETICAL BASIC 6
2.1 Definition of traffic density 6
2.2 Definition of level of service (LOS) 6
2.3 Definition of computer vision - machine learning 7
2.3.1 Haar-cascade 7
2.3.1.1 Calculating Haar Features 8
2.3.1.2 Creating Integral Images 9
2.3.1.3 Adaboost Training 10
2.2.1.4 Implementing Cascading Classifiers 10
2.3.2 Convolutional neural network models 11
2.3.2.1 Convolution layer 12
2.3.2.2 Pooling layer 12
2.3.2.3 Fully connected layer 13
2.3.3 YOLO 13
2.3.3.1 Residual blocks 14
2.3.3.2 Bounding box regression 14
2.3.3.3 Intersection over union (IOU) 15
2.3 Definition of pixel per meter 16
III RELATED WORKS 18
3.1 Traffic situation in Ho Chi Minh City 18
3.1.1 Overview of traffic situation in Ho Chi Minh City 18
3.1.2 Statistics of damage 19
3.2 Vehicles detect and count approaches 20
3.2.1 Foreground extraction 20
3.2.2 Haar-cascade 21
Trang 93.2.4 YOLO 24
3.3 Object size measuring methods 25
3.3.1 Math-based calculating method 26
3.3.2 Object reference method 29
3.4 Traffic density calculating 30
IV PROPOSED SOLUTIONS 32
4.1 Calculate traffic density 32
4.2 Vehicle counting and categorizing 33
4.2.1 Vehicle counting 33
4.2.2 Vehicle classifying and converting 34
4.3 Calculate intersection area 36
4.3.1 Distance-based method 38
4.3.2 Mean-based method 43
V EXPERIMENTAL RESULT 46
5.1 Experiment setup environment 46
5.2 Experimental system architecture 47
5.2.1 Data collection module 47
5.2.2 Training server module 48
5.2.3 Data analysis 50
5.2.4 Diagnosis 51
5.2.4.1 Obtain real area approach 51
5.2.4.2 Calculate error rate 52
5.3 Result from training and detecting 53
5.4 Result from inferring intersection area 58
5.5 Result from evaluating traffic density 61
VI.DISCUSSION AND FURTHER RESEARCH 63
6.1 Achievement 63
6.2 Limitation of the study 64
6.3 Recommendations for further research 65
PUBLICATIONS 66
Trang 10I INTRODUCTION
1.1 Research problem
Urbanization is understood as the process of urban expansion expressed as a percentage of the urban area or population over the total area or population of an area or region Moreover, urbanization is also considered a huge development process, improving quality of life, maintaining a balanced population, controlling population density, etc
By the end of June 2021, the coverage rate of urban zoning planning compared to construction land area in urban areas across the country will reach about 53%, in which 2 special urban areas (Hanoi and Ho Chi Minh City) and 19 grade I cities reach about 80–90%, and in urban areas of grades II, III, and IV, about 40–50% The detailed coverage rate of urban planning is about 39% compared to the area of construction land [1] According to several recent reports, the urbanization of Vietnam is at a gradually increasing pace, with the percentage of the country’s coverage reaching 40% in 2019 [2] This process of urbanization brings many benefits to a country, such as accelerating economic growth, shifting labor and economic structures, and changing population distribution Cities are not only big consumers of goods but also places to create job opportunities and income for workers [3]
Trang 11by up to 6 billion USD annually [4] and the budget of the citizens joining the traffic from the waste of gasoline in traffic jams [5] Furthermore, the government has made enormous investments in the installation of closed-circuit television (CCTV) camera systems, although their full potential has not been realized
As a result, the research team must develop ways to evaluate traffic and utilize the capabilities of these cameras Many kinds of metrics are used to measure the level of traffic on roadways and in particular areas The study team focuses on traffic density characteristics in this study Two aspects must be considered when calculating traffic density: the number of vehicles and the area of region where the counting takes place, in this case, the junction area Several studies and modern technology, particularly machine learning, have been committed to dealing with the vehicle counting problem On the other hand, calculating the area of an intersection presents a different level of complexity Each camera has different attributes and is positioned at different heights and locations, making the calculation challenging and requiring significant effort in collecting these parameters Additionally, for the convenience of applying the solution in practical settings and across the majority of intersections, we aim to find a solution to generalize the aforementioned problem, which means calculating the area without relying on those specific technical specifications
To solve the preceding issue, we propose an approach that needs the use of a reference object [31] The use of a reference object is a strategy that utilizes the known size of an object in space to estimate the size of another It was discovered in this study that reference objects have dynamic rather than static attributes For the computations, we use traffic vehicles that frequently recognized reference objects Based on that concept, we obtained numerous promising results from this study:
- Propose a way to calculate traffic density at intersections - Propose a way to count vehicles appropriately
- Propose solutions for calculating the region’s area and evaluate them in real-world circumstances
Trang 121.2 Objectives of the topic
The objectives of the topic are mentioned as follows:
- Proposing a way to calculate traffic density at intersections
- Comparing and selecting appropriate machine learning models for determining the number of vehicles in a given region and categorizing them as motorbikes, cars, trucks, and so on
- Proposing methods to convert the intersection area from pixel to square meters - Developing an experimental system for collecting and processing data
1.3 Scope of study
The main scientific idea of the project is to research and develop solutions to estimate traffic density from traffic cameras or videos at main intersections The data used in this study is collected directly from the Transportation Facilities’ CCTV cameras Ho Chi Minh City is the area used as a research target The study focuses on the issue of evaluating traffic density at intersections using images and solving concerns related to calculating traffic density
1.4 Scientific and practical significances
This study benefits the subject of traffic density estimate by using an image-based technique, which has both scientific and practical implications It provides insightful knowledge of the unique traffic patterns seen in Ho Chi Minh City while also outlining feasible alternatives for strategic planning and traffic management within the city's transportation infrastructure
1.4.1 Practical significance
Trang 13intersections This allows them to choose the best route and avoid crowded regions, increasing traffic efficiency, and cutting down on travel times
- Resource allocation and infrastructure planning: effective resource allocation can help transportation agencies if traffic density is accurately estimated Authorities can prioritize investments in infrastructure enhancements, such as more lanes, traffic signal optimization, or intelligent transportation systems, by identifying junctions with high traffic density
- Estimating congestion tool: this research contributes to the practical aspect of traffic engineering by providing a tool for identifying congestion at intersections, which helps transportation agencies implement strategies to reduce traffic congestion in Ho Chi Minh City
- Intelligent Transportation Systems (ITS) development: the suggested technique can help with the development of Intelligent Transportation Systems (ITS) by accurately estimating traffic density at crossings Transportation agencies may improve the overall effectiveness and efficiency of traffic management, including adaptive traffic signal control, incident detection, and real-time traffic information distribution, by incorporating the traffic density data into an ITS framework This might result in better traffic flow, a decrease in congestion, and better overall performance of the transportation system in Ho Chi Minh City
1.4.2 Academy significance
Related to academy significance, this research contributes some points as follows: - Improvement in traffic density estimation: by suggesting a novel approach for
calculating traffic density at crossings using an image-based method, the research contributes a scientific contribution Comparing this method to previous ones, a new strategy is introduced that has the potential to increase the accuracy and reliability of traffic density estimation
Trang 14research highlights the use of machine learning methods in the transportation industry, showing the potential of artificial intelligence to address actual traffic issues
- Validation and comparative analysis of calculating region’s area methods: the research study reviews and contrasts several approaches to determining the total area of a region The study advances our understanding of the advantages and disadvantages of various methods by comparing proposed strategies to current ones Researcher and practitioner remarks about the selection and use of suitable traffic density estimating methodologies in various urban settings might be influenced by this comparative analysis
Trang 15II THEORETICAL BASIC
In this chapter, we aim to present the theoretical foundations that our research team has relied on as a basis for developing corresponding solutions to serve the study We will utilize appropriate theories, alongside evaluating and considering theories that are not suitable for the context of the problem, in Chapter III
2.1 Definition of traffic density
The number of cars occupying a particular length of highway in a traffic lane is referred to as traffic density It's measured in terms of vehicle/mile, vehicle/kilometer, or vehicle/meter In 500 feet, for example, four automobiles are displayed So the traffic density per mile is 42.24 automobiles The volume of traffic is inversely proportional to the density If the density is lower, the speed and traffic volume will be higher And if the density is higher, the speed and traffic volume will be lower When a traffic congestion occurs at a certain spot, we may consider expanding the road, building a flyover, or installing an underpass based on the peak hour traffic flow [18]
2.2 Definition of level of service (LOS)
According to Wikipedia and the Oxford dictionary, a qualitative metric called level of service (LOS) is used to assess how well motor vehicle traffic services are provided By classifying traffic flow and assigning quality levels of traffic based on performance measures such as vehicle speed, density, congestion, etc., LOS is used to study roads and intersections In a general context, all services in the asset management sector may fall under the umbrella of levels of service [7]
Trang 162.3 Definition of computer vision - machine learning
Computer vision is the technique of using computers to understand digital images and videos It aims to automate tasks that human vision can perform Techniques for acquiring, processing, and understanding digital pictures, as well as retrieving data from the real world, are used to create information It also has subcategories like object recognition, video tracking, and motion prediction, making it helpful for navigation, visualization of objects, and other applications
Machine learning, which is a branch of artificial intelligence, is the study of algorithms and statistical models Without specific guidelines, systems rely on patterns and inference to carry out a task As a result, it applies to pattern recognition, software engineering, and computer vision Computers accomplish machine learning with only a little assistance from software programmers Data is used to make decisions, and data may be used in a variety of ways across fields You can divide learning into three categories: supervised learning, semi-supervised learning, and unsupervised learning
Computer vision and machine learning are two fields that have developed close ties Computer vision for tracking and recognition has improved because of machine learning It provides efficient acquisition, image processing, and object focus techniques that are applied in computer vision The application of machine learning has also been expanded by computer vision It includes a digital image or video, a sensor device, an interpreting device, and the stage of interpretation The stages of the interpreting device and interpretation in computer vision use machine learning [12]
2.3.1 Haar-cascade
Trang 17can be used to develop detectors for any "things," including bananas, cooking utensils, buildings, automobiles, and structures
In general, Haar cascade is a cascading window technique that attempts to calculate characteristics in each window and identify whether it may be an object [11], [12] Although the Viola-Jones framework undoubtedly paved the way for object detection, subsequent approaches, such as deep learning and histogram of oriented gradients (HOG) + linear SVM, have far surpassed it
The Haar cascade algorithm is divided into four stages: computing Haar features, generating integral pictures, utilizing Adaboost, and implementing cascading classifiers However, it's important to recall that this approach, like other machine learning models, requires a large number of positive and negative photos of the same items to train the classifier
2.3.1.1 Calculating Haar Features
The collection of Haar features is the initial phase as shown in Fig 1 To put it simply, a Haar feature is a detection window position where calculations are done on adjacent rectangular sections The calculation includes adding up the pixel intensities in each region and figuring out how different the amounts are But identifying these elements in a huge photograph can be challenging Thus, integral images come into play in this situation since they allow for a reduction in the number of operations
Trang 182.3.1.2 Creating Integral Images
According to Paul Viola and Michael Jones [11], using an intermediate picture representation known as the integral, rectangle features can be computed quickly
ii(x, y) = ∑ i(x′, y′)
x′≤x, y′≤y
(1)
s(x, y) = s(x, y − 1) + i(x, y) (2)
ii(x, y) = ii(x, y − 1) + s(x, y) (3)
Where: ii(x, u) is the integral image; i(x, y) is the original image and s(x, y) is the cumulative row sum, s(x, -1) = 0, and ii(-1, y) = 0
The integral image may be calculated in a single pass over the original picture Using the integral image, any rectangle sum may be computed using four array references Clearly, eight references may be utilized to compute the difference between two rectangular sums Due to the neighboring rectangular sums implied by the two-rectangle features mentioned above, they may be computed using six array references, eight for three-rectangle features, and nine for four-rectangle features
Fig 2 The total number of pixels in rectangle A makes up the value of the integral
Trang 192.3.1.3 Adaboost Training
Adaboost, in practice, chooses the best features and trains the classifiers to utilize them It combines many "weak classifiers" to create a "strong classifier" that the algorithm may use to detect items
Weak learners are produced by moving a window across the input picture and computing the Haar features for each area of the image This distinction is analogous to a threshold that may be trained to differentiate between things and non-objects Because they are "weak classifiers," a large number of Haar features are necessary to properly produce a strong classifier The last phase, which employs cascade classifiers, combines these weak learners into a strong learner
Fig 3 Representation of a boosting algorithm [11]
2.2.1.4 Implementing Cascading Classifiers
The cascade classifier consists of several stages, each of which is made up of a group of weak learners A highly accurate classifier can be created from the mean prediction of all weak learners by employing boosting during the training of weak learners
Trang 20Stages are created so that negative samples can be rejected as quickly as possible because the majority of the windows do not contain anything of interest By classifying an object as a non-object, your object recognition method would be significantly hampered; hence, it is crucial to maximize a low false-negative rate Before utilizing the Haar cascade, it is important to keep in mind that training the model with the proper hyperparameters is necessary
2.3.2 Convolutional neural network models
Convolutional neural networks (CNN), a sort of artificial neural network that has been dominating various computer vision tasks, have drawn interest in a range of areas, including images, which are inspired by the architecture of the animal visual cortex, according to [13] CNN uses numerous building blocks like convolution layers, pooling layers, and fully connected layers to learn spatial hierarchies of information automatically and adaptively through backpropagation The first two layers, including convolution and pooling, are used to extract features, while the third one, known as the "fully connected layer, is used to transfer the collected features into the final output, such as classification and detection A convolution layer is a key component of CNN, which is made up of a stack of mathematical operations such as convolution, a form of linear operation
Trang 21Fig 4 An overview of a convolutional neural network (CNN) architecture and the
training process [13]
2.3.2.1 Convolution layer
Convolutional layers perform a convolution operation on the input and transmit the output to the following layer Convolutions combine all of the pixels in their receptive area into a single value For example, if you apply convolution to a picture, you will reduce the image size while also combining all of the information in the field into a single pixel The convolutional layer's final output is a vector We may employ several types of convolutions depending on the sort of issue we need to solve and the features we want to learn [14]
2.3.2.2 Pooling layer
Trang 222.3.2.3 Fully connected layer
Last but not least, the output feature maps of the final convolution or pooling layer are often flattened, that is, turned into a one-dimensional (1D) array of integers (or vectors) Then they are linked to one or more completely connected layers, also called dense layers, in which every input is linked to every output by a learnable weight [14] Once the features extracted by the convolution layers and down-sampled by the pooling layers have been formed, they are transferred to the network's final outputs, such as the probabilities for each class in classification tasks, by a subset of fully connected layers The number of output nodes in the final fully linked layer is normally equal to the number of classes As previously explained, each fully linked layer is followed by a nonlinear function such as ReLU
2.3.3 YOLO
You Only Look Once is referred to as YOLO informally, first described in the seminal 2015 paper by Joseph Redmon et al [15] This method identifies and finds various things in images (in real time) YOLO performs object detection as a regression problem and outputs the class probabilities of the discovered photos
To detect objects in real-time, a technique called YOLO utilizes convolutional neural networks (CNN) To identify objects, the approach requires just one forward propagation through a neural network, as the name implies This means that the entire image is predicted by a single algorithm run CNN is used to forecast several class probabilities and bounding boxes at the same time The YOLO algorithm consists of various variants, such as YOLO v1, YOLO v2 (YOLO9000) in 2016, YOLO v3 in 2018, YOLO v4 in 2020, etc YOLO, nowadays, is very well-known when it comes to detecting object tasks, which is very straightforward for the thesis The reason is described as follows:
- Speed: The real-time item prediction capability of this method increases the speed of detection
Trang 23- Learning capabilities: The algorithm has excellent learning capabilities that allow it to pick up on object representations and use them to its advantage when detecting objects
In general, this approach includes three techniques: residual blocks, bounding box regression, and intersection over union (IOU)
2.3.3.1 Residual blocks
The picture is first separated into grids, with each grid having a square dimension Figure 5 below depicts how an input image is separated into equal grids Objects that appear within grid cells are detected by each grid cell For example, if an item center occurs within a certain grid cell, that cell will be in charge of detecting it
Fig 5 An image is divided into grids Green grid cells detect pedestrian [40]
2.3.3.2 Bounding box regression
A bounding box is an outline that draws attention to an object in a picture Every bounding box, as shown in the illustration by YOLO, has the following properties: width (bw); height (bh); class (person, automobile, traffic light, etc.), symbolized by the letter c; and bounding box center (bx, by) YOLO predicts the height, width, center, and class of objects using a single bounding box regression Figure 6 below depicts an example of a bounding box The bounding box has been
Trang 24Fig 6 The information provided by a bounding box [40]
2.3.3.3 Intersection over union (IOU)
The object detection process known as intersection over union (IOU) describes box overlapping YOLO uses IOU to generate an output box that correctly encircles the objects The bounding boxes and confidence ratings are forecasted by each grid cell
IOU is mostly utilized in object identification applications, where we train a model to produce a box that precisely encloses an object When performing YOLO implementation, IOU is also used in non-maximal suppression, which is used to remove boxes that are positioned close to the same item based on which box has higher confidence [16]
The concept for calculating IOU is mentioned as Fig 7 Let us assume that box 1 is represented by [x1, y1, x2, y2], and box 2 is represented by [x3, y3, x4, y4] Following Fig 8, IOU will be calculated as following Eq (4):
IOU = Area of Intersection of two boxes
Trang 25Fig 7 Given two boxes, the dark area is intersection area [16]
Fig 8 IOU equation is presented in picture form [16]
2.3 Definition of pixel per meter
Trang 26Pixel measurement is a term used to describe an approach for counting or calculating the density of some specific value pixels in an image Depending on how much information people might have, it can calculate the real size or distance just based on the picture According to [20], pixels per meter (PPM) is a measurement used to indicate how much possible visual detail a camera provides at a particular distance
Trang 27III RELATED WORKS
3.1 Traffic situation in Ho Chi Minh City
3.1.1 Overview of traffic situation in Ho Chi Minh City
According to statistics, the population of Ho Chi Minh City has expanded by about 2 million people in the last ten years; by 2021, the city's population will be more than 9 million people, making it the most populated metropolis in the country This causes a slew of issues for the city, including transportation congestion [21]
According to Department of Transportation data, the overall length of roads in the city is 4,734 km, there are 1,160 road bridges, and the total road surface area is 50.7 million m2 The land traffic rate was 12.76%, 10% lower than the norm The traffic density is 2.26 km/km2, which is less than one-fifth of the benchmark and lower than similar cities like Bangkok, Taipei, and Singapore [21]
According to statistics from the Department of Transportation, more than 1,000 new vehicles are registered every day, including around 221 automobiles and 804 two-wheelers Furthermore, by the end of the third quarter of 2022, Ho Chi Minh City will have handled 8.7 million vehicles, comprising over 850,000 automobiles and approximately 7.8 million two-wheelers The overall number of vehicles grew by 3.1% as compared to the same time in 2021 (cars climbed by 7.2% and two-wheelers increased by 2.7%) [21]
According to the Department of Transportation, this puts strain on the infrastructure system, resulting in greater and greater traffic congestion There are now 18 places in the region that are in danger of traffic congestion, of which 3 have made good adjustments, 8 have changed but are still complex, and 7 do not
Trang 28minimize traffic congestion [22] According to the Department of Transport, the key solution tasks in 2023 for reducing traffic jams include continuing to implement the project of developing traffic infrastructure in Ho Chi Minh City between 2020 and 2030; increasing public passenger transport in conjunction with controlling the use of personal motor vehicles in city traffic; and focusing on urban planning and development oriented towards public transportation (TOD model) [22]
3.1.2 Statistics of damage
The data was presented by a representative of Ho Chi Minh City's Department of Transport on July 12 at a conference summarizing the implementation of the Politburo's Resolution 53/2005 and Conclusion 27/2012 on socio-economic development, national defense, and security in the Southeast region and the Southern Key Economic Zone to 2010 and Ho Chi Minh City's orientation to 2020
Trang 293.2 Vehicles detect and count approaches
Related to detecting and counting vehicles' tasks, the below sections introduced some methods, and technologies to perform it
3.2.1 Foreground extraction
In this approach, the authors introduce background subtraction used for extracting objects This method is used to find pixels that would have a place in a moving vehicle Specifically, a background image of a street with no vehicles is used, and the current frame in the video is converted from a color (RGB) image to a gray-scale image At that time, the gray intensity of a background picture is subtracted from that of the current frame for each pixel (x, y) The absolutist result is stored in a comparable spot in another picture, known as a distinguishing image [24]
Trang 30Fig 9 Foreground extraction demo
Although this approach has some advantages, such as being easy to implement and taking less time for computation, it is very sensitive to external factors Based on [24], the author has to adjust the brightness parameter when he runs experiments because of changes in the pixel intensity Besides, it requires a clean foreground image to perform this method, which is suitable only for making experiments Moreover, the step of ROI selection is performed manually and doesn’t have any kind of diagnosis method to examine whether the selected ROI is still correct time by time, especially since external factors such as wind can have a big impact on the selection result
3.2.2 Haar-cascade
Trang 31Haar-cascade is used as a classifier for detecting and counting cars and buses in images taken from CCTV sources It is mentioned that this approach can be applied to both video and images Besides, the author also provides a pre-trained XML file (Cascade file) for detecting vehicles with the given angle of camera view, which is the same angle as the thesis As with the above method, the Haar-cascade approach is easy to implement and takes less time to process images
Fig 10 Result applies Haar, same environment when training [25]
Trang 32Fig 11 Result in real environment – low quality image
3.2.3 Convolutional neural network
A convolutional neural network (CNN) is a more advanced method than Haar-cascade in terms of not only speed but also accuracy and performance Furthermore, CNN has many other variants that outperform the older versions, such as Faster R-CNN, Mask R-R-CNN, ResNet-50, and so on As far as performing experiments like those mentioned in [27], the detection result is very good compared to the methods mentioned above for both images and video
Trang 33Fig 12 Result of (a) Faster R-CNN, (b)Mask R-CNN, (c) RestNet50 [27]
3.2.4 YOLO
Based on all the methods mentioned above, to achieve real-time requirements and ensure high performance, YOLO, known as You Only Look Once, can be considered one of the best state-of-the-art currently According to [28], the original version of YOLO is applied for detecting vehicles from CCTV on video Following the result mentioned in [28], YOLO can support up to 45 frames per second, compared with 30 frames per second from Section 3.2.3 methods The performance also shows that the accuracy is around 95% to 100%, which is higher than that of convolutional neural network approaches
Trang 34Furthermore, Yolo also provides many upgraded versions over time Each version has a different upgraded patch focusing on increasing performance, reducing training time, and reducing computation costs A team including members from Vietnam and China applies YOLO v4, an advanced version of YOLO, to detect and count vehicles under mixed traffic conditions [29] Based on the report results, the authors provide a comparison between the YOLO v4 method with Haar-cascade and background subtraction, which are already mentioned in the above section, using the COCO dataset Thus, the YOLO approach can be considered to be applied to the thesis to obtain the result
Fig 14 Pre-trained model YOLO v7 in real environment
3.3 Object size measuring methods
Trang 353.3.1 Math-based calculating method
Based on the research results on traffic density estimation from highly noisy image sources [30], to perform this approach, they mention four basic steps to be covered However, only two steps need to be focused on to apply to the thesis
Firstly, traditional image processing approaches for automatically detecting road segments do not function well with their collection of noisy picture sources Instead, authors mark a polygon area in the image to characterize the region of interest that represents a specific road segment given a set of traffic images from a single image source Additionally, basic geometry calibrations relating to the position and angle of the camera are required In short, as with the mentioned methods, the ROI is selected manually, not automatically
Secondly, the authors map pixels to various locations along the road segment using a graded measure to obtain a weighted measure for the pixel intensity distributions, which captures the concept of distance In this step, the authors require physical specifications, such as camera height, the distance from the camera to objects, camera angle, etc., to be able to perform the below equations
Trang 36Fig 15 Field of view [30]
Where: H is the height of the camera; C is starting point of the field of view; d + Xmaxis the actual endpoint of field of view; E is the observed end point of the field of view; G is the actual road point under inspection; F is observed position of G in the image; xi is actual distance of road point from C; h∆ is observed height of the complete road length in the image; hi is observed height of Point G in the image
The road's endpoints under consideration are C and D The camera's field of vision is represented by ∆ACD Any length more than d + Xmax is outside the camera's field of view There is a distance d = H * tanθ before the near point in the camera's field of view The camera's zoom level and image quality affect the camera's field of vision and image clarity
Trang 37Given that CE has a height of h∆, the original image's scaling factor is h∆/p∆ Any point G on the road corresponds to a projection point (point F) in the simulated scaled-up projection Let hi be the height of CF If pi denotes the distance of the pixel corresponding to G from the start of the road section., then we get: hi = h∆ (pi / p∆)
This provides a mapping between the pixel position in the image and the actual distance in the real-world setting As the authors go towards the far point of view, the density increases but the complementary effect on the camera’s image is reduced To solve this problem, they developed a density function based on the geometric properties of the road segment and its image The product of a pixel counts in row i represented by count(i), and its related weight function W(i), gives the overall road traffic density They derive the density function as Eq (5)
𝐷𝑒𝑛𝑠𝑖𝑡𝑦 𝑓(𝑥) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑖) ∗ 𝑊(𝑖)
𝑖
(5)
Where: i is the number of rows of an image; count(i) represents for pixel counts in the ith row; W(i) is the weight function of ith row
Based on geometric analysis, the authors derive that 𝑊(𝑖) = 𝑥𝑖 + 𝑑
𝑑 and, from the Fig 15, 𝑥𝑖
ℎ𝑖 = 𝑥𝑖 + 𝑑
𝐻 Thus, they can obtain follow Eq (6) As a result, given H, Xmax and d, together with the input image, the author can estimate the graded measure of the pixel count for the road segment
𝐷𝑒𝑛𝑠𝑖𝑡𝑦 𝑓(𝑥) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑖) ∗ 𝐻𝐻 − (𝑃𝑃𝑖∆) ℎ∆𝑖(6)𝑋𝑚𝑎𝑥𝑋𝑚𝑎𝑥 + 𝑑 = ℎ∆𝐻 (7)
Trang 38beginning of the road segment; the image's scaling factor is h∆/p∆; d + Xmax is the actual endpoint of the field of view
In general, this approach can return an almost exact value by comparing the inferred size of the target object in an image with its real size Besides, based on the paper result [30], it can be seen that the equation is already demonstrated in Fig 15 However, the main drawback of this method is that it requires a lot of physical specifications, which are hard to obtain because of the differences in camera installation at each crossroads as well as differences in camera branches Thus, although this method is quite good in terms of mathematics, it cannot be applied to the thesis since more than one intersection will be used for performing experiments It is necessary to find other methods which don’t rely on camera specifications heavily In the next section, the object reference concept is introduced to overcome this issue
3.3.2 Object reference method
Trang 39Fig 16 United States quarter is used as reference object [31]
This method can help the thesis achieve the goal of calculating the pixel per meter parameter without depending on physical specifications Applying this method can be generalized for all the intersection cases, not just a specific one, just like the Haar-cascade or CNN methods Although it has advantages, this method requires two important things To begin, it requires that the camera view angle be perpendicular to the plane to obtain and calculate the object size exactly Secondly, the object reference should be unique in an image for detection Even for those, this method can be considered to apply to the thesis with some modification to adapt the real cases to an approximate level
3.4 Traffic density calculating
According to the definition mentioned in [32], traffic density is defined as a fundamental characteristic to tell how significant the congestion of cars on the road is Its Eq (8) is mentioned below
density = m
L (8)
Trang 40Furthermore, some studies, such as [33], [34], and [35], mention that the number of vehicles counted in a given ROI (Region of Interest) can be used as a measurable unit for traffic density and categorized into three levels, such as light, medium, and solid However, all of them mostly perform on a segment of the road, which is shown in their experiments, not a region Thus, Eq (8) cannot be applied to this work because of following reasons
First of all, there isn't a standard definition of length or width for intersections that can be used to decide which direction serves as the length when calculating density The direction in which vehicles move must be taken into account in order to solve this problem However, it becomes impossible to precisely estimate the vehicle direction for defining the length direction depending on the camera angle and the existence of various flows of vehicles at the crossing Additionally, it might be difficult to estimate the length of the road segment for crossroads like T-junctions or six-way intersections since we must take the entire area into account unless we divide the intersection into extremely small road segments for processing