Luận văn thạc sĩ Khoa học máy tính: Research and develop solutions to estimate traffic density from traffic cameras at main intersections

Trang 1

BUI MINH HIEU

RESEARCH AND DEVELOP SOLUTIONS TO ESTIMATE TRAFFIC DENSITY FROM TRAFFIC CAMERAS AT MAIN

Trang 2

THIS THESIS IS COMPLETED AT

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor: Assoc Prof Ph D Tran Minh Quang

Examiner 1: Assoc Prof Ph D Nguyen Van Vu Examiner 2: Assoc Prof Ph D Nguyen Tuan Dang

This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on July 11th 2023

Master’s Thesis Committee:

(Please write down full name and academic rank of each member of the Master’s Thesis Committee)

1 Chairman: Assoc Prof Ph D Le Hong Trang 2 Secretary: Ph D Phan Trong Nhan

3 Review 1: Assoc Prof Ph D Nguyen Van Vu 4 Review 2: Assoc Prof Ph D Nguyen Tuan Dang 5 Member: Assoc Prof Ph D Tran Minh Quang

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)

CHAIRMAN OF THESIS COMMITTEE HEAD OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING

Trang 3

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

I THESIS TITLE : RESEARCH AND DEVELOP SOLUTIONS TO

ESTIMATE TRAFFIC DENSITY FROM TRAFFIC CAMERAS AT MAIN

INTERSECTIONS (NGHIÊN CỨU, XÂY DỰNG CÁC PHÉP ƯỚC LƯỢNG

MẬT ĐỘ GIAO THÔNG DỰA VÀO DỮ LIỆU CAMERA Ở NHỮNG NÚT GIAO THÔNG QUAN TRỌNG).

II TASKS AND CONTENTS :

The objective of this thesis is to research and develop a method for estimating traffic density at intersection areas using images or videos Accordingly, the tasks involved in this work include determining the calculation method and evaluating traffic density for a given area, comparing existing papers and systems with the proposed method to identify any differences, proposing solutions to address challenges, and advancing the calculation method of the thesis Additionally, designing a prototype system to demonstrate the functionality will be undertaken

III THESIS START DAY : (According to the decision on assignment of

(Full name and signature)

CHAIR OF PROGRAM COMMITTEE

(Full name and signature)

HEAD OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING

(Full name and signature)

Trang 4

ACKNOWLEDGEMENT

First and foremost, I would want to express my gratitude and extend my heartfelt gratitude to Assoc Prof Ph D Tran Minh Quang, the instructor who has completely guided and helped me to accomplish my thesis

I'd want to thank the Computer Science and Engineering instructors, as well as the teachers who generously shared their knowledge with me throughout my time at the HCM city University of Technology Finally, I'd want to thank my family and friends, who have always supported and encouraged me throughout the process of doing this research

HCM City, June 8th 2023

Bui Minh Hieu

Trang 5

ABSTRACT

Traffic congestion has become a pressing issue, not only for the government but also for the general public, as it directly impacts the quality of life It is necessary to research and develop solutions to manage traffic conditions and minimize the impact of congestion Understanding this, this study provides readers with a method to estimate traffic conditions, specifically traffic density at intersections To achieve this, the research team divided the task into two main components: vehicle counting and area measurement Vehicle counting is relatively simple with the assistance of modern technologies and techniques In contrast, and notably, this study offers readers methods to calculate the area of a region from images, without relying on camera specifications, based on the concept of object reference Additionally, through this research, we also present the design of the testing system, the challenges faced, and the accompanying solutions during the design process, as well as the results achieved when applied in a real-world environment Finally, alongside the expected outcomes, some limitations need to be discussed and addressed in the future

Keyword: Traffic density, Area of a region, Object reference

Trang 6

TÓM TẮT LUẬN VĂN THẠC SĨ

Tắc đường đã trở thành vấn đề cấp bách, không chỉ đối với chính phủ mà còn cả với người dân, vì nó ảnh hưởng trực tiếp đến chất lượng cuộc sống Việc nghiên cứu và phát triển các giải pháp quản lý tình trạng giao thông để giảm thiểu tác động của ùn tắc đã trở nên cần thiết hơn bao giờ Nhằm để hiểu rõ hơn vấn đề đã nêu, nghiên cứu này cung cấp cho độc giả một phương pháp để ước tính, đánh giá điều kiện giao thông, cụ thể là tính mật độ giao thông tại các ngã tư Để đạt được mục tiêu này, nhóm nghiên cứu đã chia công việc thành hai thành mục chính: đếm số phương tiện và đo diện tích khu vực Việc đếm số phương tiện khá đơn giản với sự hỗ trợ của các công nghệ và kỹ thuật hiện đại Ngược lại, đo diện tích khu vực lại khó khăn hơn và thử thách hơn, do đó nghiên cứu này đề xuất các phương pháp tính toán diện tích của một khu vực từ hình ảnh, mà không dựa vào thông số của máy ảnh, dựa trên khái niệm về tham chiếu đối tượng Bên cạnh đó, thông qua nghiên cứu này, chúng tôi cũng giới thiệu kiến trúc hệ thống mà chúng tôi thiết kế, các thách thức gặp phải trong quá trình chạy hệ thống và các giải pháp đi kèm, cũng như kết quả đạt được khi áp dụng trong môi trường thực tế Cuối cùng, bên cạnh các kết quả dự kiến, một số hạn chế cần được thảo luận và giải quyết trong tương lai

Từ khóa: Mật độ giao thông, Diện tích khu vực, Tham chiếu đối tượng

Trang 7

THE COMMITMENT

I confirm that this is my research The data utilized in the thesis's complete analytic process has a clear and transparent provenance, and it was released in compliance with scientific research standards and ethics In this thesis, I have presented the results of my study openly and fairly The thesis results are presented in this report for the first time and have not been published in any earlier thesis

HCM City, June 8th 2023

Bui Minh Hieu

Trang 8

II THEORETICAL BASIC 6

2.1 Definition of traffic density 6

2.2 Definition of level of service (LOS) 6

2.3 Definition of computer vision - machine learning 7

2.3.1 Haar-cascade 7

2.3.1.1 Calculating Haar Features 8

2.3.1.2 Creating Integral Images 9

2.3.1.3 Adaboost Training 10

2.2.1.4 Implementing Cascading Classifiers 10

2.3.2 Convolutional neural network models 11

2.3.3.2 Bounding box regression 14

2.3.3.3 Intersection over union (IOU) 15

2.3 Definition of pixel per meter 16

III RELATED WORKS 18

3.1 Traffic situation in Ho Chi Minh City 18

3.1.1 Overview of traffic situation in Ho Chi Minh City 18

Trang 9

3.2.4 YOLO 24

3.3 Object size measuring methods 25

3.3.1 Math-based calculating method 26

3.3.2 Object reference method 29

3.4 Traffic density calculating 30

IV PROPOSED SOLUTIONS 32

4.1 Calculate traffic density 32

4.2 Vehicle counting and categorizing 33

4.2.1 Vehicle counting 33

4.2.2 Vehicle classifying and converting 34

4.3 Calculate intersection area 36

4.3.1 Distance-based method 38

4.3.2 Mean-based method 43

V EXPERIMENTAL RESULT 46

5.1 Experiment setup environment 46

5.2 Experimental system architecture 47

5.2.1 Data collection module 47

5.2.2 Training server module 48

5.2.3 Data analysis 50

5.2.4 Diagnosis 51

5.2.4.1 Obtain real area approach 51

5.2.4.2 Calculate error rate 52

5.3 Result from training and detecting 53

5.4 Result from inferring intersection area 58

5.5 Result from evaluating traffic density 61

VI.DISCUSSION AND FURTHER RESEARCH 63

6.1 Achievement 63

6.2 Limitation of the study 64

6.3 Recommendations for further research 65

PUBLICATIONS 66

REFERENCES 67

Trang 10

I INTRODUCTION

1.1 Research problem

Urbanization is understood as the process of urban expansion expressed as a percentage of the urban area or population over the total area or population of an area or region Moreover, urbanization is also considered a huge development process, improving quality of life, maintaining a balanced population, controlling population density, etc

By the end of June 2021, the coverage rate of urban zoning planning compared to construction land area in urban areas across the country will reach about 53%, in which 2 special urban areas (Hanoi and Ho Chi Minh City) and 19 grade I cities reach about 80–90%, and in urban areas of grades II, III, and IV, about 40–50% The detailed coverage rate of urban planning is about 39% compared to the area of construction land [1] According to several recent reports, the urbanization of Vietnam is at a gradually increasing pace, with the percentage of the country’s coverage reaching 40% in 2019 [2] This process of urbanization brings many benefits to a country, such as accelerating economic growth, shifting labor and economic structures, and changing population distribution Cities are not only big consumers of goods but also places to create job opportunities and income for workers [3]

Consequently, the necessity for travel has led to an increase in the number of means of transportation, which has increased traffic congestion as a result of the large cities' rapid population growth Traffic jams have always been a nuisance for every citizen in urbanized cities to cope with since it is uncomfortable to travel, and for Vietnam's governments to deal with since it not only costs a lot of money and consideration to establish an effective plan to solve the problem but is also very dangerous if left as is, as transport within Vietnam will be delayed and the economy will be affected due to such circumstances [3] It is estimated that traffic jams in one of the most urbanized cities, Ho Chi Minh City, can damage the government budget

Trang 11

by up to 6 billion USD annually [4] and the budget of the citizens joining the traffic from the waste of gasoline in traffic jams [5] Furthermore, the government has made enormous investments in the installation of closed-circuit television (CCTV) camera systems, although their full potential has not been realized

As a result, the research team must develop ways to evaluate traffic and utilize the capabilities of these cameras Many kinds of metrics are used to measure the level of traffic on roadways and in particular areas The study team focuses on traffic density characteristics in this study Two aspects must be considered when calculating traffic density: the number of vehicles and the area of region where the counting takes place, in this case, the junction area Several studies and modern technology, particularly machine learning, have been committed to dealing with the vehicle counting problem On the other hand, calculating the area of an intersection presents a different level of complexity Each camera has different attributes and is positioned at different heights and locations, making the calculation challenging and requiring significant effort in collecting these parameters Additionally, for the convenience of applying the solution in practical settings and across the majority of intersections, we aim to find a solution to generalize the aforementioned problem, which means calculating the area without relying on those specific technical specifications

To solve the preceding issue, we propose an approach that needs the use of a reference object [31] The use of a reference object is a strategy that utilizes the known size of an object in space to estimate the size of another It was discovered in this study that reference objects have dynamic rather than static attributes For the computations, we use traffic vehicles that frequently recognized reference objects Based on that concept, we obtained numerous promising results from this study:

- Propose a way to calculate traffic density at intersections - Propose a way to count vehicles appropriately

- Propose solutions for calculating the region’s area and evaluate them in world circumstances

real Develop an experimental system for practical application

Trang 12

1.2 Objectives of the topic

The objectives of the topic are mentioned as follows:

- Proposing a way to calculate traffic density at intersections

- Comparing and selecting appropriate machine learning models for determining the number of vehicles in a given region and categorizing them as motorbikes, cars, trucks, and so on

- Proposing methods to convert the intersection area from pixel to square meters - Developing an experimental system for collecting and processing data

1.3 Scope of study

The main scientific idea of the project is to research and develop solutions to estimate traffic density from traffic cameras or videos at main intersections The data used in this study is collected directly from the Transportation Facilities’ CCTV cameras Ho Chi Minh City is the area used as a research target The study focuses on the issue of evaluating traffic density at intersections using images and solving concerns related to calculating traffic density

1.4 Scientific and practical significances

This study benefits the subject of traffic density estimate by using an based technique, which has both scientific and practical implications It provides insightful knowledge of the unique traffic patterns seen in Ho Chi Minh City while also outlining feasible alternatives for strategic planning and traffic management within the city's transportation infrastructure

image-1.4.1 Practical significance

Related to practical significance, there are several factors to be considered as follows: - Real-time traffic information: the suggested technique might contribute to the developing of real-time traffic information systems The system can give drivers up-to-date information by continually measuring traffic density at

Trang 13

intersections This allows them to choose the best route and avoid crowded regions, increasing traffic efficiency, and cutting down on travel times

- Resource allocation and infrastructure planning: effective resource allocation can help transportation agencies if traffic density is accurately estimated Authorities can prioritize investments in infrastructure enhancements, such as more lanes, traffic signal optimization, or intelligent transportation systems, by identifying junctions with high traffic density

- Estimating congestion tool: this research contributes to the practical aspect of traffic engineering by providing a tool for identifying congestion at intersections, which helps transportation agencies implement strategies to reduce traffic congestion in Ho Chi Minh City

- Intelligent Transportation Systems (ITS) development: the suggested technique can help with the development of Intelligent Transportation Systems (ITS) by accurately estimating traffic density at crossings Transportation agencies may improve the overall effectiveness and efficiency of traffic management, including adaptive traffic signal control, incident detection, and real-time traffic information distribution, by incorporating the traffic density data into an ITS framework This might result in better traffic flow, a decrease in congestion, and better overall performance of the transportation system in Ho Chi Minh City

- Application of machine learning algorithms: for identifying objects in images of traffic, the study uses machine learning strategies This aspect of the

Trang 14

research highlights the use of machine learning methods in the transportation industry, showing the potential of artificial intelligence to address actual traffic issues

- Validation and comparative analysis of calculating region’s area methods: the research study reviews and contrasts several approaches to determining the total area of a region The study advances our understanding of the advantages and disadvantages of various methods by comparing proposed strategies to current ones Researcher and practitioner remarks about the selection and use of suitable traffic density estimating methodologies in various urban settings might be influenced by this comparative analysis

- Application to urban areas: although the research focuses on estimating traffic density in Ho Chi Minh City, it has the potential to be more broadly generalized to other cities that are experiencing comparable traffic problems The research offers insights and approaches that may be utilized in other cities by addressing the unique traffic circumstances and complexity of Ho Chi Minh City This broadens the scope of the scientific relevance beyond a particular city region and helps in the development of practical traffic management plans for differed urban settings

Trang 15

II THEORETICAL BASIC

In this chapter, we aim to present the theoretical foundations that our research team has relied on as a basis for developing corresponding solutions to serve the study We will utilize appropriate theories, alongside evaluating and considering theories that are not suitable for the context of the problem, in Chapter III

2.1 Definition of traffic density

The number of cars occupying a particular length of highway in a traffic lane is referred to as traffic density It's measured in terms of vehicle/mile, vehicle/kilometer, or vehicle/meter In 500 feet, for example, four automobiles are displayed So the traffic density per mile is 42.24 automobiles The volume of traffic is inversely proportional to the density If the density is lower, the speed and traffic volume will be higher And if the density is higher, the speed and traffic volume will be lower When a traffic congestion occurs at a certain spot, we may consider expanding the road, building a flyover, or installing an underpass based on the peak hour traffic flow [18]

2.2 Definition of level of service (LOS)

According to Wikipedia and the Oxford dictionary, a qualitative metric called level of service (LOS) is used to assess how well motor vehicle traffic services are provided By classifying traffic flow and assigning quality levels of traffic based on performance measures such as vehicle speed, density, congestion, etc., LOS is used to study roads and intersections In a general context, all services in the asset management sector may fall under the umbrella of levels of service [7]

In general, the standards used to assess the level of traffic congestion in various nations and cities vary But they're all based on the same primary factor, which is average speed and traffic flow on the road, along with secondary factors like waiting times at nodes, service levels, state persistence times, and waiting line length In Vietnam, so far, there have been no specific criteria for determining traffic congestion based on the basic parameters of traffic flow (velocity, volume, density, etc.) [8], [9]

Trang 16

2.3 Definition of computer vision - machine learning

Computer vision is the technique of using computers to understand digital images and videos It aims to automate tasks that human vision can perform Techniques for acquiring, processing, and understanding digital pictures, as well as retrieving data from the real world, are used to create information It also has subcategories like object recognition, video tracking, and motion prediction, making it helpful for navigation, visualization of objects, and other applications

Machine learning, which is a branch of artificial intelligence, is the study of algorithms and statistical models Without specific guidelines, systems rely on patterns and inference to carry out a task As a result, it applies to pattern recognition, software engineering, and computer vision Computers accomplish machine learning with only a little assistance from software programmers Data is used to make decisions, and data may be used in a variety of ways across fields You can divide learning into three categories: supervised learning, semi-supervised learning, and unsupervised learning

Computer vision and machine learning are two fields that have developed close ties Computer vision for tracking and recognition has improved because of machine learning It provides efficient acquisition, image processing, and object focus techniques that are applied in computer vision The application of machine learning has also been expanded by computer vision It includes a digital image or video, a sensor device, an interpreting device, and the stage of interpretation The stages of the interpreting device and interpretation in computer vision use machine learning [12]

2.3.1 Haar-cascade

This novel technique was initially published by Paul Viola and Michael Jones in their 2001 study, Rapid Object Detection Using a Boosted Cascade of Simple Features [11], and has since become one of the most cited papers in computer vision research This technology allowed for real-time object recognition in video feeds Viola and Jones concentrate on identifying faces in pictures However, the framework

Trang 17

can be used to develop detectors for any "things," including bananas, cooking utensils, buildings, automobiles, and structures

In general, Haar cascade is a cascading window technique that attempts to calculate characteristics in each window and identify whether it may be an object [11], [12] Although the Viola-Jones framework undoubtedly paved the way for object detection, subsequent approaches, such as deep learning and histogram of oriented gradients (HOG) + linear SVM, have far surpassed it

The Haar cascade algorithm is divided into four stages: computing Haar features, generating integral pictures, utilizing Adaboost, and implementing cascading classifiers However, it's important to recall that this approach, like other machine learning models, requires a large number of positive and negative photos of the same items to train the classifier

2.3.1.1 Calculating Haar Features

The collection of Haar features is the initial phase as shown in Fig 1 To put it simply, a Haar feature is a detection window position where calculations are done on adjacent rectangular sections The calculation includes adding up the pixel intensities in each region and figuring out how different the amounts are But identifying these elements in a huge photograph can be challenging Thus, integral images come into play in this situation since they allow for a reduction in the number of operations

Fig 1 Some examples of Haar feature’s types [11]

Trang 18

2.3.1.2 Creating Integral Images

According to Paul Viola and Michael Jones [11], using an intermediate picture representation known as the integral, rectangle features can be computed quickly

Fig 2 The total number of pixels in rectangle A makes up the value of the integral

picture at position 1 Result of D is (4 + 1 - (2 + 3)) [11]

Trang 19

2.3.1.3 Adaboost Training

Adaboost, in practice, chooses the best features and trains the classifiers to utilize them It combines many "weak classifiers" to create a "strong classifier" that the algorithm may use to detect items

Weak learners are produced by moving a window across the input picture and computing the Haar features for each area of the image This distinction is analogous to a threshold that may be trained to differentiate between things and non-objects Because they are "weak classifiers," a large number of Haar features are necessary to properly produce a strong classifier The last phase, which employs cascade classifiers, combines these weak learners into a strong learner

Fig 3 Representation of a boosting algorithm [11]

2.2.1.4 Implementing Cascading Classifiers

The cascade classifier consists of several stages, each of which is made up of a group of weak learners A highly accurate classifier can be created from the mean prediction of all weak learners by employing boosting during the training of weak learners

The classifier either chooses to go on to the subsequent region or decides to indicate that an object was identified (positive) based on this prediction (negative)

Trang 20

Stages are created so that negative samples can be rejected as quickly as possible because the majority of the windows do not contain anything of interest By classifying an object as a non-object, your object recognition method would be significantly hampered; hence, it is crucial to maximize a low false-negative rate Before utilizing the Haar cascade, it is important to keep in mind that training the model with the proper hyperparameters is necessary

2.3.2 Convolutional neural network models

Convolutional neural networks (CNN), a sort of artificial neural network that has been dominating various computer vision tasks, have drawn interest in a range of areas, including images, which are inspired by the architecture of the animal visual cortex, according to [13] CNN uses numerous building blocks like convolution layers, pooling layers, and fully connected layers to learn spatial hierarchies of information automatically and adaptively through backpropagation The first two layers, including convolution and pooling, are used to extract features, while the third one, known as the "fully connected layer, is used to transfer the collected features into the final output, such as classification and detection A convolution layer is a key component of CNN, which is made up of a stack of mathematical operations such as convolution, a form of linear operation

As stated in [13], pixel values in digital pictures are stored in a dimensional (2D) grid, i.e., a number array, and at each image location, a parameter small grid termed the kernel, an optimizable feature extractor is applied Because a feature may be found anywhere in the picture, CNNs are particularly efficient for image processing Extracted features can evolve hierarchically and become progressively more complicated as one layer feeds its output into the next layer Training is the process of adjusting parameters such as kernels in order to reduce the difference between outputs and ground truth labels using optimization algorithms like backpropagation and gradient descent, among others

Trang 21

two-Fig 4 An overview of a convolutional neural network (CNN) architecture and the

training process [13]

2.3.2.1 Convolution layer

Convolutional layers perform a convolution operation on the input and transmit the output to the following layer Convolutions combine all of the pixels in their receptive area into a single value For example, if you apply convolution to a picture, you will reduce the image size while also combining all of the information in the field into a single pixel The convolutional layer's final output is a vector We may employ several types of convolutions depending on the sort of issue we need to solve and the features we want to learn [14]

2.3.2.2 Pooling layer

According to [14], a pooling layer conducts traditional down-sampling, lowering the in-plane dimensionality of the feature maps to introduce translation invariance to small shifts and distortions and reduce the number of future learnable parameters It is worth noting that none of the pooling layers have learnable parameters, although filter size, stride, and padding are hyperparameters in pooling operations, similar to convolution processes

Trang 22

2.3.2.3 Fully connected layer

Last but not least, the output feature maps of the final convolution or pooling layer are often flattened, that is, turned into a one-dimensional (1D) array of integers (or vectors) Then they are linked to one or more completely connected layers, also called dense layers, in which every input is linked to every output by a learnable weight [14] Once the features extracted by the convolution layers and down-sampled by the pooling layers have been formed, they are transferred to the network's final outputs, such as the probabilities for each class in classification tasks, by a subset of fully connected layers The number of output nodes in the final fully linked layer is normally equal to the number of classes As previously explained, each fully linked layer is followed by a nonlinear function such as ReLU

2.3.3 YOLO

You Only Look Once is referred to as YOLO informally, first described in the seminal 2015 paper by Joseph Redmon et al [15] This method identifies and finds various things in images (in real time) YOLO performs object detection as a regression problem and outputs the class probabilities of the discovered photos

To detect objects in real-time, a technique called YOLO utilizes convolutional neural networks (CNN) To identify objects, the approach requires just one forward propagation through a neural network, as the name implies This means that the entire image is predicted by a single algorithm run CNN is used to forecast several class probabilities and bounding boxes at the same time The YOLO algorithm consists of various variants, such as YOLO v1, YOLO v2 (YOLO9000) in 2016, YOLO v3 in 2018, YOLO v4 in 2020, etc YOLO, nowadays, is very well-known when it comes to detecting object tasks, which is very straightforward for the thesis The reason is described as follows:

- Speed: The real-time item prediction capability of this method increases the speed of detection

- High accuracy: YOLO is a prediction approach that produces accurate findings with low background noise

Trang 23

- Learning capabilities: The algorithm has excellent learning capabilities that allow it to pick up on object representations and use them to its advantage when detecting objects

In general, this approach includes three techniques: residual blocks, bounding box regression, and intersection over union (IOU)

2.3.3.1 Residual blocks

The picture is first separated into grids, with each grid having a square dimension Figure 5 below depicts how an input image is separated into equal grids Objects that appear within grid cells are detected by each grid cell For example, if an item center occurs within a certain grid cell, that cell will be in charge of detecting it

Fig 5 An image is divided into grids Green grid cells detect pedestrian [40]

2.3.3.2 Bounding box regression

A bounding box is an outline that draws attention to an object in a picture Every bounding box, as shown in the illustration by YOLO, has the following properties: width (bw); height (bh); class (person, automobile, traffic light, etc.), symbolized by the letter c; and bounding box center (bx, by) YOLO predicts the height, width, center, and class of objects using a single bounding box regression Figure 6 below depicts an example of a bounding box The bounding box has been

represented with a yellow outline

Trang 24

Fig 6 The information provided by a bounding box [40]

2.3.3.3 Intersection over union (IOU)

The object detection process known as intersection over union (IOU) describes box overlapping YOLO uses IOU to generate an output box that correctly encircles the objects The bounding boxes and confidence ratings are forecasted by each grid cell

IOU is mostly utilized in object identification applications, where we train a model to produce a box that precisely encloses an object When performing YOLO implementation, IOU is also used in non-maximal suppression, which is used to remove boxes that are positioned close to the same item based on which box has higher confidence [16]

The concept for calculating IOU is mentioned as Fig 7 Let us assume that box 1 is represented by [x1, y1, x2, y2], and box 2 is represented by [x3, y3, x4, y4] Following Fig 8, IOU will be calculated as following Eq (4):

IOU = Area of Intersection of two boxes

Trang 25

Fig 7 Given two boxes, the dark area is intersection area [16]

Fig 8 IOU equation is presented in picture form [16]

2.3 Definition of pixel per meter

A pixel is the smallest unit of a computer picture or graphic that may be shown and represented on a digital display device, according to Techopedia Furthermore, pixels are combined to make a complete image, video, text, or anything else viewable on a computer monitor Besides, a pixel is also known as a picture element (pix = picture, el = element) so that it can be used as a countable unit in an image [19]

Trang 26

Pixel measurement is a term used to describe an approach for counting or calculating the density of some specific value pixels in an image Depending on how much information people might have, it can calculate the real size or distance just based on the picture According to [20], pixels per meter (PPM) is a measurement used to indicate how much possible visual detail a camera provides at a particular distance

Grids or blocks of pixels are used to split digital pictures, with each block indicating the color of a specific spot in the image The number of divisions or points per meter acquired by the camera is determined by the PPM value A higher PPM number suggests that the image may include more detailed information While PPM represents the amount of detail a camera can record at a particular distance, it does not always imply the image's sharpness or contrast These elements are determined by the lens's quality, the subject's distance, and the atmospheric conditions between the subject and the camera Two cameras with the same PPM rating may produce different levels of detail based on lens sharpness, and two cameras with the same PPM but different distances will likely show varying amounts of perceived detail due to the camera capturing longer-range shots having more atmospheric interference [20]

Trang 27

III RELATED WORKS

3.1 Traffic situation in Ho Chi Minh City

3.1.1 Overview of traffic situation in Ho Chi Minh City

According to statistics, the population of Ho Chi Minh City has expanded by about 2 million people in the last ten years; by 2021, the city's population will be more than 9 million people, making it the most populated metropolis in the country This causes a slew of issues for the city, including transportation congestion [21]

According to Department of Transportation data, the overall length of roads in the city is 4,734 km, there are 1,160 road bridges, and the total road surface area is 50.7 million m2 The land traffic rate was 12.76%, 10% lower than the norm The traffic density is 2.26 km/km2, which is less than one-fifth of the benchmark and lower than similar cities like Bangkok, Taipei, and Singapore [21]

According to statistics from the Department of Transportation, more than 1,000 new vehicles are registered every day, including around 221 automobiles and 804 two-wheelers Furthermore, by the end of the third quarter of 2022, Ho Chi Minh City will have handled 8.7 million vehicles, comprising over 850,000 automobiles and approximately 7.8 million two-wheelers The overall number of vehicles grew by 3.1% as compared to the same time in 2021 (cars climbed by 7.2% and two-wheelers increased by 2.7%) [21]

According to the Department of Transportation, this puts strain on the infrastructure system, resulting in greater and greater traffic congestion There are now 18 places in the region that are in danger of traffic congestion, of which 3 have made good adjustments, 8 have changed but are still complex, and 7 do not

From the beginning of the year to the end of the third quarter of 2022, Ho Chi Minh City researched and proposed three solutions to combat traffic congestion: collecting tolls for cars circulating in the city center; prohibiting passenger cars with more than 30 seats from entering the inner city according to a time frame; and building 16 intersections throughout the city Ho Chi Minh City also pushes ring road projects 2, 3, and 4 to prevent cars from accessing the city center region and so

Trang 28

minimize traffic congestion [22] According to the Department of Transport, the key solution tasks in 2023 for reducing traffic jams include continuing to implement the project of developing traffic infrastructure in Ho Chi Minh City between 2020 and 2030; increasing public passenger transport in conjunction with controlling the use of personal motor vehicles in city traffic; and focusing on urban planning and development oriented towards public transportation (TOD model) [22]

3.1.2 Statistics of damage

The data was presented by a representative of Ho Chi Minh City's Department of Transport on July 12 at a conference summarizing the implementation of the Politburo's Resolution 53/2005 and Conclusion 27/2012 on socio-economic development, national defense, and security in the Southeast region and the Southern Key Economic Zone to 2010 and Ho Chi Minh City's orientation to 2020

According to the Deputy Director of Ho Chi Minh City Department of Transport Phan Cong Bang, traffic congestion costs the city around $6 billion USD each year, based on estimations by the Department of Transport of Ho Chi Minh City and the Institute of Development Studies [23] According to Mr Bang, the traffic index of Ho Chi Minh City accounts for almost a quarter of the entire country, indicating high traffic strain He cited figures such as seaport volume accounting for 26% of the country's cargo volume with 174 million tons; passenger traffic through Tan Son Nhat airport is 41.1 million turns per year, accounting for approximately 25% of the total country; and the number of registered modes of transportation is approximately 8.7 million, accounting for approximately 26% of the total country [23]

Trang 29

3.2 Vehicles detect and count approaches

Related to detecting and counting vehicles' tasks, the below sections introduced some methods, and technologies to perform it

3.2.1 Foreground extraction

In this approach, the authors introduce background subtraction used for extracting objects This method is used to find pixels that would have a place in a moving vehicle Specifically, a background image of a street with no vehicles is used, and the current frame in the video is converted from a color (RGB) image to a gray-scale image At that time, the gray intensity of a background picture is subtracted from that of the current frame for each pixel (x, y) The absolutist result is stored in a comparable spot in another picture, known as a distinguishing image [24]

According to [24], the next step after subtracting is identifying the region of interest (ROI) The ROI is a manually set, isolated zone where automobile counting will be undertaken The ROI pixel threshold is then used to produce vehicle pixels in the form of binary images The output comprises vehicle shapes with pixel values set to 255 (white) and others set to 0 (black) Finally, the authors use a contour area threshold to track and count cars in the ROI to achieve the result

Trang 30

Fig 9 Foreground extraction demo

Although this approach has some advantages, such as being easy to implement and taking less time for computation, it is very sensitive to external factors Based on [24], the author has to adjust the brightness parameter when he runs experiments because of changes in the pixel intensity Besides, it requires a clean foreground image to perform this method, which is suitable only for making experiments Moreover, the step of ROI selection is performed manually and doesn’t have any kind of diagnosis method to examine whether the selected ROI is still correct time by time, especially since external factors such as wind can have a big impact on the selection result

3.2.2 Haar-cascade

In this section, machine learning-based approaches will be considered The most basic one, which is to be introduced, is Haar-cascade Based on [25], Haar-

Trang 31

cascade is used as a classifier for detecting and counting cars and buses in images taken from CCTV sources It is mentioned that this approach can be applied to both video and images Besides, the author also provides a pre-trained XML file (Cascade file) for detecting vehicles with the given angle of camera view, which is the same angle as the thesis As with the above method, the Haar-cascade approach is easy to implement and takes less time to process images

Fig 10 Result applies Haar, same environment when training [25]

However, this method requires a high-quality input image for detection purposes, while the thesis environment, taken from the traffic department, cannot adapt to this requirement due to the constraints of Cybersecurity Law 2018 No 24/2018/QH14 [26] Besides, this approach is quite sensitive to changes in the environment, such as the angle of the camera's view Changing the angle can have a big impact on the detection result To overcome this issue, re-training will be required, but it takes a lot of effort and cannot be done in a general way for all the intersections

Trang 32

Fig 11 Result in real environment – low quality image

3.2.3 Convolutional neural network

A convolutional neural network (CNN) is a more advanced method than cascade in terms of not only speed but also accuracy and performance Furthermore, CNN has many other variants that outperform the older versions, such as Faster R-CNN, Mask R-CNN, ResNet-50, and so on As far as performing experiments like those mentioned in [27], the detection result is very good compared to the methods mentioned above for both images and video

Haar-However, the drawbacks of [27] need to be considered Firstly, it takes a lot of time for computation, known as "real-time rate," and is always far away from 1 In short, it is hard to apply the approaches to a real-time scenario Secondly, these methods require around 3200 images, together with 15–16 hours of training [27] To be applied to the Ho Chi Minh real-world case environment, re-training will be required and it will take a lot of effort to label the data Moreover, each intersection will have a different angle of camera view and distance Thus, in the worst case, each junction will require a re-training phase that will take a lot of time and effort to obtain results

Trang 33

Fig 12 Result of (a) Faster R-CNN, (b)Mask R-CNN, (c) RestNet50 [27]

3.2.4 YOLO

Based on all the methods mentioned above, to achieve real-time requirements and ensure high performance, YOLO, known as You Only Look Once, can be considered one of the best state-of-the-art currently According to [28], the original version of YOLO is applied for detecting vehicles from CCTV on video Following the result mentioned in [28], YOLO can support up to 45 frames per second, compared with 30 frames per second from Section 3.2.3 methods The performance also shows that the accuracy is around 95% to 100%, which is higher than that of convolutional neural network approaches

Fig 13 Detection result using YOLO on the ROI [28]

Trang 34

Furthermore, Yolo also provides many upgraded versions over time Each version has a different upgraded patch focusing on increasing performance, reducing training time, and reducing computation costs A team including members from Vietnam and China applies YOLO v4, an advanced version of YOLO, to detect and count vehicles under mixed traffic conditions [29] Based on the report results, the authors provide a comparison between the YOLO v4 method with Haar-cascade and background subtraction, which are already mentioned in the above section, using the COCO dataset Thus, the YOLO approach can be considered to be applied to the thesis to obtain the result

Fig 14 Pre-trained model YOLO v7 in real environment

3.3 Object size measuring methods

To be able to obtain traffic density, one parameter to be considered is the region’s area The below sections introduce some current methods that can be considered to convert from pixel to meter (in the thesis, we need to convert pixel to square meter) based on its advantages and disadvantages

Trang 35

3.3.1 Math-based calculating method

Based on the research results on traffic density estimation from highly noisy image sources [30], to perform this approach, they mention four basic steps to be covered However, only two steps need to be focused on to apply to the thesis

Firstly, traditional image processing approaches for automatically detecting road segments do not function well with their collection of noisy picture sources Instead, authors mark a polygon area in the image to characterize the region of interest that represents a specific road segment given a set of traffic images from a single image source Additionally, basic geometry calibrations relating to the position and angle of the camera are required In short, as with the mentioned methods, the ROI is selected manually, not automatically

Secondly, the authors map pixels to various locations along the road segment using a graded measure to obtain a weighted measure for the pixel intensity distributions, which captures the concept of distance In this step, the authors require physical specifications, such as camera height, the distance from the camera to objects, camera angle, etc., to be able to perform the below equations

As explained in the paper [30], Fig 15 shows the cross-section of the road area under consideration For ease of evaluation, only one lane is considered The real and observed points taken into account on a road segment are displayed in the legend below

Trang 36

Fig 15 Field of view [30]

Where: H is the height of the camera; C is starting point of the field of view; d + Xmaxis the actual endpoint of field of view; E is the observed end point of the field of view; G is the actual road point under inspection; F is observed position of G in the image; xi is actual distance of road point from C; h∆ is observed height of the complete road length in the image; hi is observed height of Point G in the image

The road's endpoints under consideration are C and D The camera's field of vision is represented by ∆ACD Any length more than d + Xmax is outside the camera's field of view There is a distance d = H * tanθ before the near point in the camera's field of view The camera's zoom level and image quality affect the camera's field of vision and image clarity

The road that is projected on the image is CGD Let p∆ stand in for the number of pixels that separate the start and end of the road segment in the image when marking the road segment in the image In other words, the projection of the whole segment CD on the camera screen is what p∆ represents We can think of the projection CE as a scaled-up projection of the image even though the camera scale is too small for geometric analysis

Trang 37

Given that CE has a height of h∆, the original image's scaling factor is h∆/p∆ Any point G on the road corresponds to a projection point (point F) in the simulated scaled-up projection Let hi be the height of CF If pi denotes the distance of the pixel corresponding to G from the start of the road section., then we get: hi = h∆ (pi / p∆)

This provides a mapping between the pixel position in the image and the actual distance in the real-world setting As the authors go towards the far point of view, the density increases but the complementary effect on the camera’s image is reduced To solve this problem, they developed a density function based on the geometric properties of the road segment and its image The product of a pixel counts in row i represented by count(i), and its related weight function W(i), gives the overall road traffic density They derive the density function as Eq (5)

ℎ𝑖 = 𝑥𝑖 + 𝑑

𝐻 Thus, they can obtain follow Eq (6) As a result, given H, Xmax and d, together with the input image, the author can estimate the graded measure of the pixel count for the road segment

𝐷𝑒𝑛𝑠𝑖𝑡𝑦 𝑓(𝑥) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑖) ∗ 𝐻𝐻 − (𝑃𝑃𝑖

∆) ℎ∆

𝑋𝑚𝑎𝑥𝑋𝑚𝑎𝑥 + 𝑑 =

Where: H is the height of the camera; h∆ is the observed height of the complete road length in the image; pi represents the distance of pixel corresponding to G from the

Trang 38

beginning of the road segment; the image's scaling factor is h∆/p∆; d + Xmax is the actual endpoint of the field of view

In general, this approach can return an almost exact value by comparing the inferred size of the target object in an image with its real size Besides, based on the paper result [30], it can be seen that the equation is already demonstrated in Fig 15 However, the main drawback of this method is that it requires a lot of physical specifications, which are hard to obtain because of the differences in camera installation at each crossroads as well as differences in camera branches Thus, although this method is quite good in terms of mathematics, it cannot be applied to the thesis since more than one intersection will be used for performing experiments It is necessary to find other methods which don’t rely on camera specifications heavily In the next section, the object reference concept is introduced to overcome this issue

3.3.2 Object reference method

As mentioned above, the math-based calculation method requires a lot of information related to physical specifications, which are hard to obtain due to differences in camera installation To overcome this issue, one approach independent of the specifications is considered, known as object reference [31] Following this approach, instead of getting all physical specifications, it is required that one know the dimensions of the target object (in terms of width and height) in a measurable unit, such as millimeters, inches, etc Moreover, this target object should be easy to find in an image or have a unique shape compared with others In the paper [31], the author uses a United States quarter as the reference object and calculates the ratio, called pixels_per_metric

Trang 39

Fig 16 United States quarter is used as reference object [31]

This method can help the thesis achieve the goal of calculating the pixel per meter parameter without depending on physical specifications Applying this method can be generalized for all the intersection cases, not just a specific one, just like the Haar-cascade or CNN methods Although it has advantages, this method requires two important things To begin, it requires that the camera view angle be perpendicular to the plane to obtain and calculate the object size exactly Secondly, the object reference should be unique in an image for detection Even for those, this method can be considered to apply to the thesis with some modification to adapt the real cases to an approximate level

3.4 Traffic density calculating

According to the definition mentioned in [32], traffic density is defined as a fundamental characteristic to tell how significant the congestion of cars on the road is Its Eq (8) is mentioned below

density = m

Where: m is number of vehicles that occupy on a segment of a road L is length of the segment of the road

Trang 40

Furthermore, some studies, such as [33], [34], and [35], mention that the number of vehicles counted in a given ROI (Region of Interest) can be used as a measurable unit for traffic density and categorized into three levels, such as light, medium, and solid However, all of them mostly perform on a segment of the road, which is shown in their experiments, not a region Thus, Eq (8) cannot be applied to this work because of following reasons

First of all, there isn't a standard definition of length or width for intersections that can be used to decide which direction serves as the length when calculating density The direction in which vehicles move must be taken into account in order to solve this problem However, it becomes impossible to precisely estimate the vehicle direction for defining the length direction depending on the camera angle and the existence of various flows of vehicles at the crossing Additionally, it might be difficult to estimate the length of the road segment for crossroads like T-junctions or six-way intersections since we must take the entire area into account unless we divide the intersection into extremely small road segments for processing

Last but not least, if it were simply a matter of counting vehicles without classification, a car would be treated equally with a motorcycle This approach fails to capture the spatial occupancy of cars in the area, as it does for trucks and buses Therefore, it is necessary to investigate a new approach to calculating traffic density