VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITYHO CHI MINH UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING GRADUATION THESIS Vehicle detection in surveillance videos Ma
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
GRADUATION THESIS Vehicle detection in surveillance videos
Major: Computer Science
THESIS COMMITTEE:
Computer Science and Engineering
SUPERVISOR: Assoc Prof Nguyễn Thanh Bình
REVIEWER: Ph.D Phan Trọng Nhân
HO CHI MINH CITY, 10/2022
Trang 5We promise that the thesis : “Vehicle detection in surveillance videos” is conducted by us,under the supervision of Assoc Prof Nguyễn Thanh Bình All of the references and relatedworks are cited in Section References The contents of this thesis have never been publishedunder any format If there were any violations, we will take full responsibility before thecommittee and presidents of the institution
Group members
Trang 6ACKNOWLEDGEMENTFirst and foremost, we would like to thank our research supervisors Assoc Prof NguyễnThanh Bình - lecturer of Computer Science and Engineering Department (Ho Chi Minh CityUniversity of Technology - University of Vietnam National University, Ho Chi Minh City).Without his assistance and dedicated involvement in every step throughout the process, thispaper could have been accomplished I am extremely grateful to you for supporting andunderstanding me during the past time.
We would also like to show gratitude to my committee of Ho Chi Minh City University ofTechnology, lecturers of the Computer Science and Engineering Department for getting usthrough our dissertation required with academic support, and I have many, many people tothank for listening
Most importantly, none of this could have happened without our family To our guardians – itwould be an understatement to say that, as a family, we have experienced some ups anddowns in the past three years This dissertation stands as a testament to your unconditionallove and encouragement
Group members
Trang 85 Conclusion and Future Work 40
Trang 91 Introduction
1.1 Thesis Introduction
With the recent trend of development in the field of technology, especially in many countries,the traffic surveillance system is deployed and widely used to monitor traffic conditions toensure road safety as it is a major concern recently However, news about the road accidentsvalidating the inefficiency of the current surveillance systems while statistics show that there
is a large increase in the rate of road accidents every year
So an accurate real-time traffic information collecting technique is necessary and essential forintelligent traffic surveillance and autonomous driving systems Particularly crucial isvision-based technology that uses real-time video cameras to gather road photos andautomatically extract traffic data from those images Due to the wide range of conditions,lighting, weather, vehicle position, and scale patterns, traffic surveillance camera images arequite complex Vehicle detection in surveillance videos is the most important and challengingstage of traffic surveillance using computer vision techniques The vehicle detection systemincludes detection of moving vehicles, counting the number of vehicles and the classification
of the detected vehicles
This thesis plans to create a lightweight and real-time vehicle detection system byimplementing an appropriate algorithm on a video input taken from surveillance cameras
1.2 Missions & Goals
1.2.1 Missions
Our team aims to apply the computer vision algorithms as it simply needs video footage fromthe surveillance camera to collect data of traffic Those data can be used to analyze and find asolution to lower the accident rate in the future Our vehicle detection system usingsurveillance’s video includes the following features:
- Detection vehicles
- Classification of its type
- Counting the number of vehicles
1.2.2 Goals
In this thesis, we will investigate the background knowledge and related researchs Whiledoing the thesis, we will collect and enrich the dataset And for illustration, we will run a
Trang 10demonstration on a few labeled images in Viet Nam In the future, we hope our data of trafficcan be used as a dataset for various applications like congestion analysis, accident detection.
1.3 Limitations
Our vehicle detection system aims to work normally on good weather and light conditionvideos (clear sky, sunny day) and on extreme weather and low light images (rain, fog,blizzard day or low light condition) It also can operate when the vehicle at the front is notblocking more than 50% of the vehicle behind it Finally, the inference speed is at least realtime with 24 frames per second similar to industrial standard
1.4 Structure of Thesis
The structure of this thesis is as below:
- Chapter 2 contains some background knowledges and related researches that are used
in this thesis
- Chapter 3 restates the problem and introduces the algorithms and methodologies that
we use to build the system
- Chapter 4 illustrates the results of the vehicle detection system and evaluates itsperformance
- Chapter 5 summarizes the work and compares the pros and cons of the suggestedmethod and proposes future development
Trang 112 Background & Related Works
2.1 Background
2.1.1 Object detection
In the past few years, we can see a huge leap in many fields of information technology One
of the most notable developments is the development of Object Detection algorithms andimplementations Object detection has progressed from detecting simple objects to humanfaces, animals, vehicles, etc With the recent development of technology and society, peoplehave implemented it in many industries like mass manufacturing, autonomous driving, trafficmonitoring
Figure 2.1: Object detection development [15]
Object detection is a computer vision technique for detecting instances of objects in pictures
or videos Object detection algorithms usually integrate with machine learning or deeplearning to produce meaningful results People can quickly identify and pinpoint objects ofinterest when viewing photos or videos The goal of object detection is to simulate thisintelligence using a computer
Deep learning object detection algorithms are data-driven algorithms for feature extractionand object detection, as opposed to feature engineering methods Deep learning methodsextract features from image data statistics and automatically learn object appearance features.Furthermore, CNN features have more representative characteristics, and the feature learningprocess is similar to that of the human visual mechanism Deep learning object detectionalgorithms are divided into two categories: two-stage object detection and one-stage objectdetection, based on network structure and feature learning progress
Trang 122.1.2 R-CNN
Feature extraction
Using the Caffe implementation of the CNN, the algorithm extracts a 4096-dimensionalfeature vector from each region proposal Forward propagation of a mean-subtracted 227x227RGB image through five convolutional layers and two fully connected layers is used tocompute features [3]
To compute features for a region proposal, the image data must first be converted in thatregion into a format that the CNN can understand (the CNN's architecture requires inputs of afixed size of 227x227 pixels) R-CNN chose the simplest of the many possibletransformations for the arbitrary-shaped regions, all pixels are wrapped in a tight boundingbox around the candidate region to the required size, regardless of its size or aspect ratio Itdilates the tight bounding box before wrapping so that there are exactly p pixels of warpedimage context around the original box at the warped size (p = 16)
Run-time analysis
There are two things that make R-CNN detection efficient First, all CNN parameters arecommon to all categories Second, when compared to other common approaches, such asspatial pyramids with bag-of-visual-word encodings, the feature vectors computed by theCNN are low-dimensional The UVA detection system, for example, has features that are twoorders of magnitude larger (360k vs 4k-dimensional)
The time spent computing region proposals and features (13s/image on a GPU or 53s/image
on a CPU) is amortized across all classes as a result of this sharing The only computationsthat are class-specific are dot products between features and SVM weights, as well asnon-maximum suppression
In practice, all of an image's dot products are combined into a single matrix-matrix product
In most cases, the feature matrix is 20004096, and the SVM weight matrix is 4096N, where
N is the number of classes R-CNN can scale to thousands of object classes without usingapproximate techniques like hashing, according to this analysis On a modern multi-coreCPU, matrix multiplication takes only 10 seconds even if there are 100k classes Thisefficiency is not solely due to the use of shared features and region proposals Due to itshigh-dimensional features, the UVA system would be two orders of magnitude slower,
Trang 13requiring 134 GB of memory just to store 100k linear predictors, versus only 1.5GB for theselower-dimensional features.
2.1.3 Fast R-CNN
The approach is similar to the R-CNN algorithm But, instead an entire image and a set ofobject proposals are fed into a Fast R-CNN network To create a convolutional feature map,the network first processes the entire image with several convolutional and max poolinglayers A region of interest (RoI) pooling layer extracts a fixed-length feature vector from thefeature map for each object proposal Each feature vector is fed into a series of fullyconnected layers, which branch out into two sibling output layers: one that generates softmaxprobability estimates for K object classes plus a catch-all "background" class, and anotherthat generates four real-valued numbers for each of the K object classes The refinedbounding-box positions for one of the K classes are encoded by each set of four values
The Region of Interest pooling layer
The RoI pooling layer employs max pooling to convert features within any valid region ofinterest into a small feature map with a fixed spatial extent of H W, where H and W are layerhyper-parameters that are independent of any specific RoI [4] An ROI is a rectangularwindow into a conv feature map in this paper Each RoI has a four-tuple (r, c, h, w) thatspecifies the top-left corner (r, c), as well as the height and width (h, w)
RoI max pooling divides the h w RoI window into a H W grid of sub-windows withapproximate sizes of h/H w/W, and then max-pools the values in each sub-window into thecorresponding output grid cell As with standard max pooling, each feature map channel ispooled separately The RoI layer is simply a one-level version of the spatial pyramid poolinglayer used in SPPnets
2.1.5 Two-stage detection algorithms
The two-stage detection algorithm divides the detection process into two stages: generatingcandidate regions (region proposals) and classifying them (generally with locationrefinements) To begin, most algorithms use a selective search [2] to generate proposalregions of interest (ROI) Second, employ the SVM classifier to determine the most accurateregion and type of object
R-CNN is a classic object detection algorithm that uses Convolutional Neural Networks toextract vehicle features (CNN) The detection task is completed in two stages using thismethod Fast RCNN is an improved version of R-CNN that includes the ROI pooling layer,
Trang 14which eliminates duplicate computation in R-CNN Furthermore, Fast R-CNN improveddetection results by using a softmax layer instead of an SVM classifier in R-CNN FasterR-CNN is an improved version of Fast R-CNN that uses RPN networks instead of theselective search methods used by R-CNN and Fast R-CNN The loss function smooths thetraining progress and improves the detection results, which is another Faster R-CNNinnovation.
The region-based fully convolutional network (R-FCN) aims to improve detectionperformance by combining Faster R-CNN and FCN The main contributions are: theintroduction of FCN to achieve more network parameters and feature sharing (compared toFaster R-CNN) and solving the problem of fully convolutional networks' location sensitivitydeficiencies (using position-sensitive score maps)
The two-stage detection algorithms produce accurate vehicle detection results However, itcannot achieve real-time inference
2.1.6 One-stage detection algorithms
The main idea behind one-stage detection is to uniformly conduct intensive sampling indifferent locations of the images, sampling can use different scales and aspect ratios, and thenuse CNN to extract features before directly classifying and regressing As a result, itsadvantage is speed, but uniform intensive sampling has a significant disadvantage in thattraining is more complicated, resulting in slightly lower model accuracy
The inspection task is transformed into a uniform, end-to-end regression problem, and theposition and classification are obtained simultaneously by a single process The idea oftransforming detection into regression was borrowed from YOLO, and a similar Prior boxbased on Anchor in Faster R-CNN was proposed to locate and classify targets at the sametime, a Pyramidal Feature Hierarchy-based detection approach, i.e., predicting targets on afeature map of different sensory fields, was also added
YOLO implements detection using a CNN network, the training and prediction process isend-to-end, the algorithm is quick and simple YOLO performs convolutional calculations onthe entire image, giving it the advantage of a larger field of view during detection and making
it difficult to misjudge the background The attention module makes use of the fullconvolutional layer Furthermore, YOLO has a good generalization capability and a high
Trang 15model robustness when migrating Additionally, the new Darknet-19 feature extractionnetwork, adaptive anchor box, and multi-scale training improve YOLO detectionperformance.
RetinaNet solves the problem of simple sample loss covering the loss of a large number ofcomplicated cases, and it's a One-Stage algorithm model with accuracy comparable to theTwo-Stage detection algorithm However, RetinaNet's inference time is slow when compared
to YOLO, SSD
Trang 162.2 Related Works
Vehicle detection
Background subtraction is a typical method of detecting vehicles because the trafficsurveillance cameras are static Elder and Corral-Soto [21] fitted a mixing model for thevehicle dimensions distribution using labeled data The model is combined with thepreviously established geometry of blobs of vehicles at the current frame to determine thevehicle configuration based on background subtraction
Similar to this, Dubska et al [20] obtain vehicle masks using background subtraction Somemethodologies that depend on background subtraction may not be ideal for some trafficsurveillance scenarios because they can be sensitive to changing lighting conditions orvibrations of traffic cameras
On the KITTI dataset [25], several methods for recognizing 3D bounding boxes have beenpublished and tested Videos from a driving vehicle's egocentric point of view in diverseurban settings are included in this dataset The movies include 3D bounding boxes ofpertinent items, such automobiles, bikers, and pedestrians, annotated on them
Numerous methods that have been published use modified 2D object detectors A review of aslightly modified version of their detector on the KITTI dataset was released by the creators
of the CenterNet object detector [26] In order to create a final 3D bounding box, Mousavian
et al [27] start with a 2D bounding box, regress vehicle dimensions and orientationindependently, then merge those results with geometry constraints A 3D detection head isadded to a RetinaNet object detector as part of the MonoDIS system
Trang 17align frames via image registration using the maximization of the Enhanced CorrelationCoefficient (ECC) or matching features like ORB, numerous researchers employed cameramotion compensation (CMC).
Vehicle Speed Estimation
Based on current techniques for calibrating traffic cameras, measuring vehicle speeds, andanalyzing evaluation datasets, it was discovered that many of the results that have beenpublished are based on limited datasets with only a small number of the monitored cars'actual speeds known Furthermore, the majority of the datasets used in published papers werenot accessible to the general public There is an author who offers their own dataset calledBrnoCompSpeed dataset, containing 21 movies totaling an hour in length that include 20,000cars with known ground truth speeds measured using laser gates For this dataset, the authorsadditionally offer an evaluation script
Luviz´on et al [18] paper illustrates that they have published a dataset containing 5 hours ofvideo from an intersection The dataset includes annotated license plate locations as well asground truth vehicle speeds determined by inductive loops placed in the road The authorsoffer their own speed estimation pipeline that is based on the identification of license plates
License Plates Detection
Even though it is not our main focus in this work, detecting license plates is an essential step
in the suggested speed measurement system The surveys from Du et al [20] examine themost recent license plate identification methods as of 2013 These algorithms are dependent
on features like edges, texture, color, shape, and geometry Poor maintenance, occlusion,changes in position and illumination, complicated backdrops, low contrast, low imagequality, and motion blur are some of the difficulties encountered
The Epstein et alStroke 's Width Transform (SWT) is a text detector that uses the orientations
of gradients across edge pixels to calculate the local stroke widths of candidate letters
Minetto et al SnooperText 's [23] is a multi-scale text detector that distinguishes betweencharacters and non-characters based on form descriptors and morphological picturesegmentation
The T-HOG classifier [24], a specific gradient based descriptor tailored for single-line textregions, is used by SnooperText to assess candidate text regions Each of these algorithmsincludes pre-processing procedures as well as tests for removing inaccurate answers based onsize and geometry
Trang 183 Proposed methodologies
3.1 Problem Statement
Surveillance video systems typically employ cameras as automatic observation devices,processing their output to extract useful data We carried out comparisons of objectdetection-based detection methods using CNN, such as YOLOv7 and Faster R-CNN, toestablish which method provides optimum results for detecting vehicle objects in surveillancevideos
Because of their importance in providing appropriate information about vehicles on the road,these systems are considered an important part of Intelligent Transportation Systems:
- Vehicle type recognition
- Vehicle counting
- Can operate in adverse weather conditions
Above are some of the interesting tasks that have emerged from efforts to solve theseproblems These applications became more popular with the advent of computers and thedevelopment of image processing techniques For such traffic applications, a video-basedapproach must have features like quick processing time, high reliability, and low cost Thesystem's ability to process sequential frames while pursuing a variety of goals whilemaintaining acceptable accuracy and quality is a key consideration when developingalternatives to current traffic data collection methods Current video-based systems however,are sensitive to harsh environment conditions such as weather and lighting, resulting in loweraccuracy and reliability
3.1.1 Vehicle Counting
One important method for assessing traffic conditions and estimating traffic flow is vehiclecounting The effectiveness of vehicle detection is what the video-based vehicle counting ismostly dependent on Thus, applying in surveillance camera videos can use the sameapproach Vehicle counting can solve some of the following problems:
- Parking lot availability
- Traffic flow estimation
- Traffic jam prediction
Where a parking lot is definitely one of the places to test such a problem, where cameras areusually placed with different perspectives from what is seen on urban highways, they aresituated along the side of the road and cars are frequently parked close together By providing
Trang 19the capacity of the area, a simple system can be used to access the available parking spaces in
a particular zone Another application traffic flow estimation, by tracking the intensity ofvehicles through a surveillance camera An intelligent traffic system can predict and managetraffic jams by guiding drivers to a more optimal path
3.1.2 Violation Detection
Although it is required to follow traffic regulations, there are many offenses that happen onthe road, such as running red lights, road markers or speeding… The law enforcers have putnumerous new ideas into practice to decrease the frequency of traffic violations, yet they arestill ineffective at handling them The constraint of investment costs of the road resources cannot keep up with the steadily growing number of vehicles and people It is necessary to putout a more clever and economical plan to address the traffic management issue The computervision solution is not only more intelligent than the conventional detection and monitoringbased on physical equipment, but it can also lower the cost to a greater level Throughreal-time traffic scene monitoring and data processing, experiments show that the yolov7 canhandle the demands of intelligently managing urban traffic Some of the violation traffictracking using machine learning can be tracked and reported autonomously from a singlesurveillance camera but with better precision and up time Thus, allowing more coverage inroad infrastructure with minimum investment and opening up much more possibilities fromwhat our current manual traffic regulation system is using For example:
- Speeding detection, by setting a threshold on speed limit lane, pictures of vehiclesexceeding such speed could be recorded real time
- Accident detection The basis of an accident is a sudden change of lane or speedwhere a vehicle drifts or stops immediately By focusing on this idea, we can applysurveillance cameras on busy roads to report accidents as soon as possible andinstantaneously
3.1.3 Detection in adverse environment
Weather
Naturally, there are a variety of environmental elements that affect outdoor recordings'background images Driving on real world roads may experience many weathercircumstances, e.g rain, fog, typhoons and snow Vehicles captured by traffic surveillancecameras in inclement weather display several degrees of vagueness, such as blackness,blurring, and partial occlusion At the same time, snowflakes and raindrops that arrive intraffic scenes make it more difficult to extract regions of interest which are vehicles
Trang 20False detection can happen as a result of the complexity of adverse weather, and some carswill not be detected correctly With the advancement of computer vision technologies, morework has been done in situations that are more challenging, such as cloudy, foggy, or snowydays Extreme weather has rarely been tested in vehicle detection up to now On the onehand, there are not many driving scenes in bad weather that are used and tested in releaseddatasets On the other hand, methods for detecting vehicles that are only applicable in uniquesituations have a very limited range of uses.
Figure 3.1: Vehicle in a blizzard [12]
Lighting
By monitoring ambient light in the real environment, video cameras can provide detailedcontextual information The exposure of an image controls how light or dark it will look oncethe camera has taken it The problem has been researched for many years, yet it is still adifficult one for vehicle detection Practical computer vision systems are deployed inmetropolitan settings where lighting conditions change over time and can change locally orglobally Local illumination changes refer to the shadows or highlights of moving objects inthe image while global illumination changes refer to the varied weather, daytime/nightconditions in the scene
Due to ambient light, the shapes of automobiles are noticeable during the day To extract thecontour and texture information of automobiles in photos, a variety of techniques can beused However, during the period between dawn and nightfall, illumination radically shifts
Trang 21Ambient light, being an uncontrolled environmental component, provides extra difficulty indetermining vehicle appearance in low light settings.
Vehicle detection in tunnels can also be classified as a special lighting scenario Verticalmargins in photographs may become blurry or even disappear due to camera overexposure totunnel lights
Figure 3.2: Minimal light condition [12]
3.2 Proposed methodologies
3.2.1 Detection - YOLOv7
Yolo[1] uses a sole convolution neural network to predict bounding boxes and classprobabilities, considering the entire image in a single evaluation in one step and for one unityolo predicts multiple bounding boxes, the class probabilities for each box and all thebounding boxes across the classes making it the one stage detection model Unlike earlierobject detection models which localize objects and images by using regions of the image withhigh probabilities of contender yolo considers the full image
Procedure
Figure 3.3: General object detection steps
Trang 22In order to solve the detection problem, there are a few step to follow to produce aninference:
- Input: images with format 'bmp', 'jpg', 'jpeg', 'png', 'tif', 'tiff', 'dng', 'webp', 'mpo' orvideos with format 'asf', 'avi', 'gif', 'm4v', 'mkv', 'mov', 'mp4', 'mpeg', 'mpg', 'ts', 'wmv',this can be of any resolution The performance is higher the lower the video resolutionsince there are less things to look at Thus produce less region of interest
- Preprocessing: since the input file can be of different shapes and sizes, apreprocessing step of the video is required The video’s frames will be saved as a list
of images, then these images will be resized to a predefined shape
- Yolov7 model: the list of preprocessed images are passed to the pretrained model toproduce detection There is a clear trade-off between model inference speed andaccuracy In order to make it possible to fulfill use of inference speed accuracy, weuse the first yolov7 which is the fastest and since the dataset chosen will also comewith low resolution
- Output: the final output of the whole process is the bounding boxes of each object.This is represented as (x1, y1, x2, y2, c) where (x1, y1) is the coordinate of the topright corner and (x2, y2) is the lower left corner of the bounding box, c is the classconfidence from the model prediction that how likely that object is a car or a truck
Figure 3.4: Average Precision and frames per second trade of from yolov7 family [14]
Trang 23Figure 3.5: Example result of vehicle detection with type and class confidence
YOLO architecture is FCNN (Fully Connected Neural Network) based It predicts thebounding box coordinates and class probabilities for these boxes using the entire image as asingle instance [1] The most significant advantage of using YOLO is its incredible speed Italso understands generalized object representation This is one of the best object detectionalgorithms, with accuracy that is comparable to the R-CNN algorithms while processing at amuch higher speed Since the main purpose of this research is vehicle detection insurveillance videos, our input will be a stream of images and YOLO speed will allow us toprocess at an acceptable frame per second (FPS)
Trang 24Figure 3.6: Yolov7 performance compared to other former Yolo models [14]
High level idea
Yolo passes the image (nxn) once through the fully convolutional neural network, as opposed
to other region proposal classification networks that perform detection on various regionproposals and thus end up performing prediction multiple times for various regions in animage
The input image is divided into a SxS grid by the system If the center of an object falls into agrid cell, the object is detected by that grid cell
Then B bounding boxes and confidence scores are predicted for each grid cell Theseconfidence scores reflect the model's belief that the box contains an object as well as theaccuracy with which it believes the box it predicts Confidence is defined as Pr(Object)IOUtruth pred in the algorithm The confidence scores should be zero if no object exists inthat cell Otherwise, the predicted box and ground truth's intersection over union (IOU) isexpected to equal the confidence score
There are five predictions in each bounding box: x, y, w, h, and confidence The (x, y)coordinates represent the box's center in relation to the grid cell's bounds The width and
Trang 25height are calculated in terms of the entire image Finally, the IOU between the predicted boxand any ground truth box is represented by the confidence prediction.
Each grid cell also predicts Pr(Classi |Object), which are C conditional class probabilities.These probabilities are based on the grid cell in which an object is located Regardless of thenumber of boxes B, the model only predicts one set of class probabilities per grid cell At testtime, one thing we can do is multiply the conditional class probabilities and the individualbox confidence predictions, s
which gives class-specific confidence scores for each box These scores represent thelikelihood of that class appearing in the box as well as the accuracy with which the predictedbox fits the object
Yolo architecture
Image frames are featured through a backbone, which is then combined and mixed in theneck Then they are passed along and Yolo predicts the bounding boxes, the classes of thebounding box and objects of the bounding boxes To understand how Yolo manages to pulloff such performance, we need to take a look at each module separately
Figure 3.7: General architecture of YOLO
Phase 1: Input
First the input layer is nothing but the image input we provide, it can be a two dimensionalarray with three channels red, green and blue or a video input with each frame making up alist of images
Phase 2: Backbone
The backbone is a deep neural network composed mainly of convolutional layers The mainobjective of the backbone is to extract the essential features The selection of backbone is akey step as it will improve the performance of object detection Often pre-trained neural