Chapter 1Introduction In recent years, with the evolution of the deep neural network in computer vision [25, 24, 8], we have seen various methods being proposed to resolve 2D object dete
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE & ENGINEERING
——————– * ———————
GRADUATE THESIS
3D OBJECT POSE DETECTION
FROM IMAGE
Major: Computer Science
Council: COMPUTER SCIENCE 02
Supervisor: Dr NGUYEN DUC DUNG
Reviewer: Msc LUU QUANG HUAN
—o0o—
Student: BUI VIET MINH QUAN (1710259)
HO CHI MINH CITY, 08/2021
Trang 2We hereby declare that this thesis titled ‘3D OBJECT POSE DETECTION FROM IMAGE’and the work presented in it are our own We confirm that:
• This work was done wholly or mainly while in candidature for a degree at this University
• Where any part of this thesis has previously been submitted for a degree or any otherqualification at this University or any other institution, this has been clearly stated
• Where we have consulted the published work of others, this is always clearly attributed
• Where we have quoted from the work of others, the source is always given With theexception of such quotations, this thesis is entirely our own work
• We have acknowledged all main sources of help
• Where the thesis is based on work done by ourselves jointly with others, we have madeclear exactly what was done by others and what we have contributed ourselves
Trang 3First of all, we would like to express our greatest respect and gratitude to our supervisors, Dr.Nguyen Duc Dung and Dr Pham Hoang Anh for their profound trust, support and guidance.They have spent so much time helping us revise this work rigorously Furthermore, they gave usmany good advices and strengthened our motivation when we were struggling The enthusiasmand energy in them has inspired us so much in our work We surely could not complete thisthesis without them
Thank our beloved colleagues, friends and all the people who have encouraged us duringthis stage of our life Thank HCMC University of Technology and the Faculty of ComputerScience and Engineering for giving us this wonderful experience
Trang 4Monocular 3D object detection has recently become prevalent in autonomous driving and igation applications due to its cost-efficiency and easy-to-embed to existent vehicles The mostchallenging task in monocular vision is to estimate a reliable object’s location cause of the lack
nav-of depth information in RGB images Many methods tackle this ill-posed problem by directlyregressing the object’s depth or take the depth map as a supplement input to enhance the model’sresults However, the performance relies heavily on the estimated depth map quality, which isbias to the training data In this work, we propose depth-adaptive convolution to replace the tra-ditional 2D convolution to deal with the divergent context of the image’s features This lead tosignificant improvement in both training convergence and testing accuracy Second, we propose
a ground plane model that utilizes geometric constraints in the pose estimation process Withthe new method, named GAC3D, we achieve better detection results We demonstrate our ap-proach on the KITTI 3D Object Detection benchmark, which outperforms existing monocularmethods Benefiting from this simple structure, ours is much faster than many state-of-the-artmethods and enables real-time inference
Trang 5Table of contents
1.1 Introduction 1
2 Background knowledge 4 2.1 Input Sensors 4
2.1.1 LiDAR 4
2.1.2 Camera 4
2.2 Camera Model 5
2.2.1 The Perspective Projection Matrix 7
2.2.2 The Intrinsic Parameters and the Normalized Camera 7
2.3 Object Detection 8
2.3.1 Feature Extraction 9
2.3.2 Regional Proposal 9
2.3.3 Object Classification 9
2.3.4 Object Regression 10
2.4 Feed-forward Neural Network 10
2.4.1 Overview 10
2.4.2 Activation Function 10
2.4.3 Loss Function 11
2.4.4 Gradient Descent 12
2.5 Convolutional Neural Network 12
2.6 Least Squares Problems and Singular Value Decomposition 13
2.6.1 Least Squares Problems 13
2.6.2 Singular Value Decomposition and Pseudo-inverse Matrix 14
2.7 CenterNet: Object Detector by Keypoints 14
2.7.1 Neural Network Layout 14
2.7.2 Loss Function 15
3 Datasets and Metrics 16 3.1 Datasets 16
3.2 Metrics 17
3.2.1 Fundamental terminologies 17
3.2.2 Definitions of various metrics 19
4 Related Work 21 4.1 Literature review 21
4.1.1 LiDAR-based 3D object detection 21
4.1.2 Monocular 3D object detection using representation transformation 22
4.1.3 Monocular 3D object detection using anchor-based detector 25
Trang 64.1.4 Monocular 3D object detection using center-based detector 26
4.1.5 Summary 29
5 GAC3D 30 5.1 Base Detection Network 30
5.1.1 Overview architecture 30
5.1.2 Backbone 31
5.1.3 Center Head 32
5.1.4 Keypoints Head 33
5.1.5 Pseudo-contact Point Head 34
5.1.6 Orientation Head 34
5.1.7 Dimension Head 36
5.1.8 3D Confidence head 37
5.2 Detection Network With Depth Adaptive Convolution 38
5.2.1 Depth Adaptive Convolution 39
5.2.2 Depth Adaptive Detection Head 39
5.2.3 Variant Guidance For Depth Adaptive Convolution 41
5.3 End-to-End Depth Guidance 41
5.4 Losses 43
5.5 Geometric Ground-Guide Module 44
5.5.1 Object’s Pseudo-position 45
5.5.2 2D-3D Transformation 46
5.6 Summary 49
6 GAC3D Implementation 50 6.1 Image Preprocessing and Data Augmentation 50
6.2 Network Implementation 50
7 Performance Analysis 54 7.1 Quantitative Results 54
7.2 Qualitative Results 57
7.3 Ablation Study 59
7.3.1 Accumulated impact of our proposed methods 59
7.3.2 Evaluation on Depth Adaptive Convolution 61
7.3.3 Evaluation on the impact of depth estimation quality on Adaptive Con-volution 62
7.4 Abnormal detection cases of KITTI dataset 62
Trang 7List of tables
2.1 Output layer activation functions and loss functions 12
6.1 Software specification of the training machine 53
7.1 KITTI Object Detection benchmark of our GAC3D method . 55
7.2 KITTI Object Detection benchmark of our GAC3D-E2E-Lite method . 55
7.3 KITTI Object Detection benchmark of our GAC3D-E2E method. 55
7.4 Comparative results on the KITTI 3D object detection test set of the Car category 56 7.5 Comparative results of the Pedestrian and Cyclist on KITTI test set . 57
7.6 Evaluation on accumulated improvement of our proposed methods on KITTI valset 61
7.7 Impact of Geometric Ground-Guide Module for 3D object detection on the KITTI val set . 62
7.8 Comparisons of different depth estimation quality for 3D object detection on KITTI val set 62
11.1 The implementation of the detection network with ResNet-18 backbone . 76
11.2 The implementation of the detection network with DLA-34 backbone . 79
Trang 8List of figures
1.1 An example of occluded object in driving scenario 2
2.1 Front-view-camera image and LiDAR 3D point cloud from a sample of KITTI dataset[20] 5
2.2 Image formation in a pinhole camera 6
2.3 The pinhole camera model 6
2.4 The optical axis, focal plane, and retinal plane 6
2.5 The coordinate system 7
2.6 Changing coordinate systems in the retinal plane 8
2.7 A neural network with 1 hidden layer 11
3.1 Examples from the KITTI dataset (from left color camera)[19] 17
3.2 Recording platform of KITTI dataset[19] 18
4.1 Network architecture of PointNet[50] 22
4.2 PV-RCNN[57] proposed architecture 23
4.3 CaDDN[52] proposed architecture 24
4.4 Predefined 3D anchors in M3D-RPN[6] 25
4.5 M3D-RPN[6] proposed architecture and the depth aware convolution 25
4.6 D4LCN[15] proposed architecture 26
4.7 Ground aware convolution from [39] 26
4.8 Network structure of SMOKE[41] 27
4.9 Proposed architecture of RTM3D[34] 28
4.10 KM3D-Net[33] proposed architecture 28
5.1 Overview of the proposed network 31
5.2 Overview of the baseline network 32
5.3 The layers configuration of a residual block 32
5.4 The ground truth heatmap of 2D bounding box’s center 33
5.5 Illustration of keypoints regression 33
5.6 The projected location on 2D image plane of the pseudo-contact point 34
5.7 Equal egocentric angles (left) and equal allocentric angles (right) 35
5.8 The moving car has the same egocentric (global) orientation (moving forward) in 3 frames while the allocentric (local) orientation with respect to the camera is changed 35
5.9 Egocentric (green color) and allocentric (orange color) angles in the bird’s-eye-view Red arrow indicates the heading of the car while blue arrow is the ray between the origin and the car’s center 36
Trang 95.10 The orientation decomposition We decompose the observation angle into three components: axis classification, heading classification, and relative offset
re-gression 36
5.11 The default dimension setting of KITTI dataset and the impact of visual appear-ance of object on dimension regression 37
5.12 Illustration of 3D confidence outputs Green: prediction, red: ground truth The further car with less accurate 3D pose estimation yields lower 3D confidence score 38
5.13 Depth Adaptive Convolution Detection Head 40
5.14 Illustration of depth estimations guiding the depth adaptive convolution 40
5.15 Depth Adaptive Convolution Detection Head With Variant Guidance 41
5.16 The architecture of our detection network with end-to-end depth estimation 42
5.17 Illustration of ground truth (red color) and predicted (blue color) 3D bounding box in bird’s-eye-view 44
5.18 The depth of the ground plane generated from the extrinsic information of cam-era 45
5.19 Object’s pseudo-position P and related terms in the inference process using camera model and ground plane model 46
6.1 Color jittering augmentation 51
6.2 Horizontal flipping augmentation 51
6.3 Scaling and shifting augmentation 52
7.1 Illustration of unlabeled cases in KITTI val set 58
7.2 Visualization of the objects’ 2D centers heatmap from the center head of the detection network Left: The monocular image from KITTI val split Right: The corresponding detected heatmap of objects’ 2D centers for class Car . 59
7.3 Visualization of the depth branch’s output Left: The monocular image from KITTI val split Right: The corresponding end-to-end depth estimation from the depth branch 60
7.4 Visualization of the 2D projected objects’ keypoints prediction Top: The cropped region of cars from monocular image Bottom: The nine projected keypoints es-timated from the keypoints head 60
7.5 Detailed visualization of our 3D detection result 64
7.6 Qualitative illustration of our monocular 3D detection results (left: val set, right: test set) Green color: our predictions, red color: ground truth, dot: projected 3D center, diagonal cross: heading of object 65
7.7 Illustration of multi-class detection on KITTI val set Green: Car, Yellow: Cy-clist, Purple: Pedestrian 66
7.8 Visualization of the impact of pseudo-position for refining object’s position Red: groundtruth z-position, Green: predicted z-position . 67
7.9 Visualization of the impact of pseudo-position for refining object’s position Red: groundtruth z-position, Green: predicted z-position . 68
7.10 Trajectories of the optimization process for each detection head with standard convolution and depth adaptive convolution operation 69
7.11 Abnormal detection cases from KITTI val set Left: ground truth labels, right: our predictions. 70
9.1 Visualization of the detection result for image ’000422’ of KITTI val set. 73
Trang 109.2 Visualization of the detection result for image ’000527’ of KITTI val set. 7410.1 Failure case: objects in congested situation 7510.2 Failure case: very distant object 7510.3 Failure case: highly occluded and truncated objects 75
Trang 11Chapter 1
Introduction
In recent years, with the evolution of the deep neural network in computer vision [25, 24, 8],
we have seen various methods being proposed to resolve 2D object detection task [54, 23, 53,
67, 61], and achieve remarkable performance, which almost approach human visual perception.Even so, in particular fields such as autonomous driving or infrastructure-less robot navigation,the demand for scene understanding, including the detailed 3D poses, identities, and scene con-text, is still high Researchers pay attention to 3D object detection, especially in autonomousnavigation applications To obtain an accurate depth map of the environment, people adoptLiDAR sensors widely due to their reliable 3D point cloud acquired using laser technology.Such LiDAR-based systems [58, 29, 57, 22] achieve promising results, they also come withvisible limitations, including high-cost sensors, hard to mount on vehicles, sparse and unstruc-tured data Therefore, an alternative solution using a singular RGB camera is required It is farmore affordable, versatile, and almost available on every modern vehicle The main difficulty
of image-based 3D object detection is the lack of depth information, which results in a cant gap performance compared to LiDAR-based methods While stereo systems are available,
signifi-we have to calibrate the cameras with relatively high accuracy Commercial cameras, such asBumblebee, are compact systems with low cost The quality of the obtained depth map, how-ever, is nowhere near the required standard for autonomous driving systems, especially in theoutdoor environment Estimating the depth via a single camera is a good choice since humanalso perceives the depth from 2D images However, the accuracy of monocular depth estimation
is not as good as ToF cameras like LiDAR That makes the 3D object detection task on a singlecamera more challenging and has brought much attention from the community In this work, wepropose a 3D object detection system based on a single camera view We also demonstrate thatour proposed framework can bridge the gap between LiDAR and image-based detectors.There are two main approaches in monocular 3D object detection: the representation trans-formation and the 2D convolutional neural network (CNN) In the representation approach, thegeneral idea is to imitate the 3D point clouds of LiDAR by estimating depth information fromimages This depth map is then projected into 3D space to generate the pseudo-point clouds.With the pseudo point clouds, one can employ several algorithms that using LiDAR data todetect objects However, the raw point clouds are sparse due to the range laser sensor’s phys-ical principle The pseudo point clouds, however, are considerably denser and depend heavily
on the estimated depth map quality Thus, applying object pose detection on these point cloudsdecreases the performance significantly Besides, the pseudo-LiDAR methods consist of sev-eral separate steps, which usually require training the estimation model separately, leading to
Trang 12Figure 1.1: An example of occluded object in driving scenario: Thered dot representing 2Dcenter of thecarlies on the visual appearance of theother car
non-end-to-end optimal training and time-consuming in the inference phase
On the other hand, the 2D CNN approaches extend the 2D object detector’s architecture toadapt the 3D output representation and add several techniques to solve the ill-posed problem
In M3D-RPN [6], and D4LCN [15], the authors redefine the anchor from YOLO [53] to 3Danchor by adding dimension, orientation and depth information [37], [46] follow the pipeline
of two-stage detector like Faster R-CNN [54] to detect the 2D bounding box of the object, thenregress to the 3D box In the proposal stage, they localize the 2D bounding boxes with additional3D regressed attributes In the second stage, the 3D bounding box can be reconstructed via anoptimization process that leverage the geometric constraints between the projected 3D box andthe corresponding 2D box These methods rely heavily on an accurate 2D detector Even with asmall error in the 2D bounding box, it can cause poor 3D prediction
Inspired by anchor-free architecture in the 2D detector CenterNet [67], SMOKE [41] andRTM3D [34] add several regression heads parallel to the primary 2D center detection head toregress 3D properties These anchor-free approaches are more light-weight and flexible than theother anchor-based approaches since they do not have to pre-define the 3D anchor box, which
is more complicated than the one in the 2D detector Unlike some regular 2D object detectiondatasets such as COCO [36] and Pascal VOC [16], the 3D datasets like KITTI [20] usuallycontain occluded objects due to the driving scenario of the data collecting process As illus-trated in Fig 1.1, in a dense object scene, the 2D center of the occluded car locates at anotherinstance’s appearance, which potentially causes errors in the pose estimation process More-over, the standard convolution operation is content-agnostic [60], which means once trained,the kernel weights remain unchanged despite the variance of the input scenario Thus, the cen-ter misalignment phenomenon confounds the center-based detector with traditional convolu-tion filtering to identify the object locations accurately To overcome this issue, we introduce anovel convolution operation called depth adaptive convolution layer, which leverages the exter-nal guidance from a pre-trained depth estimator to enhance features selection for regression anddetection tasks Our novel convolution filtering applies a set of secondary weights on the orig-inal convolution kernel based on the depth value variance at a single pixel As a consequence,this operator improves the precision and robustness of center-based object prediction
Trang 13In autonomous navigation and robotics applications, most moving obstacles stand on aground plane Thus, the height difference between the mounted camera and the obstacle’s bot-toms is almost constant for each vehicle and is equal to the camera’s height Moreover, assumethat the ground plane is parallel to the axis of the camera, we can reproject the 2D location inthe image to the ground plan to get the z-coordinate of the object bottom This assumption holdsfor most real driving scenarios The reprojection can significantly reduce the lack of depth in-formation in the monocular image Such geometric information obtained from these perspectivepriors can help mitigate the ill-posed problem in monocular 3D object detection
In this work, we proposed a single-stage monocular 3D object detector employing ideas fromthe above discussions We name this method GAC3D (Geometric ground-guide and AdaptiveConvolution for 3D object detection) Our work consists of three main contributions:
• Employ a novel depth adaptive convolution playing as secondary weights to adapt withthe depth variance on every pixel
• Introduce the concept of pseudo-position that serves as an initial value for the estimationprocess
• Introduce a ground-guide module to infer the 3D object bounding box information from2D regression results
In the next section, we discuss some background knowledge required for this work Some
of these knowledge are familiar with the computer vision discipline, including basics of sensorand camera models, and object detection techniques We also present the deep learning modelsand one important network called CenterNet for object detection using keypoints In Chapter 3,
we describe the dataset and metrics for evaluating this work We then discuss on the literature
in the next section, Chapter 4 In Chapter 5, we presents the details of our proposed method.The implementation of our method and performance analysis are discussed in Chapter 6 and 7.Finally, we discuss some issues and conlude our work in Chapter 8
Trang 14Chapter 2
Background knowledge
In this chapter, we present the theoretical backgrounds required for this work Section 2.1introduces different types of input sensors in the autonomous driving field Next, we go throughsome concepts of the camera model in Section 2.2 A brief description of object detection isproposed in Section 2.3 This is followed by an explanation of the basic concepts of artificialneural network in Section 2.4 and convolutional neural network in Section 2.5 Section 2.6 pro-vides the fundamental mathematics concepts of the least square problem and the singular valuedecomposition The chapter ends with explaining the CenterNet[67] object detection method inSection 2.7
To drive safer, autonomous vehicles must “see” better than humans Thus, a reliable visionsystem is a critical factor for self-driving cars Most automotive manufacturers commonly usethe following two types of sensors in autonomous vehicles: LiDAR and camera Figure 2.1presents two types of input from camera and LiDAR
2.1.1 LiDAR
Light Detection and Ranging (LiDAR) is a commonly used sensor in autonomous vehicleapplications It emits laser pulses and captures reflected pulses to measure the distance to sur-rounding objects, creating a 3D image called a point cloud, where each data point corresponds
to a real-world 3D location in which the emitted light reflects Typically, data points are saved as
a list of positions in space (x, y, z) and reflective intensity r Its advantages include impressivelyaccurate depth perception, which allows LiDAR to know the distance to an object within a fewcentimeters, up to dozens of meters away However, a key factor that hinders the use of LiDARfor commercial purposes is its high cost
2.1.2 Camera
An RGB camera is a cheap and versatile sensor that is available in modern vehicles A digitalcamera gathers information from the real world by capturing the light intensity to build up anelectric charge at every pixel location This essentially means that information about how faraway objects are is lost, as there is no depth information in typical 2D images
A variant of a digital camera is a stereo camera that consists of two or more lenses with aseparate image sensor for each lens This design allows the camera to simulate human binocular
Trang 15BACKGROUND KNOWLEDGE
(a) Front-view-camera image
(b) LiDAR 3D point cloud Purpil points: points in the field-of-view of the camera
Figure 2.1: Front-view-camera image and LiDAR 3D point cloud from a sample of KITTIdataset[20]
vision and, therefore, can capture three-dimensional images
Compared with LiDAR, a camera is significantly smaller, cheaper, and has the advantage
of better resolution and color However, by creating a 3D cloud of points, LiDAR is far better
at depth perception than the camera, make it more suitable and reliable in 3D recognition anddetection tasks
Considering the system depicted in Fig 2.2, it consists of two screens A small hole hasbeen punched in the first screen, and through this hole, some of the rays of light emitted orreflected by the object pass, forming an inverted image of that object on the second screen Wecan directly build a geometric model of the pinhole camera as indicated in Fig 2.3 It consists
of a plane X called the retinal plane in which the image is formed through an operation called
a perspective projection: a point C, the optical center, located at a distance J, the focal length
of the optical system, is used to form the image m in the retinal plane of the 3D point M as theintersection of the line (C, M) with the plane X
The optical axis is the line going through the optical center C and perpendicular to X, which
Trang 16BACKGROUND KNOWLEDGE
Figure 2.2: Image formation in a pinhole camera
Figure 2.3: The pinhole camera model
Figure 2.4: The optical axis, focal plane, and retinal plane
it pierces at a point c Another plane of interest (see Fig 2.4) is the plane: F going through Cand parallel to X It is called the focal plane
Trang 17BACKGROUND KNOWLEDGE
Figure 2.5: The coordinate system
2.2.1 The Perspective Projection Matrix
We choose the coordinate system (C, x, y, z) for the three-dimensional space and (c, u, v) forthe retinal plane as indicated in Fig 2.5 The coordinate system (C, x, y, z) is called the standardcoordinate system of the camera The relationship between image coordinates and 3-D spacecoordinates can be written as:
2.2.2 The Intrinsic Parameters and the Normalized Camera
We go from the old coordinate system, which is centered at the intersection c of the opticalaxis with the retinal plane, and it has the same units on both axes to the new coordinate system,
Trang 18BACKGROUND KNOWLEDGE
Figure 2.6: Changing coordinate systems in the retinal plane
which is centered at a point cn in the image (usually one of the corners) and will sometimeshave different units on both axes due to the electronics of acquisition For a pixel m we have
−→
cnm= −c→
nc+ −cm→ (2.6)Writing−cm→= u
old~i+ vold~j in the old coordinate system and introducing the scaling from the oldcoordinate system (~i,~j) to the new (~I, ~J), we have:~i = s~I and ~j = s~J) with
s =ku 0
0 kv
(2.7)
We can denote cncby t in the new coordinate system, and this allows us to rewrite equation 2.6
in projective coordinates as mnew= Hmoldwhere
Trang 19BACKGROUND KNOWLEDGE
while containing the entire object Additionally, each produced bounding box contains a labelreferring to the classification of the objects The object detection problem is partly a regressionproblem in terms of finding the location and size of the bounding box and partly a classificationproblem in terms of labeling each of the regressed bounding boxes
Deep learning approaches are well suited for the task of object detection [54, 23, 53, 67, 61].Typically, it uses a deep neural network, usually taken from the image classification task, toextract features from the input data Several techniques can be applied to regress the boundingboxes with the predicted class with these output feature maps
Object detection methods can generally be split into four parts: feature extraction, regionalproposals, classification, and regression All these terminologies are described in this section,while the Section 2.4 presents the basics of how a neural network works
2.3.1 Feature Extraction
The digital image is the most common input for the object detection task An image isrepresented by a 2-dimensional matrix where each element can be a single value (gray-scaleimage) or a 3-tuple (RGB image) Getting a computer to understand the contents of an imagethrough the matrix representation is a challenging task, especially as the image data are muchsensitive to the external environment factor like lighting conditions, weather The solution is tolearn general features from the actual data and find some representation that separates the object
of interest from the background and other objects
Features within an image are typically represented through patches of pixels In traditionalcomputer vision, Histogram of Oriented Gradient (HOG) and Scale Invariant Feature Transform(SIFT) create, by looking at the colors of pixels within patches, features from most changedirection known as the oriented gradient
With the evolution of deep learning, feature extractors that use convolutional neural works are more preferred These networks can assign importance in the form of weights tospecific properties in patches, creating more abstract feature representations The networks aretrained by being shown examples of inputs and their supposed outputs Finding which proper-ties in the input are essential to make the desired decisions Given enough examples, ideally, thenetwork would generally differentiate one class from another through the help of the learnedfeatures
net-2.3.2 Regional Proposal
One difficulty of object detection is that the number of objects in an environment can vary,varying the output sizes for different frames A classical method to deal with this problem iscalled a sliding window, which consists of sliding a window across the input image, horizontallyand vertically, to extract smaller sections, where each section is considered as an interest region
of the object It also alters the scale and size of the sliding windows to handle objects with variantshapes The generated region proposals can then be interpreted as unclassified bounding boxes,essentially bounding boxes for the model to explore further whether they contain objects or not
2.3.3 Object Classification
Each of these aforementioned region proposals (bounding boxes) is subject to the tion task to find which category the bounding box belongs to The classification is implementedthrough a separate module or an integrated part of the model generating the region proposals
Trang 20classifica-BACKGROUND KNOWLEDGE
The classification yields a probability for each class to be present in the proposed boundingboxes Typically, an additional class referring to background is added to discard bounding boxeswithout objects in them There are two well-known classifiers for this module: Support VectorMachine (SVM)[12] and Softmax Classifier In the first version of R-CNN[21], they used SVM
as the head of the object detector, while the Softmax Classifier is now more commonly used inrecent detector architecture
2.3.4 Object Regression
The regression process is to determine the object’s position and dimension In one-stagedetection architecture, the regression task is integrated with the classification task, while in atwo-stage detector, it is designed separately and applies to each object’s proposal To reduce thesearch space of the regression task, many detection methods [54, 38, 53] leverage the anchor box
to regress the object’s bounding box Anchor boxes are predefined bounding boxes of a certainheight and width with different scales and aspect ratios to capture various types of objects.During detection, the predefined anchor boxes are tiled across the image Therefore, the networkdoes not directly predict bounding boxes but rather regresses the relative offsets corresponding
to the tiled anchor boxes
2.4.1 Overview
Neural networks, also known as artificial neural networks (ANNs) or multi-layer perceptron(MLP), are a subset of machine learning and are at the heart of deep learning algorithms Theirname and structure are inspired by the human brain, mimicking how biological neurons signal
to one another Neural networks have become the most popular machine learning model in thepast decade With their tremendous power of approximating functions, they can be used forvarious pattern recognition tasks, including image recognition, speech recognition
A neural network can be regarded as a complex function The simplest form of neural works is feed-forward neural networks Feed-forward neural networks usually consist of manylayers stacked on top of each other; for this reason, they are also called deep neural networks(DNN) Each layer consists of many neurons; collectively, they take a fixed-size vector as inputand generate a fixed-size vector as output Let h(l−1) ∈ Rm be the input to the lth layer, and
net-h(l)∈ Rnits output, then the behavior of the layer can be expressed as:
h(l)= σ (W(l)h(l−1)+ b(l)) (2.10)
In this equation, W(l)∈ Rm×n and b(l)∈ Rnare called the weight matrix and the bias vector,and they are the parameters of the lth layer σ is a non-linear function called the activation func-tion, and it is this non-linearity that gives neural networks the power to approximate functions.Figure 2.7 depicts a neural network with 1 hidden layer
2.4.2 Activation Function
Commonly used on-linear functions include element-wise function such as the logistic moid function (σ ), the hyperbolic tangent function (tanh), and the rectified linear unit function
Trang 21sig-BACKGROUND KNOWLEDGE
Figure 2.7: A neural network with 1 hidden layer
(ReLU ), LeakyReLU, Maxout, ELU The formula of these activation functions are given below:
σ (x) =
1
1 + e−x (2.11)tanh(x) =e
x− e−x
ex+ e−x (2.12)ReLU(x) = max(x, 0) (2.13)
2.4.3 Loss Function
The training of a neural network is the procedure of learning the layers’ parameters to imize a scalar loss function The loss function is usually a sum or average of the error fromeach instance of the training data Denote by x the input of an instance and t its target output,and let y be the actual output of the network when x is fed into it The form of the contributionL(y,t) of this instance to the loss function depends on the type of the task; the most commonforms are also listed in Table 2.1 Given the training data, the loss function L on the entiretraining corpus can be regarded as a function of the network parameters Θ Many algorithmsminimize the loss function; most of them depend on the gradient ∇L(θ ) of the loss functionwith respect to the network parameters The gradient can be computed using a procedure callederror backpropagation [56], which in essence is the procedure of repeatedly applying the chainrule of differentiation Modern deep learning toolkits, such as Theano [3], TensorFlow [2], andPyTorch [49], can perform error backpropagation automatically, so there is no need to deriveformulas of the gradient by hand
Trang 22min-BACKGROUND KNOWLEDGE
Table 2.1: Output layer activation functions and loss functions suitable for different types ofmachine learning tasks
Task Output layer
activation Loss function Expression of loss functionRegression Linear Mean squared
error (MSE) L(y,t) = ky − tk
2 2
Binary
classification Sigmoid
Binarycross-entropy L(y,t) = − ∑ni=1tilogyi
− ∑ni=1(1 − ti)log(1 − yi)
Multi-class
classification Softmax
Categoricalcross-entropy
L(y,t) = − ∑ni=1tilogyior
L(y,t) = −log ∑ni=1tiyi
is evaluated on a validation corpus (called a checkpoint); if the performance on the validationcorpus stops improving, the learning rate λ is reduced
The traditional gradient descent algorithm (or batch gradient descent) uses the entire trainingset at every step, as a result of which it is prolonged on extensive training data and requirescomputationally expensive to do Therefore, to accelerate training, stochastic gradient descent(SGD) is often employed in practice Stochastic gradient descent is stochastic It picks up arandom instance of training data at each step and then computes the gradient making it muchfaster as there are much fewer data to manipulate at a single time In this way, the parametersget updated more often, and because each mini-batch offers a slightly different gradient, theparameters are less likely to get stuck in a bad local minimum The time it takes to go over theentire training data is called an epoch It is customary to shuffle the mini-batches to avoid thenetwork learning false knowledge from the order of the mini-batches The learning rate is alsooften adjusted after each complete pass over the training data, for example, one checkpoint isapplied per epoch
The Convolutional Neural Networks (ConvNet, CNN) are very similar to the ordinary ral Networks from the previous chapter: they are made up of neurons with learnable weightsand biases Each neuron receives some inputs, performs a dot product, and optionally follows
Neu-it wNeu-ith a non-linearNeu-ity ConvNet archNeu-itectures make the explicNeu-it assumption that the inputs areimages, which allows us to encode certain properties into the architecture These then make theforward function more efficient to implement and vastly reduce the network parameters
A convolutional neural network usually consists of convolutional layers, interweaved with
pooling layers The data passed between the layers are in the form of 3-dimensional tensors,each slice of which is called a feature map We denote the pth feature map at the output of the
Trang 23BACKGROUND KNOWLEDGE
lth layer by the matrix Hp(l) The parameters of a convolutional layer include a 4-dimensional
kernel tensor W(l) and a 3-dimensional bias tensor B(l) Let Wpq(l) and B(l)p be 2-dimensionalslices of the kernel and bias tensors, then the behavior of a convolutional layer is described by:
where the asterisk stands for 2-dimensional convolution, and σ is a non-linear function
The behavior of pooling layers is simpler A m × n pooling layer divides each input featuremap into regions of m × n pixels (m × n is called the stride of the pooling layer) and computesstatistics for each region as the output The most common statistics include the maximum andthe average When applied to image recognition, the neural network only needs to make oneprediction for an entire image, represented as 1 (for gray-scale images) or 3 (for color images)input feature maps The layers are usually arranged so that convolutional layers increase thenumber of feature maps, and pooling layers reduce the size of the feature maps When thefeature maps are sufficiently small, they are often flattened into one single vector, followed byone or more fully connected layers to make the prediction
The benefits of CNNs include shift-invariance and locality For image recognition, shiftinvariance means that the prediction for an image should not change when the object of interestmoves within the image CNNs ensure shift variance by applying the same convolution kernel toall parts of the input Locality means that the network has a sense of which parts of the input arenext to each other and which parts are far apart This is ensured by using neurons that receiveinformation only from neurons representing a neighboring region in the layer below
Decomposi-tion
2.6.1 Least Squares Problems
The method of least squares is a standard approach in regression analysis to approximate thesolution of overdetermined systems (sets of equations in which there are more equations thanunknowns)
i.e., a system in which A is a rectangular m × n matrix with more equations than unknowns(when m > n) Historically, the method of least squares was used by Gauss and Legendre tosolve problems in astronomy and geodesy
Legendre first published the method in 1805 in a paper on methods for determining theorbits of comets However, Gauss had already used the method of least squares as early as
1801 to determine the orbit of the asteroid Ceres, and he published a paper about it in 1810after the discovery of the asteroid Pallas Incidentally, it is in that same paper that Gaussianelimination using pivots is introduced More equations than unknowns arise in such problemsbecause repeated measurements are taken to minimize errors This produces an over-determinedand often inconsistent system of linear equations
The idea of the method of least squares is to determine the solution such that it minimizesthe sum of the squares of the errors, namely,
kAx − bk22 (2.17)
Trang 24BACKGROUND KNOWLEDGEand that these solutions are given by the square n × n system
A>Ax = A>b (2.18)
called the normal equations Furthermore, when the columns of A are linearly independent, it
turns out that A>Ais invertible, and so x is unique and given by
x = (A>A)−1A>b (2.19)Note that A>A is a symmetric matrix, one of the nice features of the normal equations of
a least squares problem In fact, given any real m × n matrix A, there is always a unique x+ ofminimum norm that minimizes kAx − bk22even when the columns of A are linearly dependent(theorem 13.1 in [18]) The proof also shows that x minimizes kAx − bk22if and only if:
A>(b − Ax) = 0 , e.i, A>Ax = A>b (2.20)Finally, it turns out that the minimum norm least squares solution x+can be found in terms
of the pseudo-inverse A+ of A, which is itself obtained from singular value decomposition ofA
2.6.2 Singular Value Decomposition and Pseudo-inverse Matrix
The singular value decomposition (SVD) is a central matrix decomposition method in linearalgebra because it can be applied to all matrices, not only to square matrices, and it always exists[14] The SVD of a matrix A ∈ Rm×nis the factorization of A into the product of three matrices:
A = UΣV> (2.21)where the columns of U ∈ Rm×m and V ∈ Rn×n are orthonormal and the matrix Σ ∈ Rm×n isdiagonal with positive real entries Then the pseudo-inverse of A is defined as:
A+= VΣ−1U> (2.22)
In recent years, the idea that objects can be transformed to a set of points has been gainingpopularity in 2D object detection field And the detection task can be thought of as a keypointestimation problem CornerNet introduces this approach for the first time using paired keypoints[30] As the name suggests, an object is represented as a pair of keypoints, the top-left cornerand the bottom-right corner Similar idea is explored in Object As Points [67], also known asCenterNet In this work, the authors detect the center point of a bounding box using a heat map.Other properties such as size of the bounding box are predicted directly using regression Our3D detection framework is inspired from the center prediction idea of CenterNet Let discusssome essential definitions of the keypoint detection network
2.7.1 Neural Network Layout
CenterNet architecture comprises four modules which are a fully-convolutional decoder network as feature extractor, a heatmap head, an offset regression head and a size head
Trang 25C is the number of keypoint types The output stride downsamples the output prediction
by a factor R A prediction Yx,y,c0 = 1 corresponds to a detected keypoint, while Yx,y,c0 = 0
is background
• Offset Head: To recover the discretization error caused by the output stride, offset headpredicts a local offset O ∈ RW/R×H/R×2for each center point All classes share the sameoffset prediction
• Size Head: Let (x(k)1 , y(k)1 , x(k)2 , y(k)2 ) be the bounding box of object k with category ck Itscenter point is lies at pk = (x
2.7.2 Loss Function
The training method and loss definitions of CenterNet are designed as follows:
• Heatmap Head: For each ground truth keypoint p ∈ R2of class c, a low-resolution alent p0 = bRpc is computed.Then, they splat all ground truth keypoints onto a heatmap
equiv-Y ∈ [0, 1]W /r×H/R×C using a Gaussian kernel Yxyc= exp(−(x−p
0
x )2+(y−p0y)22σ 2 ), where σp is
an object size-adaptive standard deviation If two Gaussians of the same class overlap,they take the element-wise maximum The training objective is a penalty-reduced pixel-wise logistic regression with focal loss [35]:
Lk= −1
N ∑xyc
|O0p0− (p
R− p0)| (2.24)where O0is the predicted offset
• Size Head: They use L1 loss at the center point for size regression:
Lsize= 1
N
N
∑k=1
|S0pk− sk| where S0is the predicted size (2.25)The overall training objective is written as follow:
Ldet = Lk+ γsizeLsize+ γo f fLo f f (2.26)
Trang 26com-KITTI consists of 7481 training images and 7518 test images for each task Despite the fact
that there are eight different classes, only the classes Car, Cyclist, and Pedestrian are
evalu-ated in their benchmark, as only for those classes have enough instances for a comprehensiveevaluation have been labeled The dataset’s sub-folders are structured as follows:
• image_02: contains the left color camera images (png)
• label_02: contains the left color camera label files (plain text files)
• calib: contains the calibration for all four cameras (plain text file)
The label files contain 15 columns for each object:
• type: Describes the type of object: ‘Car’, ‘Van’, ‘Truck’, ‘Pedestrian’, ‘Person sitting’,
‘Cyclist’, ‘Tram’, ‘Misc’ or ‘DontCare’
• truncated: Float from 0 (non-truncated) to 1 (truncated), where truncated refers to theobject leaving image boundaries
• occluded: Integer (0,1,2,3) indicating occlusion state: 0 means fully visible, 1 meanspartly occluded 2 means largely occluded and 3 means unknown
• α: Observation angle of object, ranging in [−π, π]
• bounding box: 2D bounding box of object in the image (0-based index): contains left,top, right, bottom pixel coordinates
• dimensions: 3D object dimensions: height, width, length (in meters)
• location: 3D object location x,y,z in camera coordinates (in meters)
Trang 27DATASETS AND METRICS
Figure 3.1: Examples from the KITTI dataset (from left color camera)[19]
• ry: Rotation ryaround Y-axis in camera coordinates, ranging in [−π, π]
• score (only for results): Float, indicating confidence in detection, needed for sion/recall curves, higher is better
preci-The coordinates in the camera coordinate system can be projected in the image by using the
3 × 4projection matrix in the calibration folder, where for the left color camera for which theimages are provided, P2 must be used [19] The difference between ryand α is that ryis directlygiven in camera coordinates, while α also considers the vector from the camera center to theobject center, to compute the relative orientation of the object with respect to the camera Forexample, a car that is facing along the X-axis of the camera coordinate system corresponds to
ry = 0, no matter where it is located in the Oxz plane (bird’s eye view), while α is zero onlywhen this object is located along the Z-axis of the camera When moving the car away from theZ-axis, the observation angle will change
3.2.1 Fundamental terminologies
Before diving into various metrics used to evaluate the results of the object detection task,
we briefly introduce some related important concepts Thanks to [48] has contributed greatpaper to illustrate fundamental concepts in the object detection task
3.2.1.1 Confidence score
The confidence score is the probability that an anchor box contains an object It is usuallypredicted by a classifier
Trang 28DATASETS AND METRICS
Figure 3.2: Recording platform of KITTI dataset[19]
3.2.1.2 Intersection Over Union (IOU)
Intersection over Union (IoU) is a measure based on Jaccard Index that evaluates the lap between two bounding boxes By applying the IoU on a ground truth bounding box and apredicted bounding box, we can tell if a detection is valid (True Positive) or not (False Positive).IoU is given by the overlapping area between the predicted bounding box and the groundtruth bounding box divided by the area of union between them:
over-IoU =area(Bp∩ Bgt)
area(Bp∪ Bgt) (3.1)
3.2.1.3 True Positive, False Positive, False Negative and True Negative
Some basic concepts used by the metrics:
• True Positive (TP): A correct detection Detection with IoU ≥ threshold
• False Positive (FP): A wrong detection Detection with IoU < threshold
• False Negative (FN): A ground truth not detected
• True Negative (TN): Does not apply It would represent a corrected miss-detection In the
object detection task there are many possible bounding boxes that should not be detectedwithin an image Thus, TN would be all possible bounding boxes that were corrrectlynot detected (so many possible boxes within an image) That’s why it is not used by themetrics
• Threshold: depending on the metric, it is usually set to 50%, 70% or 95%.
Trang 29DATASETS AND METRICS
3.2.1.4 Precision
Precision is the ability of a model to identify only the relevant objects It is the percentage
of correct positive predictions and is given by:
T P
T P+ FP=
T PAll detections (3.2)
3.2.1.5 Recall
Recall is the ability of a model to find all the relevant cases (all ground truth boundingboxes) It is the percentage of true positive detected among all relevant ground truths and isgiven by:
T P
T P+ FN=
T PAll ground truths (3.3)
3.2.2 Definitions of various metrics
3.2.2.1 Precision × Recall curve
The Precision × Recall curve is a good way to evaluate the performance of an object detector
as the confidence is changed by plotting a curve for each object class An object detector of aparticular class is considered good if its precision stays high as recall increases, which meansthat if you vary the confidence threshold, the precision and recall will still be high Anotherway to identify a good object detector is to look for a detector that can identify only relevantobjects (low False Positives means high precision), finding all ground truth objects (low FalseNegatives means high recall)
A poor object detector needs to increase the number of detected objects (increasing FalsePositives means lower precision) in order to retrieve all ground truth objects (high recall) Inconsequence, the Precision × Recall curve normally starts with high precision values, decreas-ing as recall increases
• 11-point interpolation
The 11-point interpolation tries to summarize the shape of the Precision × Recall curve
by averaging the precision at a set of eleven equally spaced recall levels [0, 0.1, 0.2, , 1]:
AP= 1
11 ∑r∈(0,0.1, ,1)
with
ρinter p= max
Trang 30DATASETS AND METRICS
where ρ(er) is the measured precision at recaller
Instead of using the precision observed at each point, the AP is obtained by interpolatingthe precision only at the 11 levels r taking the maximum precision whose recall value isgreater than
• Interpolating all pointsInstead of interpolating only in the 11 equally spaced points, wecould interpolate through all points n in such way that:
∑n=0
Trang 314.1.1 LiDAR-based 3D object detection
4.1.1.1 PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet [50] is the first model consume unordered list of points for classification and mantic segmentation PointNet processes each point individually and identically, and then feedthe results to a symmetric function The PointNet directly takes an unordered set of points toproduce m × n scores of n points and m semantic categories This semantic segmentation isthen post-processed with connected component analysis and used for 3D object detection Theoverall architecture of PointNet is illustrated in Fig 4.1
se-4.1.1.2 F-PointNet: Frustum PointNets for 3D Object Detection from RGB-D Data
Frustum PointNets [51] is a 3D object detection framework based on RGB-D data It ages both 2D features and 3D point cloud by creating a 2D proposal bounding box using a 2Ddetector then extracting 3D bounding frustum for point cloud data The network takes frustumpoint cloud input and performs a 3D instance segmentation task Finally, those segmented ob-ject points are processed by a box regression PointNet together with a preprocessing transformernetwork to estimate the object’s amodal oriented 3D bounding box
lever-4.1.1.3 PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud
PointRCNN [58] directly generates high-quality 3D proposals from point clouds and age ROI-pooling to extract features for box refinement In the first stage, the network performssemantic segmentation for foreground and background This result is fed into the bounding boxproposal head Each point in the foreground is responsible for generating one bounding boxproposal The bounding boxes are put through Non-maximum Suppression before feeding tothe second stage The second stage of PointRCNN performs canonical 3D box refinement The
Trang 32lever-RELATED WORK
Figure 4.1: Network architecture of PointNet[50]
3D bounding box refinement takes advantage of box proposals and refines box coordinates withrobust bin-based losses
4.1.1.4 VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
VoxelNet [68] is the pioneering work in the voxel-based framework for 3D object detection.This work divides a point cloud into equally spaced 3D voxels and transforms a group of pointswithin each voxel into a unified feature representation The high-level feature is extracted frompoint voxels by 3D convolution Then, it converts the voxel features to dense 4D feature maps.Finally, the Region Proposal Network takes dense feature maps as input to perform classificationand bounding box regression tasks
4.1.1.5 PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection
PV-RCNN [57] deeply integrates 3D voxel convolutional neural network (CNN) and based ensemble abstraction to learn more discriminative point cloud features It takes advan-tage of the high-quality recommendations of 3D voxel CNN and the flexible receptive field ofPointNet-based networks As described in Fig 4.2, this method consists of two principal steps.First, they parallelly perform feature encoding, proposal generation using 3D voxel convolutionneural network and voxel-to-keypoint encoding Then, the keypoint features are aggregated tothe RoI-grid points to refine proposal and predict object confidence
PointNet-4.1.2 Monocular 3D object detection using representation transformation
To remedy the lack of depth information, there is a line of works that researches on how torepresent the monocular image Common approaches are generating point clouds based on thedepth estimation of RGB images and converting perspectives image to Birds-eye-view (BEV)images
4.1.2.1 Orthographic Feature Transform for Monocular 3D Object Detection
Orthographic Feature Transform (OFT) [55] maps the 2D feature map to bird-eye view
by orthographic feature transformation A ResNet-18 [24] is used to extract perspective imagefeatures Then voxel-based features are generated by accumulating image-based features overthe projected voxel area The voxel features are then collapsed along the vertical dimension to
Trang 33RELATED WORK
Figure 4.2: PV-RCNN[57] proposed architecture
yield orthographic ground plane features Finally, another ResNet-like network is used to refinethe BEV map
4.1.2.2 Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles
Another perspective transformation approach is BirdGAN [59], which uses Generative versarial Network (GAN) to perform BEV transformation as an image-to-image translation task.This work generates BEV images directly from a single RGB image by designing a high fidelityGAN architecture and by carefully curating a training mechanism, which includes selectingminimally noisy data for training There are three main modules in this pipeline.The first net-work is the GAN-based network, it takes the RGB image as input and outputs the 3 channelBEV image The three channels of the BEV image, in this case, are the height, density and,intensity of the points The second network reconstructs a 3D model using the RGB image asinput The 3D reconstruction network takes the three channel RGB image as input and generateseither the point clouds or their voxelized version as the 3D model The generated 3D model isthen used to obtain the ground estimation for constructing the 3D bounding boxes around thedetected objects Finally, it uses default BirdNet [4] for estimating 3D bounding boxes
Ad-4.1.2.3 Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object
Detection for Autonomous Driving
Pseudo-Lidar [63] uses depth-estimation state-of-the-art work to generate pseudo point cloudthen directly apply lidar-based 3D Object Detection algorithms Instead of incorporating thedepth D as multiple additional channels to the RGB images, the 3D location (x, y, z) of eachpixel (u, v) in the left camera’s coordinate system could be derived as follows:
Trang 34RELATED WORK
Figure 4.3: CaDDN[52] proposed architecture
4.1.2.4 Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous
Driv-ing
The authors of Pseudo-Lidar followed up with Pseudo-Lidar++ [64] They adapt the stereonetwork architecture and loss function to be more aligned with accurate depth estimation offaraway objects Then, they leverage extremely sparse but cheaper LiDAR sensors to de-biasour depth estimation and correct point cloud representation results
4.1.2.5 Task-Aware Monocular Depth Estimation for 3D Object Detection
Another remarkable work on pseudo-lidar is ForeSeE [62] They claim that foreground andbackground have different depth distributions, so they estimate the foreground and backgrounddepth using separate optimization objectives and decoders This method significantly improvesthe depth estimation performance on foreground objects
4.1.2.6 CaDDN: Categorical Depth Distribution Network for Monocular 3D Object
De-tection
Recently, [52] introduced a new method following the representation transformation proach called CaDDN The general idea is to project the perspective image to BEV represen-tation and then perform object detection They use a predicted categorical depth distributionfor each pixel to project rich contextual feature information to the appropriate depth interval in3D space However, instead of separating depth estimation from 3D detection during the train-ing phase like previous methods, they perform end-to-end training to focus on generating aninterpretable depth representation for monocular 3D object detection
ap-Overall, represent a monocular image as pseudo significantly improves the performance ofmonocular object detection results However, these approaches require many complex comput-ing stages that are incompatible for real-time processing and deployment on embedded devices
Trang 35RELATED WORK
Figure 4.4: Predefined 3D anchors in M3D-RPN[6]
Figure 4.5: M3D-RPN[6] proposed architecture and the depth aware convolution
4.1.3 Monocular 3D object detection using anchor-based detector
4.1.3.1 M3D-RPN: Monocular 3D Region Proposal Network for Object Detection
M3D-RPN[6] can be considered as the pioneer in exploiting the 3D anchor box for ular 3D detection They introduce a single-stage network with the concept of a 2D-3D anchorbox (Fig 4.4) to predict 2D and 3D boxes simultaneously The anchor box consists of param-eters from both spaces, including [h, w]2D, zP, [h, w, l, θ ]3D and are statistically pre-computedover the dataset and leverage as an initial solid prior for the 3D regression task They propose touse row-wise convolution (depth-aware convolution), where the feature maps are divided into bbins and each bin has a separate kernel weight (Fig 4.5), to improve the spatial-awareness ofhigh-level features as the depth is largely correlated with rows in autonomous driving scenes.Furthermore, to improve the accuracy of orientation regression, they apply an offline 3D-2Doptimization process that iterates through different orientation values θ and compute the errorbetween the projected 3D bounding box to the image plane with the predicted 2D box to findthe best configuration
monoc-4.1.3.2 Learning Depth-Guided Convolutions for Monocular 3D Object Detection
[15] extends the idea of depth-aware convolution from [6], they introduce the Depth-guidedDynamic-Depthwise-Dilated local convolution network (D4LCN), which takes RGB imagesand an additional input, depth map (Fig 4.6) Their D4LCN module generates different convo-lutional kernels with different receptive fields (dilation rates) for different pixels and channels
of different images The utilization of depth image help to compensate the limitations of 2D
Trang 36RELATED WORK
Figure 4.6: D4LCN[15] proposed architecture
Figure 4.7: Ground aware convolution from [39]
convolution and narrow the gap between 2D convolutions and the point cloud-based 3D tors
opera-4.1.3.3 Ground-aware Monocular 3D Object Detection for Autonomous Driving
Inspired by the way human perceives objects in driving scenarios, [39] leverages the groundplane priors to enhance the anchor-based detector performance First, utilizing the fact thatmost objects of interest should be on the ground plane, they propose an anchor pre-processingmethod to eliminate anchors far away from the ground Second, they introduce the ground awareconvolution (Fig 4.7) that extracts geometric priors and features from pixels beneath then con-catenates and merges them into the origin feature maps to enhance the localization results Thisconvolution mimics how humans utilize the ground plane in depth perception
4.1.4 Monocular 3D object detection using center-based detector
4.1.4.1 SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation
SMOKE[41] proposes a monocular 3D detector based on the CenterNet[67] architecture.Their architecture consists of two main branches: keypoint classification branch and 3D box re-
Trang 37RELATED WORK
Figure 4.8: Network structure of SMOKE[41]
gression branch The keypoint classification branch follows similar design of [67] such that eachobject is represented by one specific keypoint Instead of identifying the center of a 2D boundingbox, they choose the projected 3D center of the object on the image plane They demonstratethat representing object with only 3D information is more robust and noise-insensitive that a2D-3D representation The regression branch predicts the essential 3D properties to construct3D bounding box for each keypoint on the heatmap These properties are encoded as a tuple
of 8 parameters τ = [δz δxc δyc δxc δh δw δl sin(α) cos(α)] where δz denotes the depth offset,
δxc,δy c is the discretization offset due to downsampling, δh,δw, δl denotes the residual sions, sin(α),cos(α) is the vectorial representation of the rotational angle α These variable arelearned in residual representation with the statistical priors taken from the dataset to reduce thelearning interval and speed up the training task
dimen-They also introduce a multi-step disentangling approach for 3D bounding box regression.This transformation separates the contribution of each parameter group to the final loss andstimulates training convergence and accuracy SMOKE obtains remarkably low inference timethanks to its simple architecture and direct 3D box regression approach
4.1.4.2 RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous
Driving
RTM3D[34] also uses CenterNet-liked structure to regress the object’s 2D location and3D properties, including depth, dimension, and orientation Furthermore, they detect the 2Dprojection of 8 cuboid vertices and cuboid center (Fig 4.9) for every object Instead of directlyreconstruct the object’s 3D pose like [41], they propose an offline optimization method byutilizing the 2D-3D geometric constraints The 3D bounding box is refined by minimizing theerror between the 2D projection of the regressed 3D box and nine predicted keypoints Theyformulate it as a nonlinear least squares optimization problem and solve it via the Gauss-Newton
or Levenberg-Marquardt algorithm
4.1.4.3 KM3D-Net: Monocular 3D Detection with Geometric Constraints Embedding
and Semi-supervised Training
KM3D-Net[33] comes from the same author of RTM3D[34], propose a single-shot, free, and keypoints-based framework They replace the offline optimization step in RTM3Dwith a differentiable 3D geometry reasoning module It helps to reduce the running time whilemaintaining the consistency of model outputs in an end-to-end fashion (Fig 4.10) It also claims
Trang 38anchor-RELATED WORK
Figure 4.9: Proposed architecture of RTM3D[34]
Figure 4.10: KM3D-Net[33] proposed architecture
that the number of keypoints could be reduced in the training process while preserving themodel’s accuracy Its ablation study shows that with only two keypoints, the performance can
be reasonable, and no apparent improvements beyond using four keypoints
Trang 39RELATED WORK
4.1.5 Summary
In summary, monocular 3D object detection is an ill-posed problem due to the lack of depthinformation Therefore, directly regressing properties of 3D bounding boxes is such a challenge.One solution is to utilize virtual keypoints based on 2D bounding boxes and leverage geometricconstraints on 2D-3D relationship to optimize the predictions Another approach uses monoc-ular depth estimation to generate additional input depth maps to narrow the representation gapbetween 3D point clouds and 2D images
Trang 40Chapter 5
GAC3D
The proposed framework aims to improve the accuracy of the 3D object detection task fromthe monocular image By introducing the depth adaptive convolution layer, we can leverage theprediction result of detection heads We also present a new module that utilizes the object’spseudo-position inference for enhancing the 3D bounding box regression results To demon-strate this idea, we describe the details of our geometric ground-guide module (GGGM), whichinfers the final location, orientation, and the 3D bounding boxes This module utilizes the inter-mediate output of the detection network and the pseudo-position value to recover the object’s3D pose via 2D-3D geometric transformation
This section clarifies every critical component of our monocular 3D object detection work given in Fig 5.1:
frame-• We provide an insight into the base detection network’s architecture
• We introduce the concept of depth adaptive convolution and how we leverage it to improvethe performance of the base detection network
• We detail the mechanism and function of our 2D-3D transformation module
• We present an end-to-end learning strategy that significantly improves the execution ciency while remaining the accuracy of our framework unchanged
In our framework, the base detection network aims to estimate the projected location onthe 2D image plane of 3D keypoints, the observation angle, the values of the object’s dimension(height, length, width), and the confidence score of a detected object All of the predicted valuesfrom the base detection network are later employed in the 2D-3D transformation module toretrieve the final object’s 3D pose
5.1.1 Overview architecture
Fig 5.2 illustrates the overview of the base detection network Our detection network followsthe idea of CenterNet [67] architecture, which consists of a backbone for feature extractionfollowed by multiple detection heads The detection network takes a monocular RGB image I ∈
RH×W ×3 where H and W is the image height and width to produce the intermediate prediction