3D object pose detection from image

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE & ENGINEERING ——————– * ——————— GRADUATE THESIS 3D OBJECT POSE DETECTION FROM IMAGE Major: Computer Science Council: COMPUTER SCIENCE 02 Supervisor: Dr NGUYEN DUC DUNG Reviewer: Msc LUU QUANG HUAN —o0o— Student: BUI VIET MINH QUAN (1710259) HO CHI MINH CITY, 08/2021 Declaration We hereby declare that this thesis titled ‘3D OBJECT POSE DETECTION FROM IMAGE’ and the work presented in it are our own We confirm that: • This work was done wholly or mainly while in candidature for a degree at this University • Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated • Where we have consulted the published work of others, this is always clearly attributed • Where we have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely our own work • We have acknowledged all main sources of help • Where the thesis is based on work done by ourselves jointly with others, we have made clear exactly what was done by others and what we have contributed ourselves Acknowledgments First of all, we would like to express our greatest respect and gratitude to our supervisors, Dr Nguyen Duc Dung and Dr Pham Hoang Anh for their profound trust, support and guidance They have spent so much time helping us revise this work rigorously Furthermore, they gave us many good advices and strengthened our motivation when we were struggling The enthusiasm and energy in them has inspired us so much in our work We surely could not complete this thesis without them Thank our beloved colleagues, friends and all the people who have encouraged us during this stage of our life Thank HCMC University of Technology and the Faculty of Computer Science and Engineering for giving us this wonderful experience Abstract Monocular 3D object detection has recently become prevalent in autonomous driving and navigation applications due to its cost-efficiency and easy-to-embed to existent vehicles The most challenging task in monocular vision is to estimate a reliable object’s location cause of the lack of depth information in RGB images Many methods tackle this ill-posed problem by directly regressing the object’s depth or take the depth map as a supplement input to enhance the model’s results However, the performance relies heavily on the estimated depth map quality, which is bias to the training data In this work, we propose depth-adaptive convolution to replace the traditional 2D convolution to deal with the divergent context of the image’s features This lead to significant improvement in both training convergence and testing accuracy Second, we propose a ground plane model that utilizes geometric constraints in the pose estimation process With the new method, named GAC3D, we achieve better detection results We demonstrate our approach on the KITTI 3D Object Detection benchmark, which outperforms existing monocular methods Benefiting from this simple structure, ours is much faster than many state-of-the-art methods and enables real-time inference Table of contents Introduction 1.1 Introduction Background knowledge 2.1 Input Sensors 2.1.1 LiDAR 2.1.2 Camera 2.2 Camera Model 2.2.1 The Perspective Projection Matrix 2.2.2 The Intrinsic Parameters and the Normalized Camera 2.3 Object Detection 2.3.1 Feature Extraction 2.3.2 Regional Proposal 2.3.3 Object Classification 2.3.4 Object Regression 2.4 Feed-forward Neural Network 2.4.1 Overview 2.4.2 Activation Function 2.4.3 Loss Function 2.4.4 Gradient Descent 2.5 Convolutional Neural Network 2.6 Least Squares Problems and Singular Value Decomposition 2.6.1 Least Squares Problems 2.6.2 Singular Value Decomposition and Pseudo-inverse Matrix 2.7 CenterNet: Object Detector by Keypoints 2.7.1 Neural Network Layout 2.7.2 Loss Function Datasets and Metrics 3.1 Datasets 3.2 Metrics 3.2.1 Fundamental terminologies 3.2.2 Definitions of various metrics Related Work 4.1 Literature review 4.1.1 LiDAR-based 3D object detection 4.1.2 Monocular 3D object detection using representation transformation 4.1.3 Monocular 3D object detection using anchor-based detector i 1 4 4 7 9 10 10 10 10 11 12 12 13 13 14 14 14 15 16 16 17 17 19 21 21 21 22 25 4.1.4 4.1.5 Monocular 3D object detection using center-based detector Summary GAC3D 5.1 Base Detection Network 5.1.1 Overview architecture 5.1.2 Backbone 5.1.3 Center Head 5.1.4 Keypoints Head 5.1.5 Pseudo-contact Point Head 5.1.6 Orientation Head 5.1.7 Dimension Head 5.1.8 3D Confidence head 5.2 Detection Network With Depth Adaptive Convolution 5.2.1 Depth Adaptive Convolution 5.2.2 Depth Adaptive Detection Head 5.2.3 Variant Guidance For Depth Adaptive Convolution 5.3 End-to-End Depth Guidance 5.4 Losses 5.5 Geometric Ground-Guide Module 5.5.1 Object’s Pseudo-position 5.5.2 2D-3D Transformation 5.6 Summary 26 29 30 30 30 31 32 33 34 34 36 37 38 39 39 41 41 43 44 45 46 49 GAC3D Implementation 6.1 Image Preprocessing and Data Augmentation 6.2 Network Implementation 50 50 50 Performance Analysis 7.1 Quantitative Results 7.2 Qualitative Results 7.3 Ablation Study 7.3.1 Accumulated impact of our proposed methods 7.3.2 Evaluation on Depth Adaptive Convolution 7.3.3 Evaluation on the impact of depth estimation quality on Adaptive Convolution 7.4 Abnormal detection cases of KITTI dataset 54 54 57 59 59 61 Conclusion 71 More qualitative results 73 62 62 10 Some failure cases 75 11 Implementation of the detection networks 76 List of tables 2.1 Output layer activation functions and loss functions 12 6.1 Software specification of the training machine 53 7.1 7.2 7.3 7.4 7.5 7.6 KITTI Object Detection benchmark of our GAC3D method KITTI Object Detection benchmark of our GAC3D-E2E-Lite method KITTI Object Detection benchmark of our GAC3D-E2E method Comparative results on the KITTI 3D object detection test set of the Car category Comparative results of the Pedestrian and Cyclist on KITTI test set Evaluation on accumulated improvement of our proposed methods on KITTI val set Impact of Geometric Ground-Guide Module for 3D object detection on the KITTI val set Comparisons of different depth estimation quality for 3D object detection on KITTI val set 55 55 55 56 57 7.7 7.8 11.1 The implementation of the detection network with ResNet-18 backbone 11.2 The implementation of the detection network with DLA-34 backbone iii 61 62 62 76 79 List of figures 1.1 An example of occluded object in driving scenario 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Front-view-camera image and LiDAR 3D point cloud from a sample of KITTI dataset[20] Image formation in a pinhole camera The pinhole camera model The optical axis, focal plane, and retinal plane The coordinate system Changing coordinate systems in the retinal plane A neural network with hidden layer 6 11 3.1 3.2 Examples from the KITTI dataset (from left color camera)[19] Recording platform of KITTI dataset[19] 17 18 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 Network architecture of PointNet[50] PV-RCNN[57] proposed architecture CaDDN[52] proposed architecture Predefined 3D anchors in M3D-RPN[6] M3D-RPN[6] proposed architecture and the depth aware convolution D4LCN[15] proposed architecture Ground aware convolution from [39] Network structure of SMOKE[41] Proposed architecture of RTM3D[34] KM3D-Net[33] proposed architecture 22 23 24 25 25 26 26 27 28 28 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Overview of the proposed network Overview of the baseline network The layers configuration of a residual block The ground truth heatmap of 2D bounding box’s center Illustration of keypoints regression The projected location on 2D image plane of the pseudo-contact point Equal egocentric angles (left) and equal allocentric angles (right) The moving car has the same egocentric (global) orientation (moving forward) in frames while the allocentric (local) orientation with respect to the camera is changed Egocentric (green color) and allocentric (orange color) angles in the bird’seye-view Red arrow indicates the heading of the car while blue arrow is the ray between the origin and the car’s center 31 32 32 33 33 34 35 5.9 iv 35 36 5.10 The orientation decomposition We decompose the observation angle into three components: axis classification, heading classification, and relative offset regression 5.11 The default dimension setting of KITTI dataset and the impact of visual appearance of object on dimension regression 5.12 Illustration of 3D confidence outputs Green: prediction, red: ground truth The further car with less accurate 3D pose estimation yields lower 3D confidence score 5.13 Depth Adaptive Convolution Detection Head 5.14 Illustration of depth estimations guiding the depth adaptive convolution 5.15 Depth Adaptive Convolution Detection Head With Variant Guidance 5.16 The architecture of our detection network with end-to-end depth estimation 5.17 Illustration of ground truth (red color) and predicted (blue color) 3D bounding box in bird’s-eye-view 5.18 The depth of the ground plane generated from the extrinsic information of camera 5.19 Object’s pseudo-position P and related terms in the inference process using camera model and ground plane model 6.1 6.2 6.3 36 37 38 40 40 41 42 44 45 46 Color jittering augmentation Horizontal flipping augmentation Scaling and shifting augmentation 51 51 52 Illustration of unlabeled cases in KITTI val set Visualization of the objects’ 2D centers heatmap from the center head of the detection network Left: The monocular image from KITTI val split Right: The corresponding detected heatmap of objects’ 2D centers for class Car 7.3 Visualization of the depth branch’s output Left: The monocular image from KITTI val split Right: The corresponding end-to-end depth estimation from the depth branch 7.4 Visualization of the 2D projected objects’ keypoints prediction Top: The cropped region of cars from monocular image Bottom: The nine projected keypoints estimated from the keypoints head 7.5 Detailed visualization of our 3D detection result 7.6 Qualitative illustration of our monocular 3D detection results (left: val set, right: test set) Green color: our predictions, red color: ground truth, dot: projected 3D center, diagonal cross: heading of object 7.7 Illustration of multi-class detection on KITTI val set Green: Car, Yellow: Cyclist, Purple: Pedestrian 7.8 Visualization of the impact of pseudo-position for refining object’s position Red: groundtruth z-position, Green: predicted z-position 7.9 Visualization of the impact of pseudo-position for refining object’s position Red: groundtruth z-position, Green: predicted z-position 7.10 Trajectories of the optimization process for each detection head with standard convolution and depth adaptive convolution operation 7.11 Abnormal detection cases from KITTI val set Left: ground truth labels, right: our predictions 58 7.1 7.2 9.1 Visualization of the detection result for image ’000422’ of KITTI val set 59 60 60 64 65 66 67 68 69 70 73 9.2 Visualization of the detection result for image ’000527’ of KITTI val set 74 10.1 Failure case: objects in congested situation 10.2 Failure case: very distant object 10.3 Failure case: highly occluded and truncated objects 75 75 75 CONCLUSION • Investigating the stereo vision and adapting the framework to run on both monocular and binocular images 72 Chapter More qualitative results Figure 9.1: Visualization of the detection result for image ’000422’ of KITTI val set 73 MORE QUALITATIVE RESULTS Figure 9.2: Visualization of the detection result for image ’000527’ of KITTI val set 74 Chapter 10 Some failure cases Figure 10.1: Failure case: objects in congested situation Figure 10.2: Failure case: very distant object Figure 10.3: Failure case: highly occluded and truncated objects 75 Chapter 11 Implementation of the detection networks Table 11.1: The implementation of the detection network with ResNet-18 backbone Layer 0_conv1 1_bn1 2_relu 3_maxpool 4_layer1.0.Conv2d_conv1 5_layer1.0.BatchNorm2d_bn1 6_layer1.0.ReLU_relu 7_layer1.0.Conv2d_conv2 8_layer1.0.BatchNorm2d_bn2 9_layer1.0.ReLU_relu 10_layer1.1.Conv2d_conv1 11_layer1.1.BatchNorm2d_bn1 12_layer1.1.ReLU_relu 13_layer1.1.Conv2d_conv2 14_layer1.1.BatchNorm2d_bn2 15_layer1.1.ReLU_relu 16_layer2.0.Conv2d_conv1 17_layer2.0.BatchNorm2d_bn1 18_layer2.0.ReLU_relu 19_layer2.0.Conv2d_conv2 20_layer2.0.BatchNorm2d_bn2 21_layer2.0.downsample.Conv2d_0 22_layer2.0.downsample.BatchNorm2d_1 23_layer2.0.ReLU_relu 24_layer2.1.Conv2d_conv1 25_layer2.1.BatchNorm2d_bn1 26_layer2.1.ReLU_relu 27_layer2.1.Conv2d_conv2 28_layer2.1.BatchNorm2d_bn2 29_layer2.1.ReLU_relu 30_layer3.0.Conv2d_conv1 Kernel Shape [3, 64, 7, 7] [64] [64, 64, 3, 3] [64] [64, 64, 3, 3] [64] [64, 64, 3, 3] [64] [64, 64, 3, 3] [64] [64, 128, 3, 3] [128] [128, 128, 3, 3] [128] [64, 128, 1, 1] [128] [128, 128, 3, 3] [128] [128, 128, 3, 3] [128] [128, 256, 3, 3] Output Shape [1, 64, 192, 640] [1, 64, 192, 640] [1, 64, 192, 640] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 256, 24, 80] Params 9408.0 128.0 36864.0 128.0 36864.0 128.0 36864.0 128.0 36864.0 128.0 73728.0 256.0 147456.0 256.0 8192.0 256.0 147456.0 256.0 147456.0 256.0 294912.0 76 IMPLEMENTATION OF THE DETECTION NETWORKS 31_layer3.0.BatchNorm2d_bn1 32_layer3.0.ReLU_relu 33_layer3.0.Conv2d_conv2 34_layer3.0.BatchNorm2d_bn2 35_layer3.0.downsample.Conv2d_0 36_layer3.0.downsample.BatchNorm2d_1 37_layer3.0.ReLU_relu 38_layer3.1.Conv2d_conv1 39_layer3.1.BatchNorm2d_bn1 40_layer3.1.ReLU_relu 41_layer3.1.Conv2d_conv2 42_layer3.1.BatchNorm2d_bn2 43_layer3.1.ReLU_relu 44_layer4.0.Conv2d_conv1 45_layer4.0.BatchNorm2d_bn1 46_layer4.0.ReLU_relu 47_layer4.0.Conv2d_conv2 48_layer4.0.BatchNorm2d_bn2 49_layer4.0.downsample.Conv2d_0 50_layer4.0.downsample.BatchNorm2d_1 51_layer4.0.ReLU_relu 52_layer4.1.Conv2d_conv1 53_layer4.1.BatchNorm2d_bn1 54_layer4.1.ReLU_relu 55_layer4.1.Conv2d_conv2 56_layer4.1.BatchNorm2d_bn2 57_layer4.1.ReLU_relu 58_deconv_layers.ConvTranspose2d_0 59_deconv_layers.BatchNorm2d_1 60_deconv_layers.ReLU_2 61_deconv_layers.ConvTranspose2d_3 62_deconv_layers.BatchNorm2d_4 63_deconv_layers.ReLU_5 64_deconv_layers.ConvTranspose2d_6 65_deconv_layers.BatchNorm2d_7 66_deconv_layers.ReLU_8 67_transfer 68_depth_head.Conv2d_0 69_depth_head.ReLU_1 70_depth_head.Conv2d_2 71_depth_head.Sigmoid_3 72_depth_net.Conv2d_0 73_depth_net.ReLU_1 74_depth_net.Conv2d_2 75_depth_net.ReLU_3 76_depth_net.Conv2d_4 [256] [256, 256, 3, 3] [256] [128, 256, 1, 1] [256] [256, 256, 3, 3] [256] [256, 256, 3, 3] [256] [256, 512, 3, 3] [512] [512, 512, 3, 3] [512] [256, 512, 1, 1] [512] [512, 512, 3, 3] [512] [512, 512, 3, 3] [512] [256, 512, 4, 4] [256] [256, 256, 4, 4] [256] [256, 256, 4, 4] [256] [256, 128, 1, 1] [128, 64, 3, 3] [64, 1, 1, 1] [1, 16, 3, 3] [16, 32, 3, 3] [32, 64, 3, 3] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 48, 160] [1, 256, 48, 160] [1, 256, 48, 160] [1, 256, 96, 320] [1, 256, 96, 320] [1, 256, 96, 320] [1, 128, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 1, 96, 320] [1, 1, 96, 320] [1, 16, 96, 320] [1, 16, 96, 320] [1, 32, 96, 320] [1, 32, 96, 320] [1, 64, 96, 320] 512.0 589824.0 512.0 32768.0 512.0 589824.0 512.0 589824.0 512.0 1179648.0 1024.0 2359296.0 1024.0 131072.0 1024.0 2359296.0 1024.0 2359296.0 1024.0 2097152.0 512.0 1048576.0 512.0 1048576.0 512.0 32896.0 73792.0 65.0 160.0 4640.0 18496.0 77 IMPLEMENTATION OF THE DETECTION NETWORKS 77_depth_net.ReLU_5 78_depth_net.Conv2d_6 79_hm.PacConv2d_0 80_hm.ReLU_1 81_hm.Conv2d_2 82_hps.PacConv2d_0 83_hps.ReLU_1 84_hps.Conv2d_2 85_rot.PacConv2d_0 86_rot.ReLU_1 87_rot.Conv2d_2 88_dim.PacConv2d_0 89_dim.ReLU_1 90_dim.Conv2d_2 91_prob.PacConv2d_0 92_prob.ReLU_1 93_prob.Conv2d_2 [64, 1, 1, 1] [128, 64, 3, 3] [64, 3, 1, 1] [128, 64, 3, 3] [64, 20, 1, 1] [128, 64, 3, 3] [64, 6, 1, 1] [128, 64, 3, 3] [64, 3, 1, 1] [128, 64, 3, 3] [64, 1, 1, 1] [1, 64, 96, 320] [1, 1, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 3, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 20, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 6, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 3, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 1, 96, 320] 65.0 73792.0 195.0 73792.0 1300.0 73792.0 390.0 73792.0 195.0 73792.0 65.0 78 IMPLEMENTATION OF THE DETECTION NETWORKS Table 11.2: The implementation of the detection network with DLA-34 backbone Layer 0_conv1 1_bn1 2_relu 3_maxpool 4_layer1.0.Conv2d_conv1 5_layer1.0.BatchNorm2d_bn1 6_layer1.0.ReLU_relu 7_layer1.0.Conv2d_conv2 8_layer1.0.BatchNorm2d_bn2 9_layer1.0.ReLU_relu 10_layer1.1.Conv2d_conv1 11_layer1.1.BatchNorm2d_bn1 12_layer1.1.ReLU_relu 13_layer1.1.Conv2d_conv2 14_layer1.1.BatchNorm2d_bn2 15_layer1.1.ReLU_relu 16_layer2.0.Conv2d_conv1 17_layer2.0.BatchNorm2d_bn1 18_layer2.0.ReLU_relu 19_layer2.0.Conv2d_conv2 20_layer2.0.BatchNorm2d_bn2 21_layer2.0.downsample.Conv2d_0 22_layer2.0.downsample.BatchNorm2d_1 23_layer2.0.ReLU_relu 24_layer2.1.Conv2d_conv1 25_layer2.1.BatchNorm2d_bn1 26_layer2.1.ReLU_relu 27_layer2.1.Conv2d_conv2 28_layer2.1.BatchNorm2d_bn2 29_layer2.1.ReLU_relu 30_layer3.0.Conv2d_conv1 31_layer3.0.BatchNorm2d_bn1 32_layer3.0.ReLU_relu 33_layer3.0.Conv2d_conv2 34_layer3.0.BatchNorm2d_bn2 35_layer3.0.downsample.Conv2d_0 36_layer3.0.downsample.BatchNorm2d_1 37_layer3.0.ReLU_relu 38_layer3.1.Conv2d_conv1 39_layer3.1.BatchNorm2d_bn1 40_layer3.1.ReLU_relu 41_layer3.1.Conv2d_conv2 42_layer3.1.BatchNorm2d_bn2 43_layer3.1.ReLU_relu Kernel Shape [3, 64, 7, 7] [64] [64, 64, 3, 3] [64] [64, 64, 3, 3] [64] [64, 64, 3, 3] [64] [64, 64, 3, 3] [64] [64, 128, 3, 3] [128] [128, 128, 3, 3] [128] [64, 128, 1, 1] [128] [128, 128, 3, 3] [128] [128, 128, 3, 3] [128] [128, 256, 3, 3] [256] [256, 256, 3, 3] [256] [128, 256, 1, 1] [256] [256, 256, 3, 3] [256] [256, 256, 3, 3] [256] - Output Shape [1, 64, 192, 640] [1, 64, 192, 640] [1, 64, 192, 640] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 128, 48, 160] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] Params 9408.0 128.0 36864.0 128.0 36864.0 128.0 36864.0 128.0 36864.0 128.0 73728.0 256.0 147456.0 256.0 8192.0 256.0 147456.0 256.0 147456.0 256.0 294912.0 512.0 589824.0 512.0 32768.0 512.0 589824.0 512.0 589824.0 512.0 79 IMPLEMENTATION OF THE DETECTION NETWORKS 44_layer4.0.Conv2d_conv1 45_layer4.0.BatchNorm2d_bn1 46_layer4.0.ReLU_relu 47_layer4.0.Conv2d_conv2 48_layer4.0.BatchNorm2d_bn2 49_layer4.0.downsample.Conv2d_0 50_layer4.0.downsample.BatchNorm2d_1 51_layer4.0.ReLU_relu 52_layer4.1.Conv2d_conv1 53_layer4.1.BatchNorm2d_bn1 54_layer4.1.ReLU_relu 55_layer4.1.Conv2d_conv2 56_layer4.1.BatchNorm2d_bn2 57_layer4.1.ReLU_relu 58_deconv_layers.ConvTranspose2d_0 59_deconv_layers.BatchNorm2d_1 60_deconv_layers.ReLU_2 61_deconv_layers.ConvTranspose2d_3 62_deconv_layers.BatchNorm2d_4 63_deconv_layers.ReLU_5 64_deconv_layers.ConvTranspose2d_6 65_deconv_layers.BatchNorm2d_7 66_deconv_layers.ReLU_8 67_transfer 68_depth_head.Conv2d_0 69_depth_head.ReLU_1 70_depth_head.Conv2d_2 71_depth_head.Sigmoid_3 72_depth_net.Conv2d_0 73_depth_net.ReLU_1 74_depth_net.Conv2d_2 75_depth_net.ReLU_3 76_depth_net.Conv2d_4 77_depth_net.ReLU_5 78_depth_net.Conv2d_6 79_hm.PacConv2d_0 80_hm.ReLU_1 81_hm.Conv2d_2 82_hps.PacConv2d_0 83_hps.ReLU_1 84_hps.Conv2d_2 85_rot.PacConv2d_0 86_rot.ReLU_1 87_rot.Conv2d_2 88_dim.PacConv2d_0 89_dim.ReLU_1 [256, 512, 3, 3] [512] [512, 512, 3, 3] [512] [256, 512, 1, 1] [512] [512, 512, 3, 3] [512] [512, 512, 3, 3] [512] [256, 512, 4, 4] [256] [256, 256, 4, 4] [256] [256, 256, 4, 4] [256] [256, 128, 1, 1] [128, 64, 3, 3] [64, 1, 1, 1] [1, 16, 3, 3] [16, 32, 3, 3] [32, 64, 3, 3] [64, 1, 1, 1] [128, 64, 3, 3] [64, 3, 1, 1] [128, 64, 3, 3] [64, 20, 1, 1] [128, 64, 3, 3] [64, 6, 1, 1] [128, 64, 3, 3] - [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 512, 12, 40] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 24, 80] [1, 256, 48, 160] [1, 256, 48, 160] [1, 256, 48, 160] [1, 256, 96, 320] [1, 256, 96, 320] [1, 256, 96, 320] [1, 128, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 1, 96, 320] [1, 1, 96, 320] [1, 16, 96, 320] [1, 16, 96, 320] [1, 32, 96, 320] [1, 32, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 1, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 3, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 20, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 6, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] 1179648.0 1024.0 2359296.0 1024.0 131072.0 1024.0 2359296.0 1024.0 2359296.0 1024.0 2097152.0 512.0 1048576.0 512.0 1048576.0 512.0 32896.0 73792.0 65.0 160.0 4640.0 18496.0 65.0 73792.0 195.0 73792.0 1300.0 73792.0 390.0 73792.0 80 IMPLEMENTATION OF THE DETECTION NETWORKS 90_dim.Conv2d_2 91_prob.PacConv2d_0 92_prob.ReLU_1 93_prob.Conv2d_2 [64, 3, 1, 1] [128, 64, 3, 3] [64, 1, 1, 1] [1, 3, 96, 320] [1, 64, 96, 320] [1, 64, 96, 320] [1, 1, 96, 320] 195.0 73792.0 65.0 81 References [1] Mayavi: 3d scientific data visualization and plotting in python enthought.com/mayavi/mayavi/ https://docs [2] Martín Abadi et al TensorFlow: Large-scale machine learning on heterogeneous systems, 2015 Software available from tensorflow.org [3] Rami Al-Rfou et al Theano: A python framework for fast computation of mathematical expressions CoRR, abs/1605.02688, 2016 [4] J Beltrán, C Guindel, F M Moreno, D Cruzado, F García, and A De La Escalera BirdNet: A 3D Object Detection Framework from LiDAR Information In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3517–3523, 2018 [5] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka Adabins: Depth estimation using adaptive bins, 2020 [6] Garrick Brazil and Xiaoming Liu M3D-RPN: Monocular 3D Region Proposal Network for Object Detection pages 9286–9295, 2019 [7] Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt Schiele Kinematic 3d object detection in monocular video In Proceedings of European Conference on Computer Vision, Virtual, 2020 [8] Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan High-performance large-scale image recognition without normalization, 2021 [9] Hansheng Chen, Yuyao Huang, Wei Tian, Zhong Gao, and Lu Xiong MonoRUn: Monocular 3d object detection by reconstruction and uncertainty propagation In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021 [10] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun 3d object proposals for accurate object class detection In Advances in Neural Information Processing Systems, volume 28 Curran Associates, Inc., 2015 [11] Y Chen, L Tai, K Sun, and M Li MonoPair: Monocular 3d object detection using pairwise spatial relationships In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12090–12099, 2020 [12] Corinna Cortes and Vladimir Vapnik 20(3):273–297, 1995 Support-vector networks Machine learning, 82 REFERENCES [13] J Dai, H Qi, Y Xiong, Y Li, G Zhang, H Hu, and Y Wei Deformable Convolutional Networks In 2017 IEEE International Conference on Computer Vision (ICCV), pages 764–773, 2017 [14] Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong MATHEMATICS FOR MACHINE LEARNING Cambridge University Press, 2020 [15] Mingyu Ding, Y Huo, Hongwei Yi, Zhe Wang, Jianping Shi, Zhiwu Lu, and Ping Luo Learning Depth-Guided Convolutions for Monocular 3D Object Detection 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4306–4315, 2020 [16] M Everingham, L Gool, C K Williams, J Winn, and Andrew Zisserman The pascal visual object classes (voc) challenge International Journal of Computer Vision, 88:303– 338, 2009 [17] H Fu, M Gong, C Wang, K Batmanghelich, and D Tao Deep ordinal regression network for monocular depth estimation In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018 [18] J.H Gallier and J Quaintance Linear Algebra and Optimization with Applications to Machine Learning - Volume I: Linear Algebra for Computer Vision, Robotics, and Machine Learning World Scientific Publishing Company Pte Limited, 2020 [19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun Vision meets Robotics: The KITTI Dataset International Journal of Robotics Research (IJRR), 2013 [20] Andreas Geiger, Philip Lenz, and Raquel Urtasun Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite In Conference on Computer Vision and Pattern Recognition (CVPR), 2012 [21] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik Rich feature hierarchies for accurate object detection and semantic segmentation In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2014 [22] Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang Structure Aware Single-Stage 3D Object Detection From Point Cloud In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020 [23] K He, G Gkioxari, P Dollár, and R Girshick Mask r-cnn In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017 [24] K He, X Zhang, S Ren, and J Sun Deep Residual Learning for Image Recognition In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016 [25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton Imagenet classification with deep convolutional neural networks Neural Information Processing Systems, 25, 01 2012 [26] Jason Ku, Alex D Pon, and Steven L Waslander Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11859–11868, 2019 83 REFERENCES [27] Abhinav Kumar, Garrick Brazil, and Xiaoming Liu GrooMeD-NMS: Grouped mathematically differentiable nms for monocular 3d object detection In IEEE Computer Vision and Pattern Recognition, Nashville, TN, June 2021 [28] Samuli Laine and Timo Aila Temporal ensembling for semi-supervised learning In ICLR (Poster) OpenReview.net, 2017 [29] Alex H Lang, Sourabh Vora, H Caesar, Lubing Zhou, J Yang, and Oscar Beijbom PointPillars: Fast Encoders for Object Detection From Point Clouds 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12689–12697, 2019 [30] Hei Law and Jia Deng CornerNet: Detecting Objects as Paired Keypoints International Journal of Computer Vision, 128, 03 2020 [31] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh From big to small: Multi-scale local planar guidance for monocular depth estimation, 2020 [32] Buyu Li, Wanli Ouyang, Lu Sheng, X Zeng, and X Wang GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1019–1028, 2019 [33] Peixuan Li Monocular 3D Detection with Geometric Constraints Embedding and Semisupervised Training 09 2020 [34] Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving pages 644–660, 2020 [35] T Lin, P Goyal, R Girshick, K He, and P Dollár Focal loss for dense object detection In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017 [36] Tsung-Yi Lin, M Maire, Serge J Belongie, James Hays, P Perona, D Ramanan, Piotr Dollár, and C L Zitnick Microsoft coco: Common objects in context In ECCV, 2014 [37] Lijie Liu, Jiwen Lu, Chunjing Xu, Qi Tian, and J Zhou Deep Fitting Degree Scoring Network for Monocular 3D Object Detection 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1057–1066, 2019 [38] W Liu, Dragomir Anguelov, D Erhan, Christian Szegedy, Scott E Reed, Cheng-Yang Fu, and A Berg Ssd: Single shot multibox detector In ECCV, 2016 [39] Y Liu, Y Yuan, and M Liu Ground-aware Monocular 3D Object Detection for Autonomous Driving IEEE Robotics and Automation Letters, 2021 [40] Yuxuan Liu, Lujia Wang, and Liu Ming YOLOStereo3D: A step back to 2d for efficient stereo 3d detection In 2021 International Conference on Robotics and Automation (ICRA) IEEE, 2021 [41] Zechen Liu, Zizhang Wu, and R T’oth SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4289–4298, 2020 84 REFERENCES [42] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu Zeng, and Wanli Ouyang Rethinking Pseudo-LiDAR Representation In Proceedings of the European Conference on Computer Vision (ECCV), 2020 [43] Xinzhu Ma, Zhihui Wang, Haojie Li, Wanli Ouyang, and Pengbo Zhang Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6850– 6859, 2019 [44] Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, and Wanli Ouyang Delving into localization errors for monocular 3d object detection In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021 [45] F Manhardt, W Kehl, and A Gaidon ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2064–2073, 2019 [46] A Naiden, Vlad Paunescu, Gyeongmo Kim, ByeongMoon Jeon, and M Leordeanu Shift R-CNN: Deep Monocular 3D Object Detection With Closed-Form Geometric Constraints 2019 IEEE International Conference on Image Processing (ICIP), pages 61–65, 2019 [47] Alejandro Newell, Kaiyu Yang, and Jia Deng Stacked Hourglass Networks for Human Pose Estimation volume 9912, pages 483–499, 10 2016 [48] R Padilla, S L Netto, and E A B da Silva A survey on performance metrics for objectdetection algorithms In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), pages 237–242, 2020 [49] Adam Paszke et al Pytorch: An imperative style, high-performance deep learning library pages 8024–8035 Curran Associates, Inc., 2019 [50] C Qi, H Su, Kaichun Mo, and L Guibas Pointnet: Deep learning on point sets for 3d classification and segmentation 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85, 2017 [51] C R Qi, W Liu, C Wu, H Su, and L J Guibas Frustum pointnets for 3d object detection from rgb-d data In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018 [52] Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander Categorical Depth Distribution Network for Monocular 3D Object Detection CVPR, 2021 [53] Joseph Redmon and Ali Farhadi abs/1804.02767, 2018 Yolov3: An incremental improvement ArXiv, [54] Shaoqing Ren, Kaiming He, Ross B Girshick, and J Sun Faster R-CNN: Towards realtime object detection with region proposal networks IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015 [55] Thomas Roddick, Alex Kendall, and Roberto Cipolla Orthographic Feature Transform for Monocular 3D Object Detection 11 2018 85 REFERENCES [56] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams Learning Representations by Back-propagating Errors Nature, 323(6088):533–536, 1986 [57] Shaoshuai Shi, Chaoxu Guo, L Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10526–10535, 2020 [58] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li Pointrcnn: 3d object proposal generation and detection from point cloud In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019 [59] Siddharth Srivastava, Frederic Jurie, and Gaurav Sharma Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles 03 2019 [60] H Su, V Jampani, D Sun, O Gallo, E Learned-Miller, and J Kautz Pixel-adaptive convolutional neural networks In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11158–11167, 2019 [61] M Tan, R Pang, and Q V Le EfficientDet: Scalable and Efficient Object Detection In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10778–10787, 2020 [62] Xinlong Wang, Wei Yin, Tao Kong, Yuning Jiang, Lei Li, and Chunhua Shen TaskAware Monocular Depth Estimation for 3D Object Detection In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020 [63] Y Wang, W Chao, D Garg, B Hariharan, M Campbell, and K Q Weinberger PseudoLiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving pages 8437–8445, 2019 [64] Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving 2020 [65] F Yu, D Wang, E Shelhamer, and T Darrell Deep Layer Aggregation In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2018 [66] Dingfu Zhou, Xibin Song, Yuchao Dai, Junbo Yin, Feixiang Lu, Jin Fang, Miao Liao, and Liangjun Zhang IAFA: Instance-aware Feature Aggregation for 3D Object Detection from a Single Image, 03 2021 [67] Xingyi Zhou, Dequan Wang, and Philipp Krăahenbăuhl abs/1904.07850, 2019 Objects as Points CoRR, [68] Yin Zhou and Oncel Tuzel VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection CoRR, abs/1711.06396, 2017 86 ... 4.1.1 LiDAR-based 3D object detection 4.1.2 Monocular 3D object detection using representation transformation 4.1.3 Monocular 3D object detection using anchor-based... 7.5 7.6 KITTI Object Detection benchmark of our GAC3D method KITTI Object Detection benchmark of our GAC3D-E2E-Lite method KITTI Object Detection benchmark of our GAC3D-E2E method... for 3D object detection The overall architecture of PointNet is illustrated in Fig 4.1 4.1.1.2 F-PointNet: Frustum PointNets for 3D Object Detection from RGB-D Data Frustum PointNets [51] is a 3D

Tiêu đề	3D Object Pose Detection From Image
Tác giả	Bui Viet Minh Quan
Người hướng dẫn	Dr. Nguyen Duc Dung, Dr. Pham Hoang Anh
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	graduate thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	96
Dung lượng	11,03 MB