The two different sensors are highly comple-mentary, and previous studies show that the fusion of laser point cloud and image datacan greatly improve the efficiency of object detection i
Trang 1HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
SCHOOL OF ELECTRICAL AND
Hà Nội, December 22, 2022
Trang 2HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
SCHOOL OF ELECTRICAL AND
Hà Nội, December 22, 2022
Trang 3LIST OF SIGNS AND ABBREVIATIONS i LIST OF FIGURES ii LIST OF TABLES iii
CHAPTER 1 INTRODUCTION 1 CHAPTER 2 RELATED WORK 3
2.1 Single Sensor Detection 32.2 Multi-Sensor Detection 4
CHAPTER 3 SYSTEM ARCHITECTURE 6 CHAPTER 4 EXPERIMENT 9
4.1 Experiment Method 94.2 Experiment on KITTI 94.3 Experiment on HYYdata Dataset 12
CONCLUSIONS 13
Trang 4LIST OF SIGNS AND ABBREVIATIONS
CNN Convolution Neural Network
KITTI Karlsruhe Institute of Technology and Toyota Technological InstituteRGB Red, Green, Blue
Trang 5LIST OF FIGURES
Figure 3.1 YOLOv3 Network architecture 7
Figure 3.2 Late-stage fusion strategy 8
Figure 4.1 Precision-Recall curves of the 11
Figure 4.2 Different models detect effect changes at different levels 11
ii
Trang 6LIST OF TABLES
Table 4.1 RESULTS OF DIFFERENT LAYER FUSION DETECTION 10Table 4.2 RESULTS OF DIFFERENT LAYER FUSION DETECTION 12
Trang 7Lidar and optical camera are common sensors in the sensor layer of autopilot tem Lidar can use depth data to obtain accurate relative distance and contour informa-tion of obstacles, which is not easily affected by external light conditions Optical cameracan obtain rich object/environment semantic information through high-resolution image,which is relatively mature in technology The two different sensors are highly comple-mentary, and previous studies show that the fusion of laser point cloud and image datacan greatly improve the efficiency of object detection in out door environment In thispaper, a deep convolutional neural network detection model based on Lidar and imageinformation features layered fusion is studied We try different fusion depth at the CNNmodel to seek the best solution according to the detection performance The experimen-tal results on the KITTI dataset show that the detection accuracy of the fusion based onYOLOv3 is 1.08 percents higher than original model Another small scale experimentwith our own self-driving platform on local area also show the final fusion model canachieve better detection accuracy in real road condition
sys-Index Terms—object detection; convolutional neural network; autonomous cle; feature fusion
vehi-iv
Trang 8CHAPTER 1 INTRODUCTION
Object detection technology is an important module in automatic driving of hicles [1] It is also widely used in industrial robots, intelligent monitoring, qualitymonitoring and detection, garbage sorting and other fields The application of this tech-nology has important practical significance of protecting people in life and production
ve-In the automatic driving system of a vehicle, the vision module plays a key role in theautonomous driving process It can actively find the information about obstacles to thedriving environment of the vehicle and provide key information on the decisionmaking
in the subsequent driving process Due to the extensive application of deep learning,target detection algorithms have developed rapidly
Commonly object detection technology applied in the visual perception of lot system module, and it is constructed from the combination of different sensors [2], in-cluding lidar, cameras, millimeter wave, different sensors due to the different perceptiondemand The camera can capture abundant semantic information and clearly obtain thespecific types of obstacles Lidar can acquire abundant depth information and reflectiv-ity information and can sense the specific position of obstacles On the sensing system,the target detection algorithm primarily relies on Lidar and camera sensing equipment
autopi-By analyzing and identifying the data captured by the sensor equipment, the tion of obstacles in the surrounding environment can be obtained Most of the existingobject detection solutions rely on the single sensor, but as a result of a single sensor onthe operating principle of the defects, they cannot achieve high-accurate results Con-sidering perception characteristics of different sensors, fusion of different sensors on thesurrounding environment perception, we complement with the perception of differencesbetween different sensors through the semantic information of depth information fusion
informa-of Lidar and video cameras Combined with the deep learning algorithm, this approachcan effectively improve the surroundings obstacles of awareness of the vehicle driving
in the road
With the developing of deep learning neural network level, the network can extractmore and more complex features In the same target detection algorithm, the influence offusion at different network levels on the performance of the target detection model wasinvestigated By proposing multiple neural network fusion strategies, a feature fusiontarget detection strategy with optimal detection results is explored Based on YOLOdetection layer structure, the design of the five different levels of feature fusion strategy
is tested The input of the model is sampling data and image data on the Lidar pointcloud By designing different experiments, a fusion detection strategy with optimalperformance is obtained without changing the target detection framework The strategy
Trang 9used in YOLOv3 [3], YOLOv3-Tiny and Mobile-Yolo algorithm, and related testingexperiment Results show that the fusion level, testing results, the better, Moreover, thedetection effect of YOLOv3 was compared with the detection results exposed by KITTIdataset [4], which improved the detection accuracy obviously.
The contributions of this paper are as follows:
•A detection method based on CNN layer-fusion of lidar and image informationfeatures was proposed, and the optimal feature fusion strategy was obtained throughdesign experiments, which achieved the highest detection accuracy when deployedand tested on multiple networks
•Based on the existing sensor combination, a target detection dataset including openroad scene and underground parking lot scene is constructed, and we also test theproposed model on this dataset
The rest of this paper is organized as follow paper are organized as follows In thesection II, we review the research process from the following two aspects: simple sensordetection, multi-sensors detection method In the section III we introduce the layeredfusion approach used in this article, along with other system details In section IV, thereal dataset used in this paper and the proposed fusion strategy are introduced to improvethe detection effect A brief summary was made at the end
2
Trang 10CHAPTER 2 RELATED WORK
2.1 Single Sensor Detection
Most of the existing target detection schemes are based on a single data format.MPF [5] is a multi-view-based crowd movement detection framework for capturing tar-get populations in a fixed and narrow space This method can use different opticalimage information to characterize the structural characteristics of pedestrians with dif-ferent behaviors and automatically determine the classification Numbers of pedestrianswith different behaviors Shiyu, et al [6] proposed 3D positioning based on single eye
to realize fast and high-precision 3D object detection They tried to learn 3D features
of objects from moving structures and to achieve high positioning accuracy in close anddistant views of the KITTI dataset Multispectral converged channel characteristics wereintroduced in the article [7], [8] for road pedestrian detection When testing differentpedestrian detection tasks, MPF had an average miss ratio of 15 percents lower than asingle ACF-based approach
Meng, et al [9] proposed a scheme of small target detection By partitioning theoriginal image into vgg-16 network, and adding feature map fusion into the network,all the detection results are projected onto the image with the same size as the originalobject as far as possible by establishing the image pyramid The authors also propose
a new dataset for autonomous vehicles, which modifies the target category and labelcriteria, and a data expansion method At the same time, the experimental results on thenew dataset show that the performance of the proposed small target detection method isgreatly improved when the small target is detected in the large image This method hasstrong adaptability to complex environmental conditions and is of great significance tothe perception and planning of autonomous vehicles
DATMO network [10] mainly uses lidar to solve the problem of environmentalperception The author proposes a moving object detection and tracking algorithm based
on belief theory, which is used to detect and track targets of vehicles, bicycles, cars andtrucks
Scheidegger et al [2] makes up the application gap of the monocular camera in thefield of target detection and tracking By developing a multi-target tracking algorithmand taking the image as input, a trajectory of the detected object is generated in theworld coordinate system This paper uses a trained deep neural network to detect andestimate the distance from a single input image to a target.Experimental results show thatthis algorithm can accurately track the target and correctly process the data association.Even in the case of large object overlap in the image, it is also one of the best algorithms
Trang 11in the KITTI object tracking benchmark In addition, the average speed of the algorithm
is close to 20 frames per second
2.2 Multi-Sensor Detection
Quite a lot of work has been done on object detection using CNN [11], but NeuralNetwork is basically just a format of RGB data as input, and some papers have studiedthe effect of adding depth data on object detection effect [12], [13] Methods based onmore traditional part model technology also combine depth information [8] and comparetheir detection results with CNN Based on this work, HHA from the Velodyne LIDARdata sampled above was combined into several experimental CNN topologies Gupta et
al described R-CNN as a method of assigning horizontal parallax, height, and angle(HHA) to all image pixels, providing additional channels for use in the network.Schlosser et al [14] explored the enhancement effect of lidar and image fusion inpedestrian detection in the context of the convolutional neural network In this paper,the Lidar point cloud information is first projected into the dense depth map throughthe up-sampling operation, and then the HHA (horizontal parallax, ground elevation andAngle) image is extracted to represent three different groups of features in the 3D envi-ronment, and then these three groups of features are taken as additional image channelsfor combined detection The experimental instructions of the author are as follows: 1)the fusion of HHA features and RGB images is better than only RGB images, even if thenetwork is not fine-tuned and large datasets are used 2) the fusion of RGB and HHAcan achieve more stable results in the later stage
MV3D [15] proposed a multi-perspective 3D network based on sensor fusion Thenetwork architecture diagram is shown in figure 9 This network can take the radar pointcloud and RGB image as the input data at the same time and predict the 3D calibrationbox of orientation In this paper, the sparse 3D point cloud is encoded and presented
in multiple perspectives The network consists of two parts, one for the generation of3D candidate targets and the other for the fusion of features from multiple perspectives.MV3D network can effectively generate 3d candidate boxes from aerial view of 3d pointcloud In addition, MV3D has designed a deep fusion scheme to combine region-levelfeatures from multiple views and support the interaction between the middle layers ofdifferent channels Experimental results on the KITTI benchmark dataset show that themethod is 25-30 percents more accurate than the most advanced 3D coordinates and3D detection In addition, compared with the radar-based optimal method, this methodachieves a 14.9 percents improvement in prediction accuracy for 2D detection.Through the investigation, it is found that the perception modeling with data frommultiple different sensors [16] (especially lidar and camera) can improve the perception
4
Trang 12performance However, the processing of additional data not only increases the tational load, but also makes the model architecture more complex Besides, problemssuch as sensor calibration and data synchronization need to be considered Sensor dataare quite different from image data It is necessary to design corresponding feature ex-traction module according to data characteristics to adapt the end-to-end selflearning
compu-of features and labels In addition, there is a lack compu-of a unified quantitative comparisonframework for the strategies and methods of data fusion
Trang 13CHAPTER 3 SYSTEM ARCHITECTURE
For most deep convolutional neural networks, the fusion data of laser-point cloudand image has obvious advantages in detection effect compared with image data or laser-point cloud data only as model input.For different sensor combinations, the detectionresults under the same network are also different In this chapter, a road object de-tection dataset is constructed to discuss the influence of radar and camera on fusiondetection.The detection method of the feature layered fusion will be discussed Based
on the existing YOLOv3 Network, it will explore which level of the deep network tofuse the feature map of laser point cloud and image data, so as to obtain better detectionperformance
Layer-based Fusion Method:
This section designs five different fusion strategies based on the YOLOv3 Network,
as shown in Fig 2 Based on different fusion strategies, this chapter will discuss whichnetwork level can best fuse RGB data and radar data We divide feature extraction intothree stages: low-level feature extraction for edge information (fusion strategy a andb), mid-level feature extraction for pattern features or low-level contour features (fusionstrategy c and d), or high-level feature extraction for advanced features (fusion strategye) At what stage do we perform feature fusion to obtain the optimal detection effect? Inorder to answer the above questions, we carried out experiments based on a multi-layeredfusion framework
As shown in 4.1 , the network hierarchy is divided according to different featuregraph dimensions For the same framework, its two subnet architectures remain com-pletely consistent We have designed a variety of network architectures to test the effect
of data feature fusion at different levels
As shown in Fig 3.2, fusion strategy a directly fuses the lidar point cloud imageafter upsampling with the camera data in the lower dimension.In this framework, thelidar point cloud data of each frame is sampled up to obtain three image data of differentdimensions The resolution of each dimension is (WxHx1), where W represents thewidth of the image, H represents the height of the image, and 1 represents the number ofchannels of the image Three grayscale images of three dimensions (depth, height andangle) extracted from the lidar were synthesized into a three-channel color map throughthe image processing function, and the resolution of the color map was (WxHx3), whichwas the same as the camera image data of the corresponding frame After the upsamplingsynthesis data based on point cloud data is obtained, the upsampling synthesis data andcamera image data are spliced on the dimension of image depth to produce a 6-channel
6