Luận án Tiến sĩ Phát hiện và nhận dạng đối tượng 3D hỗ trợ sinh hoạt của người khiếm thị 3D object detection and recognition assisting visually impaired people in daily activities
Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 159 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
159
Dung lượng
3,88 MB
Nội dung
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY LE VAN HUNG 3-D OBJECT DETECTIONS AND RECOGNITIONS: ASSISTING VISUALLY IMPAIRED PEOPLE Major: Computer Science Code: 9480101 DOCTORAL DISSERTATION OF COMPUTER SCIENCE SUPERVISORS: Dr Vu Hai Assoc Prof Dr Nguyen Thi Thuy Hanoi − 2018 HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY LE VAN HUNG 3-D OBJECT DETECTIONS AND RECOGNITIONS: ASSISTING VISUALLY IMPAIRED PEOPLE Major: Computer Science Code: 9480101 DOCTORAL DISSERTATION OF COMPUTER SCIENCE SUPERVISORS: Dr Vu Hai Assoc Prof Dr Nguyen Thi Thuy Hanoi − 2018 DECLARATION OF AUTHORSHIP I, Le Van Hung, declare that this dissertation titled, ”3-D Object Detections and Recognitions: Assisting Visually Impaired People in Daily Activities ”, and the works presented in it are my own I confirm that: This work was done wholly or mainly while in candidature for a Ph.D research degree at Hanoi University of Science and Technology Where any part of this thesis has previously been submitted for a degree or any other qualification at Hanoi University of Science and Technology or any other institution, this has been clearly stated Where I have consulted the published work of others, this is always clearly attributed Where I have quoted from the work of others, the source is always given With the exception of such quotations, this dissertation is entirely my own work I have acknowledged all main sources of help Where the dissertation is based on work done by myself jointly with others, I have made exactly what was done by others and what I have contributed myself Hanoi, November 2018 PhD Student Le Van Hung SUPERVISORS Dr Vu Hai Assoc Prof Dr Nguyen Thi Thuy i ACKNOWLEDGEMENT This dissertation was written during my doctoral course at International Research Institute Multimedia, Information, Communication and Applications (MICA), Hanoi University of Science and Technology (HUST) It is my great pleasure to thank all the people who supported me for completing this work First, I would like to express my sincere gratitude to my advisors Dr Hai Vu and Assoc Prof Dr Thi Thuy Nguyen for their continuous support, their patience, motivation, and immense knowledge Their guidance helped me all the time of research and writing this dissertation I could not imagine a better advisor and mentor for my Ph.D study Besides my advisors, I would like to thank to Assoc Prof Dr Thi-Lan Le, Assoc Prof Dr Thanh-Hai Tran and members of Computer Vision Department at MICA Institute The colleagues have assisted me a lot in my research process as well as they are co-authored in the published papers Moreover, the attention at scientific conferences has always been a great experience for me to receive many the useful comments During my PhD course, I have received many supports from the Management Board of MICA Institute My sincere thank to Prof Yen Ngoc Pham, Prof Eric Castelli and Dr Son Viet Nguyen, who gave me the opportunity to join research works, and gave me permission to joint to the laboratory in MICA Institute Without their precious support, it has been being impossible to conduct this research As a Ph.D student of 911 program, I would like to thank this programme for financial support I also gratefully acknowledge the financial support for attending the conferences from Nafosted-FWO project (FWO.102.2013.08) and VLIR project (ZEIN2012RIP19) I would like to thank the College of Statistics over the years both at my career work and outside of the work Special thanks to my family, particularly, to my mother and father for all of their sacrifices that they have made on my behalf I also would like to thank my beloved wife for everything she supported me Hanoi, November 2018 Ph.D Student Le Van Hung ii CONTENTS DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii CONTENTS v SYMBOLS vi LIST OF TABLES viii LIST OF FIGURES xvii LITERATURE REVIEW 1.1 Aided-systems for supporting visually impaired people 1.1.1 Aided-systems for navigation services 1.1.2 Aided-systems for obstacle detection 1.1.3 Aided-systems for locating the interested objects in scenes 1.1.4 Discussions 1.2 3-D object detection, recognition from a point cloud data 1.2.1 Appearance-based methods 1.2.1.1 Discussion 1.2.2 Geometry-based methods 1.2.3 Datasets for 3-D object recognition 1.2.4 Discussions 1.3 Fitting primitive shapes 1.3.1 Linear fitting algorithms 1.3.2 Robust estimation algorithms 1.3.3 RANdom SAmple Consensus (RANSAC) and its variations 1.3.4 Discussions 8 11 12 13 13 16 16 17 17 18 18 19 20 23 POINT CLOUD REPRESENTATION AND THE PROPOSED METHOD FOR TABLE PLANE DETECTION 24 2.1 Point cloud representations 24 2.1.1 Capturing data by a Microsoft Kinect sensor 24 2.1.2 Point cloud representation 25 2.2 The proposed method for table plane detection 28 2.2.1 Introduction 28 iii 2.2.2 2.2.3 2.3 Related Work The proposed method 2.2.3.1 The proposed framework 2.2.3.2 Plane segmentation 2.2.3.3 Table plane detection and extraction 2.2.4 Experimental results 2.2.4.1 Experimental setup and dataset collection 2.2.4.2 Table plane detection evaluation method 2.2.4.3 Results Separating the interested objects on the table plane 2.3.1 Coordinate system transformation 2.3.2 Separating table plane and the interested objects 2.3.3 Discussions PRIMITIVE SHAPES ESTIMATION BY A NEW ROBUST ESTIMATOR USING GEOMETRICAL CONSTRAINTS 3.1 Fitting primitive shapes by GCSAC 3.1.1 Introduction 3.1.2 Related work 3.1.3 The proposed a new robust estimator 3.1.3.1 Overview of the proposed robust estimator (GCSAC) 3.1.3.2 Geometrical analyses and constraints for qualifying good samples 3.1.4 Experimental results of robust estimator 3.1.4.1 Evaluation datasets of robust estimator 3.1.4.2 Evaluation measurements of robust estimator 3.1.4.3 Evaluation results of a new robust estimator 3.1.5 Discussions 3.2 Fitting objects using the context and geometrical constraints 3.2.1 The proposed method of finding objects using the context and geometrical constraints 3.2.1.1 Model verification using contextual constraints 3.2.2 Experimental results of finding objects using the context and geometrical constraints 3.2.2.1 Descriptions of the datasets for evaluation 3.2.2.2 Evaluation measurements 3.2.2.3 Results of finding objects using the context and geometrical constraints 3.2.3 Discussions iv 29 30 30 32 34 36 36 37 40 46 46 48 48 51 52 52 53 55 55 58 64 64 67 68 74 76 77 77 78 78 81 82 85 DETECTION AND ESTIMATION OF A 3-D OBJECT MODEL FOR A REAL APPLICATION 86 4.1 A Comparative study on 3-D object detection 86 4.1.1 Introduction 86 4.1.2 Related Work 88 4.1.3 Three different approaches for 3-D objects detection in a complex scene 90 4.1.3.1 Geometry-based method for Primitive Shape detection Method (PSM) 90 4.1.3.2 Combination of Clustering objects and Viewpoint Features Histogram, GCSAC for estimating 3-D full object models (CVFGS) 91 4.1.3.3 Combination of Deep Learning based and GCSAC for estimating 3-D full object models (DLGS) 93 4.1.4 Experiments 95 4.1.4.1 Data collection 95 4.1.4.2 Evaluation method 98 4.1.4.3 Setup parameters in the evaluations 101 4.1.4.4 Evaluation results 102 4.1.5 Discussions 106 4.2 Deploying an aided-system for visually impaired people 109 4.2.1 Environment and material setup for the evaluation 111 4.2.2 Pre-built script 112 4.2.3 Performances of the real system 114 4.2.3.1 Evaluation of finding 3-D objects 115 4.2.4 Evaluation of usability and discussion 118 CONCLUSION AND FUTURE WORKS 121 5.1 Conclusion 121 5.2 Future works 123 Bibliography 125 PUBLICATIONS 139 v ABBREVIATIONS No Abbreviation Meaning API Application Programming Interface CNN Convolution Neural Network CPU Central Processing Unit CVFH Clustered Viewpoint Feature Histogram FN False Negative FP False Positive FPFH Fast Point Feature Histogram fps f rame per second GCSAC Geometrical Constraint SAmple Consensus GPS Global Positioning System 10 GT Ground Truth 11 HT Hough Transform 12 ICP Iterative Closest Point 13 ISS Intrinsic Shape Signatures 14 JI Jaccard Index 15 KDES Kernel DEScriptors 16 KNN K Nearest Neighbors 17 LBP Local Binary Patterns 18 LMNN Large Margin Nearest Neighbor 19 LMS Least Mean of Squares 20 LO-RANSAC Locally Optimized RANSAC 21 LRF Local Receptive Fields 22 LSM Least Squares Method 23 MAPSAC Maximum A Posteriori SAmple Consensus 24 MLESAC Maximum Likelihood Estimation SAmple Consensus 25 MS MicroSoft 26 MSAC M-estimator SAmple Consensus 27 MSI Modified Plessey 28 MSS Minimal Sample Set 29 NAPSAC N-Adjacent Points SAmple Consensus vi 30 NARF Normal Aligned Radial Features 31 NN Nearest Neighbor 32 NNDR Nearest Neighbor Distance Ratio 33 OCR Optical Character Recognition 34 OPENCV OPEN source Computer Vision Library 35 PC Persional Computer 36 PCA Principal Component Analysis 37 PCL Point Cloud Library 38 PROSAC PROgressive SAmple Consensus 39 QR code Quick Response Code 40 RAM Random Acess Memory 41 RANSAC RANdom SAmple Consensus 42 RFID Radio-Frequency IDentification 43 R-RANSAC Recursive RANdom SAmple Consensus 44 SDK Software Development Kit 45 SHOT Signature of Histograms of OrienTations 46 SIFT Scale-Invariant Feature Transform 47 SQ SuperQuadric 48 SURF Speeded Up Robust Features 49 SVM Support Vector Machine 50 TN True Negative 51 TP True Positive 52 TTS Text To Speech 53 UPC Universal Product Code 54 URL Uniform Resource Locator 55 USAC A Universal Framework for Random SAmple Consensus 56 VFH Viewpoint Feature Histogram 57 VIP Visually Impaired Person 57 VIPs Visually Impaired People vii LIST OF TABLES Table 2.1 The number of frames of each scene 36 Table 2.2 The average result of detected table plane on our own dataset(%) 41 Table 2.3 The average result of detected table plane on the dataset [117] (%) 43 Table 2.4 The average result of detected table plane of our method with different down sampling factors on our dataset 44 Table 3.1 The characteristics of the generated cylinder, sphere, cone dataset (synthesized dataset) 66 Table 3.2 The average evaluation results of synthesized datasets The synthesized datasets were repeated 50 times for statistically representative results 75 Table 3.3 Experimental results on the ’second cylinder’ dataset The experiments were repeated 20 times, then errors are averaged 75 Table 3.4 The average evaluation results on the ’second sphere’, ’second cone’ datasets The real datasets were repeated 20 times for statistically representative results 76 Table 3.5 Average results of the evaluation measurements using GCSAC and MLESAC on three datasets The fitting procedures were repeated 50 times for statistical evaluations 83 Table 4.1 The average result detecting spherical objects on two stages 102 Table 4.2 The average results of detecting the cylindrical objects at the first stage in both the first and second datasets 103 Table 4.3 The average results of detecting the cylindrical objects at the second stage in both the first and second datasets 106 Table 4.4 The average processing time of detecting cylindrical objects in both the first and second datasets 106 Table 4.5 The average results of 3-D queried objects detection 116 viii Algorithm 3.2: GCSAC’s implementation for fitting a cylindrical object from the point cloud Input: 3-D points with normal vectors: Un , U nn , wt = 0.1; Output: Estimated parameters of the cylinder; Algorithm: Step 1: initialization iterations K = ∞ Step 2: While (k < K) { 2.1 k + +; Drawing randomly two points P = {p1 , p2 } from Un ; 2.2 Un∗ = ; 2.3 if (Un∗ ! = ) estimate model Mk from P else goto 2.1 2.4 Compute wk 2.5 if (wk ≥ wt ) and (wk > wm ){ Search p∗2 by Eq 3.5; Update Un∗ = {p1 , p∗2 }; wm = wk ;} 2.6 Re-estimate Mk from Un∗ 2.7 Compute Ad = (γc , nt ) 2.8 if (Ad < At ) compute −L else go to 2.1 2.9 if (−L < Lt ){ choose the best model Mb ; re-compute K;} else go to 2.1 } estimated model by angle constraint between table plane and model’s axis We compute the deviation angle Ad = (γc , nt ), where nt is the normal vector of the defined table plane At each iteration, we verify Ad with the threshold At as Fig 3.24 3.2.2 Experimental results of finding objects using the context and geometrical constraints Our framework is warped by C++ programs using a PCL 1.7 library on a PC with Core i5 processor and 8G RAM The program runs sequentially as a single thread The performances of the proposed algorithm are evaluated in experiments for grasping cylindrical objects based on the fitting results of point clouds We evaluated on three datasets that included the public and our own preparations 3.2.2.1 Descriptions of the datasets for evaluation The first dataset is constructed from a public one used in [117] The dataset contains calibrated RGB-D data collected by a MS Kinect sensor version of 111 indoor 78 Point cloud Normal vectors Estimated cylinder (a) – A correct estimation (b) – An incorrect estimation Figure 3.22 Illustrations of correct a correct (a) and incorrect estimation without using the verification scheme On each sub-figure: Left panel: point cloud data; Middle panel: the normal vector of each point; Right panel: the estimated model (a) (b) Figure 3.23 The histogram of deviation angle with the x-axis (1, 0, 0) of a real dataset in the bottom panel of Fig 3.22; (b) the histogram of deviation angle with the x-axis (1, 0, 0) of a generated cylinder dataset in the top panel of Fig 3.22 scenes To adapt with this study, only scenes that consist of cylindrical structures are manually selected Some instances are illustrated in Fig 3.25(a), (b) The second dataset is published in [68] It is captured from a MS Kinect sensor version It consists of 14 scenes containing furniture (chair, coffee table, sofa, table) and a set of the cylinder-like objects such as bowls, cups, coffee mugs, and soda cans In this dataset, we only evaluated on the scene number: 2th , 4th , 9th , 11th where the 79 β degrees Estimated Normal of cylinder’s table plane axis z x o r Table plane I y Figure 3.24 Illustrating of the deviation angle between the estimated cylinder’s axis and the normal vector of the plane (a) (b) (c) Figure 3.25 Some examples of scenes with cylindrical objects [117] collected in the first dataset cylinder-like objects appear Each scene has around 800 frames, each frame consists of more than one cylindrical objects on the table In this dataset, the radius of coffee mugs, bowls, soda cans are 3.75cm, 5cm, 2.5cm, respectively Their heights are 10cm, 7cm, 10cm An example in Fig 3.25(c), a cylinder is specified by a line connecting two selected points on the top of the interested object It is noticed that ground-truths of the cylindrical objects in the second and third datasets are manually prepared using a visualization tool of PCL library We only evaluate the estimated models whose point clouds are separable from the original scenes The third dataset is collected by ourself in indoor environments (e.g., cafeteria, sharing room) where the cylindrical objects (e.g., coffee cups, bottles) are on a table that called ’MICA3D’ There are six types of the cylinder-like objects as shown in Fig 3.26 Their radii are in a range from 3.5cm to 4.5cm with various heights (from 6.0 cm to 20 cm) A MS Kinect sensor version is mounted on the chest of a person who moves around a table The experimental dataset consists of scenarios, each scenario 80 Figure 3.26 Illustration of six types of cylindrical objects in the third dataset includes about 200 frames In addition, we put some contaminated objects such as boxes (10.0cm × 30.0cm) besides the cylindrical objects This dataset built following the context of practical application of the VIPs to find object of interest 3.2.2.2 Evaluation measurements To evaluate the performance of the proposed method, some features of the cylindrical objects such as radius (R, and position (or main axis direction γ), can be used We notate a ground-truth and estimated cylindrical object Ct and Ce , respectively It is noticed that the height of a cylinder object is normally calculated in an additional step For example, it is determined by the maximal distance between two projected points in [47] In this study, the height is set to For simpleton quantitative indexes, we used three following evaluation measures: Let denote Ea (degree) infer a difference between the estimated cylinder’s axis γc and the normal vector of table plane nt Let denote Er (%) the relative error between the radius of the estimated cylinder (Re ) and the ground-truth one (Rg ) E r= |Re − Rg | × 100% Rg (3.18) The processing time is measured in milliseconds (ms) per cylindrical object The smaller is the faster the algorithm is The proposed method is compared with MLESAC algorithm In these evaluations for all datasets, the smaller indexes (e.g., Ea , Er , ) are, the better models are estimated To evaluate the roles of the context’s constraints (as described in Section 3.2.1.1), the quantitative indexes are measured without using the proposed constraints and with using the constraints For setting the parameters, we fixed thresholds of the estimators with T = 0.01 (or 1cm), wt = 0.1 and Ad = 20 or 70 < At < 110 (degrees) when fitting a cylinder The threshold T , we determined, based on the Hartley et al [55] method on the captured dataset from the environment We also experimented 81 (a) (b) (c) Figure 3.27 Result of the table plane detection in a pre-processing step using the methods in our previous publication and [33].(a) RGB image of current scene; (b) The detected table plane is marked in green points (c) The point clouds above the table plane are located and marked in red with some Ad thresholds and found Ad = 20 degrees for the best results If the Ad is about 0o to 20o then the estimated geometric is lying on the table, otherwise, there is the inclined lying on the table With the cones are the inclined lying on the table then using the opening angle for evaluating the estimated cone T is a distance threshold to set a data point to be an inlier or outlier For fair evaluations, T is set equally for both fitting methods 3.2.2.3 Results of finding objects using the context and geometrical constraints It is noticed that the table plane in the scenes are detected from a pre-processing step Figure 3.27 illustrates a result of table plane detection in which the table plane is marked in green points (Fig 3.27(b)) The point clouds data above the table plane are remained for further fitting, as shown in Fig 3.27(c) Figure 3.28 shows some fitting results from the second, and the third dataset For the comparative evaluations, Table 3.5 compares the performances of the proposed method (GCSAC) and MLESAC In this table, Ea , Er , are averaged on whole fitting results from three datasets Compared with MLESAC, the estimated objects fitted by GCSAC algorithm are more higher accurate The most differences between GCSAC and MLESAC can be observed from the fitting results for the first and the second datasets While MLESAC always obtains (Ea is from 45o to 47o ) of the angle derivations, using the GCSAC, Ea is lower than, from 10o for the first dataset to 2o for the third dataset The computational time is clearly different from GCSAC and MLESAC However, Ea and Er are still large errors, even with the fitting results using GCSAC This is illustrated in Fig 3.29 Radii of the blue (Fig 3.29(a)) and green (Fig 3.29(b)) objects are much larger than the ground-truth data It is noticed that the 82 Table 3.5 Average results of the evaluation measurements using GCSAC and MLESAC on three datasets The fitting procedures were repeated 50 times for statistical evaluations Dataset/ Method First dataset Second dataset Third dataset MLESAC GCSAC MLESAC GCSAC MLESAC GCSAC without the context’s constraint Ea (deg.) Er (%) (ms) 46.47 92.85 18.10 36.17 81.01 13.51 47.56 50.78 25.89 40.68 38.29 18.38 45.32 48.48 22.75 43.06 46.9 17.14 (a) (b) Figure 3.28 (a) is the results of estimating the cylindrical objects of ’MICA3D’ dataset (b) is the results of estimating the cylindrical objects of [68] dataset In these scenes, there are more than one cylinder objects They are marked in red, green, blue and yellow, so on The estimated cylinders include radius, position (a center of the cylinder), main axis direction The height can be computed using a normalization in y-value of the estimated object evaluation results reported in Table 3.5 come from the implementations in which GCSAC is deployed without using the context’s constraints to verify the estimated model The effectiveness of the context’s constraints is shown in Fig 3.30 Obviously, by using the context’s constraints, the estimated objects could be eliminated when a large angle error is observed This verification step suggests a solution to resolve estimating inlier threshold T which is a common issue of the RANSAC-based algorithms For the third dataset, we not only evaluate the fitting quality (as shown in Tab 3.5), but also evaluate the average processing time to locate interested objects The proposed system takes 1.04s/frame This computational cost included the collecting RGB and depth data from a MS Kinect; table detection; fitting object; and locate objects In these procedures, we not down-sampling the data Figure 3.31 shows 83 (b) (a) Figure 3.29 (a) The green estimated cylindrical object has he relative error of the estimated radius Er = 111.08%; (b) the blue estimated cylindrical object has he relative error of the estimated radius Er = 165.92% Figure 3.30 Angle errors Ea of the fitting results using GCSAC with and without using the context’s constraint Figure 3.31 Extracting the fitting results of the video on the scene 1th of the first dataset snap-shots from one minute video, taken from common scene in an indoor environment of the third dataset This complete video and a video of scene 4th of second dataset 84 described in the link http://mica.edu.vn/perso/Le-Van-Hung/videodemo/index.html There are four cylindrical objects on a table The proposed method successfully locates them in almost scenes As consequence, we can deploy the proposed method as an aided-service supporting the VIPs to detect common objects The collected scenes and fitting results of the third dataset are made publicity available 3.2.3 Discussions In this work, we proposed a new framework, named GCSAC, for estimating the primitive shape objects in the scene We proposed to use some geometrical constraints for selecting the good samples in the proposed algorithms Not only utilizing the geometrical constraints, the contextual constraints used for verifying the estimated model were proposed In the experimental results, GCSAC is evaluated by quality of the estimating cylinders, spheres, cones with various size in different practical scenarios It is compared with common robust estimators such as MLESAC, PROSAC, NAPSAC, so on The performances of the proposed robust estimator GCSAC were confirmed It could estimate primitive shape objects from point clouds that have contaminated by noise and outliers The average processing time of our proposed method is acceptable to deploy a real application Therefore, it suggests us deploying the real application as aided-service for impaired/blind people The main results of this chapter are presented in the following publications: Van-Hung Le, Hai Vu, Thuy Thi Nguyen, Thi-Lan Le, Thanh-Hai Tran (2017), Fitting Spherical Objects in 3-D Point Cloud Using the Geometrical constraints, Journal of Science and Technology, Section in Information Technology and Communications, N 11, 4/2018, ISSN: 1859-0209, pp 5-17 Van-Hung Le, Hai Vu, Thuy Thi Nguyen, Thi-Lan Le, Thanh-Hai Tran (2018), Acquiring qualified samples for RANSAC using geometrical constraints, Pattern Recognition Letters, Vol 102, ISSN: 0167-8655, pp 58-66, (ISI) Van-Hung Le, Hai Vu, Thuy Thi Nguyen, Thi-Lan Le, Thanh-Hai Tran (2018), GCSAC: Geometrical Constraint SAmple Consensus for Primitive Shapes Estimation in 3-D Point Cloud International Journal of Computational Vision and Robotics (SCOPUS), (Accepted) Hai Vu, Van-Hung Le, Thuy Thi Nguyen, Thi-Lan Le, Thanh-Hai Tran (2019),Fitting Cylindrical Objects in 3-D Point Cloud Using the Context and Geometrical constraints, Journal of Information Science and Engineering, ISSN: 1016-2364, Vol.35, N1.(ISI) 85 CHAPTER DETECTION AND ESTIMATION OF A 3-D OBJECT MODEL FOR A REAL APPLICATION After separating the interested objects from a table plane, the next task aims to label and estimate full object’s model from the point cloud data This task should be addressed in a context of 3-D object’s labeling and fitting because 3-D information truly expresses object’s shape in the real environment Consequently, the location information (object’s position) and object’s descriptions are fully supported for guiding the VIPs (e.g., the safety direction to grasp an object) In this chapter, we firstly argue the suitable approaches for labeling or recognizing a 3-D object The proposed GCSAC will be used to estimate full object’s model so that the descriptions about the queried objects such as its position, radius, main direction are given The first section of this chapter is a comparative study on three different approaches for recognizing 3-D objects This section is organized as follows: In sub-section 4.1, we introduce the main approaches for 3-D objects recognition Sub-section 4.1.2 presents related works Sub-section 4.1.3 describes the comparative study with three main approaches: (1) Using the primitive shape techniques for directly finding and estimating full 3-D object’s model; (2) Using 3-D learning local and global features descriptors for labeling objects; (3) Using recent advantages of the deep learning techniques 2-D image The two later methods (Method #2 and Method #3) utilize GCSAC for estimating full object model Sub-section 4.1.4 shows experimental dataset and evaluation results Finally, sub-section 4.1.5 discusses results of the 3-D objects detection techniques The second section in this chapter presents the completed system that deploys above steps for supporting VIPs querying the interested object The proposed system is evaluated in lab-environment with some subjects and in different scenarios The experimental results confirm that the proposed framework is a feasible solution 4.1 4.1.1 A Comparative study on 3-D object detection Introduction Labeling (or recognizing) 3-D objects in a complex scene is a fundamental problem in the field of computer vision and robotics It has been applied widely in the aidedsystem for VIPs However, this task still has many challenges especially when the scene 86 is complex consisting of contaminated data In addition, many objects are occluded in the practical context In Chapter 3, to estimate the coffee cup position, we formulate the problem as extracting a primitive shape (e.g., a cylinder object) from a 3-D point cloud data that is collected by a MS Kinect sensor This formulation works well for the simple cases that have only one object on the table However, in the real situations, there are many objects placed on a table plane Moreover, they may be sticking or occluded together Particularly, they could be different objects but are same geometrical structure For example, a cylinder structure could be soda can, coffee mug or bottle In this work, we exploit three different approaches to recognize the 3-D primitive objects in the complex cases through a comparative study To label/detect the queried objects, there are two common approaches: appearancebased and geometry-based method [69] as given in the surveys in Sec 1.2.2 of Chapter An appearance-based method utilizes a registration technique in order to align a candidate to a template, or utilizes matching point pairs [51], or extracting 3-D features [121] After learning the 3-D features, the appearance-based methods perform some classifiers as Support Vector Machine (SVM), AdaBoost, Random Forest to classify objects Once the objects are changed, ones need to re-train or re-prepare the object templates/gallery The performances of these approaches are not high because they are based on hand designed features They also require a huge processing time because calculations are implemented on the point cloud data Recently, due to the development of computer hardware and the advantages of a series of Convolutional Neural Networks (CNN), the results of object detection and recognition are significantly improved They combine many features, especially, a CNN could exploit deeply features at multiple levels for training a object model An approach using the geometry-based method to detect simple objects is stable with object’s appearances and is independent with changes of the environment In other words, the geometrical-based techniques not need to learn or re-train object model when the system is applied to a new scene However, performances of this approach are not high enough in case of depth data containing many noises Therefore, we study on a combination of the geometry-based method and appearance-based one to build an aided-system that utilizes both RGB and depth data collected by a MS Kinect sensor In this section, we exploit two above conventional approaches besides the proposed scheme that takes advantages of a neuron network, that is YOLO [114], [115], and a robust estimator as GCSAC algorithm YOLO is selected because it achieves a stateof-the-art performance for object detection in the RGB image After detecting object in the RGB image, the queried objects are projected into a point cloud data (3-D data) GCSAC will be utilized to generate the full model for describing the queried objects The proposed method is compared to two baseline ones such as using primitive 87 shape detection, and using hand-designed features for 3-D object recognition The comparisons are evaluated on the self-collected dataset and public datasets In the evaluation, the interested objects are placed on the table plane with clustered scenes They have simple geometry structure (e.g coffee mugs, jars, bottles, soda cans are cylindrical shapes, soccer-balls are spherical ones) 4.1.2 Related Work Detecting/labeling 3-D queried objects in a complex scene has been widely attempted in the field of robotic and computer vision Many relevant works addressing this task are listed in Section 1.2 of Chapter In this research context, e.g., a clustered scene with many occluded objects, we survey the related works focusing on: (1) Segmentation-based approaches; (2) 3-D Object recognition; (3) 3-D Fitting shapes techniques Regarding the first group, many works for point cloud segmentation have been proposed As is presented in two comprehensive surveys [11], [120], the techniques for segmenting point cloud could be edge-based, region-based, attributes-based, modelbased, and graph-based methods The authors present and analyze the advantages, disadvantages of each point cloud segmentation techniques For example, by utilizing a graph-based method, Aleksey et al [6] propose using a min-cut based segmentation technique This method creates the graph of the point cloud by the k-nearest neighbors algorithm then utilizes a penalty function that encourages a smooth segmentation where the point cloud of each object is weakly connected to the background This function is minimized using a min-cut algorithm This technique depends on the density of the point cloud If the density of the point data between two objects is little changing then they could not segment Regarding the model-based method, Schnabel et al [131] utilize the RANSAC algorithm for estimating parameters of the interested model and define criteria of the qualified samples from a 3-D point cloud Each estimated model includes points belonging to object and these points are extracted with the points of point cloud, where each iteration the primitive with maximal score is searched using the RANSAC paradigm A candidate is created from a new point cloud after the point cloud extraction of each object For the region-based method, Rusu et al [122] use Kd-tree structure and k-nearest neighbors algorithm [15] to search the neighbor points [109] of a node in the tree, and Euclidean distance threshold among points for clustering regions in a point cloud This module is already integrated in the Point Cloud Library (PCL) [107] Qingming et al [110] propose a point cloud segmentation algorithm based on colorimetrical similarity and spatial proximity It contains region growing, region merging and refinement processes A point cloud is segmented by the kd-tree structure and a k-nearest neighbours algorithm that are based on the normal vector of points 88 After clustering, the similar point cloud regions are merged and refined following the basis of colorimetrical and spatial relation For appearance-based methods, Rusu et al [125],[111], Alexandre et al [7] compute some features on the point cloud that are based on the normal vector of points such as Point Feature Histogram (PFH), Point Feature Histogram RGB (PFH-RGB), Fast Point Feature Histogram (FPFH), VFH (Viewpoint Feature Histogram) [125] These features are computed on the clustered point cloud of individual objects The training phase calculates these features on the cloud data set of pre-prepared objects The testing phase is a matching the extracted features of queried objects with a set of features that were learnt in the training phase In recent years, along with the development of computer hardware, deep learning becomes an efficient tool for object detection, recognition, segmentation on both RGB and depth image Faster R-CNN [116] is combined of Fast R-CNN [50] and Region Proposal Network (RPN) The RPN train end-to-end to generate high-quality region proposals which are used by Fast R-CNN for the detection task Fast R-CNN reduces the run time of the detection networks due to exposing region proposal computation as a bottleneck In addition, You Look Only Once (YOLO) network achieves the balance between performances and computational costs The accuracy of object detection on PASCAL VOC 2007 is 73.2% and processing time is 5fps, while YOLO [115] is 76.8% and 67fps Particular, the challenges for detecting and recognizing objects in COCO [84] and ImagNet [61] (500,000 images only for training and 200 object categories) have resolved by YOLO 3-D object recognition task using Convolutional Neural Network (CNN) on the point data always requires large processing time Pang et al [105] use the CNN for the 2-D detection on the image then projecting bounding box of the detected objects on the image to 3-D (point cloud) However, this method requires a large training dataset and powerful computer’s configuration, especially, the training phase needs a GPU for faster computation Regarding to the third category 3-D shape fitting algorithms, the related works aim to estimate the full models of objects based on their point cloud data have discovered Schnabel et al [131] propose a method to estimate primitive shapes To locate and describe objects, full model of objects needs to be built based on their point cloud data Schnabel et al [131] and Garc et al [47] use the RANSAC and RANSAC variations for estimating primitive shapes from point cloud of objects The performance of RANSAC variations are evaluated and present in Choi et al [26] Recently, we have proposed a RANSAC variation named ’Geometrical Constraint SAmple Consensus -GCSAC’ as described in Chapter This algorithm focuses on the choosing good samples for estimating primitive shapes It utilizes the geometrical constraints for estimating each primitive shape 89 4.1.3 Three different approaches for 3-D objects detection in a complex scene The evaluation study to answer a question that what is an appropriate technique for 3-D simple object detection from a complex scene Because the evaluation context consists of clustered scene, this comparative study also answers the question if clustering/segmenting an object from point cloud data is necessary or not Therefore, the comparative study consists of three different approaches as follows: The proposed method named DLGS It is a combination of Deep Learning (YOLO) for objects detection in the RGB image and using GCSAC to estimate a full object model from the corresponding point cloud of the detected object The second is an approach that does not require clustering the point cloud We select a Primitive Shape detection Method proposed by [131](PSM) This is the state-of-the-art techniques that directly estimate model of interested object from the point cloud The last one names CVFGS It is a combination of Clustering objects for computing 3-D features (Viewpoint Feature Histogram-VFH) to detect 3-D objects and using GCSAC to estimate the full model of the interested objects To conduct the comparative study, we assume a real scenario where a VIP comes to a kitchen or sharing-room to look for common objects such as coffee mugs, jars, or balls They are simple geometrical objects and are commonly used in daily living In the related study [37], the coffee mugs are usually located on the table and appear in a lot of problems of finding objects In this work, we are interested in simple geometric objects placed on the table with challenges of occlusion and clustered scenes The evaluated datasets consist of objects that are made of porcelain or plastic To avoid issues of missing the depth data, object’s color should not be black because black absorbs infra-red light limited from MS Kinect sensor The objects have a geometry structure which is not changed from different viewpoints of the sensor The detail of each method is presented in the next section 4.1.3.1 Geometry-based method for Primitive Shape detection Method (PSM) We adopt the Primitive Shape Method (PSM) that is developed by Schnabel et al [131] For instance, the VIPs want to find a spherical object, PSM is used to directly detect and estimate a sphere object in the point cloud Intuitively, using this method to separate two objects that are same geometry is very difficult (e.g., a coffee cup and a soda can have same cylindrical structure) This method is illustrated in Fig 4.1 90 Point cloud of objects on the table plane Primitive shape detection 3-D Object models Sphere detection The detected sphere Point cloud of objects on the table plane Figure 4.1 Top-panel: the procedures of PSM method Bottom-panel illustrated the result of each step 4.1.3.2 Combination of Clustering objects and Viewpoint Features Histogram, GCSAC for estimating 3-D full object models (CVFGS) This approach aims to evaluate performance (or role) of a combination between a clustering technique and a feature-based learning technique for 3-D object detection in a complex scene We deploy this scheme by utilizing two conventional ways: the first is a clustering techniques; and the second is a 3-D feature-based learning technique In this work, once the table plane is segmented in the collected data, as presented in Sec 2.3.2, the point cloud data of interested objects is clearly separated In the clustered scene, the point cloud of objects is segmented by a technique proposed by Rusu et al [121] In the common situations, similar point data could be clustered by directly using a Euclidean distance [122] However, in these cases, Euclidean distance of two objects should be large enough It is not suitable in clustered or occluded scenes To avoid this issue, we deploy a region growing technique that utilizes colorimetrical similarity and spatial proximity as described in Qingming et al [110] The details of these techniques are presented as following First, The point cloud is segmented used the Kd-tree structure and KNN (K- Nearest Neighbours) algorithm [15], [109] A re-defined step of region growing algorithm is deployed based on the colorimetrical similarity This step is performed on the segmented regions The different color between regions is computed by the average color values of points in each region If the different color of a region and its neighbor region is less than a threshold then they are merged together Figure 4.2 illustrates a result of clustering point cloud that used the method of Qingming et al [110] In the practical, this is an efficient clustering algorithm because it is easy implementation and achieves high accurate when the 91 (a) (b) Figure 4.2 A result of object clustering when using the method of Qingming et al [110] (a) RGB image; (b) the result of objects clustering projected to the image space threshold t is estimated correctly Otherwise, the number of clustered regions depend on the threshold t However, the fact that number of clustered regions strongly depend on the threshold t To recognize 3-D objects, we compute the Viewpoint Feature Histogram (VFH features) as described in [125, 121] VFH is a global descriptor [53], [8] for a 3-D object recognition It is combined of FPFH (Fast Point Feature Histograms) [123] descriptor and the viewpoint variance This work retains the invariance of scale and satisfies the speed as well as discriminative features A viewpoint component is computed by collecting a histogram of the angles between the central viewpoint direction and each normal vector FPFH of Pq point is computed based on two components: Point Feature Histograms (PFH)[124] descriptors and the neighboring SP F H values, as defined by Eq 4.1 k 1X F P F H(Pq ) = SP F H(Pq ) + SP F H(pk ) k i=1 dk (4.1) where k is the number of neighbors of Pq ; dk is the Euclidean distance of Pq to each neighbor point; SP F H(Pq ) of each point Pq is computed from the parameters α, φ, θ These parameters infer the relationship between a pair of points (Pq , Pki ) as shown in Fig 4.3(a) Their definitions are given in Eq 4.2 (pt − ps ) ; θ = arctan(w.nt , u.nt ) d where d is the Euclidean distance between ps and pt α = v.nt ; φ = u (4.2) Assuming that the relative difference of two points ps , pt and their normal vectors ns , nt is computed by Eq 4.3: u = ns ; v =u× (pt − ps ) ; || pt − ps ||2 92 w =u×v (4.3)