Audio and visual perceptions for mobile robot

Founded 1905 AUDIO AND VISUAL PERCEPTIONS FOR MOBILE ROBOT FENG GUAN (BEng, MEng) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF ELECTRICAL & COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2006 Acknowledgements I would like to thank all the people who have helped me to achieve the final outcome of this work, however only a few can be mentioned here. In particular, I am deeply grateful to my main supervisor, Professor Ai Poh Loh. Because of her guidance in a constant, patient and instructive way, I was able to achieve success, little by little, in my academic work since the start of my research work in 2001. In our discussion, she always listens to my report and thinks carefully, analyzes critically and gives her feedback and ideas creatively. She has inspired me to concentrate on this research work in a systematic, deep and complete manner. I also thank her for her kind considerations on a student’s daily life. I would like to express my appreciation to my co-supervisor, Professor Shuzhi Sam Ge, who has provided me the directions of my research work. He has also provided me with many opportunities to learn new things systematically, jobs creatively and gain valuable experiences completely. Due to his technical insight and patient training, I was able to experience the process, to gain confidence through hardwork and to enjoy what I do. Thanks to his philosophy, he has imparted much to me through his past experiences. For this and many more, I am grateful. I wish to also acknowledge all the members of the Mechatronics and Automation Lab at the National University of Singapore. In particular, Dr Jin Zhang, Dr Zhuping Wang, Dr Fan Hong, Dr Zhijun Chao, Dr Xiangdong Chen, Professor Yungang Liu, Professor Yuzhen Wang have shared kind and instructive discussions with me. I would also like to thank other members of this lab, such as Mr Chee Siong Tan, Dr ii Kok Zuea Tang who have provided the necessary support in all my experiments. Thanks to Dr Jianfeng Cheng at the Institute for Inforcomm Research who demonstrated the performance of a two-microphone system. I am also very grateful for the support provided by the final year student, Mr Yun Kuan Lee, in the experiment on mask diffraction. Last in sequence but not least in importance, I would like to acknowledge the National University of Singapore for providing the research scholarship and the necessary facilities for my research work. iii Contents Contents Acknowledgements ii Contents iv Summary ix List of Figures xi List of Tables xvi Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Sound Localization Cues . . . . . . . . . . . . . . . . . . . . . 1.2.2 Smart Acoustic Sensors . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Microphone Arrays . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Multiple Sound Localization . . . . . . . . . . . . . . . . . . . 1.2.5 Monocular Detection . . . . . . . . . . . . . . . . . . . . . . . 1.2.6 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 10 iv Contents 1.3 Research Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Research Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Sound Localization Systems 16 2.1 Propagation Properties of a Sound Signal . . . . . . . . . . . . . . . . 16 2.2 ITD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 ITD Measurement . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Practical Issue Related to ITD . . . . . . . . . . . . . . . . . . 22 2.3 Two Microphone System . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Localization Capability . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Three Microphone System . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Localization Capability . . . . . . . . . . . . . . . . . . . . . . 29 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Sound Localization Based on Mask Diffraction 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Mask Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Sound Source in the Far Field . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 Sound Source at the Front . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Sound Source at the Back . . . . . . . . . . . . . . . . . . . . 45 3.4 ITD and IID Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Process of Azimuth Estimation . . . . . . . . . . . . . . . . . . . . . 51 3.6 Sound Source in the Near Field . . . . . . . . . . . . . . . . . . . . . 54 v Contents 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3D Sound Localization Using Movable Microphone Sets 57 59 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Three-microphone System . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.1 Rotation in Both Azimuth and Elevation . . . . . . . . . . . . 62 4.3 Two-Microphone System . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 One-microphone System . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . 73 4.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 74 4.7 Continuous Multiple Sampling . . . . . . . . . . . . . . . . . . . . . . 78 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Sound Source Tracking and Motion Estimation 85 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 A Distant Moving Sound Source . . . . . . . . . . . . . . . . . . . . . 86 5.3 Localization of a Nearby Source Without Camera Calibration . . . . 94 5.3.1 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.2 Localization Mechanism . . . . . . . . . . . . . . . . . . . . . 97 5.3.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4 Localization of a Nearby Moving Source With Camera Calibration . . 103 5.4.1 Position Estimation . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.2 Sensitivity to Acoustic Measurements . . . . . . . . . . . . . . 110 vi Contents 5.4.3 Velocity and Acceleration Estimation . . . . . . . . . . . . . . 113 5.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 118 5.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 118 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Image Feature Extraction 127 6.1 Intrinsic Structure Discovery . . . . . . . . . . . . . . . . . . . . . . . 129 6.1.1 Neighborhood Linear Embedding (NLE) . . . . . . . . . . . . 129 6.1.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.2 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Robust Human Detection in Variable Environments 150 7.1 Vision System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.1.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . 152 7.1.2 Geometry Relationship for Stereo Vision . . . . . . . . . . . . 153 7.2 Stereo-based Human Detection and Identification . . . . . . . . . . . 158 7.2.1 Scale-adaptive Filtering . . . . . . . . . . . . . . . . . . . . . 158 7.2.2 Human Body Segmentation . . . . . . . . . . . . . . . . . . . 163 7.2.3 Human Verification . . . . . . . . . . . . . . . . . . . . . . . . 169 7.3 Thermal Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 175 7.4 Human Detection by Fusion . . . . . . . . . . . . . . . . . . . . . . . 178 7.4.1 Extrinsic Calibration . . . . . . . . . . . . . . . . . . . . . . . 178 vii Contents 7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.5.1 Human Detection Using Stereo Vision Alone . . . . . . . . . . 183 7.5.2 Human Detection Using Both Stereo and Infrared Thermal Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.5.3 Human Detection in the Presence of Human-like Object . . . 187 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Conclusions and Future Work 193 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Appendix A 198 Calibration of Camera . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Author’s Publications 202 Bibliography 204 viii Summary Summary In this research, audio and visual perception for mobile robots are investigated, which include passive sound localization mainly using acoustic sensors, and robust human detection using multiple visual sensors. Passive sound localization refers to the motion parameter (position, velocity) estimation of a sound source, e.g., a speaker in a 3D space using spatially distributed passive sensors such as microphones. Robust human detection relies on multiple visual sensor information such as stereo cameras and thermal camera to detect humans in variable environment. Since mobile platform requires the sensor structure to be compact and small, it results in the conflict between miniaturization and the estimation of higher dimensional motion parameters in audio perception. Thus, in this research, and microphone systems are mainly investigated in an effort to enhance their localization capabilities. Several strategies are proposed and studied, which include multiple localization cues, multiple sampling and multiple sensor fusion. Due to the mobility of a robot, the surrounding environment varies. To detect humans robustly in such variable 3D space, we use stereo and thermal cameras. Information fusion of these two kinds of cameras is able to detect humans robustly and ix Summary discriminate humans from human-like objects. Furthermore, we propose an unsupervised learning algorithm (Neighborhood Linear Embedding - NLE) to extract visual features such as human faces from an image in a straightforward manner. In summary, this research provides several practical solutions to solve the problem between miniaturization and localization capability for sound localization systems, and robust human detection methods for visual systems. x Author’s Publications Intelligence, Robotics and Autonomous Systems (CIRAS 2003), Singapore, 15-18 December, 2003. S. S. Ge, A. P. Loh, and F. Guan, “Sound Localization Based on Mask Diffraction,” Proceedings of IEEE International Conference on Robotics and Automation, pp. 1972-1977, September 14-19, 2003, Taipei, Taiwan. F. Guan, A. P. Loh and S. S. Ge, “3D Sound Localization Using Movable Microphones Sets,” Proceedings of Fourth International Conference on Industrial Automation, Montreal, Canada, June 9-11, 2003. S. S. Ge, A. P. Loh and F. Guan, “3D sound localization based on audio and video” Proceedings of Fourth International Conference on Control and Automation, pp. 168-172, Montreal, Canada, June 10-12, 2003. 203 Bibliography Bibliography [1] J. Y. Weng and K. Y. Guentchev, “Three-dimensional sound localization from a compact noncoplanar array of microphones using tree-based learning,” Journal of the Acoustical Society of America, vol. 110, no. 1, pp. 310–323, 2001. [2] Q. H. Wang, T. Ivanov, and P. Arari, “Acoustic robot navigation using distributed microphone arrays,” Information Fusion, vol. 5, pp. 131–140, June 2004. [3] J. Tabrikian and H. Messer, “Three-dimensional source localization in a waveguide,” IEEE Transactions on Signal Processing, vol. 44, pp. – 13, January 1996. [4] S. M. Griebel, A Microphone Array System for Speech Source Localization, Denoising, and Dereverberation. PhD thesis, Engineering and Applied Sciences, Harvard University, Cambridge Massachusetts, April 2002. [5] A. Mahajan and M. Walworth, “3D position sensing using the differences in the time of flights from a wave source to various receivers,” IEEE Transactions on Robotics and Automation, vol. 17, no. 1, pp. 91–94, 2001. 204 Bibliography [6] H. Wang and P. Chu, “Voice source localization for automatic camera pointing system in videoconferencing,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, (Munich, Germany), pp. 187 –190, 21-24 Apr 1997. [7] J. Vermaak, M. Gangnet, A. Blake, and P. Perez, “Sequential monte carlo fusion of sound and vision for speaker tracking,” in IEEE International Conference on Computer Vision, vol. 1, (Los Alamitos, Calif), pp. 741 –746, 7-14 July 2001. [8] J. Huang, T. Supaongprapa, I. Terakura, N. Ohnishi, and N. Sugie, “Mobile robot and sound localization,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 2, (Grenoble, France), pp. 683–689, September 7-11 1997. [9] H. Buchner and W. Kellermann, “An acoustic human-machine interface with multi-channel sound reproduction,” in Proceedings of IEEE Fourth Workshop on Multimedia Signal Processing, (Cannes, France), pp. 359 –364, October 3-5, 2001 2001. [10] J. G. Desloge, W. Rabinowitz, and P. Zurek, “Microphone-array hearing aids with binaural output - part : Fixed-processing systems,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 6, pp. 529 –542, 1997. [11] D. Welker, J. Greenberg, J. Desloge, and P. Zurek, “Microphone-array hearing aids with binaural output. ii. a two-microphone adaptive system,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 6, pp. 543 –551, 1997. 205 Bibliography [12] B. G. Ferguson, L. G. Criswick, and K. W. Lo, “Locating far-field impulsive sound sources in air by triangulation,” Journal of the Acoustical Society of America, vol. 111, no. 1, pp. 104–116, 2002. [13] J. W. L. Strutt, “On our perception of sound direction,” Philos. Mag, vol. 13, pp. 214–232, 1907. [14] P. R. Cook, Music, Cognition, and Computerized Sound,- An Introduction to Psychoacoustics. Cambridge, Massachusetts: The MIT Press, 1999. [15] S. Handel, Listening-An Introduction to the Perception of Auditory Events. Cambridge, Masachusetts: The MIT Press, 1989. [16] B. G. Shinn-Cunningham, S. Santarelli, and N. Kopco, “Tori of confusion: Binaural localization cues for sources within reach of a listener,” Journal of the Acoustical Society of America, vol. 107, no. 3, pp. 1627–1636, 2000. [17] F. Palmieri, M. Datum, A. Shah, and A. Moiseff, “Sound localization with a neural network trained with the multiple extended kalman algorithm,” in Proceedings of International Joint Conference on Neural Networks, vol. 1, pp. 125 –131, November 1991. [18] S. Carlile, Virtual Auditory Space : Generation and Applications. Neuroscience Intelligence Unit, Austin, TX: RG Landes, 1996. [19] C. J. Pu, J. Harris, and J. C. Principe, “A neuromorphic microphone for sound localization,” in Proceedings of IEEE International Conference on Systems, 206 Bibliography Man, and Cybernetics, Computational Cybernetics and Simulation, vol. 2, (Orlando, USA), pp. 1469 –1474, October 1997. [20] M. Rucci, G. M. Edelman, and J. Wray, “Adaptation of orienting behavior: From the barn owl to a robotic system,” IEEE Transactions on Robotics and Automation, vol. 15, no. 1, pp. 96–110, 1999. [21] J. Huang, N. Ohnishi, and N. Sugie, “A biomimetic system for localization and separation of multiple sound sources,” IEEE Transactions on Instrumentation and Measurement, vol. 44, pp. 733–738, 1995. [22] J. Huang, N. Ohnishi, and N. Sugie, “Sound localization in reverberant environment based on the model of the precedence effect,” IEEE Transactions on Instrumentation and Measurement, vol. 46, pp. 842–846, 1997. [23] M. Brandstein and D. Ward, Microphone Arrays - Signal Processing Techniques and Applications. New York: Springer, 2001. [24] B. G. Ferguson and K. W. Lo, “Passive ranging errors due to multipath distortion of deterministic transient signals with application to the localization of small arms fire,” Journal of the Acoustical Society of America, vol. 111, no. 1, pp. 117–128, 2002. [25] S. Gazor and Y. Grenier, “Criteria for positioning of sensors for a microphone array,” IEEE Transactions on Speech and Audio Processing, vol. 3, pp. 294 – 303, July 1995. 207 Bibliography [26] M. S. Brandstein, J. E. Adcock, and H. F. Silverman, “A closed-from location estimator for use with room environment microphone arrays,” IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 45–50, January 1997. [27] M. S. Brandstein, A Framework for Speech Source Localization Using Sensor Arrays. PhD thesis, Brown University, May 1995. [28] V. Katkovnik and A. B. Gershman, “A local polynomial approximation based beamforming for source localization and tracking in nonstationary environments,” IEEE Signal Processing Letters, vol. 7, pp. 3–5, January 2000. [29] Sofiéne, S. Gazor, and Y. Grenier, “An algorithm for multisource beamforming and multitarget tracking,” IEEE Transactions on Signal Processing, vol. 44, pp. 1512–1522, June 1996. [30] J. O. Smith and J. S. Sbel, “Close-form least-square source location estimation from range-didfference measurement,” IEEE Transactions on Acoustics, Speech, and Signal Procesisng, vol. ASSP-35, pp. 1661–1669, December 1987. [31] Y. T. Chan and K. C. Ho, “A simple and efficient estimator for hyperbolic location,” IEEE Transactions on Signal Processing, vol. 42, pp. 1905 – 1915, August 1994. [32] S. R, “Multiple emitter location and signal parameter estimation,” IEEE Transactions on Antennas and Propagation, vol. 34, pp. 276 – 280, March 1986. [33] A. S. Bregman, Auditory scene analysis : the perceptual organization of sound. Cambridge, Mass: MIT Press, 1990. 208 Bibliography [34] A. Belouchrani, K. Abed-Meraim, and J.-F. Cardoso, “A blind source separation technique using second-order statistics,” IEEE Transactions on Signal Processing, vol. 45, pp. 434–444, February 1997. [35] L. Parra and C. Spence, “Convolutive blind separation of non-stationary sources,” IEEE Transactions on Speech and Audio Processing, vol. 8, pp. 320– 327, May 2000. [36] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correction,” IEEE Transactions on Neural Network, vol. 10, pp. 684–697, May 1999. [37] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visual surveillance of object motion and behavior,” IEEE Transactions on System, Man, and Cybernetics - Part C: Applications and Reviews, vol. 34, pp. 334–352, August 2004. [38] I. Haritaoglu, D. Harwood, and L. S. Davis, “W : Real-time surveillance of people and their activities,” IEEE Transcations on Pattern Analysis and Machine Intelligence, vol. 22, pp. 809–830, August 2000. [39] N. Friedman and S. Russell, “Image segmentation in video sequence: a probabilistic approach,” in 13th Conference on Uncertainty in Artificial Intelligence, (Rhode Island, USA), pp. 1–3, Brown University, August 1-3 1997. [40] C. Ridder, O. Munkelt, and H. Kirchner, “Adaptive background estimation and forground detection using kalman-filtering,” in International Conference on Recent Advances in Mechatronics, (Istanbul, Turkey), pp. 193–199, August 14-16 1995. 209 Bibliography [41] W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee, “Using adaptive tracking to classify and monitor activities in a site,” in IEEE Conference on Computer Vision and Pattern Recognition, (Santa Barbara, California), pp. 22–31, June 23-25 1998. [42] S. J. McKenna, S. Jabri, Z. Duric, and A. Rosenfeld, “Tracking groups of people,” Computer Vision and Image Understanding, vol. 80, 42-56 2000. [43] G. Luca, L. Marcenaro, and C. S. Regazzoni, “Automatic detection and indexing of video-event shots for surveillance applications,” IEEE Transactions on Multimedia, vol. 4, pp. 459–471, December 2002. [44] E. Trucco and A. Verri, In troductory Techniques for 3-D Computer Vision. Upper Saddle River, New Jersey: Prentice Hall, 1998. [45] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, and O. Hasegawa, “A system for video surveillance and monitoring,” Tech. Rep. CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, May 2000. [46] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target classification and tracking from real-time video,” in 4th IEEE Workshop on Applications of Computer Vision, (Princeton, New Jersey), pp. 8–14, October 19-21 1998. [47] J. Emmerton, “The pigeon’s discrimination of movement patterns (lissajous figure) and contour-dependent rotational inveriance,” Perception, vol. 15, pp. 573– 588, September 1986. 210 Bibliography [48] R. Culter and L. S. Davis, “Robust real-time periodic motion detection, analysis and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 781–796, August 2000. [49] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Realtime tracking of the human body,” IEEE Transcations on Pattern Analysis and Machine Intelligence, vol. 19, pp. 780–785, July 1997. [50] R. L. Hsu, M. Abdei-Mottaleb, and A. K. Jain, “Face dectetion in color images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 696–706, May 2002. [51] E. Hjelm˚ as and B. K. Low, “Face detection: A survey,” Computer Vision and Image Understanding, vol. 83, pp. 236–274, September 2001. [52] M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in images : A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 34–58, January 2001. [53] R. Brunelli and T. Poggio, “Face recognition: Features versus templates,” IEEE Transcations on Pattern Analysis and Machine Intelligence, vol. 15, pp. 1042– 1052, October 1993. [54] X. Li and N. Roeder, “Face contour extraction from front-view images,” Pattern Recognition, vol. 28, pp. 1167–1179, August 1995. [55] G. Yang and T. S. Huang, “Human face detection in a complex background,” Pattern Recognition, vol. 27, pp. 53–63, Junuary 1994. 211 Bibliography [56] J. Yang and A. Waibel, “A real-time face tracker,” in the 3rd IEEE Workshop on Applications of Computer Vision, (Sarasota, Florida), pp. 142–147, December 2-4 1996. [57] C. H. LEE, J. S. Kim, and K. H. Park, “Automatic human face location in a complex background using motion and color information,” Pattern Recognition, vol. 29, pp. 1877–1889, November 1996. [58] Y. Dai and Y. Nakano, “Face-texture model based on sgld and its application in face detection in a color scene,” Pattern Recognition, vol. 29, pp. 1007–1017, June 1996. [59] S.-H. Jeng, H. Y. M. Liao, C. C. Han, M. Y. Chern, and Y. T. Liu, “Facial feature detection using geometrical face model: An efficient approach,” Pattern Recognition, vol. 31, pp. 273–282, March 1998. [60] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991. [61] M. Propp and A. Samal, “Artificial neural network architecture for human face detection,” Intelligent Engineering Systems Through Artificial Neural Networks, vol. 2, pp. 535–540, 1992. [62] B. Moghaddam and A. Pentland, “Probabilistic visual learning for object representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp. 696–710, July 1997. 212 Bibliography [63] O. B. Raphaël Feraud and D. Collobert, “A constrained generative model applied to face detection,” Neural Processing Letters, vol. 5, pp. 73–81, 1997. [64] S.-H. Lin, S.-Y. Kung, and L.-J. Lin, “Face recognition/detection by probabilistic decision-based neural network,” IEEE Transactions on Neural Networks, vol. 8, pp. 114–132, January 1997. [65] E. Osuna, R. Freund, and F. Girosit, “Training support vector machines: an application to face detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (San Juan, Puerto Rico), pp. 130–136, June 17-19 1997. [66] H. Schneiderman and T. Kanade, “A statistical method for 3d object detection applied to faces and cars,” in IEEE Conference on Computer Vision and Pattern Recognition, (South Carolina), pp. 746 – 751, June 13-15 2000. [67] N. Kakuta, S. Yokoyama, and K. Mabuchi, “Human thermal models for evaluating infrared images,” IEEE Engineering in Medicine and Biology Magazine, vol. 21, pp. 65–72, november-December 2002. [68] N. Nanadhakumar and J. L. Aggarwal, “Integrated analysis of thermal and visual images for scene interpretation,” IEEE Transcations on Pattern Analysis and Machine Intelligence, vol. 10, pp. 469–480, July 1988. [69] W. L. Chan, A. T. P. So, and L. L. Lai, “Three-dimensional thermal imaging for power equipment monitoring,” IEE Proceedings - Generation, Transmission and Distribution, vol. 147, no. 6, pp. 355 – 360, 2000. 213 Bibliography [70] X. P. V. Maldague, Infrared Methodlogy and Technology, vol. of Nondestructive Testing Monographs and Tracts. Switzerland: Gordon and Breach Sicence Publishers, 1994. [71] T. Tsuji, H. Hattori, M. Watanabe, and N. Nagaoka, “Development of nightvision system,” IEEE Transactions on Intelligent Transportation Systems, vol. 3, pp. 203–209, Septeber 2002. [72] M. M. Trivedi, S. Y. Cheng, E. M. C. Childers, and S. J. Krotosky, “Occupant posture analysis with stereo and thermal infrared video: Algorithms and experimental evaluation,” IEEE Transations on Vehicular Technology, vol. 53, pp. 1698–1712, November 2004. [73] B. C. Arrue, A. Ollero, and J. R. M. de Dios, “An intelligent system for false alarm reduction in infrared forest-fire detection,” IEEE Intelligent Systems, vol. 15, pp. 64–73, May-June 2000. [74] J. H. A. Wright, “Time-domain analysis of broad-band refraction and diffraction,” Journal of the Acoustical Society of America, vol. 46, no. 3, pp. 661–666, 1969. [75] T. Gustafsson, B. D. Rao, and M. Trivedi, “Source localization in reverberant environments: Modeling and statistical analysis,” IEEE Transactions on Speech and Audio Processing, vol. 11, pp. 791 – 803, November 2003. [76] M. Tanaka and Y. Kaneda, “Performance of sound source direction estimation methods under reverberant conditions,” Journal Of the Acoustical Society Of Japan E, vol. 14, no. 4, pp. 291–292, 1993. 214 Bibliography [77] B. Champagne, S. Bedard, and A. Stephenne, “Performance of time-delay estimation in the presence of room reverberation,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 2, pp. 148 –152, 1996. [78] M. Omologo and P. Svaizer, “Acoustic event localization using a crosspowerspectrum phase based technique,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, (Adelaide, South Australia), pp. II/273 –II/276, April 1994. [79] M. S. Brandstein, J. E. Adcock, and H. F. Silverman, “A practical time-delay estimator for localizing speech sources with a microphone array,” Computer Speech and Language, vol. 9, no. 2, pp. 153–169, 1995. [80] M. S. Brandstein and H. F. Silverman, “A practical methodology for speech source localization with microphone arrays,” Computer Speech and Language, vol. 11, no. 2, pp. 91–126, 1997. [81] C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-24, no. 4, pp. 320 –327, 1976. [82] D. T. Blackstock, Fundamentals of Physical Acoustics. New York: Wiley, 2000. [83] S. S. Ge, T. Lee, and C. Harris, Adaptive Neural Network Control of Robotic Manipulators. River Edge, NJ: World Scientific, 1998. [84] S. S. Ge, C. C. Hang, T. H. Lee, and T. Zhang, Stable Adaptive Neural Network Control. Norwell, USA: Kluwer Academic, 2001. 215 Bibliography [85] M. R. Spiegel, Mathematical Handbook of Formulas and Tables. Singapore: McGRAW-HILL Publishing Company, 1990. [86] J.-Y. Bouguet. Camera Calibration Toolbox for Matlab, October 2004. http://www.vision.caltech.edu/bouguetj/calib doc/. [87] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003. [88] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, December 2000. [89] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, February 1990. [90] Y. G. Zhang, C. S. Zhang, and S. J. Wang, “Clustering in knowledge embedded space.” In Lecture Notes in Computer Science, Volume 2837, Title: The 14th European Conference on Machine Learning (ECML -2003), Cavtat, Croatia, Proceedings, pp. 480-491, September 2003. [91] Y. G. Zhang, C. S. Zhang, and D. Zhang, “Distance metric learning by knowledge embedding,” Pattern Recognition, vol. 37, pp. 161–163, January 2004. [92] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A gloabal geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, pp. 2319–2323, December 2000. http://isomap.stanford.edu/datasets.html. 216 Bibliography [93] J. Kovaˇcević and W. Sweldens, “Wavelet families of increasing order in arbitrary dimensions,” IEEE Transactions on Image Processing, vol. 9, pp. 480–496, March 2000. [94] M. Unser and T. Blu, “Wavelet theory demystified,” IEEE Transactions on Signal Processing, vol. 51, pp. 470–483, February 2003. [95] H. Choi and R. G. Baraniuk, “Multiscale image segmentation using waveletdomain hidden markov models,” IEEE Transactions on Image Processing, vol. 10, pp. 1309–1321, September 2001. [96] Z. Xiong, K. Ramchandran, and M. T. Orchard, “Space-frequency quantization for wavelet image coding,” IEEE Transactions on Image Processing, vol. 6, pp. 677–693, May 1997. [97] T. Aach, A. Kaup, and R. Mester, “On texture analysis: Local energy transforms versus quadrature filters,” Signal Processing, vol. 45, no. 2, pp. 173–181, 1995. [98] B.-L. Zhang, H. Zhang, and S. S. Ge, “Face recognition by applying wavelet subband representation and kernel associative memory,” IEEE Transactions on Neural Networks, vol. 15, pp. 166 – 177, January 2004. [99] P. Maragos, R. W. Schafer, and M. A. Butt, eds., Mathematical morphology and its applications to image and signal processing. Boston: Kluwer Academic, 1996. 217 Bibliography [100] D. A. Socolinsky, L. B. Wolff, J. D. Neuheisel, and C. K. Eveeland, “Illumination invariant face recognition using thermal infrared imagery,” in IEEE Conference on Computer Vision and Pattern Recognition, (Hawaii), pp. 527–534, December 11-13 2001. [101] T. Coleman and Y. Y. Li, “An interior trust region approach for nonlinear minimization subject to bounds,” SIAM Journal on Optimazation, vol. 6, no. 2, pp. 418–445, 1996. 218 [...]... more friendly intelligent world, where humanoid robots enter the domestic home as helpers, ushers and so on To fulfill their tasks, robots must be able to sense the environment around them, especially humans Audio and visual perceptions are the first requirement of this operation In this thesis, audio and visual perceptions for mobile robots are investigated for the purpose of sensing the environment around... beings and animals take these capabilities of audio and visual perceptions for granted Machines, however, have no such capability and training them becomes a great challenge It is not surprising, therefore, that audio and visual perception have attracted much attention in the literature [2–7], owing to their wide applications including robotic perception [8], human-machine interfaces [9], handicappers’... will make use of multiple visual sensors in this thesis, which will provide sufficient information for human identification 1.3 Research Aims and Objectives On the basis of what we have reviewed, ITD-based microphone array and multiple cameras such as stereo cameras are chosen for audio and visual perception of mobile robot respectively Microphone arrays consist of multiple microphones at different spatial... the output power of a steeredbeamformer is maximized In the simplest type, known as delay -and- sum beamformer, the various sensor outputs are delayed and then summed Thus, for a single target, the average power at the output of the delay -and- sum beamformer is maximized when it is steered towards the target Though beamforming is extensively used in speech-array application for voice capture, it has rarely... to a half horizontal plane [20] On the other hand, mobile platforms require sensor 11 1.4 Research Methodologies structures to be compact and small, which limits the number of microphones and subsequently reduces the localization domain of the platforms Besides the problem of audio perception for mobile robots, the challenge associated with visual perception is that vision-based human detection may not... waveforms for sound source at the front 48 3.7 Computed waveforms for sound source at the back 48 xi List of Figures 3.8 The onset and amplitude for a sound source at the front 49 3.9 ITD and IID derivation from computed waveforms 50 3.10 ITD and IID response at the front 51 3.11 ITD and IID response at the back 51 3.12 Front-back... vision-based human detection may not be robust in variable environments It requires more reliable visual perception system that not only detects humans robustly, but also discriminates humans from human-like objects The ultimate objective of this work is thus to investigate audio and visual perceptions for mobile robots, which includes the analysis of the localization strategies of systems with a limited... human candidates [67–73] Robust human detection may then be achieved 1.5 Contributions In this thesis, we investigate audio and visual perception for mobile robots It includes the study on the sound localization systems with a limited number of microphones such as 3 or 2 microphones and visual human detection in variable environments The main contributions made in this thesis are summarized as follows:... fact that it is less efficient and less satisfactory as compared to other methods Moreover, the steered response of a conventional beamformer is highly dependent on the spectral content of the source signal such as the radio frequency (RF) waveform Therefore, beamforming is mainly used in radar, sonar, wireless communications and geophysical exploration In order to enable a beamformer to respond to an unknown... Localization Multiple sound source localization and separation methods have been developed in the field of antennas and propagation [32] However, different techniques have to be developed for sound, e.g., human speech as it varies dynamically in amplitude and contains numerous silent portions In [21], ITD candidates were calculated for each frequency components and mapped into a histogram The number of peaks . especially humans. Audio and visual perceptions are the first r equirement of this operation. In this thesis, audio and visual perceptions for mobile robot s are investigated for the purpose o. Founded 1905 AUDIO AND VISUAL PERCEPTIONS FOR MOBILE ROBOT FENG GUAN (BEng, MEng) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF. research, audio and visual perception for mobile robo t s are investigated, which include passive sound localization mainly using acoustic sensors, and robust human detection using multiple visual

Định dạng
Số trang	234
Dung lượng	14,5 MB