Road traffic control gesture recognition using Microsoft Kinect
Trường Đại học Công Nghệ CÔNG TRÌNH DỰ THI GIẢI THƯỞNG “SINH VIÊN NGHIÊN CỨU KHOA HỌC” NĂM 2011 Tên công trình:Road traffic control gesture recognition using Microsoft Kinect Nhóm thực hiện: Lê Quốc Khánh Nam Phạm Chính Hữu Nam Lớp: K53CA Khoa: Công nghệ thông tin Người hướng dẫn: TS. Lê Thanh Hà Hà Nội, 3/2012 AbstractOverview Our study concentrates on building an intelligent system in smart vehicle. Specifically, this system identifies the traffic control commands of policeman to propose right decision to driver. Our work enables smart vehicle to detect and recognize traffic officer on the road. Technically, we use a built-in depth sensorof Microsoft Kinect to capture image for recognition system. The feature characteristics of depth image is depth information providing (same as 3D information), color and texture invariance which is the difference from RGB camera.By incorporatingspatial- temporal invariance into the geometric features and applying machine learning classifier model, we are able to predict traffic control command from depth information captured. The construction of feature vector is based on relative angles between body parts of human which possibly be extractedfrom Kinect. We present experimental result on a test data of more than 30,000 frames whose is 6 kind of traffic commands. Using both Kmeans andSupport vector machine (SVM)to classify, the betterresult is about 99.8%by SVM classifier. Moreover, the application of this system runssteadily in real-time. 2 Contents 3 Index of Figure and Table 4 1. Problem statement Human traffic control is preferred for developing nations because of the relatively fewer cars, the few major intersections, and the low cost of human traffic-controllers [3]. In human traffic control environment, drivers must follow the directions given from the traffic police officer in forms of human body gestures. To improve the safety of the drivers, our research team is developing a novel method to automatically recognize the traffic control gestures. There have been a few methods developed for traffic control gesture recognition in the literature. Fan Guo et al. [6] recognized police gestures from the corresponding body parts on the color image plane. The detection results of this method were heavily affected by background and outdoor illumination because traffic police in a complex scene is detected by extracting the reflective traffic vest of the traffic police using color thresholding. Yuan Tao et al. [23] fixed on-body sensors on the back of each hand of police to extract gesture data. Although this accelerometer-based sensor may output accurate hand positions, it gives extra hindrance to the police and requires a unique communication protocol for vehicles. Meghna Singh et all. [11] used Radon transform to recognize air marshals’ hand gestures for steering aircraft on the runway. However, since a relatively stationary background of video sequence is required this method is not practical for traffic scene. Human gesture recognition for traffic control purpose can be related with that for human-robot interaction. Bauer et al. [6] presented an interaction system where a robot asks a human for directions, and then interprets the given directions. This system includes a vision component where the full body pose is inferred from a stereo image pair. However, this fitting process is rather slow and does not work in real time. Waldherr et al. [5] presented a template-based hand gesture recognition system for a mobile robot, with gestures for the robot to stop or follow, and rudimentary pointing. As the gesture system is based on a color-based tracker, several limitations are imposed on the types of clothing and contrast with the background. In [16], Van den Bergh et al. introduced a real-time hand gesture interaction system based on a Time- of-Flight (ToF) camera. Both the depth images from the ToF camera and the color image from the RGB camera are used for a Haarletbased hand gesture classification. Similar ToF-based systems were also described in the literature [18][5][21]. The use of the ToF camera allows for a recognition system robust to all colors of clothing, to background noise and to other people standing around. However, ToF cameras are 5 expensive and suffer from a very low resolution and a narrow angle of view. M. V. Bergh et al. [13] implemented a pointing hand gesture recognition algorithm based on Kinect sensor to tell a robot where to go. Although this system can be used for real- time robot control application, it can not be applied directly to traffic control situation because of the limitation of meaning gestures presented only by pointing of hands. In Vietnamese traffic control system, a human traffic controller is able to assess the traffic in visual range around the traffic intersection. Based on his observation, he makes intelligent decisions and give traffic signals in forms of his arms’ directions and movements to all incoming vehicle drivers. In this research, we only consider the directions of arms for classifying traffic control commands. Based on the observation at real traffic intersection in Vietnam, we categorize control command into tree types as shown in Table 1. Type Command Human arm directions 1 Stop all the vehicle in every road directions. Left/right arm raises straight up to 2 Stop all vehicle in front of and behind the traffic police officer. Left/right arm raises to the left/right to 3 Stop all vehicle on the right of and behind the traffic police officer. Left/right arm raises to the front to Table 1. Three types of traffic control command From these control command types, six traffic gestures can be constructed. Each traffic gesture is a combination of the arms’ directions as listed in Table 2. Gesture Human arm directions Command type 1 left hand raises straight up 1 2 right hand raises straight up 1 3 left hand raises to the left 2 4 right hand raises to the right 2 5 left hand raises to the front 3 6 right hand raises to the front 3 Table 2. Six traffic gestures defined. As stated in previous section, human parts including arm directions can be presented by a skeleton model consisting of 15 joints, namely, head, neck, torso, left shoulder, right shoulder, left elbow, right elbow, left hand, right hand, left hip, right hip, left knee, right knee, left foot, right foot. Therefore, the recognition of traffic gestures can be done using skeleton model. Figure 3 depicts two examples of traffic gestures and their skeletal joints. 6 Since skeleton model visualizes human parts simply by a set of relative joints, skeleton appears to have significant recognition advantage other than depth and color information. Therefore, instead of directly doing human parts recognition using depth and color images, we do skeleton recognition after preprocessing the Kinect’s depth images by using OpenNI library. Figure 1.Traffic gestures and skeletal joints In this research, we separate two type of gesture for recognition: static and dynamic gestures. Based on the description in Table 1, obviously the commands of traffic officer are considered as static gestures. We completed successfully the system for recognizing static gesture, and doon-going approach of dynamic gesture recognition to improve and extend the various kind of human gestures. Our completed approach presents a real-time human body gesture recognition method for road traffic control purpose. In this method, 6 body gestures used by police officer to control the flow direction of vehicles at a common intersection can be recognized by Kinect from Microsoft. In order to recognize the defined gesture, a depth sensor is installed is used to generate depth map of the scene where traffic police officer stands. Then, a skeleton presentation of police officer body is computed. A feature vector is created based on the joints of the skeleton model. 7 2. Related work 2.1Human body parts recognition using Microsoft Kinect The approach of using RGB images or video for human detection and recognition faces challenging problems due to variation in pose, clothing, lighting conditions and complexity of backgrounds. These will result in the drop of detection and recognition accuracy or the increase of computational cost. Therefore, the approach of using 3D reconstruction information obtained from depth cameras has been focusedrecently[22] [10][9][24]. Depth images have several advantages over 2D intensity images: range images are robust to the change in color and illumination; range images are simple representations of 3D information. However, earlier range sensors were expensive and difficult to use in human environments because of lasers. a. Microsoft Kinect for obtaining depth images Recently, Microsoft has launched the Kinect, a peripheral designed as a video-game con-trolling device for the Microsoft’s X-Box Console. But despite its initial purpose, it facilitates the research in human detection, tracking and activity analysis thanks to the combination of its high capabilities and low cost. The sensor provides a depth resolution similar to the ToF cameras, but at a cost several times lower. To obtain the depth information, the device uses the Prime Sense’s Light Coding Technology [19], in which Infra-Red (IR) light is projected as a dot pattern to the scene. This projected light pattern creates textures that helps finding the correspondence between pixels even in shiny or texture-less objects or with harsh lighting conditions. In addition, because the pattern is fixed, there is no time domain variation other than the movements of the objects in the field of view of the camera. This ensures a precision similar to the ToF, but Prime Sense’s mounted IR receiver is a standard CMOS sensor, which reduces the price of the device drastically. 8 Figure 2. Block Diagram of the Prime Sense Reference Design [20] Figure 2 depicts the block diagram of the reference design used by the Kinect sensor [20]. The sensor is composed of one IR emitter, responsible of emitting the light pattern to the scene, a depth sensor responsible of capturing the emitted pattern. It is also equipped with a standard RGB sensor that records the scene in visible light. Both depth and RGB sensors have a resolution of 640x480 pixels. The matching calibration process between the depth and the RGB pixels and the 3D reconstruction are handled at chip level. b. Human body pose recognition using depth images For human body pose recognition purpose, Prime Sense has created a open source library, Open Natural Interaction (OpenNI) [15], to promote the natural interaction. OpenNI provides several algorithms for the use of Prime Sense’s compliant depth cameras, including Microsoft Kinect, in natural interaction fields. Some of these algorithms provide the extraction and tracking of a skeleton model from the user who is interacting with the device. The kinematic model of the skeleton is a full skeleton model of the body consisting in 15 joints as shown in Figure2. The algorithms provide the 3D positions and orientations of every joint and update them at the rate of 30fps. Additionally they also provide the confidence of these measures are able to track up to four simultaneous skeletons. Figure 3. OpenNI’s kinematic model of the human body 9 Head Neck Right Shoulder Right Elbow Right Hand Right Hip Right Knee Right Foot Torsor Left Shoulder Left Elbow Left Hand Left Knee Left Foot Left Hip Other researches using MS Kinect for human pose estimation have also been addressed. In [7], J. Charles et al. proposed method for learning and recognizing 2D articulated human pose models from a single depth image obtained from Microsoft Kinect. Although the pose estimation is substantially recognized, the 2d presentation of articulated human pose models makes the human activity recognition process more difficult in comparing with 3D presentation of OpenNI. In [14], L. M. Adolfo et al. presented a method for upper body pose estimation with online initialization of pose and anthropometric profile. A likelihood evaluation is implemented to allow the system to run in real-time. Although the method in [14] has a better performance, in comparing with OpenNI, in limb self occlusion cases, only upper presentation of body pose is suitable for small range of recognition applications. From these reasons, we choose OpenNI to preprocess the depth images from MS Kinect to obtain the human skeleton models. 2.2Traffic gesture recognition [6] presents an approach to recognize traffic gesture in Chinese traffic. The Chinese traffic police gesture system is defined and regulated by Chinese Ministry of Public Security. Figure 4 shows 2 in 10 types of gesture. Figure 4. Chinese traffic gestures The idea of this recognition system is based on rotation joint angle. It can be seen from Figure 4 that these gestures need upper and lower arms keep certain angles to the vertical direction by rotating around shoulder or elbow joints, so the rotation joint angles are used to recognize gestures which makes it easy to add a new gesture without changing the existing angles. Since the gestures may not be performed perfectly in real situation, we set the angles in certain rangenot a fixed value. Let θ i ( i=1 4) θ i denotes the rotation angle related to each arm for the gestures, information about is provided in Table 3. 10 [...]... 2.4 Hand tracking and gesture recognition Cristina Manresa et al [4](Hand tracking and gesture recognition) works aims at the control of videogame based on hand gesture recognition They propose a new algorithm to track and recognize hand gestures for interacting with videogame This algorithm is based on three steps: hand segmentation, hand tracking and gesture recognition from hand gesture For the hand... algorithm by using dynamic gesture enable it to handle more gestures of traffic police 5 Future work 5.1 Dynamic gesture recognition problem In previous work, we did succeed in building a system which recognize static gesture This application proves a significant result in Vietnamese traffic control command identification Nevertheless, our system works properly with static gesture which means gesture belongs... training data samples, we capture a traffic gesture database for a group of five persons Each person performs a traffic gesture at different location and angle to Kinect sensor For each traffic gesture of a performing person, we record about 1000 frames of depth images Then, the coordinates of all 15 skeletal joints for each frame is calculated and stored in the traffic gesture database Totally, the number... the current human gesture of the traffic police officer and signal it to the vehicle driver Figure 7 shows an example of our proposed system GUI and a result of gesture recognition 26 Figure 6 Diagram of information flow in our system Figure 7 The proposed system GUI and the result of gesture recognition 27 4 Conclusion We have presented an algorithm for recognizing the gesture of traffic police based... for training and predicting skeletal traffic human gesture in real-time applications 3.3 Testbed system A real-time testbed system for traffic control gesture recognition is built and the diagram of data flow in the system is presented in Figure 6 Generally, the system is divided into two parts which are training and prediction In training part, the whole traffic gesture data, stated in previous section,... Perales “Hand tracking and gesture recognition for Human-Computer Interaction”, Compter Vision and Image Analysis,2000 [5] E Kollorz, J Penne, J Hornegger, A Barke, Gesture recognition with a time-ofFlight camera,” International Journal of Intelligent Systems Technologies and Applications, 5(3/4), 334-343 (2008) 30 [6] Fan Guo, Zixing Cai, Jin Tang, “Chinese Traffic Police Gesture Recognition in Complex... range scan data In Proc of IEEE Conf on Computer Vision and Pattern Recognition (CVPR), 2006 [10] HP Jain and A Subramanian Real-time upper-body human pose estimation using a depth camera In HP Technical Reports, HPL-2010-190, 2010 [11] Meghna Singh, Mrinal Mandal and Anup Basu “Visual gesture recognition for ground air traffic control using the Radon transform,” IEEE/RSJ IROS, 2005 [12]Meinard Müller,... which were classified as gesture x, among all examples which is truly labeled as gesture x False positive (FP) rate is the proportion of examples which were classified as gesture x, but labeled as a different gesture, among all examples which are not labeled as gesture x Precision is the proportion of the examples which is truly labeled as x among all those which were classified as gesture x The experiments... database Totally, the number of training vectors in our traffic gesture database is 30 509 and each vector is labeled by its the gesture number The Weka tool [23] is used to train and test human pose recognition accuracy with Kmeans clustering and C-SVMs classifier with C=1.0 and kernel RBF The data set includes 30509 samples labeled by six defined gestures The test mode is 10-foldcross-validation which... predicting part, depth images captured from Kinect sensor at the rate of 15 fps are recognized by using OpenNI library and the skeletal models are obtained Each skeletal model is then predicted to obtain the gesture number Because of misclassifying, especially at the border of two human poses, gesture number of the same human pose may vary rapidly Therefore, we choose gesture number with maximum occurrence . dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally they generate confidence-scored 3D proposals of several body joints by reprojecting the