Trang 3“I hereby declare that the work contained in this thesis is of my own and has not been previously submitted for a degree or diploma at this or any other higher education institution To the best of my knowledge and belief, the thesis contains no materials previously published or written by another person except where due reference or acknowledgement is made.”
“I hereby approve that the thesis in its current form is ready for committee examination as a requirement for the Master of Computer Science degree at the University of Engineering and Technology.”
I would like to express my sincere gratitude to my advisor Dr Le Thanh Ha, University of Engineering and Technology, Vietnam National University, Hanoi for his enthusiastic guidance, warm encouragement and useful research experiences I am grateful to thank all the teachers of University of Engineering and Technology, VNU for their extremely valuable knowledge, which they gave to me during my master course I would like to thank all my friends and other lab mates in Human Machine Interaction Laboratory for their helpful discussions about my research topic I sincerely acknowledge the basic research projects in natural science in 2012 of the National Foundation for Science & Technology Development (Nafosted), Vietnam (102.01-2012.36, Coding and communication of multiview video plus depth for 3D Television Systems) for supporting finance to my master study Last, but not least, my family is really the biggest motivation behind me I would like to thank my parents and
my brother for supporting me spiritually throughout writing this thesis I would like to send them my gratefulness and love
Hanoi, October 20th, 2015 Tran Nguyen Le
Trang 7Table of Contents
Table of Contents vii
Abbreviations ix
List of Figures x
List of Tables xi
1.1 Motivation 1
1.2 Objectives 2
1.3 Methodology 2
1.4 Thesis‟s outline 2
Chapter 2 RELATED WORK 3
2.1 Infrared laser tracking devices for presentation 3
2.2 Distance transform based hand gesture recognition 4
2.3 Body tracking-based hand gesture recognition using Microsoft Kinect 5
3.1 Image sequence preprocess 9
3.1.1 Motion extraction from depth images 9
3.1.2 Noise reduction 10
3.1.1 Initial hand detection 11
3.2 Hand localization 15
3.2.1 Hand tracking 15
3.2.2 Hand region segmentation 16 Updating the hand point position 16 Using depth threshold from the depth value of the hand point 17 Using blob detection to detect hand region from others 17 Reducing noise from hand area using hand point position 18
3.2.3 Hand contour extraction 18
3.3 Hand gesture recognition 20
3.3.1 Sample gesture definition 20
3.3.2 Feature vector selection 22 Hand posture 22 Dynamic hand gesture 24
3.3.3 Training and classifying 25 Hand posture 25 Dynamic hand gesture 26
3.4 Presentation controller 29
Trang 83.4.1 System requirements 29
3.4.2 Workflow of controlling presentation 29
3.4.3 Presentation controller interface 30
4.1 Data collection 32
4.1.1 Hand posture database 32
4.1.2 Dynamic hand gesture database 32
4.2 Test-bed system and results 32
4.2.1 Accuracy of hand posture recognition 32
4.2.2 Accuracy of dynamic hand gesture recognition 33
4.2.3 Presentation controller performance 34
Chapter 5 CONCLUSION 35
References 36
Trang 9Abbreviations
TV Television
RGB Red Green Blue
SDK Software Development Kit
PC Personal Computer
Trang 10List of Figures
Figure 2.1: Tracked skeleton joints of the user‟s body [9] 5
Figure 3.1: Abstract layered view of proposed system 8
Figure 3.2: The process of generating the motion image 10
Figure 3.3: (a) The opening operation, (b) The erosion operation, (c) The dilation operation 11
Figure 3.4: (a) The original motion image, (b) The reduced noise motion image 11
Figure 3.5: Motion clustering with hand size: (a) Before applying the threshold of hand size (b) After applying the threshold of hand size 12
Figure 3.6: Motion history image and motion template procedure Motion history at time (a) t, (b) t+1, (c) t+2, (d) Depth motion history image 13
Figure 3.7: The direction of cluster 14
Figure 3.8: Result of the initial hand detection 15
Figure 3.9: Hand tracking using Kalman filter 16
Figure 3.10: The result of hand region extraction using blob detection: (a) Detected blobs (b) Extracted blob including hand point 17
Figure 3.11: The result of hand segmentation 18
Figure 3.12: Binary image including hand area 18
Figure 3.13: Contour tracing using Moore-Neighbor tracing algorithm 20
Figure 3.14: Hand contour extraction using Moore-Neighbor tracing algorithm 20
Figure 3.15: Hand postures definition 21
Figure 3.16: Dynamic hand gesture definition 22
Figure 3.17: Computation of angle relation 25
Figure 3.18: Workflow of gesture recognition process 28
Figure 3.19: Workflow of controlling presentation 30
Figure 3.20: Presentation controller interface 30
Figure 4.1: Result with Logistic Regression classifier 33
Trang 11List of Tables
Table 4.1: The result of classifying next/previous and grasp/release gesture 33 Table 4.2: The result of classifying next and previous gesture 34 Table 4.3: The result of classifying grasp and release gesture 34
Trang 12Chapter 1 INTRODUCTION
1.1 Motivation
With present-day technology, slideshow presentation applications such as PowerPoint are becoming more popular and playing an important role in many areas especially business or education However, among various presentation controls, the most common and widely used tools are still the standard mouse and keyboard which can make presenters feel inconvenient during their speech For example, when the projection plane is far away from the computer, presenters have to walk back and forth
a long distance between the computer and screen if they want to point something on the slide, that causes many interruptions for their presentation On the other side, staying close to the computer leads to reduced body language and eye contact with the listeners Another favorite tool for presentation that is emerging nowadays is laser pointer Nevertheless, the laser point makes the audiences hard to follow because of its fast moving and unpredicted trajectory As the technology progresses forward, more and more natural and easy-to-use presentation techniques are developed in order to overcome above disadvantages and deliver good experiences to presenters as well as listeners
To address this demand, one of the most studied research at the present time is hand gesture recognition In the recent few years, hand gesture recognition systems have gained great attention because of the ability to interact with computer effectively Use
of gesture makes the interaction between human and computer easy, convenient and interesting The evidence is that today hand gestures are used to control various applications like robot control, smart TV control, gaming, etc Along with the strong development of such systems, more and more new devices in the area of gesture recognition are getting popular and successful One of them is an input device for motion sensing, developed by Microsoft, namely Kinect sensor [1] This sensor allows the users to control and interact with an application using real gestures Also the low price and availability to work with traditional computer hardware and existence of developers‟ tools for Kinect application development made Kinect so popular, compared to the other existing sensors for motion tracking Therefore, in this thesis, the idea to control slides during a presentation by hand gesture recognition system using Kinect is put forth
Trang 131.2 Objectives
The main objective of our thesis is to propose an architectural design of the intelligent presentation system using a contour based hand gesture recognition method [2] Our system contains four major components: the image sequence preprocess, hand localization, hand gesture recognition and presentation controller Different from other hand gesture recognition systems based on visual color method [3,4] that are highly affected by the illumination condition, the proposed system should be able to work under low light environment which is the common condition of a presentation by using depth image data captured from Kinect sensor In addition, it must ensure the accuracy and real-time performance of hand gesture recognition method
1.3 Methodology
In this research, the first component of the proposed system detects the initial hand by
a motion-based algorithm Then, the hand localization unit extracts and describes hand contours by illumination-, rotation- and scale-invariant feature vector after detecting and tracking hand region In the third major component, logistic regression and multilayer perceptron classifiers are employed for hand posture and dynamic hand gesture recognition respectively Finally, in the presentation controller module, the recognized hand gestures will be transformed as a visual command in order to move forward or backward a slide
1.4 Thesis’s outline
The remainder of this thesis is organized as follows Chapter 2 described the related hand gesture recognition methods and existing systems for intelligent presentation Then, Chapter 3 presents proposed hand localization and hand gesture recognition method as well as the way to control PowerPoint presentation Chapter 4 shows the experimental results of our prototype application Finally, Chapter 5 concludes our proposed method in this thesis
Trang 14Chapter 2 RELATED WORK
There are many existing hand gesture recognition solutions for presentation control Gesture- controlled solutions for presentation control are usually based on motion-sensing devices like cameras, data gloves, infrared sensors and other similar devices Some of these solutions are described in the sequel
2.1 Infrared laser tracking devices for presentation
In [5], a system for large display interaction using infrared laser tracking device is presented The authors address the challenge of natural interaction system by hiding cursor and laser pointer, not requiring clicking and using hotspots and gestures Hotspots are areas around objects which are highlighted with a colored background when the pointer enters them This provides a mechanism for objects to be selected without clicking Technically, to select the object, the user moves their laser pointer towards the object When the pointer moves inside the object, the system detects the crossing of the boundary by the laser beam Then, the system reacts to this crossing and highlights the object, while the laser pointer stops at the center of the object This
is ideal because people tend to point towards the center of an object, rather than the edges When the user points away from the object, it reverts to the original appearance Gestures are natural movements of the hand (as indicated by the path traced by the pointer) which the system recognizes, allowing an action to be performed Those gestures can be found and used successfully in modern web browsers such as Mozilla and Opera The idea here is to use gestures to select objects by circling around the object or navigate a piece of information such as to move forward by using a left to right sweeping gesture or move backward by a right to left gesture
On the other hand, there are two noteworthy limitations have been recognized These ideas are shown by an include module for Microsoft PowerPoint using the NaturalPoint™ Smart-Nav™ tracking device Smart-Nav is designed for use by individuals at a distance of less than approximately 2 meters; it has a low resolution of
256 x 256 pixels This is possibly sufficient for the application on a large display at a distance of around 3 meters The more significant issue is the fact that the camera has difficulties when tracking small objects smoothly In some cases, the camera may lose tracking altogether, often caused by bursts of frames between periods of inactivity
Trang 152.2 Distance transform based hand gesture recognition
In [6], Ram Rajesh J et al suggest two techniques to control the slides of PowerPoint presentation in a device free manner without any markers or gloves Utilizing exposed hand, the gesture is given as information to the webcam associated with the PC Then, using an algorithm which calculates the quantity of active fingers, the gesture is perceived and the slideshow is controlled The number of active fingers are discovered using two methods namely Circular profiling and Distance transform
Circular profiling
The finger count is determined as in [7] The centroid of the segmented binary image
of the hand is figured After that, the length of the biggest active finger is found by drawing the boundary box of the hand The centroid ascertained is made as the center and the estimation of radius is the length of the biggest finger multiplied by 0.7 [8] With the centroid as center and length of the biggest finger multiplied with 0.7 as radius a circle is drawn to intersect with the active fingers of the hand In the event that
a finger is active, then it crosses with the circle A chart is utilized to compute the quantity of transitions from white to dark area This number gives the quantity of active fingers From the number of active fingers, the gesture made can be resolved If
a value less than 0.7 is used the circle drawn encases just palm locale If a value greater than 0.7 is used the circle doesn‟t intersect the thumb
However, the disadvantage in this strategy is that the hand ought to be appropriately put regarding the webcam so that the whole hand region is caught to draw the circle If the hand is not set legitimately, the gesture is not perceived properly Gesture made in this technique includes only one hand and this decreases the quantity of gestures that can be made utilizing both hands Additionally, the reaction time is very high
Distance transform
The distance transform method gives the Euclidean distance of each pixel from the nearest boundary pixel The distance from the boundary to a pixel in the hand area increments as the pixel is far from the boundary Using this distance value, the centroid of the palm area can be computed The quantity of fingers used to describe the gesture is found by drawing a line along the major axis of the segmented finger areas The number of lines drawn is equal to number of active fingers This value is used to control the slides of PowerPoint
Trang 16This distance transform method‟s proficiency diminishes when the human hand far from the focus of the camera Improper gestures and gestures made promptly without a pause is additionally a reason for reduction in the level of accuracy The effectiveness decreases if the background has components like wall hanging, furniture and so forth containing color like skin color Issue happens if the fingers are not stretched appropriately while making a gesture
2.3 Body tracking-based hand gesture recognition using Microsoft Kinect
Many researchers use the Microsoft Kinect device to capture both RGB and depth data In [9], they have developed algorithms that identify humans in a scene and perform full body tracking, as well as they can predict a person‟s skeletal structure in real-time
Figure 2.1: Tracked skeleton joints of the user’s body [9]
Using Microsoft Kinect SDK, three streams of information can be gained: RGB, depth and skeleton data streams The RGB data stream gives the color information for each pixel, while the depth data gives the distance information between the pixels and the sensor The skeletal data stream gives the positions of various skeletal data joints of
Trang 17the users that are in the range of the sensor The tracked skeleton joints of the user‟s body are appeared in Figure 2.1 By handling the depth stream data, the skeleton data stream is created For gesture detection, the authors have utilized the skeleton data stream As the pixels color isn‟t required, the RGB data stream wasn‟t used
The characteristics of the swipe left gesture that the authors observed are:
The x-axis coordinate values are decreasing as the gesture is executed;
The y-axis coordinate values have nearly equal values as the gesture is executed;
The length of the line formed as a sum of the lengths between the points of the gesture has to exceed some previously-defined value
The spent time between the first and the last tracked point of the gesture has to
be in the previously defined allowed range
The characteristics for the swipe right gestures are the same, except for the first one where the x-axis coordinate values are increasing (not decreasing) as the gesture is executed
The parameters that the authors introduced are:
Xmax maximal threshold value of the x-axis between two consecutive hand
joint data expressed in meters for a recognized gesture
Ymax maximal threshold value of the y-axis between hand joint data expressed
in meters for a recognized gesture
Lmin minimal length of the recognized swipe gesture expressed in meters
Tmin minimal duration of the recognized swipe gesture expressed in
Trang 18hand joint data is added to the queues
The last two skeleton joint data entries are checked for the parameters Xmax and Ymax for both gestures In the case that these parameters aren‟t fulfilled, then the data from the suitable queue is erased If they are satisfied, then the other three Lmin Tmin and Tmax are additionally checked If they are satisfied, then a swipe gesture is detected
When a gesture is identified, then a proper pressing of a keyboard button is simulated The left swipe gesture represents pressing of the left arrow, and the right swipe gesture
is the right arrow on the keyboard
This method is only effective and robust to detect the location of the hand in the condition that the prediction of a person‟s skeletal structure is good The accuracy of this approach depends heavily on human body posture Hence, our project uses a hand detection method without using human skeletal information to improve the performance of the recognition system
This chapter will describe our proposed hand gesture recognition system for presentation control We define an abstract layered view of intelligent presentation system as illustrated in Figure 3.1
Figure 3.1: Abstract layered view of proposed system
Each layer of the proposed system represents an integral element The bottom layer shows the hardware device Kinect that captures the visual of the scene with depth data The Kinect depth sensor consists of infrared emitter and depth sensor The infrared
Trang 20emitter is like a camera from the outside, but it‟s an infrared projector that emits infrared light in a “pseudo-random dot” pattern over everything in front of it These dots are normally invisible to us, but it is possible to capture their depth information using an infrared depth sensor The dotted light reflects off different objects and the depth sensor reads them from the objects then converts them into depth information by measuring the distance between the sensor and the object from where the infrared dot was read The middle one includes three major modules that process the depth image sequences from Kinect sensor and automatically recognize the hand gestures The top layer represents the module that implements natural interaction for presentation control
3.1 Image sequence preprocess
3.1.1 Motion extraction from depth images
After receiving depth images data from Kinect sensor, the image sequence preprocess module will extract motion image which is used to detect the hand point position later The Kinect sensor captures approximately 30 depth frames per second However, in our method, we only use 5 continuous frame each time to create a motion image The process of generating the motion image is shown in Figure 3.2 First, the difference image is obtained by subtracting the previous frame (it-1) from the current frame (it)
as below:
Then we apply a threshold for each difference image to generate the binary difference image Finally, the accumulation of these binary difference images is the motion image In the accumulated image, all movement of human body, hand, object and noise are represented
Trang 21Figure 3.2: The process of generating the motion image
3.1.2 Noise reduction
Before detecting the hand point position from motion image, we need to remove the noise from it first in order to increase the accuracy of the method A spatial filtering and a morphological processing are used for noise reduction We used a 5x5 aperture median filter for spatial filtering The median filter replaces the pixel value with the median value of the sub-image with aperture [10] The advantage of median filter is removing salt and pepper noise in a given image therefore it is very effective in this case because the noise pattern of the motion image is very similar to salt and pepper noise
The morphological processing consists of threes operation: the opening, the erosion and the dilation [10] The opening operation is employed to reduce the outer shape and expand the outers of an object by erosion Generally, this operation makes the outers smooth, splits the narrow region and removes the thin surrounding area Thus, it removes the noise and makes the original image smooth The erosion and the dilation operation are opposite, one reduces irrelevant pixels and eliminates small noise components from the image, another one can return the eroded objects to their original size and make the size of the image bigger than before These operations are highly effective for the depth image noise reduction Figure 3.3 shows the operators of opening, erosion and dilation in a simple way Figure 3.3.a presents the opening of the dark-blue square by a disk, resulting in the light-blue square with round corners Figure 3.3.b presents the erosion of the dark-blue square by a disk, resulting in the light-blue square Figure 3.3.c presents the dilation of the dark-blue square by a disk,
Trang 22resulting in the light-blue square with rounded corners
Figure 3.3: (a) The opening operation, (b) The erosion operation, (c) The dilation
The original motion image and the result of the noise removal methods of spatial filtering and the morphological processing on the motion image are shown in Figure 3.4.a and Figure 3.4.b respectively
Figure 3.4: (a) The original motion image, (b) The reduced noise motion image
3.1.1 Initial hand detection
In this section, motion regions are clustered to detect the hand region position First, the connected components are selected from the motion image and then, they are clustered These clusters can be either real motion or noise and one of them is the hand The noise clusters are usually small, so if the size is smaller than a threshold, a noise cluster is identified and removed
Trang 23To decide the threshold of the size, the polynomial regression method is applied [11] First, a variety of hand size data are obtained from a largely range of distances Then, the polynomial method is employed to fit a curve to the dataset The fitted curve is estimated to choose the threshold of the hand size Figure 3.5.a shows the result of motion clustering before applying the threshold of the hand size Figure 3.5.b shows the result of motion clustering with the threshold by the hand size The hand cluster is found among those clusters for the hand detection process
Figure 3.5: Motion clustering with hand size: (a) Before applying the threshold of
hand size (b) After applying the threshold of hand size
To find the hand, the condition of hand wave motion is set, which consists of a side-to- side motion sequence First, the directions of cluster movements are detected by using
a motion template [12, 13] The motion template is an effective method for tracking general movement; especially it is useful for gesture recognition To use the motion template, a segmented cluster is needed, which is the white rectangle shown in Figure 3.6.a This image is referred to the motion history image When the rectangle moves, a new cluster is calculated from the new current motion image and stored into the motion history image The white rectangle represents the new cluster and the previous cluster of old motions have become darker are shown in Figure 3.6.b and 3.6.c The darkest rectangle is the oldest motion These continuous changed rectangles represent the movement of clusters Figure 3.6.d shows the motion history image in depth space
Trang 24Figure 3.6: Motion history image and motion template procedure Motion history
at time (a) t, (b) t+1, (c) t+2, (d) Depth motion history image
From the motion history image, the gradient is taken to represent the direction The gradient can be calculated by the Sobel gradient function [13] In some situations, gradients from the motion history image are invalid because non-movement regions have zero gradients and outer edges of the cluster have large gradients The range of gradients can be calculated and the invalid gradients are removed when the time between frames are defined Finally, the global gradient is assigned as the direction Figure 3.7 shows the direction of clusters The line in the circle shows the direction that the clusters are moving toward