crane gesture recognition using pseudo 3-d hidden markov models5

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	5
Dung lượng	212,83 KB

Nội dung

Crane Gesture Recognition Using Pseudo 3-D Hidden Markov Models Stefan Müller, Stefan Eickeler, Gerhard Rigoll Gerhard-Mercator-University Duisburg Department of Computer Science Faculty of Electrical Engineering 47057 Duisburg – Germany e-mail: stm,eickeler,rigoll @fb9-ti.uni-duisburg.de Abstract A recognition technique based on novel pseudo 3-D Hid- den Markov Models, which can integrate spatial as well as temporal derived features is presented in this paper. The approach allows the recognition of dynamic gestures such as waving hands as well as static gestures such as stand- ing in a special pose. Pseudo 3-D Hidden Markov Mod- els (P3DHMMs) are an extension of the pseudo 2-D case, which has been successfully used for the classification of images and the recognition of faces. In the P3DHMM case the so-called superstates contain P2DHMMs and thus whole image sequences can be generated by these models. Our approach has been evaluated on a crane signal database, which consists of 12 different predefined gestures for maneuvering cranes. 1. Introduction There are many publications which, recently, report about the use of Hidden Markov Models (HMMs) for the recognition of human actions in image sequences. For ex- ample Yamato et al. [1], which is probably the first publication addressing this problem, use discrete HMMs and thus a sequence of VQ-labels in order to recognize six classes representing tennis strokes. In their approach several preprocessing steps including low pass filtering, background subtraction and binarization are applied to each image of a sequence. The outcome of these steps is a two level image, where the pose of the human is roughly extracted. Prior to the calculation of the features itself, size normalization and a centering step are applied to the binarized image. The features itself are the amounts of black pixels in a mesh, i.e. a subsampled image arranged in a feature vector. These features are vector quantized and thus the image sequence be- comes a sequence of VQ-labels, which can be processed by a discrete HMM (at that time the preferred modeling technique). Schuster and Rigoll also applied discrete HMMs to the task of image sequence recognition in [2]. Their approach utilizes a much simpler preprocessing, which leads to a system with real-time capabilities. The color images of a sequence are subsampled for each RGB plane separately and horizontal or vertical stribes are directly fed into a vector quantizer. Alternatively, the same steps are applied to a difference image sequence. This real-time capable system has been evaluated on a ten class database, which consists of gestures such as nod-no,nod-yes, kotow and clapping. The system mentioned above has been improved by uti- lizing continuous HMMs in conjunction with geometric moments calculated on difference images. As reported in [3] the improved system is capable of classifying 24 gestures with a recognition accuracy of 90%. Continuous HMMs in combination with moments are also used by Starner et al. in [4]. This system recognizes American Sign Language by extracting the hands of a per- son from images and performsa second moment analysis on the extracted blobs. Besides the components derived from the extracted shapes of the hands, dynamic features such as the change of the position between frames are also part of the feature vector. Most of the systems mentioned previously heavily rely on the existence of motion or moving body parts, due to the calculation of e.g. moments on the difference images. In order to overcome this limitation, we propose the usage of pseudo 3-D HMMs, which are able to integrate features derived from temporal as well as spatial information and which can also perform an elastic matching on the individual images. This is different from the previously mentioned approaches, because either VQ-labels are assigned to whole images ([1, 2]) or global features are calculated ([3],[4]) and thus no elastic matching on the image itself is performed. The elastic matching procedure should also allow a position invariant recognition of gestures. This paper is organized as follows. Section 2 gives an introduction to pseudo 3-D HMMs and describes the feature extraction used in the experiments. Section 3 presents experimental results. A summary is given in Section 4. 2. Pseudo 3-D HMMs for the Stochastic Mod- eling of Three-Dimensional Data Hidden Markov Models are finite non-deterministic state machines which have been successfully applied to continuous speech [5] and online handwriting recognition [6]. They consist of a fixed number of states with associated output density functions (pdfs) as well as transition probabilities ,where denotes the actual state at time , is a distinct state and denotes a feature vector. Especially large feature vectors consisting of inho- mogeneous components are often divided into statistically independent streams (see e.g. [7]) and thus for streams and given streamweights the pdf of state can be calculated as (1) For every stream , the pdfs are usually given by finite Gaussian mixtures of the form (2) where is the mixture coefficient for the th mixture in stream and is a multivariate Gaus- sian density with mean vector and covariance matrix . The use of streams allows the integration of features derived from temporal as well as spatial data into a single model. Furthermore, the stream weights provide the opportunity to adjust the influence of temporal and spatial features. AHMM with N states is fully described by the N N-dimensional transition matrix , the N-dimensional output pdf vector and the initial state distribu- tion vector which consists of the probabilities . After the model has been trained using the Baum-Welch algorithm, feature sequences can be scored according to (3) Usually the likelihood is estimated by the Viterbi algorithm, which is an approximation based on the most likely state sequence ( ). For recognition tasks, is used to classify an unknown pattern to class p which satisfies Eq. 4. p argmax p p (4) A very detailed explanation of the HMM-framework is given by Rabiner in [5]. It has been shown that HMMs can not only be applied successfully to time series problems, but also to pattern recognition problems with the pattern varying in space rather than in time. Therefore, HMMs have been recently applied to image recognition problems with promising results [8, 9]. In both publications pseudo 2-D HMMs have been utilized, which are also known as planar HMMs. A P2DHMM is an extension of the one-dimensional HMM paradigm, which has been developed in order to model two- dimensional data. They are called pseudo due to the fact that the state alignment of consecutivecolumns is calculated independently from each other. P2DHMMs are stochastic state machines with a two-dimensional arrangement of the states, as outlined in Fig. 1. The states in horizontal direction are denoted as superstates, and each superstate consists of a one-dimensional HMM in vertical direction. The P2DHMM shown in Fig. 1 can be trained from data, after features have been extracted, using the segmen- tal k-means algorithm. Once the models have been trained for each class, the recognition procedure is accomplished by calculating the class-dependent probability that the (un- classified) data has been generated by the corresponding HMM. For this procedure, the doubly embedded Viterbi algorithm can be utilized, which has been proposed by Kuo and Agazzi in [8]. Alternatively, Samaria shows in [10], that a P2DHMM can be transformed into an equivalent one-dimensional HMM by the insertion of special start-of- line states and features. Fig. 2 shows an augmented P2DHMM with start-of-line states (indicated by a cross). These states generate a high probability for the emission of start-of-line features. When using the structure in Fig. 2 one has to take care of the fact that the value for the start- of-line feature is different from all possible ordinary features. These equivalent HMMs can be trained by the standard Baum-Welch algorithm and the recognition step can be carried out using the standard Viterbi algorithm. The natural extension of the two-dimensional case leads to a structure as shown in Fig. 3, which shows a pseudo 3-D HMM. Each superstate now consists of a P2DHMM. We implemented the structure in Fig. 3 by applying the technique suggested by Samaria twice, i.e. by additionally inserting special start-of-image states and features. Due to this implementation technique, the P3DHMM shown in Fig. 3 can be trained from data, by applying standard HMM techniques. The feature extraction used throughout this paper is based on the discrete cosine transform (DCT). Each image of a sequence is scanned with a sampling window top to bottom and left to right. The pixels in the sampling window of the size are transformed using the DCT according to the equation: (5) A triangle shaped mask extracts the first 15 coefficients , which are arranged in a vector. These DCT coefficients are calculated on the individual images (static feature component) of a sequence as well as the difference images (dynamicfeature component). Due to the utilization of the HMM framework, both features can be integrated by using feature-streams and by assigning stream weights in order to control the influence of the individual streams (see also Eq. 1). 3. Experiments and Results In order to obtain a detailed evaluation of the P3DHMM approach, experiments on a crane signal database consisting of 12 classes have been performed. Crane signals are a well defined set of gestures, which allow to maneuver a crane in the presence of obstacles or problematic envi- ronments (see also [11]). Fig. 4 shows the 12 classes slew left (right), travel to (from) me, extend (retract) jib, jib up (down), hoist, lower, stop and emergency stop,wherethe latter two classes represent two examples for static gestures with hardly any movement involved. Five individuals performed each of the 12 gestures several times and thus two repetitions for each gesture built the training set, whereas the remaining repetitions are used for testing. Fig. 5 il- lustrates the two classes jib up and jib down in the upper and lower row, respectively, taken from the stm set. Ta- ble 1 shows the recognition accuracies achieved in the experiments and presents also results on the crane signal task using one-dimensional HMMs and geometric moments as described in [3]. In the experiments, four superstates with P2DHMMs per superstate have been used as con- figuration of the P3DHMMs. Note that the P3DHMM approach shows a slightly higher recognition accuracy com- slew left (right), travel to (from) me, extend (retract) jib, jib up (down), hoist, lower, stop emergency stop pared to the one-dimensional case. However, there are two more important reasons for using P3DHMMs: One is the fact that static and dynamic gestures can be now mixed and handled with the same unique recognition paradigm. The other is the possibility that due to the warping capabilities of the P3DHMM an elastic matching can be performed on the individual images which results in a position and size invariant gesture recognition mode. 4. Summary Image sequence recognition based on novel pseudo three-dimensional Hidden Markov Models has been presented. The modeling technique allows the integration of spatial and temporal derived features in an elegant way and is also capable of recognizing static gestures where hardly any body movement is involved. Compared to an approach based on one-dimensional HMMs and geometric moments, 1D HMM P3DHMM ste 100% 88.6% stm 85.3% 91.2% ank 100% 100% bw 88.2% 94.1% jmr 80.5% 80.5% average 90.74% 90.88% the P3DHMMs showed a slightly better recognition accuracy on a 12 class crane signal task. Due to the warping capabilities of the P3DHMMs, the proposed approach leads to a position independent recognition mode. However, this has not been fully evaluated yet and the present publication shows mainly the feasibility of this modeling approach. References [1] J. Yamato, J. Ohya, and K. Ishii, “Recognizing Hu- man Action in Time-Sequential Images Using Hidden Markov Model”, In Proc. IEEE Int. Conference on Computer Vision and Pattern Recognition, 1992, pp. 379–385. [2] M. Schuster and G. Rigoll, “Fast Online Video Im- age Sequence Recognition with Statistical Methods”, In Proc. IEEE Int. Conference on Acoustics, Speech and Signal Processing, Atlanta, 1996, pp. 3450–3453. [3] G. Rigoll and A. Kosmala, “New Improved Feature Extraction Methods for Real-Time High Performance Image Sequence Recognition”, In Proc. IEEE Int. Conferenceon Acoustics, Speech, and SignalProcess- ing, Munich, 1997, pp. 3373–3376. [4] T. Starner, J. Weaver, and A. Pentland, “Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video”, IEEE Trans. on Pattern Recognition and Machine Intelligence, Vol. 20, No. 12, Dec. 1998, pp. 1371–1375. [5] L. R. Rabiner, “A Tutorial on Hidden Markov Mod- els and Selected Applications in Speech Recognition”, Proc. of the IEEE, Vol. 77, No. 2, Feb. 1989, pp. 257– 285. [6] K. S. Nathan, J. R. Bellegarda, D. Nahamoo, and E. J. Bellegarda, “On-line Handwriting Recognition Using Continuous Parameter Hidden Markov Mod- els”, In Proc. IEEE Intern. Conference on Acoustics, jib up jib down stm Speech, and Signal Processing , Minneapolis, 1993, Vol. 5, pp. 121–124. [7] V. N. Gupta, M. Lenning, and P. Mermelstein, “Inte- gration of Acoustic Information in a Large Vocabulary Word Recognizer”, In Proc. IEEE Intern. Conference on Acoustics, Speech, and Signal Processing , Dallas, 1997, pp. 697–700. [8] S. Kuo and O. Agazzi, “Keyword Spotting in Poorly Printed Documents Using Pseudo 2-DHidden Markov Models”, IEEE Trans. on Pattern Recognition and Machine Intelligence, Vol. 16, No. 8, 1994, pp. 842– 848. [9] S. Eickeler, S. Müller, and G. Rigoll, “High Quality Face Recognition in JPEG Compressed Images”, In Proc. IEEE Intern. Conference on Image Processing, Kobe, 1999. [10] F.S. Samaria, “Face Recognition Using Hidden Markov Models”, Ph. D. Thesis, Cambridge Univer- sity, 1994. [11] A. Parrish, “Mechanical Engineers’s Reference Book”, Butterworth, London, 1980. . Crane Gesture Recognition Using Pseudo 3-D Hidden Markov Models Stefan Müller, Stefan Eickeler, Gerhard Rigoll Gerhard-Mercator-University. recognition of dynamic gestures such as waving hands as well as static gestures such as stand- ing in a special pose. Pseudo 3-D Hidden Markov Mod- els (P3DHMMs) are an extension of the pseudo 2-D case, which. results in a position and size invariant gesture recognition mode. 4. Summary Image sequence recognition based on novel pseudo three-dimensional Hidden Markov Models has been presented. The

Ngày đăng: 24/04/2014, 13:44

Xem thêm