Automated Posture Segmentation in Continuous Finger Spelling Recognition Nhat Thanh Nguyen The Duy Bui Human Machine Interaction Laboratory College of Technology, Vietnam National University, Hanoi Human Machine Interaction Laboratory College of Technology, Vietnam National University, Hanoi Abstract— Recognizing continuous finger spelling plays an important role in understanding sign language There are two major phases in recognizing continuous finger spelling, which are posture segmentation and posture recognition In the former, a continuous gesture sequence is decomposed into segments, which are then used for the latter to indentify corresponding characters Among all the segments, beside valid postures corresponding to characters, there are also many movement epentheses, which appear between pairs of postures to move the hands from the end of one posture to the beginning of the next In this paper, we propose a framework to split a continuous movement sequence into segments as well as to identify valid postures and movement epentheses By using the velocity and signing rate based filter, we can obtain very good result with both high recall and precision rate Keywords- finger spelling recognition, posture segmentation, velocity filter, signing rate filter, maximal matching I INTRODUCTION Sign language, a non-verbal language, is a primary means of communication in the deaf community Different from speech, sign language uses finger spelling and gestures to convey information Automatic sign language recognition and interpretation concentrate on understanding human signs and translating them into text or speech, which might help to overcome the difficulties in communication between the deft people and the rest of the world These systems are often developed with two main approaches: vision based one and device based one Corresponding to the two approaches, timeserial data is obtained as the input of systems in two different formats Vision based approach uses video cameras to capture the gestures of users, while device based approach depends on sensing gloves to get hand parameters such as joint angles and hand position Sign language is presented by the sequential gestures in which some gestures bring information, while others are movement epentheses Movement epentheses are movements that are added between two consecutive valid signs to move the hand from the end of one sign to the beginning of the next The arisen question is how to identify and locate valid gestures in the time-serial data Segmentation is one solution to this problem Segmentation has been considered as a critical phase that determined the quality of the later processing of sign language recognition and interpretation systems The way to differentiate between meaningful signs and movement epentheses depends on whether the sign language expression manners are gestures or finger spelling In gestures, a valid segment is where the hand posture expressed by hand shape, hand position, and hand orientation together with movement trajectory of one or both hands form a meaningful word or phrase A movement epenthesis is where hands transit from the end-point of a sign to the beginning-point of the next sign Many researches uses hand velocity as the cue for gesture segmentation Tanibata et al [10] separated valid signs and motion epentheses with the assumption that valid signs have small velocities while motion epentheses have large velocities In addition, large changes of hand motion were considered as the cues of borders Sagawa and Takeuchi [6] proposed a similar approach Besides, they also considered the noise made by unstable gestures They excluded them by comparing the sum of maximum velocities of two adjacent candidates to predefined thresholds The meaningful gestures and transitions are separated by acceleration in which meaningful gestures have the minimum acceleration This method was applied for 100 words of Japanese Sign Language, and got 80.2% accuracy Another approach in [14] uses time-varying parameters (TVP) as the cues to detect the correct postures which have the number of TVPs dropping below a threshold Gaolin Fang et al [4] proposed a more effective method Simple Recurrent Network (SRN) was used to classify gesture into three output units: the left boundary, the right boundary, and interior of segments Using SRN independently, the accuracy of segmentation was 87% Hence, self organizing feature maps (SOFM) was added It was used as the feature extraction network providing inputs for SRN It can determine the left boundary and right boundary, used as constraint in the segmentation With this method, the segmentation recall reaches to 98.8% Beside gestures, finger spelling plays an important role in sign language In this manner, finger postures corresponding to letters of the alphabet are presented sequentially and conform to the spelling rules to make words The segmentation for finger spelling concentrates on marking the points where valid postures occur Some approaches require unnatural performance to mark the valid segments such as inserting marks [2] or keeping a posture about one minute before it is recognized [16] Harling et al [12] calculated hand tension and used it as a cue to detect valid segments The transitions have less tension than valid postures The transition from one 978-1-4244-7570-4/10/$26.00 ©2010 IEEE intentional posture to the next intentional posture goes through a relaxed hand state, which is used to detect letter borders However, this method is only tested on a small scale and suitable for device based systems Wu et al [8] based on the differences between current frames to separate the moving and the steady H Birk et al [5] used two motion cues to solve the temporal segmentation In addition, they considered to the case where the same letter is repeated, e.g letter l in hello The third cue is based on the view that there is a small amount of motion between them Three cues are combined with AND operation to obtain final decision R Erenshteyn et al [13] recognized letters in real time and then used two filters for segmentation of dynamic signing A low pass filter relies on the difference between frames The derivative analysis provides the foundation for the second filter In this filter, the end point of a letter is where there is greatest variation of recognition results and meets an additional minimum proximity heuristic The recognition is performed at the midpoints of the segments The segmentation accuracies of two filters are 87.8% and 92.3%, respectively Nevertheless, the first filter leaves many redundant segments, and the second deletes extra middle points letter value and the average of weighted sum of values in a selected segment are used for recognition After that, words need to be recognized from the sequence of recognized letters Word segmentation is not required if the signer is forced to place a special sign after each word However, that is not a natural way of signing In order to separate word automatically from a sequence of characters, we use the maximal matching approach with the presence of a Vietnamese dictionary This approach is also used to correct mis-recognized letters We propose in this paper a framework to split a continuous movement sequence in finger spelling into segments as well as to identify valid postures and movement epentheses In our framework, a number of techniques are applied sequentially to identify valid segments in the time serial data Firstly, hand velocity is calculated to find the stable candidates where velocities fall under a certain threshold Then, we apply a filter based on the signing rate featured by the posturing duration to remove redundant segments A represented value of each valid segment is calculated to be the input of a letter recognition system After that, words are segmented from the sequence of recognized letters based on maximal matching to predefined words in a Vietnamese dictionary With our framework, a sentence presented by finger spelling can be segmented fast and correctly By combining segmentation techniques, our framework has proved to be an effective method with both high precision and recall rate The signing rate filter based on posturing duration works well in eliminating superfluous segments The problem of two adjacent letters referring to the same value (ex: “hello”, “litter”) as mention in [5] is solved by this filter B Letter segmentation by signing rate filter Based on hand velocity techniques, most of valid segments are detected However, the subtle changes of the hand velocity in unstable postures or noise make many superfluous candidates Fortunately, a signer has to hold a posture long enough for people to recognize Therefore, in order to remove superfluous candidates, we propose a letter signing rate filter Letter signing rate refers to the duration of signing a posture At each segment candidate, we calculate the letter signing rate and compare to experimentally chosen thresholds (low threshold and high threshold) A valid segment is the one with posturing duration in between two thresholds (see Figure 1) The rest of the paper is organized as follows In section we propose the segmentation framework and related techniques in detail Section shows the experimental results and discussion II SEGMENTATION FRAMEWORK The segmentation framework focuses on separating letters and words from a time-serial data of a sentence presented by finger spelling Firstly, hand velocity is calculated to find the stable candidates where velocities fall under a predefined threshold However, many redundant segments are found together with valid segments because this technique is very sensitive with noises Therefore, in the next step, we apply a filter based on the signing rate which features with the posturing duration to remove superfluous candidates The A Letter segmentation by hand velocity This technique is based on the nature of finger spelling In this manner, letters are signed sequentially following the spelling grammar Each letter is presented by a posture described by hand shape and palm orientation Postures have to be stable to recognize Therefore, postures are corresponding to segments having low hand velocity, while transitions are vise versa as mentioned in [5], [6] and [10] Based on hand parameters, hand velocity is calculated and compared with predefined threshold to find the candidates for the next step 1.2 0.8 0.6 0.4 0.2 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 posture Figure The posture is hold at suitable duration between signing rate low threshold and signing rate high threshold C Letter recognition At each valid segment, we calculate the represented value of the segment which is used for recognition This value is calculated as the average of sum of corresponding values in a selected number of segments We applied the classification method mentioned in [17] Twenty three letters (A, B, C, D, Đ, E, G, H, I, K, L, M, N, O, P, Q, R, S, T, U, V, X, and Y) of Vietnamese Sign Language alphabet are recognized with high recognition accuracy In this paper, we have not considered letters with diacritical signs (e.g Â, Ă, Ô, Ơ, Ê, Ư) and tones (e.g level, high rising, low (falling), dipping rising, high rising glottalized, and low glottalized) of the Vietnamese alphabet Each diacritical sign is presented by an independent sign and follows a particular letter to form another Each tone is formed by a sign combined with a motion Therefore, the segmentation for them is carried out after the recognition phase and needs additional techniques III EXPERIMENT AND DISCUSSION A Data collection and pre-processing We used 5DT Data Glove [3] as an input device for our system The data glove has 18 sensors corresponding to ten positions on fingers (thumb near, thumb far, index near, index far, middle near, middle far, ring near, ring far, litter near, litter far), four positions between fingers (thumb/index, index/middle, middle/ring, ring/litter), and a position on the back of the hand Sensors measure and return values of finger’s flexure, finger’s spread, and the pitch and roll of the hand After calibration and normalization, the sensor values are in the range from to (see Figure 2) We performed the experiment on 594 samples of 23 letters in Vietnamese alphabet In the pre-processing phase, data received from sensing glove is smoothed by Gaussian filter (see Figure 3) based on the Gaussian distribution in 1-D with the standard deviation of the distribution as (Equation 1) − x2 G ( x) = e 2δ δ 2π where δ = B Segmentation with hand velocity and signing rate filter With the data collected from the Data Glove, the hand velocity was calculated by Equation at every frame: N v(t ) = i =1 N (2) where P(i, t) is the value of sensor i at frame t, and N is the number of sensors The adjacent frames form a candidate if the velocities at these frames are lower than a threshold Almost of all segments including postures are detected by this technique However, the number of superfluous segments is rather large The reason is that unstable postures, e.g the third segment in Figure 5, or slow movement parts of the hand, e.g the fourth segment in Figure 5, create invalid segments (noises) We found that noise is often slighter and faster than posture Besides, too long segments are also abnormal In most cases, they are found by wrong velocity segmentation rather than formed by signers Therefore, they are often redundant (see Figure 6) We calculate the posturing duration at each segment candidate In this experiment, the segment is rejected if its posturing duration is lower than the empirically determined threshold of 150ms or higher than the empirically determined threshold of 1500ms In addition, Vietnamese Sign Language also has to face the problem of two adjacent letters, similar to two l in hello in American Sign Language As being analyzed in [5], there is a small amount of motions between them Therefore, signing rate filter which is based on posturing duration can solve the problem (Figure 7) (1) 1 ∑ P(i, t ) − P(i, t − 1) 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 Figure Valid segments of the word “CHUA” 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 Figure Raw data 1 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 Figure Data is smoothed with Gaussian Filter Figure Five segments are detected by hand velocity and two invalid segments are eliminated by signing rate filter Table Velocity Segmentation with different velocity thresholds Velocity Threshold Precision (%) Recall (%) 0.02 0.05 0.10 0.15 0.20 57.08 68.95 65.81 60.87 57.14 83.50 96.46 94.95 77.78 60.60 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 Figure The invalid segment which is too long is eliminated by signing rate filter Table Velocity and Signing Rate Segmentation with different velocity thresholds Velocity Threshold 13 19 25 31 37 43 49 55 61 67 73 79 85 91 Velocity Segmentation 97 103 109 115 121 Velocity and Signing Rate Segmentation Precision (%) Recall (%) 0.02 86.26 83.50 0.05 93.78 96.46 0.10 95.27 94.95 0.15 87.50 77.78 0.20 81.08 60.60 Figure The problem of two adjacent letters referring to the same value (the word “CO OI” in this example) is solved by the singing rate filter C Results The Precision and Recall are calculated by Equation and Equation 4, respectively Re call = NumberofValidSegment × 100% NumberofActualSegments Using only velocity segmentation, we could detect most of correct segments in time serial data (as high as 96.46%); however, the precision rate was rather low (as low as 68.95% for the recall of 94.46%) Combining signing rate filter afterward, we kept recall rate and increased the precision rate considerably (as high as 95.27% for the recall of 94.95%, and 93.78% for the recall of 96.46%) It showed that the combination of techniques obtained better results than only one in segmentation The preserved recall rate proved the effect of chosen signing rate thresholds Hand velocity Filter 60 Hand veloctiy and Signing rate filter 40 Velocity Threshold (4) The hand velocity threshold is tested with five values: 0.02, 0.05, 0.10, 0.15, and 0.20 The values threshold lower than 0.02 or higher than 0.20 give not very good result, therefore we not include in this here The results are shown in Table and Table The segmentation archived highest precision rate and recall rate with two velocity thresholds (0.05 or 0.10) In the experiment, if the lower velocity threshold (0.02) is chosen, many valid segments are missed On the other hand, the higher velocity thresholds (0.15 or 0.20) combine adjacent segments into one, which leads to the low recall and precision rate 80 20 NumberofValidSegments × 100% (3) NumberofDetectedSegments Figure Precision rate of two segmentation techniques 100 96.46 80 Recall Pr ecision = Precision 100 94.95 83.50 77.78 60 60.60 40 20 Velocity Threshold Figure Recall rate of two segmentation techniques IV CONCLUSION We have proposed in this paper a framework to separate signing postures from a time serial data We have applied a number of techniques sequentially in order to identify meaningful segments Hand velocity is calculated first Stable candidates with low velocity are selected A filter based on the signing rate is then applied to remove superfluous segments Finally, recognized letters are group together to form words based on a dictionary With our framework, we have obtained high recall and precision rate (94.95% and 95.27%, respectively) in separating valid segments from a continuous stream of data when testing with Vietnamese sign language [8] J Wu and W Gao (2001), “The Recognition of FingerSpelling for Chinese Sign Language”, Proc Gesture Workshop, pp 96-100 In the future, we want to carry out experiments on our framework together with different recognition techniques to see the overall recognition rate We also want to apply our framework to vision-based approach [9] N Chaimanonart, and D J Young (2006), “Remote RF powering system for wireless MEMS strain sensors”, IEEE Sensors Journal, Vol 6-2, pp 484 – 489 REFERENCES [1] Durell Bouchard (2006), “Automated Time Series Segmentation for Human Motion Analysis”, http://cg.cis.upenn.edu/hms/research/RIVET/AutomatedTi meSeriesSegmentation.pdf [2] D Rubine (1991), “Specifying Gestures by Example”, Computer Graphics, pp 329-337 [10] N Tanibata, and N Shimada (2002), “Extraction of Hand Features for Recognition of Sign Language Words”, Proc Int’l Conf Vision Interface, pp 391-398 [11] Peter Vamplew, and Anthony Adams (1998), “Recognition of sign language gestures using neural networks”, Australian Journal of Intelligent Information Processing Systems, pp 94-102 [3] Fifth Dimension Technologies (2004), “5DT Data Glove Ultra Series, User’s Manual”, http://www.5DT.com [12] Philip A Harling, and Alistair D.N Edwards (1996), “Hand tension as a gesture segmentation cue”, Proc of Gesture Workshop on Progress in Gestural Interaction, pp 75-88 [4] Gaolin Fang, Wen Gao, Xilin Chen, Chunli Wang, and Jiyong Ma (2001), “Signer-independent Continuous Sign Language Recognition Based on SRN/HMM”, Lecture Notes In Computer Science, vol 2298, pp 76-85 [13] R Erenshteyn, P Laskov, R Foulds, L Messing, and G Stem (1996), “Recognition Approach to Gesture Language Understanding”, Proc Int’l Conf Pattern Recognition, vol 3, pp 431-435 [5] H Birk, T.B Moeslund, and C.B Madsen (1997), “RealTime Recognition of Hand Alphabet Gestures Using Principal Component Analysis”, Proc Scandinavian Conf Image Analysis, pp 261-268 [14] Rung-Huei Liang, Ming Ouhyoung (1998), “A Real-time Continuous Gesture Recognition System for Sign Language”, Proc Third IEEE International Conf on Automatic Face and Gesture Recognition, pp 558-567 [6] H Sagawa, and M Takeuchi (2000), “A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence”, Proc Fourth IEEE International Conf on Automatic Face and Gesture Recognition, pp 434-439 [15] Sylvie C.W Ong and Surendra Ranganath (2005), “Automatic Sign Language Analysis: A Survey and the Future beyond Lexical meaning”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol 27, no 6, pp 873-891 [7] J Karamer and L Leifer (1978), “The Talking Glove: An Expressive and Receptive Verbal Communication Aid for the Deaf, Deaf-Blind, and Nonvocal”, Proc Third Ann Conf Computer Technology, Special Education, Rehabilitation, pp 335-340 [16] T Takahashi and F Kishino (1991), “Hand Gesture Coding Based on Experiments using a Hand Gesture Interface Device”, SIGCHI Bulletin, pp 67-73 [17] The Duy Bui, and Thang Long Nguyen (2007), “Recognizing postures in Vietnamese Sign Language with MEMS accelerometers”, Sensors Journal, IEEE, vol 7-5, pp 707-712 ... superfluous candidates, we propose a letter signing rate filter Letter signing rate refers to the duration of signing a posture At each segment candidate, we calculate the letter signing rate and compare... on maximal matching to predefined words in a Vietnamese dictionary With our framework, a sentence presented by finger spelling can be segmented fast and correctly By combining segmentation techniques,... II SEGMENTATION FRAMEWORK The segmentation framework focuses on separating letters and words from a time-serial data of a sentence presented by finger spelling Firstly, hand velocity is calculated