ON MERGING HIDDEN MARKOV MODELS WITH DEFORMABLE TEMPLATES Ram R. Rao and Russell M. Mersereau School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia 30332 rr@eedsp.gatech.edu ABSTRACT Hidden Markov modeling has proven extremely useful for statistical analysis of speech signals. There are, however, inherent problems in two dimensional exten- sions to HMM’s, one of which is the exponential com- plexity associated with fully 2-D HMM’s. In this paper, we propose a new 2-D HMM-like structure obtained by embedding states within regions of a deformable tem- plate structure. With this state-embedded deformable template (SEDT), each region of a deformable tem- plate has an underlying observation probability distri- bution. This structure allows for computation of the P[image]tempZate]. The template that maximizes this probability provides an optimal segmentation of the image. This segmentation capability will be demon- strated in facial analysis applications. 1. INTRODUCTION Facial analysis is a difficult problem which has many potential applications. Robust facial analysis systems are an integral part of any model-based coding, fa- cial recognition [I], or visual speech recognition sys- tem [a]. Many researchers are attempting to provide a standard framework for tackling these image analysis tasks. Two of the more interesting analysis approaches are deformable templates and hidden Markov model- ing. Both of these approaches have advantages and shortcomings. Deformable templates [3] have been used to model the eyes, lips, and face for applications such as visual speech recognition and face recognition. These tem- plates have certain structural characteristics, such as associating the head with an ellipse, or the lips with four parabolas. They also have energy functions which are often the sum of an image-related energy term, and an internal energy term. The image-related term is This work is supported by the U.S. Army Research Office, Contract DAAL03-92-G-0068. usually a function of the edge, peak and valley fields derived from the image. The internal energy is of- ten heuristically designed to keep template parameters within acceptable ranges. Minimization of the energy function yields the template which best matches the image. The main problem with deformable templates is that the energy functions are experimentally designed, and they do not statistically segment the image. There is strong motivation for statistically modeling the pixel values which occur in an image. Since there is a difference between “skin” colors and background colors in head and shoulder images [4], one would like to model these distributions and use this information to segment the image. Hidden Markov models [5] provide a strong sta- tistical framework for analyzing one-dimensional ran- dom processes. The key concept behind HMM’s is a set of states which have probabilistic output dis- tributions. Two-dimensional HMM’s aren’t quite so tractable. Fully two-dimensional HMM’s have been shown to have exponential complexity [6]. One practi- cal solution to this has been to use psuedo-2D HMM’s [7]. Essentially, one dimensional HMM’s operate on the rows of the image, and these HMM’s are nested in another HMM. Psuedo-2D HMM’s, however, can not incorporate any shape constraints since each row is an- alyzed independently. 2. STATE-EMBEDDED DEFORMABLE TEMPLATES Since it seems that deformable templates provide a good framework for structurally analyzing an image, and HMM’s provide a good framework for statistically analyzing an image, it makes sense to capitalize on the benefits of both. Our solution entails associating a state with each region of a deformable template. These states have observation probability density functions which reflect the probability of observing a particular pixel value while in the state. For example, the head 556 O-8186-7310-9/95 $4.00 0 1995 IEEE Proceedings of the 1995 International Conference on Image Processing (ICIP '95) 0-8186-7310-9/95 $10.00 © 1995 IEEE (a) (b) (b) Figure 1: (a) SEDT used for facial extraction. X = (21,z2,yl,y2). (b) SEDT used for lip tracking. X = (a 2% YL Y2, Y3). can be modeled by an ellipse with a foreground state and a background state (Figure la). This has some in- tuitive sense since the face normally has different sta- tistical characteristics than the background, especially when using color. Our SEDT’s are specified as follows: l The variable, X = (Xi . . . X,), parameterizes a deformable template structure. For example, if the template were a rectangle, K = 4, and X could be the x and 2/ coordinates of the upper- left and lower-right corners of the rectangle. l The template divides the image into N regions RXJ - . . RN-~. In case of N = 2, we have an im- age divided into foreground and background re- gions. Each region has an associated observation probability density function, IQ(Q), where Q is a (possibly multidimensional) pixel value. bj (0) can be any parameterized pdf such as a Gaussian or Gaussian mixture. From this, it follows that: N-l (1) where I is the image, and 1(x, y) is the (possibly mul- tidimensional) pixel value at location (z, y). Maximizing P[I(A] over J yields the optimal tem- plate. Equivalently, we can minimize - log P[I]A]. Look- ing at SEDT’s from a deformable template perspective, we can think of - logP[I]A] as our energy function. Alternatively, looking at our solution from an HMM perspective, we can think of the optimal template as Figure 2: Shown is a) original image with the initial and final position of the template (foreground: /J == 200, u2 = 100; background: p = 200,o’ = 10); b) points for which P[pixel c foreground] > P[pixel E background] being analogous to the optimal state sequence parti- tioning. The analog of Viterbi training would be to par- tition the data using the optimal templates, reestimate the output probability distribution functions given the partitioned data, and repeat until convergence. 3. SYNTHETIC EXAMPLE The first test of our t,emplate was to find an arbitrary sized rectangle within an image. The rectangle had pix- els with intensity specified by a Gaussian with mean and variance, pf and af, respectively. Likewise, the background had intensity specified by a Gaussian with pb and gb. Our template was a rectangle specified by the coordinates of its upper-left and lower-right cor- ners. Starting with an initial template, estimates of the foreground and background pdf’s were made. A steep- est descent minimization algorithm was then used to minimize log P[I]B] over A. This new template was then used to reestima.te the foreground and background pdf’s, and the proces$s was repeated until convergence. It was seen that this process is sensitive to the initial placement of the template. Good results were obtained when the initial template completely covered the un- known rectangle, or when it was contained within the unknown rect,angle. These template choices work well because either the foreground pdf or the background pdf is reliably estimated initially. Now since we didn’t know the position of the rectangle, our system was al- ways started with a rectangle that covered a majority of the input image (Figure 2). There is a problem with this procedure. Consider the case where the foreground has a lower variance than the background, and they both have equal means. The choice of a large initial template would likely con- tain pixels from both the foreground and background. 557 Proceedings of the 1995 International Conference on Image Processing (ICIP '95) 0-8186-7310-9/95 $10.00 © 1995 IEEE (b) (a) Cd) Figure 3: Initialization procedure. (a) Region used to estimate facial distribution for “Chris”; (b) Result of applying this distribution to ‘LHaluk” and applying threshold; (c) Probability of pixel being part of face for “Haluk” using distribution derived from (a) (dark region = high probability); (d) “Haluk” image, with initial template position” Thus, the estimate of the variance of the foreground pdf would approa,ch the variance of the background pdf. When there is a large overlap between the two pdf’s the system will not work well. This can be remedied by altering the reestimation procedure to ensure that there is adequate separation between the two pdf’s. 4. FACIAL EXTRACTION One of our main objectives was to find a robust pro- cedure for extracting the boundary of a person’s head in a full-color head and shoulders video sequence. The head was modeled as an ellipse with no rotation, and the foreground and background pdf’s were modeled as Gaussian mixtures. Each mixture contained two Gaus- sians with full covariance matrices. In the development of our system, a number of facts became clear. First, if the foreground and background pdf’s are available, minimizing the energy function, - logP[I]h], would successfully segment the face from the background. However, since these distributions are unknown in the initial frame, they must somehow be es- timated. Second, if a point on the person’s face could be located, a region around this point could be used to estimate the foreground pdf. Assuming everything outside this region was background, we could also esti- mate a background pdf. The facial border could then Cd) Figure 4: Facial Extraction. (a) Original image with initial and final placement of template; (b) Pixels for which P&ad > Pba&,rOUnd; (c) & (d) Probability of head and background, respectively (dark = high prob- ability). be found by iterating between minimizing the energy function and reestimating foreground and background distributions. One important task was to develop a subsystem which could locate a point on a person’s face. This could be done by first developing a general ‘(face” pdf. Ideally, one would like to collect a large database of faces under varying lighting conditions to estimate a general “face” pdf, but we didn’t have such a large database. We chose to use the facial distribution of one person as an approximation of the facial distribution for a different person. A point in the face was found by applying this pdf to the input image. A threshold was applied to the new image to find all points which had probability within a certain range of the pixel with maximum probability. The median 5 and y values of these pixels would be located in the person’s face. The median operation works much better than averaging, and also works better than attempting to find an n by n square of pixels whose joint probability is greatest. It also seems to implicitly use the fact that for the most part, the face of interest is near the center of the image. This procedure is shown in Figure 3. Figure 4 shows the convergence of the template to the final head border. Image specific distributions for the foreground and background are estimated using the initial template. A steepest descent minimization al- gorithm is then used to minimize - logP[I]A]. This process is repeated until convergence. Comparing Fig- 558 Proceedings of the 1995 International Conference on Image Processing (ICIP '95) 0-8186-7310-9/95 $10.00 © 1995 IEEE Figure 5: Results of lip tracking algorithm (top); Pixels for which P+ > Pba&,rOzlnd (bottom). ure 4(c) and Figure 3(c) shows the difference between using a general facial distribution, and one matched to the actual image. Notice how the facial region is much darker in Figure 4, indicating a higher probability. 5. LIP TRACKING Another goal of our research is to develop a robust lip analysis system. As a first step, we wanted to test the ability of SEDT’s to track the border of the lips through a video sequence. Our template is shown in Figure l(b). The template has two parabolas which are embedded in a rectangle. There are a total of five parameters - four for the rectangle, and one to specify the vertical position of the intersection of the parabo- las. Our test consisted of manually placing the tem- plate in frame 1 of the video sequence, and estimating the foreground and background distributions. These distributions were applied to successive frames, and a minimization algorithm was run to find the opti- mal template. As shown in Figure 5, the results are very promising. Likewise, the inner contour of the lips can be tracked by estimating the distribution of the mouth opening, and considering the lips themselves to be background. 6. CONCLUSION facial extraction and Bp tracking. Our method cap- italizes on the statistical segmentation properties of HMM’s and incorporates the shape coherence proper- ties of deformable templates. Work remains in finding automatic methods for initializing the templates, par- ticularly for the the lip tracking algorithm. It is also necessary to assess whi.ch color spaces and parameter sets work best and which ones are most invariant to varying lighting conditions and differing speakers. 7. RE:FERENCES PI PI PI PI PI PI VI R. Chellapa, C. Wilson, and S. Sirohey, “Human and machine recognition of faces: A survey,” Pro- ceedings of the IEEE, vol. 83, pp. 705-740, May 1995. M. Hennecke, K. Prasad, and D. Stork, “Using de- formable templates to infer visual speech dynam- ics,” in Proceedings of the 28th Annual Asilomar Conference on Signals, Systems, and Computers, (Pacific Grove, CA), November 1994. A. Yuille, P. Hallinan, and D. Cohen, LLFeature ex- traction from faces using deformable templates,” International Joumal of Computer Vision, vol. 8, no. 2, pp. 99-111, 1992. H. M. Hunke, “Locating and tracking of human faces with neural networks,” Tech. Rep. CMU-CS- 94-155, Carnegie Mellon University, August 1994. L. Rabiner and B. Juang, Fundamentals of Spech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993. E. Levin and R. Pieraccini, “Dynamic planar warp- ing for optical character recognition,” in Proc. Int. Conf. Acoust.,Speech,Signal Processing, pp. III-149 - 111-152, 1992. 0. Agazzi and S. Kuo, “Hidden Markov model based optical character recognition in the presence of deterministic transformations,” Pattern Recogni- tion, vol. 26, no. 12, pp. 1813-26, 1993. In this paper, we have presented an extension to de- formable templates which allows for statistical segmen- tation of images. The system performed well on many foreground/background segmentation tasks including 5!59 Proceedings of the 1995 International Conference on Image Processing (ICIP '95) 0-8186-7310-9/95 $10.00 © 1995 IEEE . [4], one would like to model these distributions and use this information to segment the image. Hidden Markov models [5] provide a strong sta- tistical framework for analyzing one-dimensional. ON MERGING HIDDEN MARKOV MODELS WITH DEFORMABLE TEMPLATES Ram R. Rao and Russell M. Mersereau School of Electrical. structure. With this state-embedded deformable template (SEDT), each region of a deformable tem- plate has an underlying observation probability distri- bution. This structure allows for computation