Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 25 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
25
Dung lượng
2,38 MB
Nội dung
RealisticFaceAnimationfor Speech
Gregor A. Kalberer
Computer Vision Group ETH Z
¨
urich, Switzerland
kalberer@vision.ee.ethz.ch
Luc Van Gool
Computer Vision Group ETH Z
¨
urich, Switzerland
ESAT / VISICS, Kath. Univ. Leuven, Belgium
vangool@vision.ee.ethz.ch
Keywords:
face animation, speech, visemes, eigen space, realism
Abstract
Realistic faceanimation is especially hard as we are all experts in the perception and interpretation
of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on
first observingthe visible 3D dynamics, extracting the basic modes, and putting these together according
to the required performance. This is the strategy followed by the paper, which focuses on speech. The
approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking
face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals
of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal
component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The
result is two-fold. On the one hand, the face can be animated, in our case it can be made to speak new
sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance
capture.
Introduction
Realistic faceanimation is a hard problem. Humans will typically focus on faces and are incredibly
good at spotting the slightest glitch in the animation. On the other hand, there is probably no shape more
important foranimation than the human face. Several applications come immediately to mind, such as
games, special effects for the movies, avatars, virtual assistants for information kiosks, etc. This paper
focuses on the realisticanimation of the mouth area for speech.
Face animation research dates back to the early 70’s. Since then, the level of sophistication has
increased dramatically. For example, the human face models used in Pixar’s Toy Story had several
thousand control points each [1]. Methods can be distinguished by mainly two criteria. On the one hand,
there are image and 3D model based methods. The method proposed here uses 3D face models. On the
other hand, the synthesis can be based on facial anatomy, i.e. both interior and exterior structures of a
face can be brought to bear, or the synthesis can be purely based on the exterior shape. The proposed
method only uses exterior shape. By now, several papers have appeared for each of these strands. A
complete discussion is not possible, so the sequel rather focuses on a number of contributions that are
particularly relevant for the method presented here.
So far, for reaching photorealism one of the most effective approaches has been the use of 2D mor-
phing between photographic images [2, 3, 4]. These techniques typically require animators to specify
carefully chosen feature correspondences between frames. Bregler et al. [5] used morphing of mouth
regions to lip-synch existing video to a novel sound-track. This Video Rewrite approach works largely
automatically and directly from speech. The principle is the re-ordering of existing video frames. It is
of particular interest here as the focus is on detailed lip motions, incl. co-articulation effects between
phonemes. But still, a problem with such 2D image morphing or re-ordering techniques is that they
do not allow much freedom in the choice of face orientation or compositing the image with other 3D
objects, two requirements of many animation applications.
In order to achieve such freedom, 3D techniques seem the most direct route. Chen et al. [6] applied
3D morphing between cylindrical laser scans of human heads. The animator must manually indicate a
number of correspondences on every scan. Brand [7] generates full facial animations from expressive
information in an audio track, but the results are not photo-realistic yet. Very realistic expressions have
been achieved by Pighin et al. [8]. They present faceanimationfor emotional expressions, based on
linear morphs between 3D models acquired for the different expressions. The 3D models are created
by matching a generic model to 3D points measured on an individual’s face using photogrammetric
techniques and interactively indicated correspondences. Though this approach is very convincing for
expressions, it would be harder to implement for speech, where higher levels of geometric detail are
required, certainly on the lips. Hai Tao et al. [9] applied a 3D facial motion tracking based on a piece-
wise bezi´er volume deformation model and manually defined action units to track and synthesize visual
speech subsequently. Also this approach is less convincing around the mouth, probably because only a
few specific feature points are tracked and used for all the deformations. Per contra L. Reveret et. al. [10]
have applied a sophisticated 3D lip model, which is represented as a parametric surface guided by 30
control points. Unfortunately the motion around the lips, which is also very important for increased
realism, was tracked by only 30 markers on one side of the face and finally mirrored. Knowing that most
of the people talks spacially unsymetric, the chosen approach results in a very symmetric and not very
detailed animation.
Here, we present a faceanimation approach that is based on the detailed analysis of 3D face shapes
during speech. To that end, 3D reconstructions of faces have been generated at temporal sampling rates
of 25 reconstructions per second. A PCA analysis on the displacements of a selection of control points
yiels a compact 3D description of visemes, the visual counterparts of phonemes. With 38 points on the
lips themselves and a total of 124 on the larger part of the face that is influenced by speech, this analysis
is quite detailed. By directly learning the facial deformations from real speech, their parameterisation in
terms of principal components is a natural and perceptually relevant one. This seems less the case for
anatomically based models [11, 12]. Concatenation of visemes yields realistic animations. In addition,
the results yield a robust face tracker for performance capture, that works without special markers.
The structure of the paper is as follows. The first Section describes how the 3D face shapes are ac-
quired that are observed during speech and how these data are used to analyse the space of corresponding
face deformations. Whereas the second Section uses these results in the context of performance capture,
the third section discusses the use for speech-based animation of a facefor which 3D lip dynamics have
been learned and for those to which the learned dynamics were copied. A last Section concludes the
paper.
The Space of Face Shapes
Our performance capture and speech-based animation modules both make use of a compact parame-
terisation of real face deformations during speech. This section describes the extraction and analysis of
the real, 3D input data.
Face Shape Acquisition
When acquiring 3D face data for speech, a first issue is the actual part of the face to be measured.
The results of Munhall and Vatikiotis-Bateson [13] provide evidence that lip and jaw motions affect the
entire facial structure below the eyes. Therefore, we extract 3D data for the area between the eyes and
the chin, to which we fit a topological model or ‘mask’, as shown in fig. 1.
This mask consists of 124 vertices, the 34 standard MPEG-4 vertices and 90 additional vertices for
increased realism. Of these vertices, 38 are on the lips and 86 are spread over the remaining part of the
mask. The remainder of this section explores the shapes that this mask takes on if it is fitted to the face
of a speaking person. The shape of a talking face was extracted at a temporal sampling rate of 25 3D
snapshots per second (video). We have used Eyetronics’ ShapeSnatcher system for this purpose [14].
It projects a grid onto the face, and extracts the 3D shape and texture from a single image. By using a
video camera, a quick succession of 3D snapshots can be gathered. The ShapeSnatcher yields several
thousand points for every snapshot, as a connected, triangulated and textured surface. The problem is
that these 3D points correspond to projected grid intersections, not corresponding, physical points of the
face. We have simplified the problem by putting markers on the facefor each of the 124 mask vertices,
as shown in fig. 2.
The 3D coordinates of these 124 markers (actually of the centroids of the marker dots) were measured
for each 3D snapshot, through linear interpolation of the neighbouring grid intersection coordinates.
This yielded 25 subsequent mask shapes for every second. One such mask fit is also shown in fig. 2.
The markers were extracted automatically, except for the first snapshot, where the mask vertices were
fitted manually to the markers. Thereafter, the fit of the previous frame was used as an initialisation for
the next, and it was usually sufficient to move the mask vertices to the nearest markers. In cases where
there were two nearby candidate markers the situation could almost without exception be disambiguated
by first aligning the vertices with only one such candidate.
Before the data were extracted, it had to be decided what the test person would say during the acquisi-
tion. It was important that all relevant visemes would be observed at least once, i.e. all visually distinct
mouth shape patterns that occur during speech. Moreover, these different shapes should be observed in
as short a time as possible, in order to keep processing time low. The subject was asked to pronounce
a series of words, one directly after the other as in fluent speech, where each word was targeting one
viseme. These words are given in the table of fig 5. This table will be discussed in more detail later.
Face Shape Analysis
The 3D measurements yield different shapes of the mask during speech. A Principal Component
Analysis (PCA) was applied to these shapes in order to extract the natural modes. The recorded data
points represent 372 degrees of freedom (124 vertices with three displacements each). Because only 145
3D snapshots were used for training, at most 144 components could be found. This poses no problem as
98% of the total variance was found to be represented by the first 10 components or ‘eigenmasks’, i.e.
the eigenvectors with the 10 highest eigenvalues of the covariance matrix for the displacements. This
leads to a compact, low-dimensional representation in terms of eigenmasks. It has to be added that so
far we have experimented with the face of a single person. Work on automatically animating faces of
people for whom no dynamic 3D face data are available is planned for the near future. Next, we describe
the extraction of the eigenmasks in more detail.
The extraction of the eigenmaks follows traditional PCA, applied to the displacements of the 124
selected points on the face. This analysis cannot be performed on the raw data, however. First, the mask
position is normalised with respect to the rigid rotation and translation of the head. This normalisation
is carried out by aligning the points that are not affected by speech, such as the points on the upper side
of the nose and the corners of the eyes. After this normalisation, the 3D positions of the mask vertices
are collected into a single vector m
k
for every frame k = 1 N, with N = 145 in this case
m
k
= (x
k1
, y
k1
, z
k1
, , x
k124
, y
k124
, z
k124
)
T
(1)
where
T
stands for the transpose. Then, the average mask ¯m
¯m =
1
N
N
k=1
m
k
; N = 145 (2)
is subtracted to obtain displacements with respect to the average, denoted as ∆m
k
= m
k
- ¯m. The
covariance matrix Σ for the displacements is obtained as
Σ =
1
N − 1
N
k=1
∆m
k
∆m
k
T
; N = 145 ; (3)
Upon decomposing this matrix as the product of a rotation, a scaling and the inverse rotation
Σ = RΛR
T
(4)
one obtains the PCA decomposition with Λ the diagonal scaling matrix with the eigenvalues λ sorted
from the largest to the smallest magnitude, and the columns of the rotation matrix R the corresponding
eigenvectors. The eigenvectors with the highest eigenvalues characterize the most important modes of
face deformation. Mask shapes can be approximated as a linear combination of the 144 modes.
m
j
= ¯m + Rw
j
(5)
The weight vector w
j
describes the deviation of the mask shape m
j
from the average mask ¯m in
terms of the eigenvectors, coined eigenmasks for this application. By varying w
j
within reasonable
bounds, realistic mask shapes are generated. As already mentioned at the beginning of this section, it
was found that most of the variance (98%) is represented by the first 10 modes, hence further use of the
eigenmasks is limited to linear combinations of the first 10. They are shown in fig 3.
Performance Capture
A face tracker has been developed, that can serve as a performance capture system for speech. It fits
the face mask to subsequent 3D snapshots, but now without markers. Again, 3D snapshots taken with
the ShapeSnatcher at 1/25 second intervals are the input. The face tracker decomposes the 3D motions
into rigid motions and motions due to the visemes.
The tracker first adjusts the rigid head motion and then adapts the weight vector w
j
to fit the remaining
motions, mainly those of the lips. A schematic overview is given in fig. 4(a). Such performance capture
can e.g. be used to drive a face model at a remote location, by only transmitting a few face animation
parameters: 6 parameters for rigid motion and 10 components of the weight vectors.
For the very first frame, the system has no clue where the face is and where to try fitting the mask. In
this special case, it starts by detecting the nose tip. It is found as a point with particularly high curvature
in both horizontal and vertical direction:
n(x, y) = {(x, y)|min(max(0, k
x
), max(0, k
y
)) is maximal} (6)
where k
x
and k
y
are the two curvatures, which are in fact averaged over a small region around the
points in order to reduce the influence of noise. The curvatures are extracted from the 3D face data
obtained with the ShapeSnatcher. After the nose tip vertex of the mask has been aligned with the nose
tip detected on the face, and with the mask oriented upright, the rigid transformation can be fixed by
aligning the upper part of the mask with the corresponding part of the face. After the first frame, the
previous position of the mask is normally close enough to directly home in on the new position with the
rigid motion adjustment routine alone.
The rigid motion adjustment routine focuses on the upper part of the mask as this part hardly deforms
during speech. The alignment is achieved by minimizing distances
between the vertices of this part of the mask and the face surface. In order not to spend too much
time on extracting the true distances, the cost E
o
of a match is simplified. Instead, the distances are
summed between the mask vertices x and the points p where lines through these vertices and parallel to
the viewing direction of the 3D acquisition system hit the 3D face surface:
E
o
=
i∈{upper part}
d
i
; d
i
= p
i
− x
i
(w) ; (7)
Note that the sum is only over the vertices in the upper part of the mask. The optimization is performed
with the downhill simplex method [15], with 3 rotation angles and 3 translation components as parame-
ters. Fig. 4 gives an example where the mask starts from an initial position (b) and is iteratively rotated
and translated to end up in the rigidly adjusted position (c).
Once the rigid motion has been canceled out, a fine-registration step deforms the mask in order to
precisely fit the instantaneous 3D facial data due to speech. To that end the components of the weight
vector w are optimised. Just as is the case with face spaces [16], PCA also here brings the advantage
that the dimensionality of the search space is kept low. Again, a downhill simplex procedure is used
to minimize a cost function for subsequent frames j. This cost function is of the same form as eq. (7),
with the difference that now the distance for all mask vertices is taken into account (i.e. also for the
non-rigidly moving parts). Each time starting from the previous weight vector w
j−1
(for the first frame
starting with the average mask shape, i.e. w
j−1
= 0 ), an updated vector w
j
is calculated for the frame at
hand. These weight vectors have dimension 10, as only the eigenmasks with the 10 largest eigenvalues
are considered (see section ). Fig. 4(d) shows the fine registration for this example.
The sequenceof weightvectors – i.e. mask shapes – extracted inthis way can be used as a performance
capture result, to animate the face and reproduce the orignal motion. This reproduced motion still
contains some jitter, due to sudden changes in the values of the weight vector’s components. Therefore,
these components are smoothed with B-splines (of degree 3). These smoothed mask deformations are
used to drive a detailed 3D face model, which has many more vertices than the mask. For the animation
of the face vertices between the mask vertices a lattice deformation was used (Maya, DEFORMER -TYPE
WRAP).
Fig. 8 shows some results. The first row (A) shows different frames of the input video sequence. The
person says “Hello, my name is Jlona”. The second row (B) shows the 3D ShapeSnatcher output, i.e. the
input for the performance capture. The third row (C) shows the extracted mask shapes for the same time
instances. The fourth row (D) shows the reproduced expressions of the detailed face model as driven by
the tracker.
Animation
The use of performance capture is limited, as it only allows a verbatim replay of what has been
observed. This limitation can be lifted if one can animate faces based on speech input, either as an audio
track or text. Our system deals with both types of input.
Animation of speech has much in common with speech synthesis. Rather than composing a sequence
of phonemes according to the laws of co-articulation to get the transitions between the phonemes right,
the animation generates sequences of visemes. Visemes correspond to the basic, visual mouth expres-
sions that are observed in speech. Whereas there is a reasonably strong consensus about the set of
phonemes, there is less unanimity about the selection of visemes. Approaches aimed at realistic anima-
tion of speech have used any number from as few as 16 [2] up to about 50 visemes [17]. This number
is by no means the only parameter in assessing the level of sophistication of different schemes. Much
also depends on the addition of co-articulation effects. There certainly is no simple one-to-one relation
between the 52 phonemes and the visemes, as different sounds may look the same and therefore this
mapping is rather many-to-one. For instance \b\ and \p\ are two bilabial stops which differ only in
the fact that the former is voiced while the latter is voiceless. Visually, there is hardly any difference in
fluent speech.
We based our selection of visemes on the work of Owens [18] for consonants. We use his consonant
groups, except for two of them, which we combine into a single \k,g,n,l,ng,h,y\ viseme. The
groups are considered as single visemes because they yield the same visual impression when uttered.
We do not consider all the possible instances of different, neighboring vocals that Owens distinguishes,
however. In fact, we only consider two cases for each cluster: rounded and widened, that represent
the instances farthest from the neutral expression. For instance, the viseme associated with \m\ differs
depending on whether the speaker is uttering the sequence omo or umu vs. the sequence eme or imi.
In the former case, the \m\ viseme assumes a rounded shape, while the latter assumes a more widened
shape. Therefore, each consonant was assigned to these two types of visemes. For the visemes that
correspond to vocals, we used those proposed by Montgomery and Jackson [19].
As shown in fig. 5, the selection contains a total of 20 visemes: 12 representing the consonants (boxes
with red ‘consonant’ title), 7 representing the monophtongs (boxes with title ‘monophtong’) and one
representing the neutral pose (box with title ‘silence’), where diphtongs (box with title ‘diphtong’) are
divided into two seperate monophtongs and their mutual influence is taken care of as a co-articulation
effect. The boxes with smaller title ‘allophones’ can be discarded by the reader for the moment. The
table also contains example words producing the visemes when they are pronounced.
This viseme selection differs from others proposed earlier. It contains more consonant visemes than
most, mainly because the distinction between the rounded and widened shapes is made systematically.
For the sake of comparison, Ezzat and Poggio [2] used 6 (only one for each of Owens’ consonant
groups while also combining two of them), Bregler et al. [5] used 10 (same clusters but they subdivided
the cluster \t,d,s,z,th,dh\ into \th,dh\ and the rest, and \k,g,n,l,ng,h,y\ into \ng\,
\h\, \y\, and the rest, making an even more precise subdivision for this cluster), and Massaro [20]
used 9 (but this animation was restriced to cartoon-like figures, which do not show the same complexity
as real faces). We feel that our selection is a good compromise between the number of visemes needed
in the animation and the realism that is obtained.
Animation can then be considered as navigating through a graph where each node represents one
of N
V
visemes, and the interconnections between nodes represent the N
2
V
viseme transformations (co-
articulation). From an animator’s perspective, the visemes represent key masks, and the transformations
represent a method of interpolating between them. As a preparation for the animation, the visemes were
mapped into the 10-dimensional eigenmask space. This yields one weight vector w
vis
for every viseme.
The advantage of performing the animation as transitions between these points in the eigenmask space,
is that interpolated shapes all look realistic. As was the case for tracking, point to point navigation in the
eigenmask space as a way of concatenating visemes yields jerky motions. Moreover, when generating
the temporal samples, these may not precisely coincide with the pace at which visemes change. Both
problems are solved through B-spline fitting to the different components of the weight vectors w
vis
(t)
[...]... only supports animation of the face of the person for whom the 3D snapshots were acquired Although we have tried to transplant visemes onto other people’s faces, it became clear, that a really realisticanimation requires visemes that are adapted to the shape or ’physiognomy’ of the face at hand Hence one cannot simply copy the deformations that have been extracted from one face to a novel face It is... graphics for archaeology is among his favour applications Figure 1 Left: example of 3D input for one snapshot; Right: the mask used for tracking the facial motions during speech Figure 2 Left: markers put on the face, one for each of the 124 mask vertices; Right: 3D mask fitted by matching the mask vertices with face markers Figure 3 Average mask (0) and the 10 dominant ‘eigenmasks’ for visual speech, ... bezi´r e volume deformation model In Proc CVPR, 1999 [10] L Reveret, G Bailly, and P Badin Mother, a new generation of talking heads providing a flexible articulatory control for videorealistic speech animation In Proc ICSL’2000, 2000 [11] S King, R Parent, and L Olsafsky An anatomically-based 3d parameter lip model to support facial animation and synchronized speech In in Proc Deform Workshop., pages... the face that are influenced by speech, it seems that this analysis is more detailed than earlier ones Based on a proposed selection of visemes, speech animation is approached as the concatenation of 3D mask deformations, expressed in a compact space of ‘eigenmasks’ Such approach was also demonstrated for performance capture This work still has to be extended in a number of ways First, the current animation. .. the mask deformations are determined The mask then drives the detailed face model Fig 8 (E) shows a few snapshots of the animated head model, for the same sentence as used for the performance capture example Row (F) shows a detail of the lips for another viewing angle It is of course interesting at this point to test what the result would be of verbatim copying of the visemes onto another face If successful,... animation and synchronized speech In in Proc Deform Workshop., pages 1–19, 2000 [12] K Waters and J Frisbie A coordinated muscle model for speech animation In Graphics Interface, pages 163–170, 1995 [13] K.G Munhall and E Vatikiotis-Bateson The moving face during speech communication In Ruth Campbell, Barbara Dodd, and Denis Burnham, editors, Hearing by Eye, volume 2, chapter 6, pages 123–39 Psychology... lip dynamics have to be captured for that face and much time and efford could be saved Such result are shown in fig 7 Although these static images seem resonable, the corresponding sequences are not really satisfactory Conclusions Realsitic faceanimation is still a hard nut to crack We have tried to attack this problem via the acquisition and analysis of exterior, 3D face measurements With 38 points... normal-hearing adult viewers In Jour Speech and Hearing Research, volume 28, pages 381–393, 1985 [19] A Montgomery and P Jackson Physical characteristics of the lips underlying vowel lipreading performance In Jour Acoust Soc Am., volume 73, pages 2134–2144, 1983 [20] D.W Massaro Perceiving Talking Faces MIT Press, 1998 [21] C Traber SVOX: The Implementation of a Text-to -Speech System PhD thesis Computer... pages 308–312, 1965 [16] V Blanz and T Vetter A morphable model for the synthesis of 3d faces In Proc SIGGRAPH, pages 187–194, 1999 [17] K.C Scott, D.S Kagels, S.H Watson, H Rom, J.R Wright, M Lee, and K.J Hussey Synthesis of speaker facial movement to match selected speech sequences In In Proceedings of the Fifth Australian Conference on Speech Science and Technology, volume 2, pages 620–625, 1994 [18]... Video rewrite: driving visual speech with audio In SIGGRAPH, pages 353–360, 1997 [6] D Chen and A State Interactive shape metamorphosis In Symposium on Interactive 3D Graphics, editor, SIGGRAPH’95 Conference Proceedings, pages 43–44, 1995 [7] M Brand Voice puppetry In Animation SIGGRAPH, 1999 [8] F Pighin, J Hecker, D Lischinsky, R Szeliski, and D.H Salesin Synthesizing realistic facial expressions . avatars, virtual assistants for information kiosks, etc. This paper
focuses on the realistic animation of the mouth area for speech.
Face animation research dates. of performance capture,
the third section discusses the use for speech- based animation of a face for which 3D lip dynamics have
been learned and for those