Improving the 3D talking head for using in an avatar of virtual meeting room

VIETNAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY ANH DUC NGUYEN IMPROVING THE 3D TALKING HEAD FOR USING IN AN AVATAR OF VIRTUAL MEETING ROOM Branch: Information Technology Code: 1.01.10 MASTER THESIS Supervisor: Dr The Duy Bui Hanoi, November 2006 Contents List of Figures Chapter - Introduction 1.1 The avatar in the virtual meeting room 1.2 Structure of this thesis .6 Chapter - The 3D animated talking head 2.1 A muscle based 3D face m odel 2.2 Combination of facial movements on a 3D talking head 2.3 From emotions to emotional facial expressions 12 2.4 Conclusion .15 Chapter - OpenGL and JO G L overview 16 3.1 OpenGL overview 16 3.1.1 Immediate Mode and Retained Mode (Scene Graphs) 16 3.1.2 OpenGL history .16 3.1.3 How does OpenGL work? 17 3.1.4 OpenGL as a state machine 19 3.1.5 Drawing geometry 20 3.2 JOGL overview .22 3.2.1 Introduction 22 3.2.2 Developing with JOGL 23 3.2.3 Using JOGL 24 3.3 Conclusion .25 Chapter - Improving lip-sync ability 26 4.1 Introduction 26 4.2 Previous work 27 4.3 FreeTTS and Mbrola 28 4.3.1 FreeTTS 28 4.3.2 Mbrola 31 4.4 The improved lip model 32 4.5 Conclusion .35 C hapter - Adding the hair and eyelashes models 36 5.1 Introduction 36 5.2 The Hair model 37 5.2.1 Introduction to V RM L 37 5.2.2 Our hair model 39 5.3 The Eyelashes m odel 42 5.4 Conclusion 44 Chapter - Implementation and illustrations 45 6.1 Implementing the face m odel 45 6.1.1 Structure of the system 45 6.1.2 Some improvements 46 6.2 Face model illustrations 47 Chapter - Conclusion 56 Future research 56 References 58 L is t o f F ig u re s 2.1: The original 3D face model: (a): The face mesh with muscles; (b): The face after rendering 2.2: System overview .10 2.3: Combination of two movements in the same channel 11 2.4: The activity of Zygomatic Major and Orbicularis Oris before (top) and after (bottom) applying combination algorithm 11 2.5: The emotion-to-expression system 12 2.6: Membership functions for emotion intensity (a) and muscle contraction level (b ) 13 2.7: Basic emotions: neutral, Sadness, Happiness, Anger, Fear, Disgust, Surprise (from left to right) 15 3.1: Software implementation of OpenGL 18 3.2: Hardware implementation of OpenGL 18 3.3: A simplified version of OpenGL pipeline 19 3.4: The structure of an application using JOGL .25 4.1: FreeTTS Architecture .29 5.1: Dividing a polygon (a) to triangles (b) 40 5.2: Importing the hair model: (a): the original head; (b): the head with the imported hair model; (c): the head with the imported and fine tuned hair model 41 5.3: Some other imported and fine tuned hair models 41 5.4: The open (a) and close eyes (b) without and with eyelashes 43 5.5: The face without (a) and with (b), (c) the hair and eyelashes models 44 6.1: The main interface of our program 47 6.2: The face model displays Happiness emotion with maximum intensity 48 6.3: The face model displays Surprise emotion with maximum intensity 48 6.4: The combination of two emotions: Happiness andSurprise 49 6.5: The effect of left Zygomatic Major muscle’scontractionat maximum level on the face model 49 6.6: The face model from different view points 50 6.7: Increasing surprise .50 6.8: The hair model after being imported .51 6.9: The hair model after being fine tuned .51 6.10: Some other hair models 52 6.11: Closing the eyes .53 6.12: The face model attach to the body 54 6.13: Our face model embeds intoother project 54 Chapter Introduction 1.1 The avatar in the virtual meeting room The Virtual Meeting Rooms (VMRs) are 3D virtual simulations of meeting rooms where the various modalities such as speech, gaze, distance, gestures and facial expressions can be controlled (a VMR project in Twente) The rapid development in computer graphics and embodied conversational agents areas allows the creation of VMRs and makes them to be useful for various purposes These purposes can be divided into three following categories [24], First, they can be used as a virtual environment for teleconferencing, a real-time communication means for remote participations of meeting [18] Using the VMRs helps to reduce the amount of data that needs to be sent to and displayed on screens of remote client side In addition, they offer to overcome some features that are problematic in real meetings or in traditional video-based conferences For examples, the participants can adapt the Virtual Environment to their own preferences without disturbing other people or they can choose a view from any seat in VMRs that they want and feel the comfortable during the meeting [17] Second, VMRs are used to simulate the content of recorded meeting in the different ways or present multimedia information about it Information can be directly recorded from participant’s behaviors in real meetings (e.g tracking of head or body movements, voice) These presentations can be used as a 3D summary of the real meetings or for evaluating the annotations and results which are obtained by machine learning methods Third, because Virtual Environments allow controlling various independent factors (voice, gaze, distance, gestures, and facial expressions); these factors can be used to study their influence on features of social interaction and social behavior Conversely, the effect of social interaction on these factors can be studied adequately in Virtual Environments as well In the VMRs environment, each participant is represented by an avatar An avatar is an embodied conversational agent that simulates all behaviors and movements of the participant The avatar will typically contain a talking head which is able to speak and displays lip movements during speech, emotional facial Introduction expressions, conversation signals and a body which is able to display gestures of the participant The important thing is the avatar of each participant must bring the belief to other participants The avatar will be believable if it can simulate the appearance, express the characteristics of the participant and its actions and reactions can be as true to life as those of the person it is representing The talking head model plays an important role in the creation of a believable avatar It is not only used to display facial movements and expressions but also used to distinguish other avatars and to express the personality of the participant In order to create a talking head model which is suitable to use for avatar in the VMRs, there are some problems which need to deal with First, the talking head must be simple enough to keep the real-time animation but still produce realistic and high quality facial expressions Second, the talking head not only has the capabilities to create facial movements such as conversational signals, emotions expressions, etc but also has to combine and solve the conflicts between them Third, the talking head must look like real head, it means the head must have other models attached to it such as hair models, tongue model, eyelashes model, etc In this thesis, we choose the talking head model from [3] to improve and then use for avatars in VMRs We study the model carefully to discover all advantages as well 93 disadvantages The advantages will be inherited while the missing functions or disadvantages will be supplemented or improved, respectively We change the rendering method of the head to new one to improve the animation speed The synchronization between audible and visible speech is also improved We supply the hair and the eyelashes models to make the head look more realistic The improved model not only can be used for avatars in VMRs environment but also can be embedded into other projects 1.2 Structure of this thesis In the Chapter 2, we introduce the 3D animated talking head [3] that our works are based on This head is able to produce realistic facial expressions, real time animation on the personal computer It can display several types of facial movements such as eye blinking, head rotation, lip movement, etc at once and the most important thing is it can generate emotional facial expressions from emotions We briefly introduce the way this muscle based 3D face model is created, the Introduction techniques it uses for producing animation, the combination of facial movements and how to generate emotional facial expressions from emotions In the Chapter 3, we present an overview of OpenGL and JOGL (Java bindings for OpenGL) OpenGL is industry standard and premier environment for developing 2D and 3D graphic applications Its capabilities allow developer to display compelling graphics and produce applications that require maximum performance (OpenGL project) JOGL is new OpenGL interface for Java platform It is open sourced, clean and minimalist API from all bindings available In the Chapter 4, we introduce an overview of FreeTTS and Mbrola FreeTTS is a robust text-to-speech system that we used to get phonemes and timing information from a text This phonemes string is used to generate lip movements when speaking FreeTTS supports Mbrola which is a speech synthesizer based on the concatenation of diphones We used Mbrola as an output thread of FreeTTS to produce synthetic audible We also present the method to improve the lip-sync capability The original head can speak but in some conditions the speech from the speaker does not synchrony with the movements of the lip on the screen Besides, we may want the head to express various emotions depends on current speaking sentence, so we need to know exactly time when the sentence is spoken then we can generate the suitable emotions The original head does not have hair model and eyelashes We supply these parts in order to make it look like a real head and become more attractive In the Chapter 5, we present the method to apply a hair model for the head and the way we draw eyelashes for the eyes Available hair models will be attached to the head model without much human intervention during process In addition, the eyelashes are a small part on the face but without them, the eyes may not look real The eyelashes also help to improve the emotions expression capability of the eyes when the eyes flutter We describe some problems about the eyelashes creation, and how to fix them to the eyelid so they can move with the eyelid when the eyes close or open In the Chapter 6, we introduce the implementation of the face using Java and JOGL We also introduce our improvement in rendering method of the talking head using the new methods and mechanism which are introduced in OpenGL 1.5 This method helps to increase the animation speed significantly Some illustrations of our 3D talking head model are also introduced in this Chapter Chapter The 3D animated talking head 2.1 A muscle based 3D face model The face model is created by a polygonal face mesh and a B-spline surface for the lips The face mesh data was obtained from a 3D scanner at first and was processed to improve the animation performance but still kept the high quality of the model The process contains two phases In the first phase, the number of vertices and polygons was reduced in non-expressive parts but maintained in the expressive parts which are the areas around the eyes, the nose, the mouth and the forehead At the end of this phase, the face mesh contains 2,468 vertices and 4,746 polygons This is small enough to have real-time animation but still preserves the high quality of detail in expressive parts of the face In the second phase, the face model was divided into eleven regions Five regions on the left part include of left lower face, left middle face, left lower eyelid, left upper eyelid and left upper face There are five corresponding regions on the right part and the last region is at the back of the head This not only helps to prevent unwanted artifacts generated because of the displacement of the vertices in the regions that should not be affected by muscle contractions but also increase the animation speed The lip model is a B-spline surface with 24 x control point grid The lip is deformed by moving the control points and the B-spline surface is polygonalized to connect with the face mesh for rendering The B-spline surface has the advantage of producing a smooth face but it can not produce wrinkles and needs to be polygonalized before rendering If the number of control point is too large then it will require heavy computations Due to these advantage and disadvantage, it is suitable to use B-spline surface for modeling the small part of the face like the lips Almost all of the 19 muscles, which are used on the face to generate animation, are vector muscles, except Orbicularis Oris that drive the mouth and Orbicularis Oculi that drive the eye The vector muscle of the face is an improved version of the vector muscle model from [28] In addition, a mechanism to generate wrinkles and bulges is added to increase the realistic of the facial expressions and the technique to reduce the computation is also introduced to enhance the animation The 3D animated talking head performance The Orbicularis Oris muscle is parameterization-based and is adopted from [12] The Orbicularis Oculi has two parts: the Pars Palpebralis that open and closes the eyelid, is adopted from [22] and Pars Orbitalis that squeezes the eye, is adopted from [28], The jaw and the eyeball rotation algorithms are improved from the ones proposed in [22] The mouth now has a natural oval-looking, and the eyes can track a target Eye movement is independent of facial muscle movements and can not rotate to impossible positions All muscles have the intensity range from to 1, the step value between two adjacent muscle contractions is 0.2 This step value is determined after trail and error experiments It is small enough to ensure the facial animations are smooth and large enough to decrease the computation times Figure 2.1 shows the original face from [3] (a) (b) Figure 2.1: The original 3D face model (a): The face mesh with muscles; (b): The face after rendering 2.2 Combination of facial movements on a 3D talking head The system takes as input the marked up text with each facial movement (except lip movement while talking) is defined as a group of muscle contractions that share the same function, start time, onset, offset and duration Lip movement will be generated separately inside the system based on the phonemic presentation of the input text Implementation and illustrations - model by using algorithms which described in previous chapters and render them on the screen by using JOGL + M M u s c l e defines the muscles and controls their contractions on the face model + MPolygon, MVertex hold the information about the coordinate of each vertex and the vertex indexes of each polygon, respectively + MFaceMesh creates the face mesh from all vertices and polygons of the face model + MPhoneme, MViseme create the movements of the lips when talking c o n t r o l l e r : C o n t r o l P a n e l class creates graphic user interface to control the whole system 6.1.2 Some improvements The vertex-at-a-time method is the original method to draw geometry of the head This is a slow one and not suitable for drawing an object with thousand vertices and polygons like this head Thus, we replace it with vertex array drawing method and apply a mechanism that permitted vertex array data to be stored in server-side memory The final result is good, rendering speed now is much higher compares to old method In our experiments on the computer with the configuration: Pentium IV 3.0GHz, 1GB Ram, ATI Radeon 6600 128MB memory graphic card, the average number of frames per second is about 90 with original method and about 200 with our new drawing method Another problem with the original head is it will display arbitrary movements when it runs on some computers The reason is the face data is used in both rendering and data activating processes and these are separate processes The interval time between two data activating processes is also fixed Thus, if the face data is activating and hasn’t finished yet but it is still used for rendering then the display results are arbitrary On almost computers this problem is not obvious but on power computer, user can realize that the head is flickered So we stop using the interval between two times of adjacent face activating and combine these two processes into one This combination will reduce the speed of faster process down to equal with slower one but the head will not flicker anymore 46 Implementation and illustrations 6.2 Face model illustrations Figure 6.1 shows the main interface of our project, the panel on the left allows user to control all parts of face The panel on the right displays the face model with nature emotion f e , ül is « S im p le F a c e Anmaton Draw ! i x s u i Norrrals Craw M js d e s rar Sea« X Trans X i v & t Eyelashes Drawf''fccc «fryxth Im p o 't ¡v praW rta^ E x p trt r n—2 Y v fW*N |5«u5S5 an* Muscles C c rtr * to n L evei R esec Al Emocons M acprw Angei _J~ > Suprs* he a r > Sad'*** J rw u t qs R « et Ail Speaking r~ Figure 6.1: The main interface of our program Figure 6.2 and 6.3 show the face model expresses the Happiness and Surprise emotions, respectively Other basic emotions that the face model can express are Anger, Fear, Disgust and Sadness Figure 6.4 displays the combination of two emotions Happiness and Surprise We can recognize the expression of the Happiness emotion at the mouth and the expression of Surprise emotion at the forehead, eyes and eyebrows 47 Implementation and illustrations WÊÊÈÊÊÊÊÊÊÊÊL,^}*} ±4 An n a to n Draw D raw Nor trais W t> */» fcy*'Jthei I> a w M j s d « s Draw M ade lu p o 't Export 5C# - I i_J vI i±l Trans X ' C l± J V ] “ Z MteMe [tB >30MT_MJC v ô FT AK A* m Mujctes C tonU crtrK vti — .-— R*C*< A l E rtorens H appine?? ba e urs Fear OvJUi» lr-*> n s R eset All Sc k i c a rx Figure 6.2: The face model displays Happiness emotion with maximum intensity A*,ma* no C» »i F ftmh vrj f M xfe* u U M aw aie K ir In port Scii» X Tr* * X Export i ;] v i : c ;± ) V f j : u i l ! Z I «> K*a id* M udaN *» ¡L t r ?j o i u ' - y u r ic _m a j o r CCffttCOCMftV* J -H r-.rf A l E‘ tfk X H ?o>r«s _J— A rq * _ j~ — fe*- J- D»inf y Evd& (7 sc bcc Ms e uc* h:* Dwo ro e Irrp r^ t-O cit - X :-j i j i~j 2[' TvwX 0-1 y 0' I2I ' K’j jrj* M»rt* 1U lEF’ rrOCMATIC MAMft j -P « i < Al Ai^*r S»gu** F««f U* »* v |*#«f CK lJ V ot J » sc«* jf i:!v: l a ,S r*4r>i x g*i ẽ3 X ẻ " ôVX*ằ r ã f* * e * H w r*A K M Ct4jJlai Itvc u» v* ’ No IStftC») U s e S o irv w t ic P a i r * No lMLI TR-2002-114, 2002 |2.Waters, K (1987) A muscle model for animating three-dimensional facial expression In Stone, M C., editor, Computer Graphics (SIGGRAPH '87 iroceedings), pages 17-24 |2fvVaters, K., Levergood, T M.: DECface: An automatic lip-synchronization ilgorithm for synthetic faces, ACM Multimedia, San Francisco, October 1994 |3(Wikipedia Foundation, JOGL definition, http://en.wikipedia.org/wiki/Jogl |3 Wikipedia Foundation, Radial Basis Function definition, ittp://en wikipedia.org/wiki/Radial_basis_function |32Wright, J.R.S., Lipchak, B., OpenGL SuperBible, Third Edition, Sams Publishing, June 30, 2004 p3Y;thara, H., Fukui, Y., Nishihara, S., Mochimaru, M., Kouchi, M.: Generation ~)f 3D Face Model using Free-Form Deformation, In Proceeding of Modeling md Simulation 2005 |34yang, G., Huang, Z.: A method o f human short hair modeling and real time animation, In Proceedings Pacific Graphics (PG 2002) 60 ... using this method, participants of the virtual meeting room can customize 41 Adding the hair and eyelashes models the head of the avatar by themselves They can choose a hair model they want and... face of avatar They can benefit from verbal and non-verbal communications and have a new way to find the points of interest in the meeting One important thing is that the face can help the avatar. .. of the participant The important thing is the avatar of each participant must bring the belief to other participants The avatar will be believable if it can simulate the appearance, express the

Định dạng
Số trang	61
Dung lượng	25,68 MB