The 2015 IEEE RIVF International Conference on Computing & Communication Technologies Research, Innovation, and Vision for Future (RIVF) A Vietnamese 3D Taking Face for Embodied Conversational Agents Thi Duyen Ngo, The Duy Bui University of Engineering and Technology, Vietnam National University, Hanoi Email: {duyennt,duybt}@vnu.edu.vn expressed, is one of the keys to creating the quality of animated films In the areas of computational synthetic agents, emotions have received much attention for their influences in creating believable characters, e.g [5][6] Abstract—Conversational agents are receiving significant attention from multi-agent and human computer interaction research societies Many techniques have been developed to enable these agents to behave in a human-like manner In order to so, they are simulated with similar communicative channels as humans Moreover, they are also simulated with emotion and personality In this work, we focus on issue of expressing emotions for embodied-agents We present a three dimensional face with ability to speak emotional Vietnamese speech and naturally express emotions while talking Our face can represent lip movements during emotionally pronouncing Vietnamese words, and at the same time it can show emotional facial expressions while speaking The face’s architecture consists of three parts: Vietnamese Emotional Speech Synthesis module, Emotions to Facial Expressions module, and Combination module which creates lip movements when pronouncing Vietnamese emotional speech and combines these movements with emotional facial expressions We have tested the face in the football supporter domain in order to confirm its naturalness The face is simulated as the face of a football supporter agent which experiences emotions and expresses emotional expressions in his voice as well as on his face According to [7], facial expressions are one of the most important sources of information about a person’s emotional state Psychologists and other researchers have long recognized the importance of facial displays for judging emotions and they have probably received as much attention as all other expressive channels combined The second most important expressive channel for judging emotions is speech; "much of the variation in vocal behavior can be attributed to the level of arousal" [7] Therefor, in our work, we focus on these two channel in solving issue of expressing emotions for embodied conversational agents In this paper, we propose a talking face system which is combination of our previous works We present a three dimensional face with ability to speak emotional Vietnamese speech and naturally express emotions on face while talking Our face can represent lip movements during emotionally pronouncing Vietnamese words, and at the same time it can show emotional facial expressions while speaking To our knowledge, there is no such face proposed before The face is built based on the two systems which we have presented These systems are: the rule-based system for synthesizing Vietnamese emotional speech, which is presented in [8]; the system providing a mechanism for simulating continuous emotional states of conversational agents, which is proposed in [9] Besides, in this work, there is additional module for creating lip movements when pronouncing Vietnamese emotional speech and combining these movements with emotional facial expressions We have tested the face in the football supporter domain in order to confirm its naturalness The face is simulated as the face of a football supporter agent which experiences emotions and expresses emotional expressions in his voice as well as on his face Keywords—Conversational Agent, Vietnamese 3D Talking Face, Emotional Speech, Emotional Facial Expression I INTRODUCTION One particularity of humans is to have emotions, this makes people different from all other animals Emotions have been studied for a long time and results show that they play an important role in human cognitive functions Picard has summarized this in her “Affective Computing” [1] This has also been supported by many other scientists [2][3] Recognizing the importance of emotions to human cognitive functions, Picard [1] concluded that if we want computers to be genuinely intelligent, to adapt to us, and to interact naturally with us, then they will need the ability to recognize and express emotions, to model emotions, and to show what has come to be called “emotional intelligence” The rest of the paper is organized as follows First, we present the face’s architecture in Section II In this section, the constructions and operations of three main modules of the face are described in three subsections We then test the face in the football supporter domain in Section III Finally, the conclusion is presented in Section IV Conversational agents have become more and more common in the multimedia worlds of films, educative applications, e-business, etc In order to make these agents more believable and friendly, they are simulated with emotion and personality as well as communicative channels such as voice and facial expression, etc As early as the 1930s, traditional character animators have incorporated emotion into animated characters to make audiences “believe in characters, whose adventures and misfortunes make people laugh - and even cry” [4] The animators believe that emotion, appropriately timed and clearly II SYSTEM ARCHITECTURE The talking face is built based on the two systems that we have proposed before [8][9] An overview of the face’s architecture can be seen in Figure The face takes neutral speech with a corresponding phoneme list with temporal information and a series of Emotion State Vector (ESV) over time 978-1-4799-8044-4/15/$31.00 c 2015 IEEE 94 professional Vietnamese artists The two actors were asked to produce utterances using five different styles: neutral, happiness, cold anger, sadness, and hot anger; each sentence had one utterance in each of the five styles, for each speaker From the database, we then extracted acoustic cues related to emotional speech: The F0 contour, the power envelope, and the spectrum were calculated by using STRAIGHT [11] Time duration was manually specified with the partly support of WaveSurfer [12] At the utterance level, a total of 14 acoustic parameters were calculated and analysed; average pitch and average power at the syllables level were also examined For each utterance, from extracted F0 information, highest pitch (HP), average pitch (AP), and pitch range (PR) were measured, average pitch of syllables were also examined From the extracted power envelope, the considered acoustic parameters were maximum power (HPW), average power (APW), and power range (PWR), average power of syllables Next, with the duration, for each utterance, the information of time segmentation was manually measured first The measurement included phoneme number, time(ms), and vowel Then the duration of all phonemes, both consonants and vowels, as well as pauses were specified From there, mean of pause lengths (MPAU), total length (TL), consonant length (CL), ratio between consonant length and vowel length (RCV) were measured Finally, from the extracted spectrum, formants (F1, F2, F3) and spectral tilt (ST) were examined Formant measures were taken approximately at the vowel midpoint of the vowels using LPC-order 12; spectral tilt was calculated from H1-A3, where H1 is the amplitude of the first harmonic and A3 is the amplitude of the strongest harmonic in the third formant Figure 1: The face’s architecture as input In a more perfect way, one part of the input should be text instead of neutral speech with a phoneme list But our work not focus on text to speech, we only concentrate on synthesizing Vietnamese emotional speech from neutral speech Therefor, we suppose that there is a Vietnamese text to speech system which returns neutral speech with a corresponding phoneme list from text, and we use this as one part of the input for our system There are three main modules in the talking face system: Vietnamese Emotional Speech Synthesis (VESS) module, Emotions to Facial Expressions (EFE) module, and Combination module The VESS module uses the system in [8] to convert Vietnamese neutral speech to Vietnamese emotional speech according to the corresponding emotional style EFE module uses the system in [9] to simulate continuous emotional facial expressions from the series of EVS Combination modules creates lip movements when pronouncing Vietnamese emotional speech from the list of phonemes (with temporal information) and combines these movements with emotional facial expressions Finally, facial expression and movements are displayed with synchronized emotional speech on a 3D face In our system, we use the muscle-based 3D face model which was presented in [10] This face model is able to produce both realistic facial expressions and real-time animation for standard personal computers The construction and operation of the system’s components will be described in the next subsections The VESS module uses our proposed work [8] to convert Vietnamese neutral speech to Vietnamese emotional speech according to the corresponding emotional style The module takes Vietnamese neutral speech with a corresponding phoneme list as input and results Vietnamese emotional speech as output In [8], we presented a framework used to simulate four basic emotional styles of Vietnamese speech, by means of acoustic feature conversion techniques applied to neutral utterances Results of perceptual tests showed that emotional styles were well recognized More detailed information about the work in [8] will be described in the following After performing the above extraction phase, with each of 190 utterances we had a set of 14 values corresponding to the 14 acoustic parameters at the utterance level From these 190 sets, the values of variation coefficients with respect to the baseline (neutral style) were calculated As a result, we had 152 sets of 14 values of variation coefficients In which there were 19 sets for each of four emotional styles, for each of the two speakers After that, with each pack of these 19 sets, clustering was carried out Then, the cluster which had the largest number of sets was chosen Finally, from the chosen cluster, the mean values of variation coefficients corresponding to 14 parameters of each emotional style were calculated At the syllable level, mean values of variation coefficients corresponding to mean F0, average power, mean duration of the syllables belonging the word/compound word at the beginning and at the end of the sentence were also calculated We were interested in these syllables because of the fact that when the emotional state changes, acoustic features vary more in some syllables of phrases instead of uniformly changes in whole phrases When analyzing the database we found that syllables belonging the word/compound word at the beginning and at the end of the sentence varied more than other syllables The mean values of variation coefficients at the utterance level as well as the syllables level were used to form rules to convert neutral speech to emotional speech First, we carried out some analyses of acoustic features of Vietnamese emotional speech, accomplished to find the relations between acoustic feature variations and emotional states in Vietnamese speech The analyses were performed on a speech database which consisted of Vietnamese utterances of 19 sentences, produced by one male and one female In our work, we used speech morphing technique to produce Vietnamese emotional speech Our speech morphing process is presented in Figure Fistly, STRAIGHT [11] was used to extract F0 contour, power envelope, and spectrum of the neutral speech signal while segmentation information was measured manually Then, acoustic features in terms A Vietnamese Emotional Speech Synthesis (VESS) module 95 database which consisted of video sequence selected from three databases namely MMI [13], FEEDTUM [14], DISFA [15] We used facial expression recognition techniques to analyze the database and then extracted the general temporal patterns for facial expressions of the six basic emotions The analysis process was performed through four steps First, for each frame of the input video, the Face Detector module used Viola Jones algorithm [16] to detect the face and returns its location Then the ASM Fitting module extracted feature points from the detected face using ASM fitting algorithm [17]; ASM shape of the face containing location of 68 feature points was returned From this shape, the Face Normalization module carried out the normalizing task in order to set the shape into a common size (the distance between the centers of eyes was used as the standard distance) Finally, the AUs Intensity Extractor module used normalized feature point locations to calculate the intensity of Action Units (AUs) which are related to the emotion style of the input video (Action Unit was defined by Ekman and Friesen They developed Facial Action Coding System (FACS) [18] to identify all possible visually distinguishable facial movements It involves identifying the various facial muscles that individually or in groups cause changes in facial behaviors These changes in the face and the underlying muscles that caused these changes are called Action Units (AU) For each emotion, there is a set of related AUs to classify it from the others.) Figure 2: Speech morphing process using STRAIGHT Figure 4: (a): Temporal pattern for facial expressions of happiness and sadness (b): Temporal pattern for facial expressions of fear, angry, disgust, and surprise emotions Figure 3: Acoustic Feature Modification Process For a video of each emotion, we had a temporal series of intensity values for each AU This series was then extracted and graphing By observing these graphics and videos, we brought out a hypothesis that the facial expressions happen in series with decreasing intensity when a corresponding emotion is triggered Thence, we proposed pre-defined temporal patterns for facial expressions of six basic emotions (Figure 4) In this pattern, the solid line part is always present while the dash line part may be absent As shown in the pattern, although the internal emotional state may have constant sufficient intensity in a long time, the corresponding facial expressions are not always at the same intensity in this long duration On the other hand, the facial expressions appear with the intensity corresponding to the intensity of the emotion, then stay in this state for a while, and then fall near the initial state We call this process is a cycle We define a cycle of facial expressions as: E = (P, T s, T e, Do, Dr) where P defines the target intensity of the expressions; T s and T e are the starting time and the ending time of the cycle; Do, Dr are onset duration and offset duration of the expressions, respectively The process in which the expressions occur in a cycle is described as a function of time: of F0 contour, power envelope, spectrum and duration were modified basing on morphing rules inferred from variation coefficients obtained in the analytic stage These modifications were carried out with taking into account of variations of acoustic features at the syllable Syllables belonging the words/compound words at the beginning and the end of the utterance are modified more Finally, emotional speech is synthesized from the modified F0 contour, power envelope, spectrum and duration using STRAIGHT The modifications are carried out according to the flow presented in Figure B Emotions to Facial Expressions (EFE) module EFE module uses our proposed system in [9] to simulate continuous emotional facial expressions The module takes a series of Emotion State Vector (ESV) over time as input and results a corresponding series of Facial Muscle Contraction Vector (FMCV) as output In [9], the scheme providing a mechanism to simulate continuous emotional states of a conversational agent was brought out based on the temporal patterns of facial activities of six basic emotions These temporal patterns were results of the analysis on a spontaneous video 96 and fear emotions, and 2.7 for surprise emotion) If there is a significant change, the ESV is converted directly to FMCV using the fuzzy rule based system proposed in [19]; and the cycle − tagi is set to for happy and sad emotions, is set to for fear, angry, surprise, and disgust emotions If not, the ESV is normalized as follows: ti′ is the time at which the most recent cycle ends, t is the current time, where φ+ and φ− are the functions that describe the onset and offset phase of expressions φ+ (x, Do) = exp( ln2 Do x) − ln2−ln( Pa +1) φ− (x, Dr) = exp(ln2 − x) − Dr In order to verify the reasonableness of the pre-defined temporal patterns, we performed the fitting task for all temporal AU profiles If the distance between the centers of two eyes was normalized to 1, the sum of squares due to error (SSE) of the fit was 0.0207 Performing the fitting task for all temporal AU profiles, we found that the average of the sum of squares due to error (SSE) was 0.055 with the standard deviation was 0.078 These values showed that the above temporal patterns and the fitting function were reasonable Basing on the temporal patterns, we proposed a scheme illustrated in Figure to improve the conversion of continuous emotional states of an agent to facial expressions The idea was that the facial expressions happen in series with decreasing intensity when a corresponding emotion is triggered For example, when an event happens that triggered the happiness of a person, he/she would not smile in full intensity during the time the happiness lasts Instead, he/she would express a series of smiles in decreasing intensity Thus, emotional facial expressions appear only when there is a significant stimuli that changes the emotional states, otherwise, the expressions in the face is kept at low level displaying moods rather than emotions even when the intensities of emotions are high The emotional expressions will not stay on the face for a long time while emotions decay slowly However, the expressions of moods can last for much longer time on the face • if cycle − tagi = and ti′ + ≤ t ≤ ti′ + + T i ∗ 0.8 then eti = eti ∗ 0.8 and cycle − tagi = • if cycle − tagi = and ti′ + ≤ t ≤ ti′ + + T i ∗ 0.6 then eti = eti ∗ 0.6 and cycle − tagi = • otherwise, eti is normalized to lower intensity In this way, the emotions are displayed as moods, the lowintensity and long-lasting state of emotions After being normalized, the EVS is converted to FMCV using the same fuzzy rule based system [19] C Combination module The Combination module creates lip movements when pronouncing Vietnamese emotional speech from the list of phonemes with temporal information and combines these movements with emotional facial expressions from EFE module Visemes for Vietnamese phonemes In order to create lip movements during speaking Vietnamese, firstly we need to have a set of visemes for the face, corresponding with Vietnamese phonemes Similar to our previous work [20], we follow the rules in [21] and [22] to specify the correlative visemes with individual Vietnamese phonemes According to [21], Vietnamese phonemes are divided into two types: vowel and consonant About the visemes of the vowels, these phonemes are separated as well as expressed according to three main factors: the position of the tongue, the open degree of the mouth, and the figure of the lips On the open of the mouth, the vowels are divided into four categories: close vowel (i), semi-close vowel (ê), semi-open vowel (e), and open vowel (a).The narrow - wide property of the vowels is specified by the gradually widening of the mouth On the figure of the lips, the vowels are separated into two types: round-lip vowel (o, ô)and unround-lip vowel (ơ) The round or unround property is determined by the shape of the lips Figure show the relationships between vowels and the above two properties The horizontal lines express the open degree of the mouth The vertical lines express the shape of the lips; the left part shows unround-lip vowels, the right part shows the round-lip vowels About the visemes of consonants, these phonemes are separated as well as expressed according to two main factors: where and how phonemes are pronounced According to the first factor, consonants are divided in to three types: lip consonant (b, p,v, ph), tongue - consonant (đ, ch, c,k), and fauces - consonant (h) Figure 5: The scheme to convert continuous emotional states of an agent to facial expressions The Expression Mode Selection adjusts the series of ESV over time so that corresponding facial expressions happen temporally in the way similar to the temporal patterns This module determines whether an emotional facial expression should be generated to express the current emotional state or the expressions in the 3D face kept at low level displaying moods rather than emotions It firstly checks if there is a significant increase in the intensity of any emotion during last T i seconds (the duration of an emotional expression cycle), that is if: eix − eix−1 > θ where t−T i ≤ x ≤ t, t is the current time, and θ is the threshold to activate emotional facial expressions (According to analytic results on the video database, T i has value of about 3.5 for happiness, 5.3 for sadness, 3.6 for disgust emotion, for angry Because the 3D face model [10] which we use simulates the effect of vector muscle, sphincter muscle and jaw rotation, it can display facial movements during Vietnamese speech The open-degree of the mouth corresponds with the amount of the jaw rotation, and the round-degree of the lips depends on muscles which affect on lips For simplicity, some vowels which are fairly similar are added up into one group In order 97 according their type such as emotion displays, lip movements when talking, etc They then presented the schemes for combination of movements in one channel and for combination in different channels These schemes have the resolution of possible conflicting muscles to eliminate unnatural facial animations At a certain time, when there is a conflict between parameters in different animation channels, the parameters involved in the movement with higher priority will dominate the ones with lower priority In our talking face, we give higher priority to the lip movement when talking The final facial animations resulted from combination are displayed on the 3D talking face with synchronized synthesized emotional speech Figure 6: Vowel trapezium to create the vowel visemes, the amount of the jaw rotation and the contraction degree of muscles affecting on lips are originally determined basing on the vowel trapezium After that, these values are manually refined relying on comparisons between the vowel visemes of the 3D face and the vowel visemes of real human face To create visemses for consonants, we care about only positions where phonemes are pronounced According to this factor, we divided consonants into three types: lip - lip consonant, lip - tooth consonant, and the last type including the remaining consonants We follow the rules in [21] and [22] to create original visemes for consonants And after that we also manually refine these as we with vowel visemes III EVALUATION In order to test our talking face, we use ParleE - an emotion model for a conversational agent [25], and put the face in the football supporter domain [26] ParleE is a quantitative, flexible and adaptive model of emotions in which appraising events is based on learning and a probabilistic planning algorithm ParleE also models personality and motivational states and their roles in determining the way the agent experiences emotion This model was developed in order to enable conversational agents to respond to events with the appropriate expressions of emotions with different intensities We put the face in the domain of a football supporter [26] because football is an emotional game; there are many events in the the game that trigger emotions of not only players but also of coaches, supporters, etc Testing the face with the football supporter’s domain gives us the chance to test many emotions as well as the dynamics of emotions because the actions in a football match happen fast Our talking face roles as the face of a football (soccer) supporter agent The agent is watching a football match in which a team, which he supports, is playing The agent can experience different emotions by appraising events based on his goals, standards, and preferences Then the emotions are showed on the face and in the voice of our talking face In short, the purpose of using ParleE and football supporter domain is to provide good input to test our talking face Combination of lip movements when talking Human speeches are always paragraph, sentence, or some words These include a set of phonemes, some phonemes form a word With each single phoneme, we already have a specific viseme Now the request is that make the movement from one viseme (e.g V1) to another viseme (e.g V2) gradual and smooth in order to make lip movement during speech realistic The most simple way is creating intermediate visemes of V1 and V2 by adding V1 and V2’s correlative parameter values and thence taking the average values However, this way is not a really good choice because the articulation of a speech segment is not self-contained, it depends on the preceding and upcoming segments In our approach, we apply the dominance model [23] to create the coarticulation effect on lip movements when talking Coarticulation is the blending effect that surrounding phonemes have on the current phonemes In [23], a lip movement correlative to a speech segment is represented as a viseme segment Each viseme segment has dominance over the vocal articulators which increase and decrease over time during articulation This dominance function specifies how close the lips come to reaching their target values of the viseme A blending over time of the articulations is created by overlapping dominance functions of adjacent movements corresponding to articulatory commands Each movement has a set of dominance functions, one for each parameter Different dominance functions can overlap for a given movement The weighted average of all the co-occurrent dominance functions produce the final lip shape Figure 7: A picture of the talking face Combination of emotional expression and facial movement during speech Figure illustrates a picture of our talking face We have performed an experiment to collect evaluation of its ability to express continuous emotional states Following Katherine Isbister and Patrick Doley [27], we selected user test method for evaluating experiments related to emotions and facial expressions To obtain the user’s assessment, we showed them In order to combine emotional facial expressions (the output of EFE module) and above lip movements during speaking Vietnamese, we apply the research proposed in [24] They divided facial movements into groups called channels 98 Figure 8: Summary of interview results from the user test a clip of the talking face and then asked them to answer some questions The face was tested with 20 users (10 males and 10 females) aged between 15 and 35 with an average age of 27 years Each user test session took about 17 minutes Sessions began with a brief introduction to the experiment process and the talking face During the next minutes, the user watched a short clips of the face Finally, each user was interviewed separately about his/her assessment on the face We asked a total of four questions as showed in Figure According to users’ assessment, the talking face was able to express emotions on the face and in the voice quite naturally [10] T D Bui, D Heylen, and A Nijholt, “Improvements on a simple muscle-based 3d face for realistic facial expressions,” in Proc CASA2003, 2003, p 33–40 [11] H Kawahara, I Masuda-Katsuse, and A de Cheveigne, “Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in sounds,” Speech Communication, vol 27, pp 187– 207, 1999 [12] “Wavesurfer: http://www.speech.kth.se/wavesurfer/index.html.” [13] M Pantic, M Valstar, R Rademaker, and L Maat, “Web-based database for facial expression analysis.” Proc 13th ACM Int’l Conf Multimedia and Expo, pp 317–321, 2005 [14] F Wallhoff, “The facial expressions and emotions database homepage (feedtum).” www.mmk.ei.tum.de/ waf/fgnet/feedtum.html [15] S Mohammad Mavadati, H M Mohammad, K Bartlett, P Trinh, and J F Cohn, “Disfa: A spontaneous facial action intensity database,” IEEE Transactions on Affective Computing, vol 4, no 2, pp 151–160, 2013 [16] P Viola and M Jones, “Robust real-time object detection„” Tech rep.,Cambridge Research Laboratory Technical report series., no 2, 2001 [17] T Cootes, C Taylor, D Cooper, and J Graham, “Active shape modelstheir training and application.” Computer Vision and Image Understanding, vol 61, no 1, pp 38–59, 1995 [18] P Ekman and W V Friesen, Facial Action Coding System Palo Alto, CA: Consulting Psychologists Press, 1978 [19] T D Bui, D Heylen, M Poel, and A Nijholt, “Generation of facial expressions from emotion using a fuzzy rule based system,” in Australian Joint Conf on Artificial Intelligence (AI 2001) Berlin: Lecture Notes in Computer Science, Springer, 2001, pp 83–95 [20] T D Ngo and N L Tran and Q K Le and C H Pham and L H Bui, “An approach for building a vietnamese talking face,” Journal on Information and Communication Technologies, no 6(26), 2011 [21] X T Đỗ and H T Lê, Giáo trình tiếng Việt Nhà xuất đại học Sư Phạm, 2007 [22] T L Nguyễn and T H Nguyễn, Tiếng Việt (Ngữ âm Phong cách học) Nhà xuất đại học Sư Phạm, 2007 [23] M M Cohen and D W Massaro, “Modeling coarticulation in synthetic visual speech,” in Models and Techniques in Computer Animation, pp 139–156 [24] T D Bui, D Heylen, and A Nijholt, “Combination of facial movements on a 3d talking head,” in Proc of the Computer Graphics International, 2004, pp 284 – 290 [25] ——, “Parlee: An adaptive plan-based event appraisal model of emotions,” in In Proc KI 2002: Advances in Artificial Intelligence, p 129–143 [26] ——, “Building embodied agents that experience and express emotions: A football supporter as an example,” in Proceedings 17th annual conference on Computer Animation and Social Agents (CASA2004), 2004 [27] K Isbister and P Doyle, “Design and evaluation of embodied conversational agents: a proposed taxonomy,” in In Proceedings of AAMAS 2002 Workshop on Embodied Conversational Agents: Let’s Specify and Evaluate Them!, Bologna, Italy, 2002 IV CONCLUSION We have presented a three dimensional face with ability to speak emotional Vietnamese speech and naturally express emotions while talking Our face can represent lip movements during emotionally pronouncing Vietnamese words, and at the same time it can show emotional facial expressions while speaking We have tested the face in the football supporter domain in order to confirm its naturalness The face was simulated as the face of a football supporter which experiences emotions and expresses emotional expressions in his voice as well as on his face The experiment results show that our talking face is able to express emotions on the face and in the voice quite naturally REFERENCES [1] R Picard, Affective Computing MIT Press, Cambridge, MA, 1997 [2] D H Galernter, The muse in the machine Free Press, New York, 1994 [3] A R Damasio, Descartes’ error: Emotion, reason, and the human brain G.P Putnam, New York, 1994 [4] F Thomas and O A Johnsto, The Illusion of Life Abbeville Press, New York, 1981 [5] C Pelachaud, “Modelling multimodal expression of emotion in a virtual agent.” Philosophical Transactions of the Royal Society B: Biological Sciences, vol 6364, no 1535, pp 3539–3548, 2009 [6] M C Prasetyahadi, I R Ali, A H Basori, and N Saari, “Eye, lip and crying expression for virtual human,” International Journal of Interactive Digital Media, vol 1(2), 2013 [7] G Collier, Emotional expression Lawrence Erlbaum Associates, New Jersey, 1985 [8] T D Ngo, M Akagi, and T D Bui, “Toward a rule-based synthesis of vietnamese emotional speech,” in Proc of the Sixth International Conference on Knowledge and Systems Engineering (KSE 2014), pp 129–142 [9] T D Ngo, T H N Vu, V H Nguyen, and T D Bui, “Improving simulation of continuous emotional facial expressions by analyzing videos of human facial activities,” in Proc of the 17th International Conference on Principles and Practice of Multi-Agent Systems (PRIMA 2014), pp 222–237 99 ... III EVALUATION In order to test our talking face, we use ParleE - an emotion model for a conversational agent [25], and put the face in the football supporter domain [26] ParleE is a quantitative,... naturally [10] T D Bui, D Heylen, and A Nijholt, “Improvements on a simple muscle-based 3d face for realistic facial expressions,” in Proc CASA2003, 2003, p 33–40 [11] H Kawahara, I Masuda-Katsuse,... Ngo and N L Tran and Q K Le and C H Pham and L H Bui, “An approach for building a vietnamese talking face, ” Journal on Information and Communication Technologies, no 6(26), 2011 [21] X T Đỗ and