How Vietnamese attitudes can be recognized and confused: Cross-cultural perception and speech prosody analysis Dang-Khoa Mac, Eric Castelli International Research Center MICA HUST-CNRS/UMI 2954 Grenoble INP Hanoi, Vietnam {dang-khoa.mac, eric.castelli}@mica.edu.vn Véronique Aubergé 1 , Albert Rilliard 2 1 Laboratory of Informatics of Grenoble (LIG), 2 LIMSI CNRS 1 Grenoble, 2 Orsay, France 1 veronique.auberge@imag.fr, 2 albert.rilliard@limsi.fr Abstract - Prosodic attitudes, or social affects, are main part of face-to-face interaction and linked to the language through the culture. This paper presents a study on prosodic attitudes in Vietnamese, a tonal language. Perception experiments on 16 Vietnamese attitudes were carried out with Vietnamese and French participants. The results revealed perception differences between native and non- native listeners. As attitudinal expressions are partially carried through speech prosody, an analysis was also carried out, in order to have a better understanding of why these attitudes are recognized or confused, and to bring out some prosodic characteristics of Vietnamese social affects. Keywords - Vietnamese, attitude, perception, prosodic analysis I. INTRODUCTION During communication between humans, speech is an important information channel to express mental, intentional, attitudinal and emotional states. According to some theoretical models of affects [1], the affective expression in speech communication may be controlled at different levels of cognitive processing, from the involuntarily controlled expressions of emotion to the intentionally, voluntarily controlled expressions of attitudes. Therefore, attitudes and emotions can be distinguished depends on the nature of the control exerted by the speaker (voluntary vs. involuntary) [2]. Some types of expressivity may be expressed as either an attitude or an emotion. For example, “surprise” can be considered as an attitude when expressed during a voluntary process; otherwise it can be considered as an emotion. Attitude expression carries the intention and points of view of the speaker (e.g. surprise, confirmation, politeness etc.) [3]. Attitudes are constructed for each language and each culture and they need to be learned by children or by second language students [5]. As all attitudinal expressions are constructed for a certain language and culture, they can differ between languages. Some attitudes can be expected to have a universal value (e.g. “surprise”), but specific attitudes in one language may not be recognized or may be ambiguous in another language [7]. The understanding of this phenomenon may benefit from cross-cultural studies [3,6,7]. The important role of prosody in emotional and attitudinal expression was shown in many researches [4,9]. According to [4], some emotions can be characterized by the mean level and the range of F0. This research also showed the different contour shapes for different emotions. With a tonal language such as Vietnamese, the acoustic parameters implied in the linguistics and affective functions of prosody (F0, intensity, timing) also play an important role at the phonemic level for lexical access. Moreover, the Vietnamese tones use voice quality settings such as creaky voice [12], that are used in the morphology of some other languages’ attitudes and emotions [11]. After presenting the corpus, we describe the perceptual experiment with Vietnamese and French participants. This result shows the differences in attitude perception between the native and non-native speakers. Then, a prosodic analysis is presented and discussed to give some explanations of the perception test. This paper concludes with some discussions. II. E XPERIMENTS A. The corpus In the researches on social affects in different languages [5,10], the attitudes have been selected thanks to the foreign languages literature in didactic. Unfortunately, as an under-resourced language, there are few researches on Vietnamese expressive speech. We have found only one study [12], which describes 16 Vietnamese attitudes (cf. Table 1), which have been selected and audio-visually recorded by a male native speaker of Hanoi (standard pronunciation of Vietnamese). However, for the purpose of prosodic analysis, this paper addresses only the audio information of Vietnamese attitudes. TABLE I. SELECTION OF 16 VIETNAMESE ATTITUDES, WITH THEIR ABBREVIATIONS Declaration DEC Irritation IRR Interrogation INT Sarcastic irony SAR Exclamation of neutral surprise EXo Scorn SCO Exclamation of positive surprise EXp Politeness POL Exclamation of negative surprise EXn Admiration ADM Obviousness OBV Infant-directed speech IDS Doubt-Incredulity DOU Seduction SED Authority AUT Colloquial COL B. Perception tests The perception test was carried out to study how the native and non-native listeners recognize and confuse the 16 Vietnamese attitudes. To examine the influence of sentence length, three sentences, having one, two or five syllables, were chosen from the corpus. To control a possible effect of Vietnamese tone on the perception of attitudes, all syllables are performed with tone 1 (the level tone). The perception test therefore comprises 48 stimuli (3 sentences * 16 attitudes). 2011 International Conference on Asian Language Processing 978-0-7695-4554-7/11 $26.00 © 2011 IEEE DOI 10.1109/IALP.2011.39 220 Forty listeners participated in this experiment: 20 Vietnamese (10 men and 10 women) who speak the same dialect as the speaker; and 20 French (10 males and 10 females) who have not been exposed to Vietnamese language. The test interface gave them the labels and the definitions of the 16 attitudes (in the native language of the listeners). No listener expressed any difficulty in understanding the concepts of these 16 attitudes. All subjects listened to each stimulus only one time. After each stimulus, they were asked to indicate the perceived attitude among the 16 presented ones. C. Result analysis Effect of factors : Firstly, a repeated measure ANOVA was carried out to evaluate the relative importance of the following factors on the listeners’ perception: the sentence length (number of syllables); the listeners’ linguistic background (natives and non-native) and the listeners’ gender. The ANOVA shows that the listeners’ linguistic background factor has a significant effect on the perception (p<0.01): Vietnamese and French listeners don’t perceive these expressions the same way. In contrast, sentence length (number of syllables) and the listeners’ gender have no influence on perception (p>0.01). TABLE II. THE OUTPUT OF ANOVA IN PERCENT OF GOOD ANSWERS . SIGNIFICANT EFFECTS AT THE 1% LEVEL ARE SET IN BOLD. Factors df F p Atttitude 15 28.700 0.000 Listener (Vietnamese or French) 1 1286.772 0.000 Gender of listener 1 3.754 0.053 Sentence length (Num. of syllables) 2 1.376 0.253 Attitude recognition : Figure 1 presents recognition rates (in percent) of the 16 attitudes for both groups of listeners. Globally, most of the attitudes were recognized above a chance level, and native listeners had higher recognition scores than foreign ones. Some attitudes were well recognized by both Vietnamese and French listeners: DEC, AUT, IRR, SAR, SED. Figure 1. Recognition rate of 16 attitudes by Vietnamese and French listeners. The dashed line indicates the chance level (6.25%). Some other attitudes received low recognition scores (POL) or were not recognized by both Vietnamese and French listeners (ADM). The SCO and IDS attitudes were well recognized by Vietnamese listeners but almost not recognized by the French listeners. Conversely, the EXn attitude was recognized by the French listeners, but not by the Vietnamese ones. Attitude confusion : The analysis of the confusions between attitudes gives interesting details on the perceptive proximity between the 16 expressive labels. From the confusion matrices, confusion graphs (cf. figure 2) were built, reporting all the confusions higher than twice the chance level (i.e. 12.5%). For both Vietnamese and French listeners, ADM was not recognized and it was mixed with COL, EXo (for Vietnamese listeners) and with COL and IDS (for French listeners). Vietnamese listeners did not recognize the EXn attitude and mixed it with EXo and DOU. French listeners did not recognize IDS and mixed it with SAR or DOU. Vietnamese listeners made reciprocal confusions between some pairs or groups of attitudes: SAR and SCO; POL and DEC; SED and COL; EXo, EXn and DOU. French listeners made reciprocal confusions between AUT and IRR; DEC and OBV; DOU and EXn; DOU and EXo. < = 19 % 3 0% = > < = 2 6 % 2 2 % = > <=63% 19% = > <= 2 5% 17 % = > 18% => <= 15% Figure 2. Confusion graphs (in percentage of recognition) for Vietnamese (top) and French (bottom) listeners. The reciprocal confusions are in bold Some similarities can be found in the confusion of Vietnamese and French listeners. Both of them made the reciprocal confusion between EXn, DOU and EXo. They strongly confused EXp with EXn (>30%), IRR with AUT (about 25%). They also confused POL, COL, INT and EXn with DEC. However, there are some differences between them. The SED was strongly confused with COL (33% of confusion) by Vietnamese listeners, but not by French listeners. For Vietnamese listeners, SAR and SCO 221 show strong reciprocal confusions, while the French listeners show no confusion between these two attitudes. III. P ROSODIC ANALYSIS A prosodic analysis was carried out to give some acoustical explanations of the recognition and confusion of 16 Vietnamese attitudes. According to the ANOVA analysis (cf. Table II), there is no influence of the sentences’ length on the perception of attitudes. In three types of sentence, only the five-syllable sentences have a complete structure of Vietnamese sentences (Subject- Verb - Object). The sentence with 5 syllable-lengths also allows us to analyze the variations of prosodic parameters in the different parts of the sentence (first, middle and last part). Therefore, and to save space, the prosodic analysis was carried out only on the 5-syllable long sentence. A. Principal Component Analysis (PCA) The audio signals of 16 attitudes were phonetically segmented manually. Three acoustic parameters were extracted automatically; F0 (in semitones calculated with 1 Hz as the reference value), syllabic duration (in seconds), and intensity (in dB). We calculated the mean values of F0 and intensity on each sentence (F0_mean, Int_mean), the slope of last syllable (F0_final_slope, Int_Final_slope) and the slope of whole sentence (i.e., the mean value of the last syllable minus the mean value of the first syllable: F0_slope, Int_slope). For the syllabic duration, the mean (dur_mean) and the length of final syllable (final_length) were calculated. Using the parameters described above as features, separate Principal Components Analyses were carried out, in order to see how all these acoustic parameters allow to distinguish the 16 different attitudes (figure 3). With the PCAs based on the F0 parameters, F0 slope separates the 16 attitudes into 2 groups: attitudes with rising F0 contour (EXp, IRR, EXo, DOU, EXN, ADM OBV) and the others with falling F0 contour. The F0 final slope shows the attitudes ADM, EXN, DOU, INT, DEC with a rising F0 on the last syllable. The OBV, AUT, IDS have falling F0 on the last syllable. The IRR and EXp are characterized by high F0 mean and high positive F0 slope. The OBV and AUT are distinguished with other attitudes by a very low and negative F0’s final slope. With the PCAs based on intensity, the parameter of mean intensity shows some attitudes with very low intensity (ADM, COL, SED, SCO, POL). The AUT, IDS, EXP have the highest mean intensity and positive final slope. The parameter Int_Slope is important to distinguish the IRR (highest positive slope) and SED (lowest negative slope). With the duration parameters, IDS, SCO and SAR are separated by high duration mean. IDS is also distinguished by a high value of duration mean and the length of the last syllable. B. Prosodic contours comparison For all attitudes, the F0 contours were extracted (in semitones calculated with 1 Hz as the reference value) to examine the similarity and the specific shape of intonation contours. Figure 4 shows F0 contours of 5 syllables-length sentences (extract in semitone) of 16 Vietnamese attitudes. Overall, most attitudes have the duration from 0.8 to 1s. However, three attitudes SCO, SAR, IDS have the duration twice longer than the others. Figure 3. Two main dimensions of PCA for 16 attitudes, base on F0 (top), Intensity (middle) and Duration (bottom) For most attitudes, the F0 curves at the middle of sentence (from the second syllable to the next-to-last syllable) are nearly similar. The F0 contours of the attitudes are mostly different at the first and the last syllables. Researches on different languages also show the informative weight of the first syllable [8]. In the case of Vietnamese, the attitudes AUT, IRR, OBV and EXp have their first syllable with a long duration and a rising F0. Amongst them, IRR have the last syllable with level 222 contour, the EXp, OBV and AUT have last syllable with the falling contours. The F0 contours of DEC, POL and ADM are nearly similar, with a flat shape for all syllables. That may explain why they were confused in perception test. The INT, EXn and DOU have the same shape of last syllable (slightly rising). That may make some confusion between them. According to the perception test, Vietnamese listeners recognized the SAR and SCO attitudes, but with a strong reciprocal confusion. Such a result can also be explained by the similar shapes of their F0 contours. Both attitudes have a long overall duration, due to an important lengthening of their first and last syllable. Their F0 contours rise rapidly from the first syllable and fall down after the second syllable. The EXp, OBV have special shape of the last syllable, which rises at the beginning but falls down rapidly at the end. The IDS can be also distinguished from other attitudes by the longest duration. IV. D ISCUSSTION AND CONCLUSIONS Using a cross-cultural perception test, 16 Vietnamese attitudes were evaluated by native and non-native listeners. Experimental results do not show any significant effect of listener’s gender nor sentence length. On the contrary, there are some obvious differences between the perception of native and non-native listeners. Some attitudes such as DEC, AUT, IRR, SAR, SED were well recognized by both Vietnamese and French listeners. One can suppose that the concepts and the expressions of these attitudes are similar between the two languages and the two cultures. Other attitudes are recognized by native listeners, but almost not recognized by non-native ones (SCO and IDS). Such attitudes shall be conceptually encoded using different strategies by Vietnamese and French speakers. The fact that some attitudes were not recognized by either Vietnamese, French listeners or both of them may be explained by the assumption that such kinds of attitudes cannot be distinguished satisfactorily from others on the basis of audio information only, outside any pertinent interaction context: the listeners may need more information – and particularly visual information from the face or from gestures to distinguish such attitudes. It raises interesting questions for future researches on audio-visual perception and the analysis of the facial parameters. It is particularly the case for the EXn attitude, which is not recognized by natives while non-natives do recognize it: the subtle variations of prosody may not be sufficient when confronted also to the sentence’s meaning – a problem that does not have non-native listeners. The prosodic analysis proposed some reasonable explanations of these 16 attitude’s recognition and confusion. It also gives us some basic characteristics of the Vietnamese attitude. Those are the basic results for our future work on modeling Vietnamese prosodic attitudes. However, this analysis was limited to three prosodic parameters (F0, intensity and duration). The future work will also deal with voice quality analysis and visual parameter analysis, in order to bring out more complete description of Vietnamese social affects. Future works will also explore the importance of the tonal system on the production and the perception of Vietnamese attitudes, not only for native, but also for foreign speakers without any linguistic knowledge of a tonal language: will they be able to separate tonal from attitudinal information? Figure 4. The F0 contours of 5 syllables-length sentences for 16 Vietnamese attitudes REFERENCES [1] K.R. Scherer, and H. Ellgring, “Multimodal Expression of Emotion: Affect Programs or Componential Appraisal Patterns?”, Emotion, 7(1), pp. 158-171, 2007. [2] V. Aubergé, "A Gestalt Morphology of Prosody Directed by Functions: the Example of a Step by Step Model Developed at ICP", Speech Prosody, 2002. [3] F. Danes , “Involvement with language and in language”, Journal of Pragmatics, 22,251–264, 1994. [4] T. Banziger and K. R. Scherer. "The role of intonation in emotional expressions." Speech Communication 46(3-4): 252-267, 2005. [5] P. Delattre “Les dix intonations de base du francǜais”. The French Review, 40(1):1-14, 1966. [6] S. Shigeno, “Cultural similarities and differences in the recognition of audio-visual speech stimuli”, ICSLP98, 1998. [7] K. R. Scherer, R Banse, H. G. Wallbott, “Emotion inferences from vocal expression correlate across languages and cultures”, Journal of Cross-Cultural Psychology, 32(1), 76-92, 2001. [8] V. Aubergé, T. Grépillat, A. Rilliard, “Can we perceive attitudes before the end of sentences? The gating paradigm for prosodic contours”, 5th Eurospeech, 1997. [9] S. Mozziconacci, “Prosody and Emotion”, Speech Prosody 2002. [10] M L. Diaféria, "Les Attitudes de l’Anglais : Premiers Indices Prosodiques", Master thesis, INP Grenoble, France 2002. [11] C. Gobl and A. Ni Chasaide, "The role of voice quality in communicating emotion, mood and attitude." Speech Communication 40(1-2): 189-212, 2003 [12] T.X. Le, "Etude contrastive de l’intonation expressive en français et en vietnamien", PhD thesis of Linguistic and Phonetic, Université Paris 3, 1989 223 . How Vietnamese attitudes can be recognized and confused: Cross-cultural perception and speech prosody analysis Dang-Khoa Mac, Eric Castelli International. language and culture, they can differ between languages. Some attitudes can be expected to have a universal value (e.g. “surprise”), but specific attitudes in one language may not be recognized. groups of attitudes: SAR and SCO; POL and DEC; SED and COL; EXo, EXn and DOU. French listeners made reciprocal confusions between AUT and IRR; DEC and OBV; DOU and EXn; DOU and EXo. < = 19 %