DSpace at VNU: A study on prosody of Vietnamese emotional speech

2012 Fourth International Conference on Knowledge and Systems Engineering A Study on Prosody of Vietnamese Emotional Speech Thi Duyen Ngo, The Duy Bui Human Machine Interaction Laboratory University of Engineering and Technology Vietnam National University, Hanoi, Vietnam duyennt@vnu.edu.vn Abstract attempts, on the acoustic aspect, we need to have detailed knowledge on how acoustic characteristics in voice are related to emotions This paper describes the analyses of the prosody of Vietnamese emotional speech, accomplished to find the relations between prosodic variations and emotional states in Vietnamese speech These relations were obtained by investigating the variations of prosodic features in Vietnamese emotional speech in comparison with prosodic features of neutral speech The analyses were performed on a multistyle emotional speech database which consisted of Vietnamese sentences uttered in different styles Specifically, four emotional styles were considered: happiness, sadness, cold anger, and hot anger Speech data in the neutral style were also collected, and prosodic differences of each style with respect to this neutral baseline were quantified The acoustic features related to prosody which were investigated were fundamental frequency, power, and duration According to the analysis results, for each speaker of the database, a set of prosodic variation coefficients was produced for each emotional style This will help for bringing emotions into Vietnamese synthesized speech, making them more natural Keywords: Vietnamese, Prosody, Acoustic Feature, Emotional Speech The review of literature has shown that there are two types of acoustic cues which have great influence on emotional state in speech One is related to the prosody and the other is related to the voice quality The prosodic change in an utterance will lead to the change in the perception of emotional speech [6] Therefore, prosody is an important factor needed to be investigated in finding acoustic feature variations related to emotional states in speech In addition to prosody, voice quality is another acoustic cue that researchers have much focused on In this study, we focus on prosody analyses, voice quality will be examined in the later work In this paper, we describe some analyses of the prosody of Vietnamese emotional speech, accomplished to find the relations between prosodic variations and emotional states in Vietnamese speech Specifically, a Vietnamese emotional speech database was recorded and analysed to verify the correlations and to quantify, for the emotional styles, the prosodic feature variations with respect to the neutral situation The database consisted of Vietnamese sentences uttered in five different styles: neutral, happiness, sadness, cold anger, and hot anger According to the analysis results, for each speaker of the database, a set of prosodic variation coefficients was produced for each emotional style This will help for bringing emotions into Vietnamese synthesized speech, making them more natural Introduction Speech is one of the most convenient and important ways that human uses to communicate with each other Apparently, we not use only linguistic meaning to convey our intention and feeling but also consciously or unconsciously inject our emotion into speech Emotion plays an extremely important role during our communication For this reason, researchers have been trying to bring emotions into virtual or simulated world in order to improve their naturalness And in order to realize more familiar human interface of a spoken dialogue system, attempts to add emotions to synthesized speech are needed Being able to perform these 978-0-7695-4760-2/12 $26.00 © 2012 IEEE DOI 10.1109/KSE.2012.19 The rest of the paper is organized as follows Section presents a summary on related works After that, the composition and acquisition of the emotional speech database is described in the Section Then, in the Section 4, we describe the prosodic feature extraction phase and the analysis results Finally, Section shows conclusion and future works 151 Related Works particular prosodic features in comparison with the one of European language (polysyllabic languages) The Vietnamese prosody is related to rhythm (word’s duration) between words in a word groups or in a compound words, while intonation (raising or lowering the tone by augmenting or reducing the amplitude and/or the frequency of all words) has the global effect on the whole sentence There is no need to change the intonation of a word in order to highlight it because each word has its own meaning thanks to one of the six accents Moreover, not like polysyllable languages, there is no question to emphasize a syllable in a Vietnamese word because each word has only one syllable It is not necessary to pronounce a word in a sentence stronger than the others, except when the speaker would express a special intention (e.g one can pronounce some words stronger and lower than the others in order to make them more important) Up to now, there have been some proposed works on prosody of Vietnamese speech Le [9] brought out and proved five hypotheses for Vietnamese speech’s duration basing on analysing 36 file of 20.815 words read by the broadcasters from several distinctive regions in Vietnam According to [15], factors which impact on the duration of a Vietnamese phonetic unit are the position, the pitch, and the structure of that unit In [4], Vietnamese compounds and phrasal constructions were investigated for phonetic correlates of lexical stress; acoustic and perceptual characteristics of Vietnamese compound words and their phrasal counterparts were reported In [7, 8], Le described some results of researches on acoustic features of Vietnamese speech to help for synthesising Vietnamese speech from text Mac [11] presented a study on Audio-Visual prosodic attitudes in Vietnamese; it showed the relative contribution of audio, visual, and audio-visual information in attitude perception and how native and non-native listeners recognize and confuse the attitudes An analysis on speech prosody was also carried out in order to further validate the results of the perception experiments, and to bring out some prosodic characteristics of Vietnamese social affects Almost all proposed researches have focused on Vietnamese neutral speech; there have been very few ones focusing on Vietnamese emotional speech In speech, prosody is essentially a collection of factors that control the pitch, loudness, and rate of speaking The variations of intonation, rhythm, stress pattern, belong to what we call the prosody of a sentence Depending on the emotional states of the speaker, a sentence can be uttered with different prosodic characteristics Therefore, the prosodic variations in an utterance have great influences on the emotions expressed in speech [6] This is the reason why prosody is an important factor needed to be investigated in finding acoustic feature variations related to emotional states in speech In the acoustic aspect, the acoustic cues which are considered significant for prosody are largely extracted from fundamental frequency (F0), power, and duration Fundamental frequency (F0) In the physical meaning aspect, F0 reflects the pitch that is perceived by listeners The F0 contour which represents the change of F0 in the time domain provides the information about the accent and intonation of a sentence’s utterance Such information have great influence on the perception of emotional states in speech Therefore, in the field of emotional speech research, F0 is an acoustic cue which has been studied most frequently and from the earliest time Erickson [2] has presented a summary on previous researches that studied to find which types of acoustic cues were related to emotional states in speech Most works found that the F0 contour had a deeply effect on emotional states in speech, no matter which method was used for data collection and languages used Power Power which is determined by the volume of air flow of breath sent out by the lungs primarily reflects the loudness that is perceived by listeners Similar to the F0 contour, the power envelope also affects emotional states in speech The power can vary widely when the speaker is in different emotional states The relationship between the power envelope and emotional states in speech has been reported in a number of proposed researches (e.g [3, 10, 14]) Emotional Speech Database Duration Duration primarily reflects the sound’s time related factors that listeners perceive, such as pause length, total length of utterances The same word or same sentence uttered with different lengths can be perceived differently In the field of emotional speech research, there have been proposed works which showed an effect of duration on emotional states in speech, in different languages, i.e., Japanese, English, Italian [3, 12, 13, 14] As a monosyllabic and tonal language, Vietnamese has The emotional speech database which was used for investigating Vietnamese prosodic features consisted of Vietnamese utterances produced by two professional Vietnamese actors, one male and one female The two actors were asked to produce utterances using five different styles They had to utter 19 sentences in four emotional styles that were: happiness, cold anger, sadness, and hot anger Besides, they also recorded the same 19 sentences in a neutral 152 Table Mean variations of F0 parameters for four emotional styles with respect to the neutral one Table Specifications of Voice Data Item Value Sampling frequency 22050 Hz Quantization 16bit Sentences 19 Male way Consequently, each sentence has one utterance in each of the five styles, for both male and female voices Therefore, there is a total of 190 utterances in the database - a half for the male voice and the other for the female voice Sentences were about words long and well representative of the Vietnamese phonetic alphabet Most of them were non-sense and had no semantic emotional content and therefore could not influence the actors provoking any particular emotional attitude During the recording sessions, the actors had to simulate each of emotional styles in turn, and a director was always present to control their pronunciation and their prosody to avoid emphatic performances Signals were recorded in a sound-proof room, high quality microphone and digital acquisition equipments were used, and waveforms were digitally acquired with parameters specified in the Table Female happy 7.70% sad -3.11% cold angry 6.00% hot angry 15.90% AP 6.88% -3.36% 5.34% 16.51% PR 33.14% 19.10% 39.97% 41.90% HP 10.56% -0.47% 7.63% 14.42% AP 7.25% -0.35% 5.66% 13.01% PR 49.35% 31.26% 41.65% 56.89% some acoustic parameters related to the F0 contour were measured These parameters were highest pitch (HP), average pitch (AP), and pitch range (PR) The mean variation values of these parameters for both male and the female voices are reported in the table The analysis results showed that three of the four emotional styles, namely happiness, cold anger, hot anger, had increase values with respect to the neutral case, for all parameters, and for both two speakers In there, the hot angry style had the largest variations in the F0 contour; all of three parameters related to F0 in this style had biggest increase values On the other hand, in the sad style, F0 related parameters varied in a different way Specifically, AP and HP decreased while PR increased with respect to the neutral case, for both male and female voices Actually, different analysis results were found among speakers’ voices With the female voice, the increase values of three parameters in the happy style were larger than those in the cold angry style However, with the male voice, AP and HP in the happy style had bigger increase values than those in the cold angry style while PR had a smaller increase value Another difference was that in the sad style, the decrease values of AP and HP of the female voice were much smaller than those of the male voice; these two parameters of the female voice decreased almost inappreciably On the contrary, the increase value of PR of the female voice was quite much bigger than the one of the male voice in the sad style These differences were due to the fact that the speakers expressed emotions in different ways and with different intensities Prosodic Feature Extraction and Analysis Results In this section, we describe the prosodic feature extraction phase and the analysis results Acoustic features which were investigated were fundamental frequency (F0), power, and duration The F0 contour and the power envelope were calculated by using STRAIGHT [5] with a FFT length of 1024 points and a frame rate of 1ms The sampling frequency was 22050 Hz Time duration was manually specified with the partly support of WaveSurfer [1] A total of acoustic parameters were calculated and analysed in order to find the relations between prosodic variations and emotional state in Vietnamese speech These features are: Three involved F0 – highest pitch (HP), average pitch (AP) and pitch range (PR); three involved power envelope – power range (PWR), average power (APW), and maximum power (HPW); and three involved duration – total length (TL), consonant length (CL), mean of pause lengths (MPAU) For these parameters of each emotional style, the mean values of variation coefficients with respect to the baseline (neutral style) were calculated These values are reported and discussed in the next subsections 4.1 HP 4.2 Power The power envelope was measured in a way similar to that for the F0 contour Power information was firstly extracted using STRAIGHT [5] and then acoustic parameters related to the power envelope were calculated The acoustic parameters considered were: maximum power (HPW), average power (APW), and power range (PWR) Table Fundamental Frequency - F0 For each utterance, firstly, the F0 information was extracted using STRAIGHT [5] Then from this information, 153 Table Mean variations of power parameters for four emotional styles with respect to the neutral one Male Female APW happy 20.81% sad -16.38% cold angry 4.38% hot angry 11.48% HPW 18.87% -6.95% 8.53% 14.60% PWR 8.38% -6.42% 14.37% 20.45% APW 28.82% -13.15% 38.82% 48.58% HPW 14.84% -8.95% 25.98% 34.31% PWR 11.32% -7.97% 18.11% 25.70% Figure An example of time segmentation Table Mean variations of duration parameters for four emotional styles with respect to the neutral one presents the mean variation values of these parameters for both male and female voice With respect to the neutral style, all of three parameters increased in the happy, cold angry, hot angry styles while they decreased in the sad one The high activation styles were characterized by bigger variation values and significant power peaks sometimes occurred in the final parts of the sentences too Similar to the F0 contour, there were some differences in variation values between the two speakers’ voices For example, with the male voice, APW and HPW in the happy style had bigger increase values than those in the hot angry style By contrast, with the female voice, these two parameters had smaller increase values in comparison between the happy and hot angry styles The reasons for these differences are the same as those for the differences in the F0 analysis results 4.3 Male Female MPAU happy -26.57% sad 50.81% cold angry 57.83% hot angry 62.83% CL -16.25% 15.49% -17.82% -24.57% TL 16.09% 19.62% 14.03% 19.26% MPAU -18.01% 49.11% -20.49% 53.03% CL -13.65% 13.29% -15.80% -23.44% TL 11.65% 20.32% 11.47% 25.78% there was a difference between the male voice and the female voice: in the cold angry style, this parameter increased for the male voice but decreased for the female voice The reason for this difference is that the two speaker expressed the cold angry emotion in different ways Conclusion and future works Time Duration For each utterance, the information of time segmentation was manually measured first The measurement included phoneme number, time (ms), and vowel The duration of all phonemes (time), both consonants and vowels, as well as pauses, were manually specified with the partly support of WaveSurfer [1] Figure illustrate an example of time segmentation In the table, the first row indicates the phonemes; the second row represents the order of the phonemes, noted by -1 before the first phoneme; the third row indicates the start time of the next phoneme; the fourth row shows whether the phonemes are consonant or vowel: – vowel, – consonant Basing on this table of time segmentation, the following parameters related to duration were measured: mean of pause lengths (MPAU), total length (TL), consonant length (CL) The mean variation values of these parameters are reported in the table 2, for both male and female voice Total utterance length parameter increased in all of four emotional style while consonant length parameter increased in the sad style and decreased in the other styles With the mean of pause length parameter, This paper has presented the results of some analyses of the prosody of Vietnamese emotional speech The analyses were perform on an emotional speech database which consists of Vietnamese sentences uttered in five different styles The relations between prosodic variations and emotional states in Vietnamese speech were obtained by investigating the variations of prosodic features of Vietnamese emotional speech in comparison with those of neutral speech Acoustic parameters related to fundamental frequency, power, and duration were measured and analysed for all utterances in the database According to the analysis results, a set of prosodic variation coefficients was produced for each emotional style and for each speaker of the database This will help for bringing emotions into Vietnamese synthesized speech, making them more natural Further studies are necessary to find the relations between acoustic spectrum variations and emotional states in Vietnamese speech In the future, we will perform this work and then use obtained results to construct a Vietnamese emotional speech synthesis system 154 Acknowledgement This work is supported by the project Towards a Model of an ”Intelligent Office Enviroment”, No QGTD.10.23 References [1] Wavesurfer: http://www.speech.kth.se/wavesurfer/index.html [2] D Erickson Expressive speech: Production, perception and application to speech synthesis Acoust Sci & Tech, 26:317–325, 2005 [3] G L Huttar Relations between prosodic variables and emotions in normal american english utterances Journal of Speech and Hearing Research, 11:481–487 [4] J Ingram and T Nguyen Stress, tone and word prosody in vietnamese compounds Proceedings of the 11th Australian International Conference on Speech Science & Technology, pages 193–198, 2006 [5] H Kawahara, I Masuda-Katsuse, and A de Cheveigne Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequencybased f0 extraction: possible role of a repetitive structure in sounds Speech Communication, 27:187–207, 1999 [6] R D Kent and C Read Acoustic Analysis of Speech San Diego: Singular Publishing Group, 1992 [7] H M Le and K H Le Analysis and synthesis for duration feature of vietnamese The 6th National Conference in Information Technology, Thainguyen, Vietnam, 2003 [8] H M Le and T N Quach Some results in phonetic analysis to vietnamese text-to-speech synthesis based on rules Journal on Information and Communication Technology, 2006 [9] T H Le, A V Nguyen, V H Truong, V H Bui, and D Le A study on vietnamese prosody New Challenges for Intelligent Information and Database Systems, 351:63–73, 2011 [10] L Leinonen Expression of emotional-motivational connotations with a one-word utterance J Acoust Soc Am., 102:1853–1863, 1997 [11] D K Mac, E Castelli, V Auberg, and A Rilliard How vietnamese attitudes can be recognized and confused: Crosscultural perception and speech prosody analysis International Conference on Asian Language Processing, pages 220–223, 2011 [12] K Maekawa Phonetic and phonological characteristics of paralinguistic information in spoken japanese Proc Int Conf Spoken Language Processing, pages 635–638, 1998 [13] M D Pell Influence of emotion and focus location on prosody in matched statements and questions J Acoust Soc Am., 109:1668–1680, 2001 [14] R W H G G T Scherer K R., Banse Vocal cues in emotion encoding and decoding Motivation and Emotion, 15:123–148, 1991 [15] D D Tran, E Castelli, J.-F Serignat, and V B Le Analysis and modeling of syllable duration for vietnamese speech synthesis O-COCOSDA, 2007 155 ... relative contribution of audio, visual, and audio-visual information in attitude perception and how native and non-native listeners recognize and confuse the attitudes An analysis on speech prosody. .. loudness, and rate of speaking The variations of intonation, rhythm, stress pattern, belong to what we call the prosody of a sentence Depending on the emotional states of the speaker, a sentence can... Figure An example of time segmentation Table Mean variations of duration parameters for four emotional styles with respect to the neutral one presents the mean variation values of these parameters

Định dạng
Số trang	5
Dung lượng	120,43 KB