Database Formulation of Emotional Speech

Một phần của tài liệu Analysis and detection of human emotion and stress from speech signals (Trang 57 - 64)

An emotion database is created in the laboratory, since there are no readily available databases that suit this study. Current available databases contain utterances with exaggerated expressions because of using actors and actresses as speakers. Prepared written texts are used in collecting emotion data and this results acted emotions. On the other hand, databases of real emotional speech recorded in real life environments present serious ethical and moral problems. The contents of these emotion utterances

may reveal personal details about the speakers. To avoid these cases, the recordings are made in a laboratory environment and non-professional speakers are selected as speakers. They are not asked to use prepared written texts. Although emotions can be recognized in the utterances at the word level, it is better to analyse the utterances of sentences to study pauses or sighs. These specific characteristics of emotion can only be involved in the sentences and they may be rarely presented in the word level.

The first goal of formulating this database is to study intra-speaker variations caused by emotions. The second goal is to study emotion classification in text- independent and speaker dependent modes. As discussed in Section 2.3, human vocal emotions are not subjected to the influences of culture responses [99]. However, no one in the area of emotional speech research made detailed experimental analysis on the relation between culture and emotion. There is a need for experimental analysis of the relation between them. Therefore, the third goal is to study the cross-cultural relation about emotions experimentally. For this purpose, an emotion database is formulated by using Burmese and Mandarin native speakers. Utterances in Burmese and Mandarin languages are used due to an immediate availability of native speakers of the languages.

The database includes short utterances covering the six archetypal emotions, namely Anger, Disgust, Fear, Joy, Sadness and Surprise. A total of six native Burmese language speakers (3 males and 3 females), six native Mandarin language speakers (3 males and 3 females) are employed to generate 720 utterances. The speakers are university staffs, postgraduate and undergraduate students from National University of

Singapore and Nanyang Technological University. Detailed profiles of the speakers are given in Table 3.1.

Table 3.1: Gender and age of the speakers contributed to emotion database.

Burmese Mandarin

Speakers Gender Age Speakers Gender Age

Speaker1 Male 32 Speaker1 Male 50

Speaker2 Male 27 Speaker2 Male 28

Speaker3 Male 21 Speaker3 Male 25

Speaker4 Female 31 Speaker4 Female 45

Speaker5 Female 32 Speaker5 Female 23

Speaker6 Female 33 Speaker6 Female 22

In order to satisfy the requirements of the database described above, database is recorded according to the following procedures. Before recording, speakers are given a brief introduction about the recording and emotion classification research. A sample list of sentences is prepared: 15 sentences for each emotion that contain emotional meaning for each emotional expression. The purpose is for ease of exploring emotional feelings before recording.

The speakers are told that if these sentences agree with their emotional feelings they can use these sentences. Otherwise they are asked to use sentences or phrases with their own choice at their convenience. All speakers prepare their own sentences before recording. In many cases, there are significant overlaps among sentences in different speakers since they often use sentences from sample emotion sentence list. Since databases are different and this study is limited to speaker dependent emotion classification, this introduces no bias. The sets of emotion sentences translated into the English language are presented in Appendix A for both Burmese and Mandarin databases.

Recording is done in a laboratory room that is noise free since undistorted speech signals without background noise are required for feature parameter analysis.

The speakers are alone throughout the recording session. The reason is that speakers may have difficulties in producing several emotional utterances in the presence of others since they are not professional actors. The speakers are instructed how to use

‘Cool Edit’ audio recording software and the format of the speech files to keep in the computer. The speakers are asked to utter one sentence for every emotion in a way that reflects the emotion. Some examples are presented to the speakers but the speakers are allowed to express the emotions in their own ways. A mouthpiece microphone is used in order to fix the distance between the mouth and microphone during the whole recording session.

After recording each emotion sentence, the speakers are asked to playback and listen. If they feel that the utterance does not express the intended emotion, they will give it another trial until they are satisfied. For each speaker, sixty different utterances, ten each for each emotional mode, are recorded. All speech data are coded at 16 bit/sample and sampled at 22kHz. Statistics of the durations of the utterances for each of the six emotion categories are given in Table 3.2.

Table 3.2: Lengths of sample speech utterances for Burmese and Mandarin Speakers (Sec)

Anger Disgust Fear Joy Sadness Surprise 0.33 0.5 0.51 0.4 0.54 0.31 0.66 1.17 1.28 0.64 1.31 0.84 1.44 1.83 1.86 1.57 1.97 1.31 Burmese

1.72 2.28 2.45 2.85 2.33 1.85 0.28 0.43 0.46 0.41 0.51 0.42 0.64 1.49 1.73 1.75 1.25 1.37 1.26 1.99 2.33 2.1 2.43 2.2 Mandarin

1.98 2.68 3.1 2.64 3.04 3.49

x 1.04 1.55 1.72 1.55 1.67 1.47

σ 0.65 0.81 0.93 0.97 0.92 0.99

The durations of the utterances for the six categories of emotion are evenly spread (large σ) and the effect of length as a clue for classification is minimal. The six emotions of Anger, Disgust, Fear, Joy, Sadness and Surprise are universal as explained in Chapter 2. For most educated persons, the meanings of these emotions are not difficult to grasp. Therefore, the speakers shall be able to elicit the specified emotions without any confusion. To further confirm the neutrality of the recorded utterances, the emotions are identified by the listeners of different language speakers.

Hereafter, this emotion database is referred to as ESMBS (Emotional Speech of Mandarin and Burmese Speakers).

The size of ESMBS emotion database is small compared to the commercial SUSAS stressed speech database. However, emotion classification using ESMBS database is also included in this thesis for exploratory research to capture emotion information embedded in speech.

3.2.1 Preliminary Subjective Evaluation Assessments

Subjective assessment of the emotional speech corpus by human subjects is carried out. The first objective of the subjective classification is to assess if the utterances include naturally expressed emotions. The second objective is to determine the listeners’ ability to correctly classify the emotional modes of the utterances and then, compare the results with machine classification performance.

The listeners of different language backgrounds are engaged for the subjective tests. Four normal-hearing listeners participate in this experiment. Two subjects are Burmese native speakers and the other two speakers are Sinhala language speakers.

The speakers are able to understand either Burmese or Sinhalese. None of them can understand both of the languages. All the subjects participated in this study are from the age group of 25 to 55 years. Burmese subjects are asked to classify Mandarin emotions and Sinhala subjects are asked to recognize Burmese emotions. The language of the utterances presented to the human subject is the one that he does not understand.

Hence, judgment is made based on the perceived emotional content rather than the semantic meaning of the utterances.

There is no training session for the listeners before the test and they are not given any feedback during the test session. The utterances are presented via headphones and played twice. The utterances are played back in random order and the subjects are requested to indicate which one of the six emotional modes is portrayed.

They could choose any one expression listed in the table for each utterance. Before each utterance is played, listener has to indicate (by raising a hand) that he is ready to

hear the next utterance. Listeners are not allowed to compare the present utterance with the previous one. The detailed classification performance of the human evaluators is summarized in Table 3.3 and Table 3.4.

Table 3.3: Average accuracy of human classification (%) Burmese Mandarin Speaker Average

Performance Average Performance

Speaker1 (Male) 75 61.7

Speaker2 (Male) 65 58.3

Speaker3 (Male) 61.7 66.7 Speaker4 (Female) 76.7 63.3

Speaker5 (Female) 60 66.7

Speaker6 (Female) 71.7 61.7

Mean 68.3 63.1

Mean 65.7

Table 3.4: Human classification performance by emotion categories

Anger Disgust Fear Joy Sadness Surprise Mean Burmese(αB) 98.3 63.3 45 53.3 85 65 68.3(xB)

Mandarin(αM) 96.7 55 45 41.7 91.7 48.3 63.1(xM)

Mean 97.5 59.15 45 47.5 88.35 56.65 65.7

Sinhala listeners classify Burmese emotion vocalization with an accuracy of 68.3%. Burmese subjects recognize Mandarin emotion vocalization with an accuracy of 63.1%. These results are in line with the accuracy rate found in the previous study reported in [99]. Both Burmese and Sinhala listeners recognize emotions with a similar accuracy in average (xBxM) as well as in emotion categories (αB’s for each emotion is approximately equal to αM’s respectively). Recognition rates are also higher than chance accuracy. This may suggest that common acoustic characteristics are used by native speakers of Mandarin and native speakers of Burmese. The listeners

also recognize emotions by listening emotional content of the utterances without understanding the semantic meaning. Therefore, it can be concluded that utterances include emotional meaning.

Furthermore, results show that high accuracy is observed in the classification of Anger and Sadness as they have the most acoustically distinct features. Joy and Fear emotions are classified with a lower accuracy than other emotions. According to De Silva [100], Joy may be easier to detect while on smiling, meaning that it is easier to detect visually. Anger is the most accurately detected speech utterance. Ohala [121]

suggested that Anger emotion is associated with the vocal tract lengthening, thereby signaling a larger sound source. Therefore, the information to detect Anger emotion is included in the speech sound than facial expression.

The above subjective assessments present accuracy of human classification on recorded emotion database and aspect of relation between two Asian cultures and expressions of emotions. In the following section, the process for preparation of noisy speech utterances is presented.

Một phần của tài liệu Analysis and detection of human emotion and stress from speech signals (Trang 57 - 64)

Tải bản đầy đủ (PDF)

(230 trang)