Emotional speech by educated speakers of singapore english an acoustic study

1 CHAPTER ONE PRELIMINARIES AND MOTIVATIONS FOR THIS STUDY 1.1 Introduction We communicate our attitude towards all utterances, even if this is to indicate as far as possible that we have no attitude. (Crystal, 1969:289) The voice is a powerful source of information. There is surely an abundance of affective information in the voice; it conveys a wide variety of paralinguistic sources of information. In fact, vocalisation may be much more contagious than facial or bodily expressions (Lewis, 2000a). If you have ever watched a show on the television with the sound turned off, you will find that it is hardly an engaging experience. While you may be able to follow the overall plot, you will probably miss the nuances of the emotions portrayed by the characters, because they have been, literally, muted. The important role of vocal expression in the communication of emotion has been recognised since antiquity. In his classic work on the expression of emotion in animals and human beings, Darwin (1872/1965) attributed primary importance to the voice as a carrier of emotional cues. As Scherer (1989:233) points out, “the use of the voice for emotional expression is such a pervasive phenomenon that it has been frequently commented upon since the beginning of systematic scientific interest in human expressive behavior.” Vocal expressions are extremely powerful and may have the ability to elicit similar emotional states in others. Despite this, however, there is little systematic knowledge about the details of the auditory cues which are actually responsible for the expression and perception of emotion in the voice. Studies regarding emotional speech have been done in the last few decades, but few researchers actually agree on how to define the phonetic quality of expressed emotions. 2 Among the various vocal cues of emotions that have been studied, intonation is the most common, and many researchers have shown that intonation is an effective function of expressing the speaker’s emotion (Williams & Stevens, 1972; O’Connor & Arnold, 1973; Abe, 1980; Bolinger, 1986; Chung, 1995) – the same word or phrase, when spoken using varying intonation, can reflect very different emotions or attitudes which are easily recognisable to listeners. Few studies, however, involve an analysis of the vowels and consonants in emotional speech, in spite of the fact that these segments of speech are affected by emotion (Williams & Stevens, 1972; Scherer, 1986; Chung, 1995; Hirose et al, 1997). Hence, one of the aims of this study is to examine the segmental features – and the role they play – in the expression of emotional English. Because this study is conducted in Singapore, it is also interesting to look at certain features of the local variety of English. There have been many studies on Singapore English (henceforth known as SE) in the past few decades, which have progressed from identifying the structural mistakes of SE (Elliott, 1980) to establishing SE as a standard form of English and describing what its features include (Tongue, 1974; Platt et al, 1984; Gupta, 1992; Brown, 1999; Zhu, 2003). Recent researchers have generally agreed on the existence of certain features of SE, and are turning their attention to ethnic variations of these features since Singapore is a multi-ethnic society (Ho, 1999; Poedjosoedarmo, 2000; Lim, 2001). This study aims to approach SE research from a new angle by looking at the relationship between certain SE features and emotions. It is hoped that the findings of this study on vowel and consonantal qualities will find support for the position that these are significant vocal cues in emotional speech which deserve more attention in this area of research, and also provide a deeper understanding of how SE is used in natural, emotional conversation. 3 1.2 Emotional speech Each aspect of consideration in a study of emotional speech is rather complex in itself. There is a wide variety of possible vocal cues to look at, and an even wider range of emotions under the different kinds of classifications. It is therefore necessary to explain the choices of emotions and vocal cues examined in this study. The following sections provide a background to emotions and their categories, followed by a discussion on the relationship between emotions and the voice, and how the decision is made on which emotions to examine. 1.2.1 Emotion labels and categories One of the first difficulties a researcher on emotion faces is having to sieve through and choose from a myriad of emotion labels in order to decide on which emotions to study. The number of emotion labels is virtually unlimited, for when it comes to labelling emotions, the tendency has been to include almost any adjective or noun remotely expressive of affect. After all, “the most obvious approach to describing emotion is to use the category labels that are provided by everyday language.” (Cowie, 2000:2) According to an estimation made by Crystal (1969), between the two studies by Schubiger (1958) and O’Connor & Arnold (1973), nearly 300 different labels are used to describe affect. It seems that the only bounds imposed here are those of the English lexicon. Thus, in the face of such a multitude of labels, some kind of systematisation, in order to constrain the labels introduced, is indispensable. However, grouping emotions into categories is also a difficult issue. As mentioned, there are thousands of emotion labels, and the similarity between them is a matter of degree. If so, no natural boundaries exist that separate discrete clusters of emotions. As a consequence, there are many reasonable ways to group emotion labels 4 together, and because there has never been a commonly accepted approach to categorising emotional states, it is no surprise that researchers on emotions differ on the number of categories and the kinds of categories to use. The following sections will highlight the ways in which some researchers have categorised emotions. 1.2.1.1 Biological approach Panksepp (1994), who looks at emotions from a neurophysiological point of view, suggests that affective processes can be divided into three conceptual categories. The researcher points out that while most models accept fear, anger, sadness, and joy as major species of emotions, it is hard to agree on emotions such as surprise, disgust, interest, love, guilt, and shame, and harder to explain why strong feelings such as hunger, thirst, and lust should be excluded. Panksepp therefore tries to include all affective processes in his three categories. Category One – “the Reflexive Affects” – consists of affective states which are organised in quite low regions of the brainstem, such as pain, startle reflex, and surprise. Category Two – “the Blue-Ribbon, Grade-A Emotions” – consists of emotions produced by a set of circuits situated in intermediate areas of the brain which orchestrate coherent behavioural, physiological, cognitive, and affective consequences. Emotions like fear, anger, sadness, joy, affection, and interest fall under this category. Lastly, Category Three – “the Higher Sentiments” – consists of the emotional processes that emerge from the recent evolutionary expansion of the forebrain, such as the more subtle social emotions including shame, guilt, contempt, envy, and empathy. However, because the concerns of a biological-neurophysiological study are vastly different from that of a linguistic study, this method of categorisation is not commonly referred to by linguistic researchers of emotional speech. Instead, the 5 question is more often “whether emotions are better thought of as discrete systems or as interrelated entities that differ along global dimensions” ( Keltner & Ekman, 2000:237). Linguistic researchers who take the stand that emotions are discrete systems would study a small number of emotions they take to be primary emotions (the mixing of which produces multiple secondary emotions), while researchers who follow the dimensional approach study a much greater number of emotions (viewed as equally important), placing them along a continuum based on the vocal cues they examine. The next two sections will briefly cover these different views on emotions. 1.2.1.2 Discrete-emotions approach The more familiar emotion theories articulate a sort of “dual-phase model of emotion that begins with ‘primary’ biological affects and then adds ‘secondary’ cultural or cognitive processes” (White, 2000:32). Cowie (2000:2) states that “probably the best known theoretical idea in emotion research is that certain emotion categories are primary, others are secondary.” Cornelius (1996) summarises six basic or primary emotion categories, calling them the “big six”: fear, anger, happiness, sadness, surprise, and disgust. Similarly, Plutchik’s (1962) theory points towards eight basic emotions. He views primary emotions as adaptive devices that have played a rule in individual survival. According to this comprehensive theory, the basic prototype dimensions of adaptive behaviour and the emotions related to them are as follows: (1) incorporation (acceptance), (2) rejection (disgust), (3) destruction (anger), (4) protection (fear), (5) reproduction (joy), (6) deprivation (sorrow), (7) orientation (surprise), and (8) exploration (expectation). The interaction of these eight primary emotions in various intensities produces the different emotions observed in everyday life. 6 This issue is discussed more fully in Plutchik (1980). He points out that emotions vary in intensity (e.g. annoyance is less intense than rage), in similarity (e.g. depression and misery are more similar than happiness and surprise), and in polarity (e.g. joy is the opposite of sadness). In his later work (Plutchik, 1989), he reiterates the concept that the names for the primary emotions are based on factor-analytic evidence, similarity scaling studies, and certain evolutionary considerations, and that emotions designated as primary should reflect the properties of intensity, similarity, and polarity. Therefore, “if one uses the ordinary subjective language of affects, the primary emotions may be labelled as joy and sadness, anger and fear, acceptance and disgust, and surprise and anticipation.” Lewis (2000a, 2000b) also presents a model for emotional development which involves basic or primary emotions. In his model, the advent of the metarepresentation of the idea of me, or the consciousness, plays a central role. He lists joy, fear, anger, sadness, disgust, and surprise as the six primary emotions, which are the emotional expressions we observe in a person in his first six months of life. These early emotions are transformed in the middle of the second year of life as the idea of me, or the meta-representation, is acquired and matures. Lewis calls this transformation “an additive model” because it allows for the development of new emotions. He stresses that the acquisition of the meta-representation does not transform the basic emotions; rather, it utilises them in an additive fashion, thereby creating new emotions. The primary emotions are transformed but not lost, and therefore the process is additive. 7 1.2.1.3 Dimensional approach The dimensional perspective is more common among those who view emotions as being socially learned and culturally variable (Keltner & Ekman, 2000). This approach argues that emotions are not discrete and separate, but are better measured and conceptualised as differing only in degree on one or another dimension, such as valence, activity, or approach or withdrawal (Schlosberg, 1954; Ekman et al, 1982; Russell, 1997). One way of representing these dimensions is by classifying emotions along bipolar continua such as tense – calm, elated – depressed, and Happy – Sad. Interestingly, this method of classification is similar to Plutchik’s (1980) concept of polarity variation as mentioned above. In an example of bipolar continua, Uldall (1972) sets 14 pairs of opposed adjectives that are placed at the two ends of a sevendegree scale, such as Bored extremely quite slightly Neutral slightly quite extremely Interested The other pairs of adjectives are polite – rude; timid – confident; sincere – insincere; tense – relaxed; disapproving – approving; deferential – arrogant; impatient – patient; emphatic – unemphatic; agreeable – disagreeable; authoritative – submissive; unpleasant – pleasant; genuine – pretended; weak – strong. A more systematic method structurally represents categories and dimensions by converging multidimensional scaling and factor analyses of emotion-related words and situations, such that categories are placed within a two- or three-dimensional space, like that shown in Figure 1.1. The implication of such a structure is that a particular instance is not typically a member of only one category (among mutually exclusive categories), but of several categories, albeit to varying degrees (Russell & Bullock, 1986; Russell & Fehr, 1994). 8 Figure 1.1: A circumplex structure of emotion concepts. Figure taken from Russell & Lemay (2000:497). Figure 1.2: Multidimensional scaling of emotion-related words. Figure taken from Russell (1980). 9 Russell (1989) also suggests the use of multidimensional scaling, and using distance in a space to represent similarity, to better represent the interrelationships among emotion labels. Figure 1.2 (Russell, 1980) shows the scaling of 28 emotionrelated words, based empirically on subjects’ judgments of how the words are interrelated. He claims that such a model is a better solution to the categorisation of emotions as it reflects the continuous variation of the emotions. It also asserts the correlation between emotions – the closer the emotions are placed together, the more likely that an emotional state can be classified as both of the emotions. While the “prototypical” emotion categories are not placed in the outset of the space like those in Figure 1.1, they are generally in similar positions in relation to the other emotion labels. Multidimensional scaling diagrams like Figures 1.1 and 1.2 have been derived for numerous languages besides English, such as that spoken in the Pacific (e.g. Lutz (1982) on Ifaluk; Gerber (1985) on Samoa; White (2000) on Solomon Islands) and Asia (e.g. Heider (1991) on Indonesia; Romney et al (1997) on Japan). There are other approaches and models which represent how emotions may be conceptualised, but it is beyond the scope of this study to explain each and every one of them. It is likely that no single theory or model is the “correct” one as some psychologists would like to think, and that each one really paints a partial picture, and highlights different properties, of emotion concepts. They are highly interrelated (though some have received much more attention than others) and are not competing accounts (Keltner & Ekman, 2000). 1.2.2 Researchers’ choices There is considerable evidence that different emotional states induce physiological changes, which can directly change voice quality (Ohala, 1981; Johnstone et al, 1995). Examples of changes that can affect the voice are dryness in the 10 mouth or larynx, accelerated breathing rate, and muscle tension. Scherer (1979), for instance, points out that arousal of the sympathetic nervous system, which characterises emotional states, increases muscle tonus – generally meaning an increase in fundamental frequency – and also affects the coordination and the rhythms of reciprocal inhibition and excitation of muscle groups. The latter effect “will affect pitch and loudness variability, stress patterns and intonation contours, speech rate… and many other speech parameters.” (Scherer, 1979:501f) Thus recent linguistic research has looked at a wide range of emotions. Among the many emotions researchers have studied, anger, sadness, and happiness / joy are the most common (van Bezooijen, 1984; Scherer et al, 1991; Chung, 1995; Johnstone et al, 1995; Klasmeyer & Sendlmeier, 1995; Laukkanen et al, 1995; McGilloway et al, 1995; Mozziconacci, 1995, 1998; Nushikyan, 1995; Tosa & Nakatsu, 1996; Hirose et al, 1997; Nicholson et al, 2000). A fair number of studies also examine fear and / or boredom (Scherer et al, 1991; Johnstone et al, 1995; Klasmeyer & Sendlmeier, 1995; McGilloway et al, 1995; Mozziconacci, 1995, 1998; Nushikyan, 1995; Tosa & Nakatsu, 1996; Nicholson et al, 2000). Disgust is another relatively popular choice of study (van Bezooijen, 1984; Scherer et al, 1991; Johnstone et al, 1995; Klasmeyer & Sendlmeier, 1995; Tosa & Nakatsu, 1996; Nicholson et al, 2000). Other emotions which some researchers study include despair, indignation, interest, shame, and surprise. The majority of these studies involving emotional speech tend to include neutral as an emotion. While – strictly speaking – neutral is not an emotion, one can understand the necessity of non-emotional data, which serves to bring out the innate values of the speech sounds so that emotional data has a basis of comparison (Cruttenden, 1997). 11 It should be noted that for the emotion anger, some researchers make the distinction between hot and cold anger (Johnstone et al, 1995; Hirose et al, 1997; Pereira, 2000). Hirose et al (1997) explain that anger can be expressed straight or suppressed, resulting in different prosodic features, and this is supported by their results, which show two opposite cases for samples of anger. 1.2.3 Choices of emotions for this study Because natural conversation is recorded (in the form of anecdotal narratives of recollections of emotional events) for data for this study, fear and boredom were not chosen since people do not normally recall events which made them feel fearful or bored and still speak with traces of the emotions felt at the time of the events. Disgust is also not an option as people do not usually sustain their tone of disgust throughout significant lengths of their narrative. (Hot) anger, sadness, happiness, and neutral are chosen as the four emotions for this study. According to the theories formulated by Lewis (2000a, 2000b) and Plutchik (1962, 1980, 1989), anger, sadness, and happiness are considered primary emotions. In other words, they are distinguishable from one another and none of them is a derivative of another. They are also shown to be in different quadrants of the multidimensional scaling diagrams, Figure 1.1 (Keltner & Ekman, 2000) and Figure 1.2 (Russell, 1980). This means that anger, sadness, and happiness are dimensionally dissimilar from one another. Neutral is necessary because, as mentioned, it serves as a basis of comparison for the data of the other emotions. Furthermore, if neutral had a place in the multidimensional scaling diagrams, it would probably be close to calm, and hence would be in the fourth quadrant, reasonably different from anger, sadness, and happiness. 12 It is also useful to note that despite their being four relatively distinct emotions, they can be grouped into two opposite pairs of ‘active’ and ‘passive’ emotions, as suggested by the vertical axis of the circumplex structure in Figure 1.1. Active emotions – in the case of this study, Angry and Happy – are represented by a heightened sense of activity or adrenaline, while passive emotions – Sad and Neutral– are represented by low levels of physical or physiological activity. (Neutral is taken to be similar to calm in Figure 1.1 and also At ease and Relaxed in Figure 1.2, and is also found by Chung (1995) to be similar to other passive emotions in terms of pitch, duration and intonation.) Such pairing is useful for it provides one more way by which to compare and contrast the four emotions. 1.3 Two major traditions of research There are two major traditions of research in the area of emotional speech: encoding and decoding studies (Scherer, 1989). Encoding studies attempt to identify the acoustic (or sometimes phonatory-articulatory) features of recordings of a person’s vocal utterances while he is in different emotional states. In the majority of encoding studies, these emotional states are not real but are mimicked by actors. In contrast, decoding studies are not as concerned with the acoustic features, but with the ability of judges to correctly recognise or infer affect state or attitude from voice samples. Due to the large number of speech sounds examined in the different number of emotions, this study will mainly be an encoding study. However, a preliminary identifying test will be conducted using some of the speech extracts from which the vocal cues used for analysis are taken. The purpose of the short listening test is to support my assumption that certain particular segments of conversation are 13 representative of the emotions portrayed. This test will be explained in further detail in Chapter 2. Before stating the motivations for this study and its aims, the following sections will provide a summary of the past research done on the aspects of voice which are cues to emotion, as well as a brief description of the segmental features of Singapore English. 1.4 Past research on emotional speech A glance at the way language is used shows us that emotions are expressed in countless different ways, and Lieberman & Michaels (1962) note that speakers may favour different acoustic parameters in transmitting emotions (just as listeners may rely on different acoustic parameters in identifying emotions). In the last few decades, researchers have studied a wide range of acoustic cues, looking for the ones which play a role in emotive speech. 1.4.1 Intonation … speakers rarely if ever objectify the choice of an intonation patter; they do not stop and ask themselves “Which form would be here for my purpose?”… Instead, they identify the feeling they wish to convey, and the intonation is triggered by it. (Bolinger, 1986:27) It is an undisputed fact that intonation has an important role to play in the expression of emotions. In fact, it is generally recognised that the use of intonation to express emotions is universal (Nushikyan, 1995), i.e. there are tendencies in the repetition of intonational forms in different languages (Bolinger, 1980:475-524). This explains why there is more literature regarding intonational patterns in emotional speech than any other speech sounds. Because pitch is the feature most centrally 14 involved in intonation (Cruttenden, 1997), studies on intonation tend to focus on pitch variations. Mozziconacci (1995) focuses exclusively on the role of pitch in English and shows that pitch register varies systematically as a function of emotion. Her acoustic analysis shows that neutral and boredom have low pitch means and narrow pitch ranges, while joy, anger, sadness, fear, and indignation have wider ranges, with the latter two emotions having the largest means. However, there is a low emotionidentification performance in her perception test. She attributes that to the fact that no characteristics other than pitch had been manipulated, which implies that pitch is not the only feature involved differentiating emotions in speech. Chung (1995) observes that in Korean and French, the pitch contour seems to carry a large part of emotional information, and anger and joy have a wider pitch range than sorrow or tenderness. Likewise, McGilloway et al (1995) find that in English utterances, happiness, compared to fear, anger, sadness, and neutral, has the widest pitch range, longer pitch falls, faster pitch rises, and produces a pitch duration that is shorter and of a narrower range. Kent et al (1996) generalise that large intonation shifts usually accompany states of excitement, while calm and subdued states tend to manifest a narrow range of intonation variations. However, the system of intonation to convey affective meaning is not the only means of communicating emotions. According to Silverman et al (1983), certain attitudes are indistinguishable on the basis of intonation. Uldall’s (1972) results demonstrate that some of the attitudes (e.g. the adjective pair “genuine – pretended”) are apparently rarely expressed by intonation. Therefore, while intonation is a significant means of conveying expressive meaning, it is not the only one and there are certainly other equally important phenomena. 15 1.4.2 Other vocal cues It seems obvious that intonation is not the only means of differentiating emotions, and that “other aspects such as duration and voice quality must also be taken into consideration” ( Mozziconacci, 1995:181). Cruttenden (1997) points out that there are a number of emotions, like joy, anger, fear, sorrow, which are not usually associated directly with tones, but may be indicated by a combination of factors like accent range, key, register, overall loudness, and tempo. Murray & Arnott (1993) note that the most commonly referenced vocal parameters are pitch, duration, intensity, and voice quality (the last term was not clearly defined though). Nevertheless, there are fewer studies done on any one of these aspects than on intonation. Furthermore, these cues and parameters mentioned by Cruttenden (1997) and Murray & Arnott (1993) are still examples of prosodic features, and no mention is made by them of the role of segmental features. The study on Korean and French by Chung (1995) suggests that the vowel duration of the last syllable differs according to the emotions: it is very short in anger and long in joy and tenderness. Consonantal duration, however, is less regular; lengthening tends to occur on stressed words. (No mention is made, however, of whether these words are sentence-final or sentence-medial.) Hirose et al (1997) find speech rate to be higher in emotional speech as compared to non-emotional speech, and Chung (1995) elaborates that it is high in anger and joy but low in sorrow and tenderness (in both studies, speech rate, while not explicitly defined, is measured over a sentence). With regard to intensity, the most obvious result of research is that it increases with anger (Williams & Stevens, 1972; Scherer, 1986; Chung, 1995; Leinonen et al, 16 1997). Other findings include intensity being significantly higher in joy than in sadness and tenderness (Chung, 1995; Hirose et al, 1997). Klasmeyer & Sendlmeier (1995) study glottis movement by analysing the glottis pulse shape in emotional speech data. Laukkanen et al (1995) also examine the glottis and the role of glottal airflow waveform in identification of emotions in speech, but the study is, unfortunately, inconclusive, because, as admitted by the researchers, since the glottal waveform was studied only at F0 maximum, it remains uncertain whether the relevance of the voice quality in their samples was related to the glottal waveform or to a pitch synchronous change in it. 1.4.3 Comparing between genders There is little written literature on the comparison between male and female speech, much less the comparison between male and female emotional speech. This is probably due to the fact that early work in phonetics focused mainly on the adult male speaker, mostly for social and technical reasons (Kent & Read, 2002:53). But it is an undeniable fact that the genders differ acoustically in speech. A classical portrayal of gender acoustic diversity is shown by Peterson & Barney (1952), who, from a sample of 76 men, women, and children speakers asked to utter several vowels, derive F1-F2 frequencies which falls within three distinct (but overlapping) clusters (men, women, and children). Likewise, Tosa (2000) discovers after running preliminary (artificial intelligence) training tests with data from males and females, two separate recognition systems – one for male speakers and another for female speakers – are needed, as the emotional expressions between males and females are different and cannot be handled by the same program model. However, further research had not been done to find out the reason behind the gender difference. 17 We do know that the differences are due in part to biological factors: women have shorter membranous length of the vocal folds, which results in higher fundamental frequency (F0), and greater mean airflow (Titze, 1989). Women’s voices are also physiologically conditioned to have a higher and wider pitch range than men, particularly when they are excited (Brend, 1975; Abe, 1980). In an experiment in which 3rd, 4th, and 5th grade children were asked to retell a story, Key (1972) observed that the girls used a very expressive intonation (i.e. highly varied throughout speech), while the boys toned down intonational features even to the point of monotony. 1.5 Singapore English Singapore is a multi-ethnic society whose resident population of four million are made up of 76.8% Chinese, 13.9% Malays, 7.9% Indians, and 1.4% of other races (Leow, 2001). While the official languages of the three main ethnic groups are Mandarin, Malay, and Tamil, respectively, English is the primary working language, used in education and administration. Because of this multi-ethnolinguistic situation, the variety of English spoken in Singapore is distinctive and most interesting to study. There has been much interest in two particular ways of studying SE. One describes the nature and characteristics of SE, documenting the semantic, syntactic, phonological, and lexical categories of SE. The other way is to provide an account for the emergence of certain linguistic features of SE; for example, some studies show evidence of the influence of the ethnic languages on SE. SE generally has two variations: Standard Singapore English (SSE) and Colloquial Singapore English (CSE) (Gupta, 1994). SSE is very similar to most Standard Englishes, while CSE differs from Standard Englishes in terms of pronunciation, syntax, etc. But many educated Singaporeans of today speak a mixture 18 of both varieties: the morphology, lexicon, and syntax is that of SSE but the pronunciation system is that of CSE (Lim, 1999). Since the interest of this study lies in the relationship between emotions and the articulation of segmental features, a brief description will be given of the phonological phenomenon of vowel- and consonant-conflation in SE. 1.5.1 Vowels It is commonly agreed by researchers that one of the most distinctive features of SE pronunciation is the conflation of vowel pairs. Much research has been done on this phenomenon, and some researchers focus on the conflation of pairs of short and long vowels such as [,]/[LØ], [£]/[$], [o]/[c], [8]/[XØ] (Brown, 1992; Poedjosoedarmo, 2000; Gupta, 2001). Brown (1992) finds that Singaporeans have no distinction between the abovementioned long and short vowel pairs, i.e. each pair is conflated. Poedjosoedarmo (2000), in her study on standard SE, studies only the vowel pair [,]/[LØ], which she takes as representative of the phenomenon of vowel conflation in SE, and finds that the pair is indeed conflated. Gupta (2001), however, finds that the standard SE vowel pairs placed in descending order of how often they are conflated are: [(]/[4], [8]/[XØ], [o]/[c], [£]/[$], [,]/[LØ]. In other words, most Singaporeans make the distinction between the vowels in [,]/[LØ], but few do so for the vowels in [(]/[4]. Other studies further include [(]/[4], analysing the differences in the positions of the tongue for the vowels in recorded standard SE, spontaneous or otherwise (Lim, 1992; Loke, 1993; Ong, 1993). Lim (1992) plots a formant chart for the vowel pairs [,]/[LØ], [(]/[4], [£]/[$], [o]/[c], and [8]/[XØ] and finds the vowel pairs statistically similar in terms of frequency (i.e. the vowel pairs are conflated). Likewise, Loke 19 (1993) finds the conflation of the vowel pairs [,]/[LØ], [(]/[4], [£]/[$], and [8]/[XØ] by examining vowel formants in spectrographs. Ong (1993), on the other hand, finds that the vowel pairs [,]/[LØ], [(]/[4], and [o]/[c] do conflate but “there is no clear evidence” of conflation of [8]/[XØ], while the vowel pair [£]/[$] appears to conflate only in terms of tongue height. However, less attention is paid to the conflation of the final pair of Received Pronunciation (RP) monophthong vowels, []/[Ø], though they have been found to be conflated in standard SE (Deterding, 1994; Hung, 1995; Bao; 1998). While most studies that omit this vowel pair from their list of vowel pairs examined do not provide reasons for the omission, Brown (1988) provides one: the distinction between []/[Ø] is primarily one of length rather than tongue positioning, which may be solely related to stress, where the [Ø] appears in stressed syllables while [] appears in unstressed ones. He reasons that since SE rhythm is typically not stressed-based, he is not considering the distinction of these two vowels in his study. Another possible problem with studying the conflation of []/[Ø] is that, in the case of natural or spontaneous speech, words containing these vowels in the stressed syllable occur much less frequently (Kent & Read, 2002). To recapitulate, Singaporeans generally conflate the vowel pairs [,]/[LØ], [(]/[4], [£]/[$], [o]/[c], [8]/[XØ], and []/[Ø] which are normally distinguished in RP. Table 1.1 shows how the vowels are conflated. Certain diphthongs are also shortened (to monophthongs) in SE (Bao, 1998; Gupta, 2001), but because monophthongs are the focus on vowels of this study, this section will not cover that aspect of Singaporean vowel conflation. 20 Table 1.1: Vowels of RP and SE (adapted from Bao, 1998:158) 1.5.2 RP SE Example RP SE Example , LØ ( 4 £ $ L L ( ( $ $ bit beat bet bat stuff staff c o 8 XØ Ø o o X X cot caught book boot about bird Consonants In most cases, standard SE consonants are pronounced much as they are in most other varieties of English. However, the dental fricatives [7] and ['] tend to be commonly replaced by the corresponding alveolar stops [W] and [G], at least in initial and medial positions. This, like the conflation of vowels, is a pervasive SE conflation and is noted by many researchers of SE (Tongue, 1979; Platt & Weber, 1980; Brown, 1992; Deterding & Hvitfeldt, 1994; Poedjosoedarmo, 2000; Gupta, 2001). Positionfinal [7] and ['] are commonly replaced by [I] and [Y] (Brown, 1992; Bao, 1998; Deterding & Poedjosoedarmo, 1998; Poedjosoedarmo, 2000). In final position, stops (especially voiceless stops) usually appear as glottal stops [] and consonant clusters tend to be simplified, often by the omission of the final stop (e.g. tact pronounced as [W(] or [W(N]; lift as [OLI]) (Gupta, 2001). This is more common in informal speech, but speakers are actually able to produce the appropriate stops or consonant clusters in careful speech. Also, standard SE speakers do not distinguish voiced from voiceless positionfinal stops, fricatives, and affricates (Gupta, 2001). The contrast between the voiced and voiceless obstruents is neutralised, such that all obstruents are voiceless and fortis, with no shortening of the vowel before them. For example, edge [(G=] is pronounced 21 by SE speakers as [(W6], and rice [U$,V] and rise [U$,]] are pronounced identically as [U$,V]. According to Gupta (2001), this conflation apparently occurs even in careful speech in most SE speakers. 1.6 Motivations for this study It is fascinating that expressing and identifying emotions come so naturally to us, and yet are so difficult to define. Despite the fact that the voice is an important indicator of emotional states, research on vocal expression of emotion lags behind the study of facial emotion expression. This is perhaps because of the overwhelming number of emotions – or more precisely, emotion labels – and the many different possible vocal cues to study, such that researchers take their pick of emotions and vocal cues in a seemingly random fashion. This makes it difficult to view the studies collectively in order to determine the distinction between emotions. This study attempts to go back to the basics, so to speak, starting with emotions that are “more primary”, less subtle, and most dissimilar from one another, and the vocal cues that are most basic to any language – the vowels and consonants. And because this study is conducted in Singapore, it is an excellent opportunity to examine SE from a different angle, applying that which is known about SE segments – in this case, vowel conflation – to an area of research in which the study of SE is completely new (i.e. emotional speech), in the hope of providing a deeper understanding of conversational SE and the way its features interact with affect, which is ever present in natural conversations. Intuitively, one would expect vowel conflation to be affected by emotions, because vowels are conflated by duration (Brown, 1992; Poedjosoedarmo, 2000; Gupta, 2001) and / or tongue position (Lim, 1992; Loke, 1993; Ong, 1993; Nihalani, 1995) and both duration and tongue position are variables of the 22 voice affected by physiology which in turn are affected by emotional states (Scherer, 1979; Ohala, 1981; Johnstone et al, 1995). In short, emotional speech involves physiological changes which affect the degree to which vowel pairs conflate. And hence this study hopes to discover the relationship between emotional states and vowel conflation, i.e. whether vowels conflate more often in a certain emotion, and if so, which vowel pairs and how they conflate. 1.7 Aims of this study The main aim is to determine the vocal cues that distinguish emotions from one another when expressed in English, and how they serve to do so. This study also aims to discover any relationship between emotions and SE vowel conflation, as well as to determine the difference in expression of emotions between males and females. Vocal cues of four different emotions are examined, the four emotions being anger, sadness, happiness, and neutral. The vocal cues fall under two general categories: vowels and consonants. 12 vowels – [,], [LØ], [(], [4], [£], [$], [o], [c], [8], [XØ], [], [Ø] – as well as eight obstruents – [S], [W], [N], [I], [7], [V], [6], [W6] – will be analysed. The variables examined are vowel and consonantal duration and intensity, as well as vowel fundamental frequency. Formant measurements will be taken of the vowels in order to compare vowel quality, and VOT and spectral measurements will be taken of the obstruents (depending on the manner of articulation) in order to compare them within their classes. The vocal production and the method of measurement of the specific cues examined are elaborated on in Chapter 3. The measurements of the vocal cues of anger, sadness, and happiness are compared with those of neutral to determine how these emotions are expressed through these cues. The quality of the vowel pairs (as mentioned in the earlier section on SE 23 vowels) will also be compared across emotions to find out if there is a relationship between vowel conflation and emotional expression. Also, the average measurements of all vocal cues of males are compared with that of females. In short, the research questions of this study are: I. whether segmental aspects of natural speech can distinguish emotions, and if so, by which vocal cues (e.g. intensity, duration, spectra, etc); II. whether a relationship exists between emotions and the vowel conflation that is pervasive in Singapore English; and III. whether there is an obvious or great difference in emotional expression between males and females. 24 CHAPTER TWO RESEARCH DESIGN AND METHODOLOGY 2.1 The phonetics study The analysis of the sounds of a language can be done in two ways: by auditory or instrumental means. In this study, the choice of speech extracts (i.e. passages taken from the anecdotal narratives) from which data will be obtained for analysis is based on auditory judgment, at the researcher’s own discretion. The data is then analysed instrumentally. However, since auditory perception is subjective, a perception test – using short utterances taken from the speech extracts chosen by the researcher – is conducted in order to verify that the researcher’s choices are relatively accurate and representative of general opinion. The perception test will be elaborated on in further sections. 2.2 Subjects According to Tay & Gupta (1981:4), an educated speaker of SE would have the following characteristics: a) He comes from an English-speaking home where English is used most if not all the time. (It can be added that he would use mostly English in his interaction with friends as well.) b) He has studied English as a first language in school up to at least GCE ‘A’ level and very possibly, University. c) He uses English as his predominant or only language at work. Two decades later, despite any sociological change in Singapore, the criteria have not changed very much; Lim & Foley’s (to appear) general description of speakers who are considered native speakers of SE is as such: 25 i. They are Singaporean, having been born in Singapore and having lived all, if not most, of their life in Singapore. ii. They have been educated in English as a first language with educational qualifications ranging from Cambridge GCE ‘A’ level (General Certificate of Education Advanced level) to a bachelor degree at the local university, and English is used as the medium of instruction at every level in all schools. iii. They use English as their main language at home, with friends, at school or at work; at the same time, most also speak other languages at home, at work, and with friends. The six subjects for this study, which consist of three males and three females since gender is an independent variable in this study, fulfil all these criteria, and thus can be said to be educated speakers of SE. All subjects are Chinese Singaporeans between 22 and 27 years of age, and are either students or graduates of the National University of Singapore (NUS) or La Salle (a Singapore college of Arts). Those who have graduated are presently employed. All of them have studied English as a first language in school, and speak predominantly in English to family, friends, and colleagues. The subjects are all close friends or family of the researcher and thus are comfortable with relating personal anecdotes to the researcher on a one-to-one basis. 2.3 Data This study compares the vowels and consonants expressed in four emotions, namely anger, sadness, happiness, and neutral. These emotions are chosen because, as 26 mentioned before, they are the most commonly used emotions in research on emotional speech, and because they are relatively distinct from one another. The vowels examined in this study are vowel pairs [,] and [LØ], [(] and [4], [£] and [$], [o] and [c], [8] and [XØ], [] and [Ø]. These vowels pairs are commonly conflated in SE and one of the aims of this study is to examine the relationship between emotional speech and vowel conflation in SE. The consonants examined are all voiceless obstruents: stops [S], [W], [N], fricatives [I], [7], [V], [6], and affricate [W6]. Voiceless instead of voiced obstruents are examined because voiceless obstruents tend to have greater aspiration and frication. The consonantal conflations that occur specifically in final position in SE (as described in the earlier chapter) will not be examined. This is because all consonants in final position will not be examined, since certain stops and fricative – such as stops [S] and [W], and fricatives [7] and [6] – drop in intensity when placed in final position (Kent & Read, 2002). To recapitulate, the research aims of this study are to (i) determine which vocal cues distinguish emotions, (ii) discover if there is a relationship between emotions and SE vowel conflation, and (iii) determine the difference in emotional expression between males and females. The following subsections explain how data is collected for the purpose of this study. 2.3.1 Data elicitation Many studies on emotional speech tend to rely on professional or amateur actors to mimic emotions. There are advantages in this practice, such as control of data obtained, ease of obtaining data, and ability to ensure clarity of recording which in turn 27 allows greater ease and accuracy in the analysis of the recorded data. However, actor portrayals may be attributable to theatre conventions or cultural display rules, and reproduce stereotypes which stress the obvious cues but miss the more subtle ones which further differentiate discrete emotions in natural expression (Kramer, 1963; Scherer, 1986; Pittam & Scherer, 1993). Hence, this study intends to obtain data from spontaneous speech rather than actor simulation. A long-term research programme (Rimé et al, 1998) has shown that most people tend to share their emotions by talking about their emotional experiences to others. This means that most people will engage in emotional speech while recounting emotional experiences and therefore such recounts should have an abundance of speech segments uttered emotionally. Thus, data for this study is elicited by engaging the subjects in natural conversation and asking them to recall personal emotional experiences pertaining to the emotions examined in this study, which are anger, sadness, and happiness. With regard to neutral, subjects are asked about their average day at work or school (depending on which applies to them) and perhaps also asked to explain the manner of their job or schoolwork. 2.3.2 Mood-setting tasks Each subject was required to have only one recording session with the researcher to record the Angry, Sad, Happy anecdotes and Neutral descriptions. This was in order to ensure that the recording environment and conditions of each subject were kept constant as far as possible for all of his or her anecdotes and descriptions. Since the subjects had to attempt to naturally express diverse emotions in the span of just a few hours, they were given mood-setting tasks to complete before they recorded each emotional anecdote. These tasks aimed to set the mood – and possibly to prepare 28 the subjects mentally and emotionally – for the following emotional experiences which the subjects were about to relate. They also helped to smoothen the transition between the end of an emotional anecdote and the beginning of another in a completely different emotion, making it less abrupt and awkward for both the researcher and subject. The mood-setting task to be completed before relating the Angry anecdote was to play a personal computer (PC) game, called Save Them Goldfish!, supplied by the researcher on a diskette. The game was played on either a nearby PC or, if there was no PC in the immediate vicinity of the recording, the researcher’s notebook. The game was simple, engaging, and most importantly, its pace became more frantic the longer it ran, thereby causing the subject to be more tensed and excited. The task ended either when the game finally got too quick for the subject and the subject lost, or – if the subject proved to be very adept at it – at the end of five minutes. The stress-inducing task aimed to agitate the subject so that by the end of it, regardless of whether the subject had actually enjoyed playing it, the subject’s adrenaline had increased and he or she was better able to relate the Angry anecdote with feeling than if he or she was casually asked to do so. For the Sad anecdote, the preceding task involved reading a pet memorial found from a website, as well as a tribute to the firemen who perished in the collapse of the United States World Trade Center on September 11, 2001, taken from the December 2001 issue of Reader’s Digest. Because all the recordings were done between July and October, 2002, the memories of the September 11 tragedy were still vivid and the relevance of the tribute was possibly renewed since it was around the time of the first anniversary of the tragedy. The researcher allowed the subject to read in silence for as long as it took, after which the subject was asked which article he or 29 she related to better, and to explain the choice. The purpose of asking the subject to talk about the article which affected him or her more was to attempt to make the topic and the tragedy of the depicted situation more personal for the subject, thereby setting a subdued mood necessary for the Sad anecdote. The mood-setting task for the Happy anecdote was simply to engage in idle humorous chatter for a few minutes. Since all the subjects are close friends and family of the researcher, the researcher knew which topics were close to the hearts of the subjects and could easily lighten the mood. There was no mood-setting task for Neutral. The subjects were just asked about their average day at work or school, and, if they had little to say about their average day, asked to explain the manner of their job or schoolwork. It should be noted that the researcher changed her tone of voice in her task instructions and conversations with the subjects in order to suit each task and the following anecdote. This also served to set the mood for each anecdotal recording. 2.4 Procedure The subjects were approached (for their consent to be recorded) months before the researcher’s estimated dates of recordings, and when they agreed to be recorded, they were asked to think of personal experiences which had caused them to be Angry, Sad, and Happy. They were not told of the specific research aims of this study, only that each anecdote should take about five to ten minutes to relate, but if they could not think of a single past event significant enough to take five to ten minutes to talk about, they could relate several short anecdotes. The subjects were not asked to avoid rehearsing their stories as if each was a story-telling performance, because the researcher assumed – correctly – that the subjects would not even attempt to do so due 30 to their own busy schedules. In fact, in one case, the subject even decided on his anecdotes no more than an hour before the actual recording session. For the recording, the subjects could pick any place of recording in which they felt most comfortable, provided the surroundings were quiet with minimal interruptions. Five of the subjects were recorded in their own homes while one was recorded in the Research Scholar’s Room at university. The subjects could sit or rest anywhere during the recording as long as they did not move about too much while they were being recorded. They could also have props or memoirs if they felt that the objects would be helpful and necessary. A sensitive, unobtrusive PZM microphone (model: Sound Grabber II) was placed between the subject and the researcher, and the recording was done on a Sony mini-disc recorder (model: MZ-R55). Before the start of the recording, the subjects were assured that they were not being interviewed and did not need to feel awkward or stressed; they were merely conversing with the researcher as they normally do and just had some personal stories to tell. They were reminded to speak in English and to avoid using any other languages as far as possible. Ample time was given for them to relax so that they would speak as naturally as possible, and they were told they did not have to watch their language and could use expletives if they wanted to. The order of the anecdotes told by each subject was fixed: Neutral, Sad, Angry, then Happy. This order seemed to work because it was easy (on both the researcher and the subject) to start a recording by asking the subject to describe a day at work. Furthermore, subjects seemed to be able to talk at length when describing the nature of their (career or school) work because they wanted to be clearly understood, and this period of time taken was useful for the subjects to get accustomed to speaking in the presence of a microphone, no matter how inconspicuous. It was noticed that the 31 subjects quickly learned to ignore the microphone and could engage in natural conversation with the researcher for most of the recording. In fact, majority of the subjects were comfortable enough to become rather caught up, emotionally, in telling their anecdotes; one subject – a close friend of the researcher – even broke down during her Sad anecdote, and then was animatedly annoyed during her Angry anecdote 45 minutes later. After the end of each anecdote and before the mood-setting task of the next, the subjects were always asked if they wanted to take a break, since their anecdotes could sometimes be rather lengthy. On average, each subject took about two hours to complete his or her recording of anecdotes. 2.5 A pilot recording A pilot recording was conducted to test and improve on the effectiveness of the mood-setting tasks and the general format of a recording session. Despite a couple of minor flaws in the initial recording design, which are described in the following paragraphs, the pilot recording is included as data because the subject was very open and honest with her emotions while she was relating her various personal experiences. For Neutral data elicitation, the plan was originally to ask subjects to describe their surroundings. However, the pilot recording revealed that the subject would speak slowly and end with rising intonation for each observation she made, as if she was reciting a list, which did not sound natural. But when the subject came to a jigsaw puzzle of a Van Gogh painting on her wall and was asked more about it, her speech flowed naturally (and in a neutral tone) as she explained in detail the history of the painting and the painter. Because the subject has a strong interest in Art and is also a qualified Art teacher, it was realised that it was more effective to ask subjects to 32 explain something which was familiar to them rather than to ask for a visual description of the surroundings. Hence the prompt for Neutral was changed to asking subjects about their average day and possibly asking them to elaborate on their work. The mood-setting task for the Sad recording initially consisted of two articles on the September 11, 2001 tragedy: a tribute to the firemen, and a two-page article on several families who had exchanged last words with their loved ones on Flight 93 – both of which were taken from the December 2001 issue of Reader’s Digest. Subjects were then supposed to be asked what they thought was most regrettable about the tragedy. The subject for the pilot recording ended up expressing her political opinion, but as mentioned, she was emotionally honest when she related her Sad personal experience (to the extent of weeping at certain points of her tale), and thus her recording was still suitable for use as data despite the fact that the task did not serve its purpose. Following the suggestion of the subject, the longer article was replaced by a pet memorial, which would be an effective mood-setting task for subjects who are animal lovers, or who have or have had pets. 2.6 Perception test As mentioned at the beginning of this chapter, data for analysis is chosen from sections in the recordings which the researcher feels are more emotionally expressive. In order to verify that the researcher’s choices are relatively accurate and representative of general opinion, a perception test was conducted using short utterances taken from the segments chosen by the researcher. The perception test was taken by 15 males and 15 females, all students of NUS and between 22 and 25 years of age. The listeners were given listening test sheets on which all the utterances were written out – without indication of who the speakers 33 were – so that they could read while they listened, in case they could not make out the words in the utterances. The listeners were told that they would hear 72 utterances from different speakers, and the utterances were pre-recorded in a random order but in the sequence as printed on the test sheets. They were given clear instructions that they would hear each utterance only once, after which they would have approximately ten seconds to decide whether it sounded Angry, Sad, Happy, or Neutral, and they had to indicate their choice by ticking the appropriate boxes corresponding to the emotions. The listeners were also reminded to judge the utterances based on the manner – instead of content – of expression. 2.6.1 Extracts for the test The speech extracts for the perception test were taken from all the recordings of the three male and three female subjects. Three utterances were taken from each of the Angry, Sad, Happy, and Neutral recordings of each subject, making a total of 72 utterances for the entire perception test. Two of the three utterances were extracted from the sections of the recordings which were considered very expressive, and one was extracted from the sections which were considered somewhat expressive (cf. Chapter Three section 3.2.1.1 regarding segmenting recordings according to expressiveness). The utterances were randomly chosen from the expressive sections by the researcher. All the utterances started and ended with a breath pause, indicating the start and end of a complete and meaningful expression. It was ensured that they consisted only of clear-sounding speech, and lasted at least two full seconds. This was so that the listeners could discern what was being said, and that the utterances were not too short for the listeners to perceive anything. 34 The test lasted about 15 minutes. However, it was felt that the length of time for the test was sufficient as increasing the number of utterances for each emotion per subject would result in having too many utterances fro subjects to listen to. 2.6.2 Test results With 30 listeners judging three utterances from each of the six subjects, the total number of possible matches for each of the four emotions is 540. (A match occurs when the emotion perceived by the listener is the same as the intended emotion of the anecdote from which the utterance was extracted.) A breakdown of the results is shown in the table below, and illustrated by the bar chart following the table. It should be stressed that the perception test is done to verify that the researcher is able to identify utterances which are representative of general opinion (on the emotion perceived), not to determine the specific utterances from which tokens for analysis are later taken. The a priori cut-off for acceptance that the researcher’s choices are accurate is set at 60%, which means that as long as there are more than 60% matches in an emotion, it is concluded that the researcher is able to accurately pick utterances which listeners in general feel are representative of that emotion. This in turn means that it will thus be acceptable that the researcher rates the expressiveness of the sections of recordings and also (randomly) selects the tokens for analysis. However, if less than 60% matches are made in an emotion in the perception test, it means that the researcher’s perception of emotions is not similar to that of listeners in general, and will thus need independent raters of expressiveness of the sections of recordings before the researcher can select tokens for analysis from those sections. 35 Table 2.1: Results of perception test Emotion Angry Matches 519 (96.11 %) Sad 439 (81.30 %) Happy 371 (68.70 %) Neutral 488 (90.37 %) Non-matches 13 Neutral (2.41 %) 8 Happy (1.48 %) 99 Neutral (18.33 %) 2 Angry (0.37 %) 138 Neutral (25.56 %) 22 Angry (4.07 %) 9 Sad (1.67 %) 9 Angry (1.67 %) 20 Happy (3.70 %) 23 Sad (4.26 %) Figure 2.1: Bar chart of results of perception test Percentage of matches & non-matches 100% 80% Sad Happy Angry Neutral Match 60% 40% 20% 0% Angry Sad Happy Neutral Intended em otion of anecdote As can be seen from the table and chart of the results, there is a high accuracy of recognition for Angry, Neutral, and Sad, and a percentage of matches large enough for Happy, such that it can be concluded that the researcher’s perception of emotions is an accurate reflection of that of listeners in general. One possible reason for the large number of matches across emotions is that the test only required listeners to choose from four emotions which were relatively dissimilar from one another. Taking the dimensional approach, it can be explained that when listeners are asked to recognise emotions with relatively different positions in the underlying dimensional space, they only have to infer approximate positions on the dimension in order to make accurate discriminations (Pittam & Scherer, 1993). 36 Another reason could be that despite the reminder from the researcher to judge based on manner of expression, the semantics of the utterances might still have played a part in affecting the decisions of the listeners. However, these reasons do not discount the fact that listeners are generally able to infer emotions from voice samples, regardless of the verbal content spoken, with a degree of accuracy that largely exceeds chance (Johnstone & Scherer, 2000:228). It is interesting to note that Neutral forms a large fraction of the misidentifications of the Angry, Sad, and Happy utterances. This is probably due to the fact that people are not extremely expressive when recounting past experiences. Days, months, or even years might have passed since the event itself, and hence the emotions expressed are possibly watered-down to some extent. It is therefore understandable that when these expressions are extracted in the form of short utterances and judged without the help of context, they can sound like Neutral utterances. However, while the absolute differences between the emotions might be smaller because they may all be closer to Neutral, the relative differences between the emotions are still accurate representations of the relative differences between fully-expressed emotions. Generally, the results of this perception test show that a large percentage of listeners could identify the intended emotion of the utterances (likewise perceived by the researcher as expressively uttered in the respective emotion) in the perception test. Another possible interpretation of the results is that a large percentage of the utterances chosen for the perception test could be correctly identified by the emotion expressed. It can thus be concluded that the researcher’s choices of data are generally accurate and representative of general opinion, and thus the researcher can rate the expressiveness of the sections of recordings from which tokens of sound segments are analysed. 37 CHAPTER THREE PRE-ANALYSIS DISCUSSION 3.1 Speech sounds explained Before the presentation and analysis of data, it is necessary to briefly explain how the speech sounds that are examined in this study are produced in general. 3.1.1 Vowels In the course of speech, the source of a sound produced during phonation consists of energy at the fundamental frequency and its harmonics. The sound energy from this source is then filtered through the supralaryngeal vocal tract (Lieberman & Blumstein, 1988:34ff; Kent & Read, 2002:18). The articulators such as the tongue and the lips are responsible for the production of different vowels (Kent & Read, 2002:24). The oral cavity changes its shape according to the tongue position and lip rounding when we speak, and the cavity is shaped differently for different vowels. It is when the air in each uniquely shaped cavity resonates at different frequencies simultaneously that its characteristic sounds are produced (Ladefoged, 2001:171). These frequencies then translate as dark bands of energy – known as formants – at various frequencies on a spectrogram. The lowest of these bands is known as the first formant or F1, and the subsequent bands are numbered accordingly (2001:173). The first and second formants (F1 and F2) are most commonly used to exemplify the articulatory-acoustic relationship in speech production, especially that of vowels (see Figure 3.1). F1 varies inversely with vowel height: the lower the F1, the higher the height of the vowel. F2 is generally related to the degree of backness of the vowel (Kent & Read, 2002:92). F2 is lower for the back vowels than for the front vowels. However, the degree of backness correlates better with the distance between 38 F1 and F2, i.e. F2-F1 (Ladefoged, 2001:177): its value is higher for front vowels and lower for back vowels. Figure 3.1: A schematic representation of the articulatory-acoustic relationships. Figure adapted from Ladefoged (2001:200). 3.1.2 Consonants Consonants differ significantly among themselves in their acoustic properties, so it is easier to discuss them in groups that are distinctive in their acoustic properties (Kent & Read, 2002:105). In this study, the groups of consonants examined are the stop, fricative, and affricate. The following sub-sections briefly explain these groups of consonants articulatorily and acoustically so as to provide a general understanding of their differences and why they cannot be treated simply as one large class. 3.1.2.1 Stops A stop consonant is formed by a momentary blockage of the vocal tract, followed by a release of the pressure. When the vocal tract is obstructed, little or no 39 acoustic energy is produced. But upon the release, a burst of energy is created as the impounded air escapes. In English, the blockage occurs at one of three sites: bilabial, alveolar, or velar (the glottal is usually considered separate from the rest) (Kent & Read, 2002:105-6). The stops examined in this study are the voiceless bilabial [S], alveolar [W], and velar [N]. Stops are typically classified as either syllable-initial prevocalic or syllablefinal postvocalic (2002:106). Syllable-initial prevocalic stops are produced by, first, a blockage of the vocal tract (stop gap), followed by a release of the pressure (noise burst), and finally, formant transitions. Syllable-final postvocalic stops begin with formant transitions, followed by the stop gap, and finally, an optional noise burst. Only syllable-initial prevocalic stops are examined in this study since syllable-final postvocalic stops do not always have noise burst and are therefore not reliable cues. The stop gap is an interval of minimal energy because little or no sound is produced, and for voiceless stops, the stop gap is virtually silent (2002:110). Silent segments could sometimes be pauses instead of stop gaps, and thus stop gaps are relatively difficult to identify and quantify in a spectrogram, especially when a stop follows a pause. The noise burst can be identified in a spectrogram by a short spike of energy pattern usually lasting no longer than 40 milliseconds (2002:110). Formant transitions are the shift of formant frequencies between a vowel and consonant adjacent to each other. In the case of syllable-initial prevocalic stops examined in this study, formant transitions are the shift of formant frequencies from their values for the stop to that for the vowel. Considering that formant transitions are mentioned as part of the acoustic properties of stops, it was decided that formant transitions should be included in the measurements of the voiceless stops in this study. 40 The manner of inclusion of formant transition values will be explained in the later section on consonant measurements. There are a couple of acoustic properties of stops which are commonly measured, of which one is the spectrum of the stop burst, which varies with the place of articulation (Halle et al, 1957; Blumstein & Stevens, 1979; Forrest et al, 1988). Kent & Read (2002:112-5) give a brief overview of some of the studies that have been done on the identification of stops from their bursts, and surmise that correct identification of stops is possible if several features are examined, namely spectrum at burst onset, spectrum at voice onset, and time of voice onset relative to burst onset (VOT). VOT, or voice onset time, is another acoustic property commonly associated with the measurement of stops. It is the time interval “between the articulatory release of the stop and the onset of vocal fold vibrations” (2002:108). When the voicing onset precedes the stop release (usually the case for voiced stops), the VOT has a negative value. A positive value is obtained when the onset of voicing slightly lags the articulatory release (usually so for voiceless stops). These acoustic properties will be referred to in the later section on consonant measurements, where the choice of acoustic properties measured, as well as the methods by which the measurements are made, are explained. 3.1.2.2 Fricatives Fricative consonants are formed by air passing through a narrow constriction maintained at a certain place in the vocal tract, which then generates turbulence noise (Kent & Read, 2002:121-2). A fricative can be identified in a spectrogram by its relatively long period of turbulence noise. And as with stops, formant transitions join 41 fricatives to preceding and / or following vowels, reflecting the movement of the tongue and jaw (Hayward, 2000:190). Fricatives may be classified into stridents and non-stridents, the main difference between them being that strident fricatives have much greater noise energy than non-stridents. In this study, the voiceless fricatives are grouped into stridents [V] and [6], and non-stridents [I] and [7]. In addition to intensity of noise energy, fricatives can be differentiated from one another by comparing various features of their spectra, such as spectral shape, length, prominent peaks, and slope. Evers et al (1998) provides a comprehensive discussion of the role of spectral slope in classifying stridents. A good method by which to obtain spectra readings for comparison of fricatives is to average several spectra taken over the course of each fricative and then comparing the averages; however not all speech analysis software provides this averaging function. If so, the best alternative is to compare single readings of narrow band spectra (Hayward, 2000:190). The measurement of spectra is not the focus of this study, so to explain spectrum bandwidth simply without going into too much detail, it must first be mentioned that the number of points of analysis (always in powers of two, i.e. 16, 32, 64, 128, etc) is inversely proportional to the bandwidth of the analysing filters; generally, the greater the number of points, the narrower the bandwidth (Hayward, 2000:75-6). 64-point analysis is taken as an example of acceptable narrow band spectra by Hayward (2000), and hence this was the researcher’s default choice in this study when narrow band spectral analyses were done. Another acoustic measurement that can be made of fricatives is rise time, which is “the time from the onset of the friction to its maximum amplitude” (Hayward, 2000:195), meaning that the reading is taken by observing the waveform, and not the 42 spectrogram. Rise time is one of the acoustic cues that distinguish fricatives from affricates (Howell & Rosen, 1983) – it is longer in fricatives because the rise of frication energy is more gradual in fricatives but more rapid in affricates. The mean rise time measured by Howell & Rosen (1983) for affricates was 33 ms while that for fricatives was 76 ms. 3.1.2.3 Affricates The affricate involves a sequence of stop and fricative articulations. The production of affricates begins with a period of complete obstruction of the vocal tract, like stops, and is followed by a period of frication, like fricatives. In English, there are only two affricates: [G=] and [W6], the latter being the one examined in this study since it is voiceless. Since the production of affricates involves both stop and fricative articulations, it follows that the acoustic measurements mentioned in the earlier sub-sections on stops and fricatives are also applicable to affricates. Hence, as will be explained in the later section on consonant measurements, the acoustic properties the researcher chooses to measure of stops and fricatives in this study will also be measured of the voiceless affricate so that comparisons can be made between the affricate and the former two classes of obstruents. 3.2 Method of analysis The following sub-sections explain the criteria for the selection of data and the details of the measurements of the vocal cues. 43 3.2.1 Selecting data for analysis One of the difficulties of using natural speech is the selection of the “more ideal” tokens (in this case, of specific vowels and consonants) for analysis. In a study of speech segments, the phonological environment of the segments examined – and even of the words in which the segments appear – should ideally be kept constant so as to allow fair comparisons of the tokens. This can be done by recording scripted speech, but is virtually impossible to achieve when natural speech is involved. In the case of selecting segments from natural speech for data analysis, what is feasible is to determine a set of conditions for the phonological environment of the segments, and then find tokens which satisfy most – if not all – of those set conditions. Below is the list of conditions for the tokens for analysis in this study, separated into two subsections. 3.2.1.1 Segmenting the recordings 1. Tokens should preferably be from sections in the recordings which the researcher determines are more emotionally expressive. It is never the case that a whole narrative is related in one high level of expressiveness; the level of expressiveness always varies throughout. For example, a narrative involving an Angry anecdote may consist of Neutral sections (such as at the introduction of the narrative), and sections narrated in various degrees of anger. Therefore, since this study is of the role of segmental features in emotional speech, it is imperative that the tokens for analysis be extracted from the expressive sections. Hence for the Angry, Sad, and Happy anecdotes, the researcher marks out the sections which are very expressive and somewhat expressive, ignoring those which are hardly expressive or not at all. (Sections which are considered very expressive are when the 44 subject sounds very emotional while speaking for a period of at least three seconds. When the subject sounds expressive but perceivably less emotional than in comparison to the very expressive sections, and does so for at least three seconds, the section is marked as somewhat expressive. All other sections are when the subject sounds neutral and are ignored.) Six random tokens of each vowel and consonant variable are taken from the very expressive sections, and four from the somewhat expressive ones (where the expressiveness of the sections of recordings were earlier rated by the researcher). This is to prevent the selection of data that is stereotypical of high levels of the emotions. (When less than six tokens can be found in the very expressive sections, more tokens are obtained from the somewhat expressive sections.) For the Neutral anecdotes, sections in which the subject sounds expressive (such as when a joke is shared between the subject and researcher) are ignored, and ten random tokens of the variables are selected from the remaining sections. 2. The first minute of each emotional recording is ignored. Because some subjects took a short while to get used to being recorded, tokens are not taken from the first minute of each emotional recording. While this may not be a large concern in the Angry, Sad, and Happy anecdotes because the first minute is usually ignored anyway on the basis that the subject sounds non-expressive, it is a valid consideration for the Neutral recording, especially since the subjects began their recording session with Neutral (then followed by Sad, Angry, and Happy). 3.2.1.2 Phonological environment 1. Tokens of consonants should preferably be followed by a vowel and is never taken from word-final position. 45 This condition ensures that word-final consonants are not selected as data (there should be no compromising this condition) because certain voiceless obstruents drop in intensity when placed in final position (Kent & Read, 2002). Furthermore, in Singapore English, consonants in the final position are sometimes deleted, and consonant clusters are sometimes simplified. Besides, if formant transitions are not to be ignored in the acoustic measurements of the obstruents, the transitions (by definition) have to be from consonant to vowel or vice versa, and not consonant to consonant. 2. Tokens of vowels should preferably be followed by a voiceless consonant. Because vowels tend to lengthen in open syllables or when followed by a voiced consonant (Kent & Read, 2002:109; Ladefoged, 2001:83), tokens of vowels should be followed by a voiceless consonant, as far as possible. 3. Tokens should preferably be taken from monosyllabic words. The duration of a segment tends to shorten when more elements are added to a single sound string (Kent & Read, 2002:146; Ladefoged, 2001:83). Therefore tokens should preferably be taken from only monosyllabic words to ensure that data is systematic. It was found that the subjects tended to use simple words in their recordings, so there is a relatively large number of monosyllabic words from which to select tokens for analysis. However, in the event that not enough satisfactory tokens of a variable can be found in monosyllabic words, the search is expanded to disyllabic words, and if that fails to provide enough tokens still, trisyllabic words are considered. If kept to a minimum, it should not affect the results of the analysis greatly because all values obtained for each variable would be averaged. 4. Tokens should be taken from words which are in the middle of a phrase, and the phrase in the middle of a clause, sentence, or between two breath pauses. 46 The last stressable syllable in a major syntactic phrase or clause tends to be lengthened (Kent & Read, 2002:150), and this phenomenon – called phrase-final lengthening – is particularly marked in Singapore English (Low & Grabe, 1999). Therefore, to avoid selecting data with phrase-final lengthening, the placement of the words from which the tokens are taken should satisfy this condition. 3.2.2 Measurements of tokens There are some similarities in the variables measured and calculated of the vowel and consonant tokens (60 for each speech segment): the intensity and duration (and also fundamental frequency, or F0, in the case of vowels) of each token are measured, of which the means of the minimum, maximum, range, and average values are calculated. Using the Kay CSL software, these values are easily obtained as long as the speech segment for measurement is clearly marked for analysis. Within every emotion, the measurements are separated according to gender. For each gender, the mean minimum value is an average of the smallest reading from each subject, the mean maximum is an average of the largest, and the mean average is simply the average of all the values from all subjects of the gender. However, the mean range is not an average of ranges, but the range between the mean maximum and mean minimum values. The researcher has decided that formant transitions between the vowels and adjacent consonants should also be taken into consideration. As mentioned in the earlier section detailing the acoustic properties of the various classes of obstruents, formant transitions are always mentioned as part of the acoustic properties of stops, fricatives, and affricates, indicating that they are part of the consonants themselves. However, because they are transitions, they are not just a part of the consonant but also 47 a part of the vowel from or towards which the formants are moving. Thus it was decided that for the purpose of this study, the midpoint of each transitions is marked and transitions are viewed as two halves: the half nearer the consonant is considered as part of the consonant and the half nearer the vowel is considered as part of the vowel, and thus the transitions are included in the measurements of the vowels and consonants. Before explaining the details of vowel and consonant measurements, it should be mentioned that all final values measured in decibels are rounded off to the nearest 0.01 dB; those measured in seconds are rounded off to the nearest 0.001 sec; those measured in Hertz are rounded off to the nearest 0.01 Hz. 3.2.2.1 Vowel measurements Vowels are identified in the spectrograms by their horizontal dark bands of formant frequencies. As mentioned earlier, formant transitions are considered as part of the vowel. Hence in the case that there are consonants before and after the vowel, the midpoints of formant transitions are marked as the respective beginning and end of the vowel. The duration of the vowel is obtained from between these marks. The intensity and F0 readings are obtained by calling up the intensity and pitch contours respectively, and noting the mean intensity and F0 (statistically calculated by the Kay CSL analysis program). The two other variables measured for each vowel are the F1 and F2 values. The values are observed from the spectrogram and taken from the midpoint of the vowel. The F2-F1 values are then calculated for every vowel uttered by subtracting the F1 value from the F2 value. For each emotion, the values are separated according to gender. The averages of the F1 values and also of the F2-F1 values are then found for 48 each of the vowels for both speakers of each sex. The final values are rounded off to the nearest integer before tabulation. 3.2.2.2 Consonant measurements To recapitulate, the consonants examined in this study are all voiceless obstruents falling under three groups: stop, fricative, and affricate. The measurements taken are of the intensity and duration. However, consonants differ significantly among themselves in their acoustic properties, so while the measuring across the entire production of a vowel for intensity, duration, and F0 is feasible, one cannot measure across the entire production of the different obstruents – stops with their stop gap and release burst, fricatives throughout their noise segments, and the affricate with its stop gap followed by noise segment (Kent & Read, 2002) – and expect to compare them fairly. Since the presence of a frication interval is common to all obstruents – at the release burst of stops and noise segments of fricatives and affricate – it is decided that the intensity and duration measurements of the obstruents will be taken across their frication interval, in the same manner that has been described of measuring vowel intensity and duration. In the case of the stops and affricate, the onset of the stop gap is marked as the beginning of the obstruent, and the end is marked at the first half of the formant transition between the obstruent and the following vowel. With fricatives, the beginning and end of the obstruent are marked by the midpoints of the formant transitions. In the interest of comparing the obstruents within their separate classes, further measurements (besides intensity and duration) are taken, based on the relevance of the variable to the class of the obstruent. 49 For stops, as explained in an earlier section, the two acoustic cues commonly measured are spectrum and voice onset time (VOT). Kent & Read (2002) note that spectra values at both burst onset and voice onset should be taken for best comparison. But since stops, which are just one of the classes of obstruents, are not the main focus of this study, and also due to time constraints and the need for simplicity and economy of measurement, spectra values are not chosen as the intra-class variable. Hence the variable with which to compare the stops within their obstruent class is VOT. Because VOT is by definition the period between the onset of stop release and the onset of voicing, the stop gap and formant transition are excluded from the time measurement, and hence there is no confusion between the VOT measurements and the duration measurements of the stops, despite their having the same unit of measurement. The two acoustic cues with which to compare fricatives among themselves are spectrum and rise time. Rise time is not feasible in this study simply because while it is easy to measure in recordings of clear reading voices, it is virtually impossible to distinguish in recordings of natural emotional speech where the start and end of the rise time of the fricatives lie. And since literature (Manrique & Massone, 1981; Kent et al, 1996; Evers et al, 1998; Hayward, 2000; Kent & Read, 2002) has shown that a single feature of spectrum can be used as comparison within fricatives, it was decided that the Fast Fourier Transform (FFT) power spectral peak is the variable with which to compare the fricatives. It was discovered that the analysis program used for this study does not have the function of averaging several spectra taken over the course of a fricative, so as suggested by Hayward (2000), a single reading of narrow band spectrum is done at the midpoint of each fricative, using a 64-point analysis. The method involves calling up the FFT power spectrum at the midpoint of the fricative, 50 such that frequencies of up to 10 kHz are displayed, and noting the frequency at which the intensity is highest. Finally, with the voiceless affricate, both VOT and FFT power spectrum at midpoint of noise are measured, because the affricate involves both stop and fricative articulations. The affricate is then compared separately against the stops and fricatives. 3.2.3 Method of comparison For all classes of vocal cues mentioned above, the values obtained for each emotion are compared to those for Neutral, to determine if the sound segment contributes to the expression of emotion, and statistical tests are applied to determine if the differences are significant. Besides the statistical significance of the differences in values, the just-noticeable difference (perceptible to the human auditory system) of the vocal cues is also taken into consideration when comparing the values. Hence, a few points with regard to the perception of F0, intensity, and duration are addressed in the following sub-sections, followed by a brief explanation of the statistical tests which are applied in this research. 3.2.3.1 Fundamental frequency (F0) The average values for F0 in conversational speech in European languages are about 200 Hz for men, 300 Hz for women, and 400 Hz for children (Kent & Read, 2002). F0 perception operates by intervals, such that the difference between one pair of F0 values sounds the same as the difference between another pair if the ratio of each pair is the same (Kent et al, 1996:233). In other words, the difference between 200 and 100 Hz is considered perceptually equivalent to the difference between 500 and 250 Hz or to other differences in which the ratio is 2:1. In terms of absolute values, the 51 just-noticeable difference for F0 perception is about 1 Hz, in the span of 80 to 160 Hz (Flanagan, 1957:534; Kent et al, 1996:233). 3.2.3.2 Intensity The human auditory system is very sensitive to the intensity of sounds, and can cope with a wide range of intensities (Laver, 1994). A normal conversation is conducted at a level around 70 dB, a quiet conversation at around50 dB, and a soft whisper at around 30 dB (Moore, 1982:8). The just-noticeable difference in intensity has a value of about 0.5 to 1 dB (Rodenburg, 1972) within the range of 20 to 100 dB (Miller, 1947). 3.2.3.3 Duration According to Lehiste (1972), the human auditory system is psychophysically able to register minute temporal differences of duration under favourable experimental conditions. The psychophysical threshold for just-noticeable different in duration between two sounds is approximately 10 to 40 msec (1972:226), which is 0.01 to 0.04 sec. 3.2.3.4 Statistical tests explained In this study, the comparisons of the measurements of vocal cues are always made between only two sets of means. For example, the mean intensity values of all the [$] uttered by females are averaged within each emotion, giving four values of intensity means of female [$]: Angry, Sad, Happy, and Neutral. The Angry, Sad, and Happy intensity means are then individually compared with the Neutral intensity mean, and a statistical test is necessary to determine if the difference between the 52 means (i.e. difference between the intensity means of Neutral [$] and [$] of another emotion) are significant. Because of the random nature of the sample values, the resulting difference between these means can either be positive or negative. Therefore, the two-tailed t-test is the appropriate statistical analysis for the sample values of vocal cues in this study. There are two distinct t-tests, which have slightly different mathematical formulae: differences in means versus mean of the differences. The former t-test is used when there are two independent samples, while the latter is used when the samples are not independent but are in fact paired comparisons. For example, the t-test analysing the differences in means is used when vocal cues are compared across genders, since the sample values are taken from independent samples (male versus female). Comparisons of means within each gender (across emotions) are analysed using the t-test of mean of the differences, since the comparisons are between pairs of values taken from same group of subjects (e.g. Angry-Neutral pairs of [$] intensity; [£]-[$] pairs of F1 values from Sad recordings). Therefore, it is evident that both t-tests are necessarily used in this study. With t-tests, the essential steps are: 1. Establishing the hypothesis in terms of the null hypothesis and the alternative hypothesis 2. Specifying the level of significance, and thereby determining the critical region using a t table (easily found in any Statistic text) 3. Applying the appropriate test statistic to determine the t-value 4. Drawing a conclusion of rejecting or accepting the null hypothesis based on where within the critical region the calculated t-value lies 53 In this study, for all comparisons of values of vocal cues, the null hypothesis is that that means of the two cues are equal (i.e. difference = 0), and the alternative hypothesis is that the means are not equal and that the difference is significant (as opposed to being simply due to chance variations). The level of significance is commonly set at 0.05 for most statistical analyses, while stricter tests set the level of significance at 0.01. To have a significance level of 0.05 means that the probability that significant results will occur due to chance variations is 5%. In this study, both levels of significance are taken into consideration – results that are not significant at a level of significance of 0.01 are not automatically dismissed as insignificant; they will be checked to determine if they are significant at a level of 0.05. If the latter case is discovered to be true, then the observation is noted that the difference is significant only at a level of 5%, not 1%. A spreadsheet program (Microsoft Excel 2002) is used to apply full-service automated t-tests without the researcher having to do any calculations. Given two sets of values, the program is able to find the means of, and apply t-tests to, both sets of values. The outcome is a number representing the probability that significant results will occur due to chance variations. In the rest of this dissertation, p is used to indicate this number. In other words, when p[...]... work, and with friends The six subjects for this study, which consist of three males and three females since gender is an independent variable in this study, fulfil all these criteria, and thus can be said to be educated speakers of SE All subjects are Chinese Singaporeans between 22 and 27 years of age, and are either students or graduates of the National University of Singapore (NUS) or La Salle (a Singapore. .. etc But many educated Singaporeans of today speak a mixture 18 of both varieties: the morphology, lexicon, and syntax is that of SSE but the pronunciation system is that of CSE (Lim, 1999) Since the interest of this study lies in the relationship between emotions and the articulation of segmental features, a brief description will be given of the phonological phenomenon of vowel- and consonant-conflation... or only language at work Two decades later, despite any sociological change in Singapore, the criteria have not changed very much; Lim & Foley’s (to appear) general description of speakers who are considered native speakers of SE is as such: 25 i They are Singaporean, having been born in Singapore and having lived all, if not most, of their life in Singapore ii They have been educated in English as... is to provide an account for the emergence of certain linguistic features of SE; for example, some studies show evidence of the influence of the ethnic languages on SE SE generally has two variations: Standard Singapore English (SSE) and Colloquial Singapore English (CSE) (Gupta, 1994) SSE is very similar to most Standard Englishes, while CSE differs from Standard Englishes in terms of pronunciation,... 80% Sad Happy Angry Neutral Match 60% 40% 20% 0% Angry Sad Happy Neutral Intended em otion of anecdote As can be seen from the table and chart of the results, there is a high accuracy of recognition for Angry, Neutral, and Sad, and a percentage of matches large enough for Happy, such that it can be concluded that the researcher’s perception of emotions is an accurate reflection of that of listeners... emotions in speech Chung (1995) observes that in Korean and French, the pitch contour seems to carry a large part of emotional information, and anger and joy have a wider pitch range than sorrow or tenderness Likewise, McGilloway et al (1995) find that in English utterances, happiness, compared to fear, anger, sadness, and neutral, has the widest pitch range, longer pitch falls, faster pitch rises, and produces... throughout speech) , while the boys toned down intonational features even to the point of monotony 1.5 Singapore English Singapore is a multi-ethnic society whose resident population of four million are made up of 76.8% Chinese, 13.9% Malays, 7.9% Indians, and 1.4% of other races (Leow, 2001) While the official languages of the three main ethnic groups are Mandarin, Malay, and Tamil, respectively, English. .. obvious or great difference in emotional expression between males and females 24 CHAPTER TWO RESEARCH DESIGN AND METHODOLOGY 2.1 The phonetics study The analysis of the sounds of a language can be done in two ways: by auditory or instrumental means In this study, the choice of speech extracts (i.e passages taken from the anecdotal narratives) from which data will be obtained for analysis is based on auditory... significantly higher in joy than in sadness and tenderness (Chung, 1995; Hirose et al, 1997) Klasmeyer & Sendlmeier (1995) study glottis movement by analysing the glottis pulse shape in emotional speech data Laukkanen et al (1995) also examine the glottis and the role of glottal airflow waveform in identification of emotions in speech, but the study is, unfortunately, inconclusive, because, as admitted by. .. from spontaneous speech rather than actor simulation A long-term research programme (Rimé et al, 1998) has shown that most people tend to share their emotions by talking about their emotional experiences to others This means that most people will engage in emotional speech while recounting emotional experiences and therefore such recounts should have an abundance of speech segments uttered emotionally ... phonetics study The analysis of the sounds of a language can be done in two ways: by auditory or instrumental means In this study, the choice of speech extracts (i.e passages taken from the anecdotal... data for analysis One of the difficulties of using natural speech is the selection of the “more ideal” tokens (in this case, of specific vowels and consonants) for analysis In a study of speech. .. subjects of the gender However, the mean range is not an average of ranges, but the range between the mean maximum and mean minimum values The researcher has decided that formant transitions

Định dạng
Số trang	147
Dung lượng	2 MB