Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 147 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
147
Dung lượng
2 MB
Nội dung
1
CHAPTER ONE
PRELIMINARIES AND MOTIVATIONS FOR THIS STUDY
1.1
Introduction
We communicate our attitude towards all utterances,
even if this is to indicate as far as possible that we have
no attitude.
(Crystal, 1969:289)
The voice is a powerful source of information. There is surely an abundance of
affective information in the voice; it conveys a wide variety of paralinguistic sources
of information. In fact, vocalisation may be much more contagious than facial or
bodily expressions (Lewis, 2000a). If you have ever watched a show on the television
with the sound turned off, you will find that it is hardly an engaging experience. While
you may be able to follow the overall plot, you will probably miss the nuances of the
emotions portrayed by the characters, because they have been, literally, muted.
The important role of vocal expression in the communication of emotion has
been recognised since antiquity. In his classic work on the expression of emotion in
animals and human beings, Darwin (1872/1965) attributed primary importance to the
voice as a carrier of emotional cues. As Scherer (1989:233) points out, “the use of the
voice for emotional expression is such a pervasive phenomenon that it has been
frequently commented upon since the beginning of systematic scientific interest in
human expressive behavior.” Vocal expressions are extremely powerful and may have
the ability to elicit similar emotional states in others. Despite this, however, there is
little systematic knowledge about the details of the auditory cues which are actually
responsible for the expression and perception of emotion in the voice. Studies
regarding emotional speech have been done in the last few decades, but few
researchers actually agree on how to define the phonetic quality of expressed emotions.
2
Among the various vocal cues of emotions that have been studied, intonation is
the most common, and many researchers have shown that intonation is an effective
function of expressing the speaker’s emotion (Williams & Stevens, 1972; O’Connor &
Arnold, 1973; Abe, 1980; Bolinger, 1986; Chung, 1995) – the same word or phrase,
when spoken using varying intonation, can reflect very different emotions or attitudes
which are easily recognisable to listeners. Few studies, however, involve an analysis of
the vowels and consonants in emotional speech, in spite of the fact that these segments
of speech are affected by emotion (Williams & Stevens, 1972; Scherer, 1986; Chung,
1995; Hirose et al, 1997). Hence, one of the aims of this study is to examine the
segmental features – and the role they play – in the expression of emotional English.
Because this study is conducted in Singapore, it is also interesting to look at
certain features of the local variety of English. There have been many studies on
Singapore English (henceforth known as SE) in the past few decades, which have
progressed from identifying the structural mistakes of SE (Elliott, 1980) to establishing
SE as a standard form of English and describing what its features include (Tongue,
1974; Platt et al, 1984; Gupta, 1992; Brown, 1999; Zhu, 2003). Recent researchers
have generally agreed on the existence of certain features of SE, and are turning their
attention to ethnic variations of these features since Singapore is a multi-ethnic society
(Ho, 1999; Poedjosoedarmo, 2000; Lim, 2001). This study aims to approach SE
research from a new angle by looking at the relationship between certain SE features
and emotions.
It is hoped that the findings of this study on vowel and consonantal qualities
will find support for the position that these are significant vocal cues in emotional
speech which deserve more attention in this area of research, and also provide a deeper
understanding of how SE is used in natural, emotional conversation.
3
1.2
Emotional speech
Each aspect of consideration in a study of emotional speech is rather complex
in itself. There is a wide variety of possible vocal cues to look at, and an even wider
range of emotions under the different kinds of classifications. It is therefore necessary
to explain the choices of emotions and vocal cues examined in this study. The
following sections provide a background to emotions and their categories, followed by
a discussion on the relationship between emotions and the voice, and how the decision
is made on which emotions to examine.
1.2.1
Emotion labels and categories
One of the first difficulties a researcher on emotion faces is having to sieve
through and choose from a myriad of emotion labels in order to decide on which
emotions to study. The number of emotion labels is virtually unlimited, for when it
comes to labelling emotions, the tendency has been to include almost any adjective or
noun remotely expressive of affect. After all, “the most obvious approach to describing
emotion is to use the category labels that are provided by everyday language.” (Cowie,
2000:2) According to an estimation made by Crystal (1969), between the two studies
by Schubiger (1958) and O’Connor & Arnold (1973), nearly 300 different labels are
used to describe affect. It seems that the only bounds imposed here are those of the
English lexicon. Thus, in the face of such a multitude of labels, some kind of
systematisation, in order to constrain the labels introduced, is indispensable.
However, grouping emotions into categories is also a difficult issue. As
mentioned, there are thousands of emotion labels, and the similarity between them is a
matter of degree. If so, no natural boundaries exist that separate discrete clusters of
emotions. As a consequence, there are many reasonable ways to group emotion labels
4
together, and because there has never been a commonly accepted approach to
categorising emotional states, it is no surprise that researchers on emotions differ on
the number of categories and the kinds of categories to use. The following sections will
highlight the ways in which some researchers have categorised emotions.
1.2.1.1 Biological approach
Panksepp (1994), who looks at emotions from a neurophysiological point of
view, suggests that affective processes can be divided into three conceptual categories.
The researcher points out that while most models accept fear, anger, sadness, and joy
as major species of emotions, it is hard to agree on emotions such as surprise, disgust,
interest, love, guilt, and shame, and harder to explain why strong feelings such as
hunger, thirst, and lust should be excluded. Panksepp therefore tries to include all
affective processes in his three categories. Category One – “the Reflexive Affects” –
consists of affective states which are organised in quite low regions of the brainstem,
such as pain, startle reflex, and surprise. Category Two – “the Blue-Ribbon, Grade-A
Emotions” – consists of emotions produced by a set of circuits situated in intermediate
areas of the brain which orchestrate coherent behavioural, physiological, cognitive, and
affective consequences. Emotions like fear, anger, sadness, joy, affection, and interest
fall under this category. Lastly, Category Three – “the Higher Sentiments” – consists
of the emotional processes that emerge from the recent evolutionary expansion of the
forebrain, such as the more subtle social emotions including shame, guilt, contempt,
envy, and empathy.
However, because the concerns of a biological-neurophysiological study are
vastly different from that of a linguistic study, this method of categorisation is not
commonly referred to by linguistic researchers of emotional speech. Instead, the
5
question is more often “whether emotions are better thought of as discrete systems or
as interrelated entities that differ along global dimensions” ( Keltner & Ekman,
2000:237). Linguistic researchers who take the stand that emotions are discrete
systems would study a small number of emotions they take to be primary emotions
(the mixing of which produces multiple secondary emotions), while researchers who
follow the dimensional approach study a much greater number of emotions (viewed as
equally important), placing them along a continuum based on the vocal cues they
examine. The next two sections will briefly cover these different views on emotions.
1.2.1.2 Discrete-emotions approach
The more familiar emotion theories articulate a sort of “dual-phase model of
emotion that begins with ‘primary’ biological affects and then adds ‘secondary’
cultural or cognitive processes” (White, 2000:32). Cowie (2000:2) states that
“probably the best known theoretical idea in emotion research is that certain emotion
categories are primary, others are secondary.” Cornelius (1996) summarises six basic
or primary emotion categories, calling them the “big six”: fear, anger, happiness,
sadness, surprise, and disgust. Similarly, Plutchik’s (1962) theory points towards eight
basic emotions. He views primary emotions as adaptive devices that have played a rule
in individual survival. According to this comprehensive theory, the basic prototype
dimensions of adaptive behaviour and the emotions related to them are as follows: (1)
incorporation (acceptance), (2) rejection (disgust), (3) destruction (anger), (4)
protection (fear), (5) reproduction (joy), (6) deprivation (sorrow), (7) orientation
(surprise), and (8) exploration (expectation). The interaction of these eight primary
emotions in various intensities produces the different emotions observed in everyday
life.
6
This issue is discussed more fully in Plutchik (1980). He points out that
emotions vary in intensity (e.g. annoyance is less intense than rage), in similarity (e.g.
depression and misery are more similar than happiness and surprise), and in polarity
(e.g. joy is the opposite of sadness). In his later work (Plutchik, 1989), he reiterates the
concept that the names for the primary emotions are based on factor-analytic evidence,
similarity scaling studies, and certain evolutionary considerations, and that emotions
designated as primary should reflect the properties of intensity, similarity, and polarity.
Therefore, “if one uses the ordinary subjective language of affects, the primary
emotions may be labelled as joy and sadness, anger and fear, acceptance and disgust,
and surprise and anticipation.”
Lewis (2000a, 2000b) also presents a model for emotional development which
involves basic or primary emotions. In his model, the advent of the metarepresentation of the idea of me, or the consciousness, plays a central role. He lists joy,
fear, anger, sadness, disgust, and surprise as the six primary emotions, which are the
emotional expressions we observe in a person in his first six months of life. These
early emotions are transformed in the middle of the second year of life as the idea of
me, or the meta-representation, is acquired and matures. Lewis calls this
transformation “an additive model” because it allows for the development of new
emotions. He stresses that the acquisition of the meta-representation does not
transform the basic emotions; rather, it utilises them in an additive fashion, thereby
creating new emotions. The primary emotions are transformed but not lost, and
therefore the process is additive.
7
1.2.1.3 Dimensional approach
The dimensional perspective is more common among those who view emotions
as being socially learned and culturally variable (Keltner & Ekman, 2000). This
approach argues that emotions are not discrete and separate, but are better measured
and conceptualised as differing only in degree on one or another dimension, such as
valence, activity, or approach or withdrawal (Schlosberg, 1954; Ekman et al, 1982;
Russell, 1997). One way of representing these dimensions is by classifying emotions
along bipolar continua such as tense – calm, elated – depressed, and Happy – Sad.
Interestingly, this method of classification is similar to Plutchik’s (1980) concept of
polarity variation as mentioned above. In an example of bipolar continua, Uldall
(1972) sets 14 pairs of opposed adjectives that are placed at the two ends of a sevendegree scale, such as
Bored extremely quite
slightly Neutral
slightly quite
extremely Interested
The other pairs of adjectives are polite – rude; timid – confident; sincere – insincere;
tense – relaxed; disapproving – approving; deferential – arrogant; impatient – patient;
emphatic – unemphatic; agreeable – disagreeable; authoritative – submissive;
unpleasant – pleasant; genuine – pretended; weak – strong.
A more systematic method structurally represents categories and dimensions by
converging multidimensional scaling and factor analyses of emotion-related words and
situations, such that categories are placed within a two- or three-dimensional space,
like that shown in Figure 1.1. The implication of such a structure is that a particular
instance is not typically a member of only one category (among mutually exclusive
categories), but of several categories, albeit to varying degrees (Russell & Bullock,
1986; Russell & Fehr, 1994).
8
Figure 1.1: A circumplex structure of emotion concepts. Figure taken from Russell &
Lemay (2000:497).
Figure 1.2: Multidimensional scaling of emotion-related words. Figure taken from
Russell (1980).
9
Russell (1989) also suggests the use of multidimensional scaling, and using
distance in a space to represent similarity, to better represent the interrelationships
among emotion labels. Figure 1.2 (Russell, 1980) shows the scaling of 28 emotionrelated words, based empirically on subjects’ judgments of how the words are
interrelated. He claims that such a model is a better solution to the categorisation of
emotions as it reflects the continuous variation of the emotions. It also asserts the
correlation between emotions – the closer the emotions are placed together, the more
likely that an emotional state can be classified as both of the emotions. While the
“prototypical” emotion categories are not placed in the outset of the space like those in
Figure 1.1, they are generally in similar positions in relation to the other emotion
labels. Multidimensional scaling diagrams like Figures 1.1 and 1.2 have been derived
for numerous languages besides English, such as that spoken in the Pacific (e.g. Lutz
(1982) on Ifaluk; Gerber (1985) on Samoa; White (2000) on Solomon Islands) and
Asia (e.g. Heider (1991) on Indonesia; Romney et al (1997) on Japan).
There are other approaches and models which represent how emotions may be
conceptualised, but it is beyond the scope of this study to explain each and every one
of them. It is likely that no single theory or model is the “correct” one as some
psychologists would like to think, and that each one really paints a partial picture, and
highlights different properties, of emotion concepts. They are highly interrelated
(though some have received much more attention than others) and are not competing
accounts (Keltner & Ekman, 2000).
1.2.2
Researchers’ choices
There is considerable evidence that different emotional states induce
physiological changes, which can directly change voice quality (Ohala, 1981;
Johnstone et al, 1995). Examples of changes that can affect the voice are dryness in the
10
mouth or larynx, accelerated breathing rate, and muscle tension. Scherer (1979), for
instance, points out that arousal of the sympathetic nervous system, which
characterises emotional states, increases muscle tonus – generally meaning an increase
in fundamental frequency – and also affects the coordination and the rhythms of
reciprocal inhibition and excitation of muscle groups. The latter effect “will affect
pitch and loudness variability, stress patterns and intonation contours, speech rate…
and many other speech parameters.” (Scherer, 1979:501f) Thus recent linguistic
research has looked at a wide range of emotions.
Among the many emotions researchers have studied, anger, sadness, and
happiness / joy are the most common (van Bezooijen, 1984; Scherer et al, 1991;
Chung, 1995; Johnstone et al, 1995; Klasmeyer & Sendlmeier, 1995; Laukkanen et al,
1995; McGilloway et al, 1995; Mozziconacci, 1995, 1998; Nushikyan, 1995; Tosa &
Nakatsu, 1996; Hirose et al, 1997; Nicholson et al, 2000). A fair number of studies
also examine fear and / or boredom (Scherer et al, 1991; Johnstone et al, 1995;
Klasmeyer & Sendlmeier, 1995; McGilloway et al, 1995; Mozziconacci, 1995, 1998;
Nushikyan, 1995; Tosa & Nakatsu, 1996; Nicholson et al, 2000). Disgust is another
relatively popular choice of study (van Bezooijen, 1984; Scherer et al, 1991; Johnstone
et al, 1995; Klasmeyer & Sendlmeier, 1995; Tosa & Nakatsu, 1996; Nicholson et al,
2000). Other emotions which some researchers study include despair, indignation,
interest, shame, and surprise. The majority of these studies involving emotional speech
tend to include neutral as an emotion. While – strictly speaking – neutral is not an
emotion, one can understand the necessity of non-emotional data, which serves to
bring out the innate values of the speech sounds so that emotional data has a basis of
comparison (Cruttenden, 1997).
11
It should be noted that for the emotion anger, some researchers make the
distinction between hot and cold anger (Johnstone et al, 1995; Hirose et al, 1997;
Pereira, 2000). Hirose et al (1997) explain that anger can be expressed straight or
suppressed, resulting in different prosodic features, and this is supported by their
results, which show two opposite cases for samples of anger.
1.2.3
Choices of emotions for this study
Because natural conversation is recorded (in the form of anecdotal narratives of
recollections of emotional events) for data for this study, fear and boredom were not
chosen since people do not normally recall events which made them feel fearful or
bored and still speak with traces of the emotions felt at the time of the events. Disgust
is also not an option as people do not usually sustain their tone of disgust throughout
significant lengths of their narrative.
(Hot) anger, sadness, happiness, and neutral are chosen as the four emotions for
this study. According to the theories formulated by Lewis (2000a, 2000b) and Plutchik
(1962, 1980, 1989), anger, sadness, and happiness are considered primary emotions. In
other words, they are distinguishable from one another and none of them is a derivative
of another. They are also shown to be in different quadrants of the multidimensional
scaling diagrams, Figure 1.1 (Keltner & Ekman, 2000) and Figure 1.2 (Russell, 1980).
This means that anger, sadness, and happiness are dimensionally dissimilar from one
another. Neutral is necessary because, as mentioned, it serves as a basis of comparison
for the data of the other emotions. Furthermore, if neutral had a place in the
multidimensional scaling diagrams, it would probably be close to calm, and hence
would be in the fourth quadrant, reasonably different from anger, sadness, and
happiness.
12
It is also useful to note that despite their being four relatively distinct emotions,
they can be grouped into two opposite pairs of ‘active’ and ‘passive’ emotions, as
suggested by the vertical axis of the circumplex structure in Figure 1.1. Active
emotions – in the case of this study, Angry and Happy – are represented by a
heightened sense of activity or adrenaline, while passive emotions – Sad and Neutral–
are represented by low levels of physical or physiological activity. (Neutral is taken to
be similar to calm in Figure 1.1 and also At ease and Relaxed in Figure 1.2, and is also
found by Chung (1995) to be similar to other passive emotions in terms of pitch,
duration and intonation.) Such pairing is useful for it provides one more way by which
to compare and contrast the four emotions.
1.3
Two major traditions of research
There are two major traditions of research in the area of emotional speech:
encoding and decoding studies (Scherer, 1989). Encoding studies attempt to identify
the acoustic (or sometimes phonatory-articulatory) features of recordings of a person’s
vocal utterances while he is in different emotional states. In the majority of encoding
studies, these emotional states are not real but are mimicked by actors. In contrast,
decoding studies are not as concerned with the acoustic features, but with the ability of
judges to correctly recognise or infer affect state or attitude from voice samples.
Due to the large number of speech sounds examined in the different number of
emotions, this study will mainly be an encoding study. However, a preliminary
identifying test will be conducted using some of the speech extracts from which the
vocal cues used for analysis are taken. The purpose of the short listening test is to
support my assumption that certain particular segments of conversation are
13
representative of the emotions portrayed. This test will be explained in further detail in
Chapter 2.
Before stating the motivations for this study and its aims, the following sections
will provide a summary of the past research done on the aspects of voice which are
cues to emotion, as well as a brief description of the segmental features of Singapore
English.
1.4
Past research on emotional speech
A glance at the way language is used shows us that emotions are expressed in
countless different ways, and Lieberman & Michaels (1962) note that speakers may
favour different acoustic parameters in transmitting emotions (just as listeners may rely
on different acoustic parameters in identifying emotions). In the last few decades,
researchers have studied a wide range of acoustic cues, looking for the ones which play
a role in emotive speech.
1.4.1
Intonation
… speakers rarely if ever objectify the choice of an
intonation patter; they do not stop and ask themselves
“Which form would be here for my purpose?”…
Instead, they identify the feeling they wish to convey,
and the intonation is triggered by it.
(Bolinger, 1986:27)
It is an undisputed fact that intonation has an important role to play in the
expression of emotions. In fact, it is generally recognised that the use of intonation to
express emotions is universal (Nushikyan, 1995), i.e. there are tendencies in the
repetition of intonational forms in different languages (Bolinger, 1980:475-524). This
explains why there is more literature regarding intonational patterns in emotional
speech than any other speech sounds. Because pitch is the feature most centrally
14
involved in intonation (Cruttenden, 1997), studies on intonation tend to focus on pitch
variations.
Mozziconacci (1995) focuses exclusively on the role of pitch in English and
shows that pitch register varies systematically as a function of emotion. Her acoustic
analysis shows that neutral and boredom have low pitch means and narrow pitch
ranges, while joy, anger, sadness, fear, and indignation have wider ranges, with the
latter two emotions having the largest means. However, there is a low emotionidentification performance in her perception test. She attributes that to the fact that no
characteristics other than pitch had been manipulated, which implies that pitch is not
the only feature involved differentiating emotions in speech. Chung (1995) observes
that in Korean and French, the pitch contour seems to carry a large part of emotional
information, and anger and joy have a wider pitch range than sorrow or tenderness.
Likewise, McGilloway et al (1995) find that in English utterances, happiness,
compared to fear, anger, sadness, and neutral, has the widest pitch range, longer pitch
falls, faster pitch rises, and produces a pitch duration that is shorter and of a narrower
range. Kent et al (1996) generalise that large intonation shifts usually accompany
states of excitement, while calm and subdued states tend to manifest a narrow range of
intonation variations.
However, the system of intonation to convey affective meaning is not the only
means of communicating emotions. According to Silverman et al (1983), certain
attitudes are indistinguishable on the basis of intonation. Uldall’s (1972) results
demonstrate that some of the attitudes (e.g. the adjective pair “genuine – pretended”)
are apparently rarely expressed by intonation. Therefore, while intonation is a
significant means of conveying expressive meaning, it is not the only one and there are
certainly other equally important phenomena.
15
1.4.2
Other vocal cues
It seems obvious that intonation is not the only means of differentiating
emotions, and that “other aspects such as duration and voice quality must also be taken
into consideration” ( Mozziconacci, 1995:181). Cruttenden (1997) points out that there
are a number of emotions, like joy, anger, fear, sorrow, which are not usually
associated directly with tones, but may be indicated by a combination of factors like
accent range, key, register, overall loudness, and tempo. Murray & Arnott (1993) note
that the most commonly referenced vocal parameters are pitch, duration, intensity, and
voice quality (the last term was not clearly defined though). Nevertheless, there are
fewer studies done on any one of these aspects than on intonation. Furthermore, these
cues and parameters mentioned by Cruttenden (1997) and Murray & Arnott (1993) are
still examples of prosodic features, and no mention is made by them of the role of
segmental features.
The study on Korean and French by Chung (1995) suggests that the vowel
duration of the last syllable differs according to the emotions: it is very short in anger
and long in joy and tenderness. Consonantal duration, however, is less regular;
lengthening tends to occur on stressed words. (No mention is made, however, of
whether these words are sentence-final or sentence-medial.) Hirose et al (1997) find
speech rate to be higher in emotional speech as compared to non-emotional speech,
and Chung (1995) elaborates that it is high in anger and joy but low in sorrow and
tenderness (in both studies, speech rate, while not explicitly defined, is measured over
a sentence).
With regard to intensity, the most obvious result of research is that it increases
with anger (Williams & Stevens, 1972; Scherer, 1986; Chung, 1995; Leinonen et al,
16
1997). Other findings include intensity being significantly higher in joy than in sadness
and tenderness (Chung, 1995; Hirose et al, 1997).
Klasmeyer & Sendlmeier (1995) study glottis movement by analysing the
glottis pulse shape in emotional speech data. Laukkanen et al (1995) also examine the
glottis and the role of glottal airflow waveform in identification of emotions in speech,
but the study is, unfortunately, inconclusive, because, as admitted by the researchers,
since the glottal waveform was studied only at F0 maximum, it remains uncertain
whether the relevance of the voice quality in their samples was related to the glottal
waveform or to a pitch synchronous change in it.
1.4.3
Comparing between genders
There is little written literature on the comparison between male and female
speech, much less the comparison between male and female emotional speech. This is
probably due to the fact that early work in phonetics focused mainly on the adult male
speaker, mostly for social and technical reasons (Kent & Read, 2002:53). But it is an
undeniable fact that the genders differ acoustically in speech. A classical portrayal of
gender acoustic diversity is shown by Peterson & Barney (1952), who, from a sample
of 76 men, women, and children speakers asked to utter several vowels, derive F1-F2
frequencies which falls within three distinct (but overlapping) clusters (men, women,
and children). Likewise, Tosa (2000) discovers after running preliminary (artificial
intelligence) training tests with data from males and females, two separate recognition
systems – one for male speakers and another for female speakers – are needed, as the
emotional expressions between males and females are different and cannot be handled
by the same program model. However, further research had not been done to find out
the reason behind the gender difference.
17
We do know that the differences are due in part to biological factors: women
have shorter membranous length of the vocal folds, which results in higher
fundamental frequency (F0), and greater mean airflow (Titze, 1989). Women’s voices
are also physiologically conditioned to have a higher and wider pitch range than men,
particularly when they are excited (Brend, 1975; Abe, 1980). In an experiment in
which 3rd, 4th, and 5th grade children were asked to retell a story, Key (1972) observed
that the girls used a very expressive intonation (i.e. highly varied throughout speech),
while the boys toned down intonational features even to the point of monotony.
1.5
Singapore English
Singapore is a multi-ethnic society whose resident population of four million
are made up of 76.8% Chinese, 13.9% Malays, 7.9% Indians, and 1.4% of other races
(Leow, 2001). While the official languages of the three main ethnic groups are
Mandarin, Malay, and Tamil, respectively, English is the primary working language,
used in education and administration. Because of this multi-ethnolinguistic situation,
the variety of English spoken in Singapore is distinctive and most interesting to study.
There has been much interest in two particular ways of studying SE. One describes the
nature and characteristics of SE, documenting the semantic, syntactic, phonological,
and lexical categories of SE. The other way is to provide an account for the emergence
of certain linguistic features of SE; for example, some studies show evidence of the
influence of the ethnic languages on SE.
SE generally has two variations: Standard Singapore English (SSE) and
Colloquial Singapore English (CSE) (Gupta, 1994). SSE is very similar to most
Standard Englishes, while CSE differs from Standard Englishes in terms of
pronunciation, syntax, etc. But many educated Singaporeans of today speak a mixture
18
of both varieties: the morphology, lexicon, and syntax is that of SSE but the
pronunciation system is that of CSE (Lim, 1999).
Since the interest of this study lies in the relationship between emotions and the
articulation of segmental features, a brief description will be given of the phonological
phenomenon of vowel- and consonant-conflation in SE.
1.5.1
Vowels
It is commonly agreed by researchers that one of the most distinctive features
of SE pronunciation is the conflation of vowel pairs. Much research has been done on
this phenomenon, and some researchers focus on the conflation of pairs of short and
long vowels such as [,]/[LØ], [£]/[$], [o]/[c], [8]/[XØ] (Brown, 1992; Poedjosoedarmo,
2000; Gupta, 2001). Brown (1992) finds that Singaporeans have no distinction
between the abovementioned long and short vowel pairs, i.e. each pair is conflated.
Poedjosoedarmo (2000), in her study on standard SE, studies only the vowel pair
[,]/[LØ], which she takes as representative of the phenomenon of vowel conflation in SE,
and finds that the pair is indeed conflated. Gupta (2001), however, finds that the
standard SE vowel pairs placed in descending order of how often they are conflated
are: [(]/[4], [8]/[XØ], [o]/[c], [£]/[$], [,]/[LØ]. In other words, most Singaporeans make
the distinction between the vowels in [,]/[LØ], but few do so for the vowels in [(]/[4].
Other studies further include [(]/[4], analysing the differences in the positions
of the tongue for the vowels in recorded standard SE, spontaneous or otherwise (Lim,
1992; Loke, 1993; Ong, 1993). Lim (1992) plots a formant chart for the vowel pairs
[,]/[LØ], [(]/[4], [£]/[$], [o]/[c], and [8]/[XØ] and finds the vowel pairs statistically
similar in terms of frequency (i.e. the vowel pairs are conflated). Likewise, Loke
19
(1993) finds the conflation of the vowel pairs [,]/[LØ], [(]/[4], [£]/[$], and [8]/[XØ] by
examining vowel formants in spectrographs. Ong (1993), on the other hand, finds that
the vowel pairs [,]/[LØ], [(]/[4], and [o]/[c] do conflate but “there is no clear evidence”
of conflation of [8]/[XØ], while the vowel pair [£]/[$] appears to conflate only in terms
of tongue height.
However, less attention is paid to the conflation of the final pair of Received
Pronunciation (RP) monophthong vowels, []/[Ø], though they have been found to be
conflated in standard SE (Deterding, 1994; Hung, 1995; Bao; 1998). While most
studies that omit this vowel pair from their list of vowel pairs examined do not provide
reasons for the omission, Brown (1988) provides one: the distinction between []/[Ø]
is primarily one of length rather than tongue positioning, which may be solely related
to stress, where the [Ø] appears in stressed syllables while [] appears in unstressed
ones. He reasons that since SE rhythm is typically not stressed-based, he is not
considering the distinction of these two vowels in his study. Another possible problem
with studying the conflation of []/[Ø] is that, in the case of natural or spontaneous
speech, words containing these vowels in the stressed syllable occur much less
frequently (Kent & Read, 2002).
To recapitulate, Singaporeans generally conflate the vowel pairs [,]/[LØ], [(]/[4],
[£]/[$], [o]/[c], [8]/[XØ], and []/[Ø] which are normally distinguished in RP. Table 1.1
shows how the vowels are conflated. Certain diphthongs are also shortened (to
monophthongs) in SE (Bao, 1998; Gupta, 2001), but because monophthongs are the
focus on vowels of this study, this section will not cover that aspect of Singaporean
vowel conflation.
20
Table 1.1: Vowels of RP and SE (adapted from Bao, 1998:158)
1.5.2
RP
SE
Example
RP
SE
Example
,
LØ
(
4
£
$
L
L
(
(
$
$
bit
beat
bet
bat
stuff
staff
c
o
8
XØ
Ø
o
o
X
X
cot
caught
book
boot
about
bird
Consonants
In most cases, standard SE consonants are pronounced much as they are in
most other varieties of English. However, the dental fricatives [7] and ['] tend to be
commonly replaced by the corresponding alveolar stops [W] and [G], at least in initial
and medial positions. This, like the conflation of vowels, is a pervasive SE conflation
and is noted by many researchers of SE (Tongue, 1979; Platt & Weber, 1980; Brown,
1992; Deterding & Hvitfeldt, 1994; Poedjosoedarmo, 2000; Gupta, 2001). Positionfinal [7] and ['] are commonly replaced by [I] and [Y] (Brown, 1992; Bao, 1998;
Deterding & Poedjosoedarmo, 1998; Poedjosoedarmo, 2000).
In final position, stops (especially voiceless stops) usually appear as glottal
stops [] and consonant clusters tend to be simplified, often by the omission of the
final stop (e.g. tact pronounced as [W(] or [W(N]; lift as [OLI]) (Gupta, 2001). This is
more common in informal speech, but speakers are actually able to produce the
appropriate stops or consonant clusters in careful speech.
Also, standard SE speakers do not distinguish voiced from voiceless positionfinal stops, fricatives, and affricates (Gupta, 2001). The contrast between the voiced
and voiceless obstruents is neutralised, such that all obstruents are voiceless and fortis,
with no shortening of the vowel before them. For example, edge [(G=] is pronounced
21
by SE speakers as [(W6], and rice [U$,V] and rise [U$,]] are pronounced identically as
[U$,V]. According to Gupta (2001), this conflation apparently occurs even in careful
speech in most SE speakers.
1.6
Motivations for this study
It is fascinating that expressing and identifying emotions come so naturally to
us, and yet are so difficult to define. Despite the fact that the voice is an important
indicator of emotional states, research on vocal expression of emotion lags behind the
study of facial emotion expression. This is perhaps because of the overwhelming
number of emotions – or more precisely, emotion labels – and the many different
possible vocal cues to study, such that researchers take their pick of emotions and
vocal cues in a seemingly random fashion. This makes it difficult to view the studies
collectively in order to determine the distinction between emotions.
This study attempts to go back to the basics, so to speak, starting with emotions
that are “more primary”, less subtle, and most dissimilar from one another, and the
vocal cues that are most basic to any language – the vowels and consonants.
And because this study is conducted in Singapore, it is an excellent opportunity
to examine SE from a different angle, applying that which is known about SE
segments – in this case, vowel conflation – to an area of research in which the study of
SE is completely new (i.e. emotional speech), in the hope of providing a deeper
understanding of conversational SE and the way its features interact with affect, which
is ever present in natural conversations. Intuitively, one would expect vowel conflation
to be affected by emotions, because vowels are conflated by duration (Brown, 1992;
Poedjosoedarmo, 2000; Gupta, 2001) and / or tongue position (Lim, 1992; Loke, 1993;
Ong, 1993; Nihalani, 1995) and both duration and tongue position are variables of the
22
voice affected by physiology which in turn are affected by emotional states (Scherer,
1979; Ohala, 1981; Johnstone et al, 1995). In short, emotional speech involves
physiological changes which affect the degree to which vowel pairs conflate. And
hence this study hopes to discover the relationship between emotional states and vowel
conflation, i.e. whether vowels conflate more often in a certain emotion, and if so,
which vowel pairs and how they conflate.
1.7
Aims of this study
The main aim is to determine the vocal cues that distinguish emotions from one
another when expressed in English, and how they serve to do so. This study also aims
to discover any relationship between emotions and SE vowel conflation, as well as to
determine the difference in expression of emotions between males and females.
Vocal cues of four different emotions are examined, the four emotions being
anger, sadness, happiness, and neutral. The vocal cues fall under two general
categories: vowels and consonants. 12 vowels – [,], [LØ], [(], [4], [£], [$], [o], [c], [8],
[XØ], [], [Ø] – as well as eight obstruents – [S], [W], [N], [I], [7], [V], [6], [W6] – will be
analysed. The variables examined are vowel and consonantal duration and intensity, as
well as vowel fundamental frequency. Formant measurements will be taken of the
vowels in order to compare vowel quality, and VOT and spectral measurements will be
taken of the obstruents (depending on the manner of articulation) in order to compare
them within their classes. The vocal production and the method of measurement of the
specific cues examined are elaborated on in Chapter 3.
The measurements of the vocal cues of anger, sadness, and happiness are
compared with those of neutral to determine how these emotions are expressed through
these cues. The quality of the vowel pairs (as mentioned in the earlier section on SE
23
vowels) will also be compared across emotions to find out if there is a relationship
between vowel conflation and emotional expression. Also, the average measurements
of all vocal cues of males are compared with that of females.
In short, the research questions of this study are:
I.
whether segmental aspects of natural speech can distinguish emotions, and if so,
by which vocal cues (e.g. intensity, duration, spectra, etc);
II.
whether a relationship exists between emotions and the vowel conflation that is
pervasive in Singapore English; and
III.
whether there is an obvious or great difference in emotional expression between
males and females.
24
CHAPTER TWO
RESEARCH DESIGN AND METHODOLOGY
2.1
The phonetics study
The analysis of the sounds of a language can be done in two ways: by auditory
or instrumental means. In this study, the choice of speech extracts (i.e. passages taken
from the anecdotal narratives) from which data will be obtained for analysis is based
on auditory judgment, at the researcher’s own discretion. The data is then analysed
instrumentally. However, since auditory perception is subjective, a perception test –
using short utterances taken from the speech extracts chosen by the researcher – is
conducted in order to verify that the researcher’s choices are relatively accurate and
representative of general opinion. The perception test will be elaborated on in further
sections.
2.2
Subjects
According to Tay & Gupta (1981:4), an educated speaker of SE would have the
following characteristics:
a) He comes from an English-speaking home where English is used most if
not all the time. (It can be added that he would use mostly English in his
interaction with friends as well.)
b) He has studied English as a first language in school up to at least GCE ‘A’
level and very possibly, University.
c) He uses English as his predominant or only language at work.
Two decades later, despite any sociological change in Singapore, the criteria
have not changed very much; Lim & Foley’s (to appear) general description of
speakers who are considered native speakers of SE is as such:
25
i.
They are Singaporean, having been born in Singapore and having lived all,
if not most, of their life in Singapore.
ii. They have been educated in English as a first language with educational
qualifications ranging from Cambridge GCE ‘A’ level (General Certificate
of Education Advanced level) to a bachelor degree at the local university,
and English is used as the medium of instruction at every level in all
schools.
iii. They use English as their main language at home, with friends, at school or
at work; at the same time, most also speak other languages at home, at
work, and with friends.
The six subjects for this study, which consist of three males and three females
since gender is an independent variable in this study, fulfil all these criteria, and thus
can be said to be educated speakers of SE.
All subjects are Chinese Singaporeans between 22 and 27 years of age, and are
either students or graduates of the National University of Singapore (NUS) or La Salle
(a Singapore college of Arts). Those who have graduated are presently employed. All
of them have studied English as a first language in school, and speak predominantly in
English to family, friends, and colleagues.
The subjects are all close friends or family of the researcher and thus are
comfortable with relating personal anecdotes to the researcher on a one-to-one basis.
2.3
Data
This study compares the vowels and consonants expressed in four emotions,
namely anger, sadness, happiness, and neutral. These emotions are chosen because, as
26
mentioned before, they are the most commonly used emotions in research on
emotional speech, and because they are relatively distinct from one another.
The vowels examined in this study are vowel pairs [,] and [LØ], [(] and [4], [£]
and [$], [o] and [c], [8] and [XØ], [] and [Ø]. These vowels pairs are commonly
conflated in SE and one of the aims of this study is to examine the relationship
between emotional speech and vowel conflation in SE.
The consonants examined are all voiceless obstruents: stops [S], [W], [N],
fricatives [I], [7], [V], [6], and affricate [W6]. Voiceless instead of voiced obstruents are
examined because voiceless obstruents tend to have greater aspiration and frication.
The consonantal conflations that occur specifically in final position in SE (as described
in the earlier chapter) will not be examined. This is because all consonants in final
position will not be examined, since certain stops and fricative – such as stops [S] and
[W], and fricatives [7] and [6] – drop in intensity when placed in final position (Kent &
Read, 2002).
To recapitulate, the research aims of this study are to (i) determine which vocal
cues distinguish emotions, (ii) discover if there is a relationship between emotions and
SE vowel conflation, and (iii) determine the difference in emotional expression
between males and females. The following subsections explain how data is collected
for the purpose of this study.
2.3.1
Data elicitation
Many studies on emotional speech tend to rely on professional or amateur
actors to mimic emotions. There are advantages in this practice, such as control of data
obtained, ease of obtaining data, and ability to ensure clarity of recording which in turn
27
allows greater ease and accuracy in the analysis of the recorded data. However, actor
portrayals may be attributable to theatre conventions or cultural display rules, and
reproduce stereotypes which stress the obvious cues but miss the more subtle ones
which further differentiate discrete emotions in natural expression (Kramer, 1963;
Scherer, 1986; Pittam & Scherer, 1993). Hence, this study intends to obtain data from
spontaneous speech rather than actor simulation.
A long-term research programme (Rimé et al, 1998) has shown that most
people tend to share their emotions by talking about their emotional experiences to
others. This means that most people will engage in emotional speech while recounting
emotional experiences and therefore such recounts should have an abundance of
speech segments uttered emotionally. Thus, data for this study is elicited by engaging
the subjects in natural conversation and asking them to recall personal emotional
experiences pertaining to the emotions examined in this study, which are anger,
sadness, and happiness. With regard to neutral, subjects are asked about their average
day at work or school (depending on which applies to them) and perhaps also asked to
explain the manner of their job or schoolwork.
2.3.2
Mood-setting tasks
Each subject was required to have only one recording session with the
researcher to record the Angry, Sad, Happy anecdotes and Neutral descriptions. This
was in order to ensure that the recording environment and conditions of each subject
were kept constant as far as possible for all of his or her anecdotes and descriptions.
Since the subjects had to attempt to naturally express diverse emotions in the span of
just a few hours, they were given mood-setting tasks to complete before they recorded
each emotional anecdote. These tasks aimed to set the mood – and possibly to prepare
28
the subjects mentally and emotionally – for the following emotional experiences which
the subjects were about to relate. They also helped to smoothen the transition between
the end of an emotional anecdote and the beginning of another in a completely
different emotion, making it less abrupt and awkward for both the researcher and
subject.
The mood-setting task to be completed before relating the Angry anecdote was
to play a personal computer (PC) game, called Save Them Goldfish!, supplied by the
researcher on a diskette. The game was played on either a nearby PC or, if there was
no PC in the immediate vicinity of the recording, the researcher’s notebook. The game
was simple, engaging, and most importantly, its pace became more frantic the longer it
ran, thereby causing the subject to be more tensed and excited. The task ended either
when the game finally got too quick for the subject and the subject lost, or – if the
subject proved to be very adept at it – at the end of five minutes. The stress-inducing
task aimed to agitate the subject so that by the end of it, regardless of whether the
subject had actually enjoyed playing it, the subject’s adrenaline had increased and he
or she was better able to relate the Angry anecdote with feeling than if he or she was
casually asked to do so.
For the Sad anecdote, the preceding task involved reading a pet memorial
found from a website, as well as a tribute to the firemen who perished in the collapse
of the United States World Trade Center on September 11, 2001, taken from the
December 2001 issue of Reader’s Digest. Because all the recordings were done
between July and October, 2002, the memories of the September 11 tragedy were still
vivid and the relevance of the tribute was possibly renewed since it was around the
time of the first anniversary of the tragedy. The researcher allowed the subject to read
in silence for as long as it took, after which the subject was asked which article he or
29
she related to better, and to explain the choice. The purpose of asking the subject to
talk about the article which affected him or her more was to attempt to make the topic
and the tragedy of the depicted situation more personal for the subject, thereby setting
a subdued mood necessary for the Sad anecdote.
The mood-setting task for the Happy anecdote was simply to engage in idle
humorous chatter for a few minutes. Since all the subjects are close friends and family
of the researcher, the researcher knew which topics were close to the hearts of the
subjects and could easily lighten the mood.
There was no mood-setting task for Neutral. The subjects were just asked about
their average day at work or school, and, if they had little to say about their average
day, asked to explain the manner of their job or schoolwork.
It should be noted that the researcher changed her tone of voice in her task
instructions and conversations with the subjects in order to suit each task and the
following anecdote. This also served to set the mood for each anecdotal recording.
2.4
Procedure
The subjects were approached (for their consent to be recorded) months before
the researcher’s estimated dates of recordings, and when they agreed to be recorded,
they were asked to think of personal experiences which had caused them to be Angry,
Sad, and Happy. They were not told of the specific research aims of this study, only
that each anecdote should take about five to ten minutes to relate, but if they could not
think of a single past event significant enough to take five to ten minutes to talk about,
they could relate several short anecdotes. The subjects were not asked to avoid
rehearsing their stories as if each was a story-telling performance, because the
researcher assumed – correctly – that the subjects would not even attempt to do so due
30
to their own busy schedules. In fact, in one case, the subject even decided on his
anecdotes no more than an hour before the actual recording session.
For the recording, the subjects could pick any place of recording in which they
felt most comfortable, provided the surroundings were quiet with minimal
interruptions. Five of the subjects were recorded in their own homes while one was
recorded in the Research Scholar’s Room at university. The subjects could sit or rest
anywhere during the recording as long as they did not move about too much while they
were being recorded. They could also have props or memoirs if they felt that the
objects would be helpful and necessary. A sensitive, unobtrusive PZM microphone
(model: Sound Grabber II) was placed between the subject and the researcher, and the
recording was done on a Sony mini-disc recorder (model: MZ-R55).
Before the start of the recording, the subjects were assured that they were not
being interviewed and did not need to feel awkward or stressed; they were merely
conversing with the researcher as they normally do and just had some personal stories
to tell. They were reminded to speak in English and to avoid using any other languages
as far as possible. Ample time was given for them to relax so that they would speak as
naturally as possible, and they were told they did not have to watch their language and
could use expletives if they wanted to.
The order of the anecdotes told by each subject was fixed: Neutral, Sad, Angry,
then Happy. This order seemed to work because it was easy (on both the researcher
and the subject) to start a recording by asking the subject to describe a day at work.
Furthermore, subjects seemed to be able to talk at length when describing the nature of
their (career or school) work because they wanted to be clearly understood, and this
period of time taken was useful for the subjects to get accustomed to speaking in the
presence of a microphone, no matter how inconspicuous. It was noticed that the
31
subjects quickly learned to ignore the microphone and could engage in natural
conversation with the researcher for most of the recording. In fact, majority of the
subjects were comfortable enough to become rather caught up, emotionally, in telling
their anecdotes; one subject – a close friend of the researcher – even broke down
during her Sad anecdote, and then was animatedly annoyed during her Angry anecdote
45 minutes later.
After the end of each anecdote and before the mood-setting task of the next, the
subjects were always asked if they wanted to take a break, since their anecdotes could
sometimes be rather lengthy. On average, each subject took about two hours to
complete his or her recording of anecdotes.
2.5
A pilot recording
A pilot recording was conducted to test and improve on the effectiveness of the
mood-setting tasks and the general format of a recording session. Despite a couple of
minor flaws in the initial recording design, which are described in the following
paragraphs, the pilot recording is included as data because the subject was very open
and honest with her emotions while she was relating her various personal experiences.
For Neutral data elicitation, the plan was originally to ask subjects to describe
their surroundings. However, the pilot recording revealed that the subject would speak
slowly and end with rising intonation for each observation she made, as if she was
reciting a list, which did not sound natural. But when the subject came to a jigsaw
puzzle of a Van Gogh painting on her wall and was asked more about it, her speech
flowed naturally (and in a neutral tone) as she explained in detail the history of the
painting and the painter. Because the subject has a strong interest in Art and is also a
qualified Art teacher, it was realised that it was more effective to ask subjects to
32
explain something which was familiar to them rather than to ask for a visual
description of the surroundings. Hence the prompt for Neutral was changed to asking
subjects about their average day and possibly asking them to elaborate on their work.
The mood-setting task for the Sad recording initially consisted of two articles
on the September 11, 2001 tragedy: a tribute to the firemen, and a two-page article on
several families who had exchanged last words with their loved ones on Flight 93 –
both of which were taken from the December 2001 issue of Reader’s Digest. Subjects
were then supposed to be asked what they thought was most regrettable about the
tragedy. The subject for the pilot recording ended up expressing her political opinion,
but as mentioned, she was emotionally honest when she related her Sad personal
experience (to the extent of weeping at certain points of her tale), and thus her
recording was still suitable for use as data despite the fact that the task did not serve its
purpose. Following the suggestion of the subject, the longer article was replaced by a
pet memorial, which would be an effective mood-setting task for subjects who are
animal lovers, or who have or have had pets.
2.6
Perception test
As mentioned at the beginning of this chapter, data for analysis is chosen from
sections in the recordings which the researcher feels are more emotionally expressive.
In order to verify that the researcher’s choices are relatively accurate and
representative of general opinion, a perception test was conducted using short
utterances taken from the segments chosen by the researcher.
The perception test was taken by 15 males and 15 females, all students of NUS
and between 22 and 25 years of age. The listeners were given listening test sheets on
which all the utterances were written out – without indication of who the speakers
33
were – so that they could read while they listened, in case they could not make out the
words in the utterances. The listeners were told that they would hear 72 utterances
from different speakers, and the utterances were pre-recorded in a random order but in
the sequence as printed on the test sheets. They were given clear instructions that they
would hear each utterance only once, after which they would have approximately ten
seconds to decide whether it sounded Angry, Sad, Happy, or Neutral, and they had to
indicate their choice by ticking the appropriate boxes corresponding to the emotions.
The listeners were also reminded to judge the utterances based on the manner – instead
of content – of expression.
2.6.1
Extracts for the test
The speech extracts for the perception test were taken from all the recordings of
the three male and three female subjects. Three utterances were taken from each of the
Angry, Sad, Happy, and Neutral recordings of each subject, making a total of 72
utterances for the entire perception test. Two of the three utterances were extracted
from the sections of the recordings which were considered very expressive, and one
was extracted from the sections which were considered somewhat expressive (cf.
Chapter Three section 3.2.1.1 regarding segmenting recordings according to
expressiveness). The utterances were randomly chosen from the expressive sections by
the researcher.
All the utterances started and ended with a breath pause, indicating the start and
end of a complete and meaningful expression. It was ensured that they consisted only
of clear-sounding speech, and lasted at least two full seconds. This was so that the
listeners could discern what was being said, and that the utterances were not too short
for the listeners to perceive anything.
34
The test lasted about 15 minutes. However, it was felt that the length of time
for the test was sufficient as increasing the number of utterances for each emotion per
subject would result in having too many utterances fro subjects to listen to.
2.6.2
Test results
With 30 listeners judging three utterances from each of the six subjects, the
total number of possible matches for each of the four emotions is 540. (A match occurs
when the emotion perceived by the listener is the same as the intended emotion of the
anecdote from which the utterance was extracted.) A breakdown of the results is
shown in the table below, and illustrated by the bar chart following the table.
It should be stressed that the perception test is done to verify that the researcher
is able to identify utterances which are representative of general opinion (on the
emotion perceived), not to determine the specific utterances from which tokens for
analysis are later taken. The a priori cut-off for acceptance that the researcher’s choices
are accurate is set at 60%, which means that as long as there are more than 60%
matches in an emotion, it is concluded that the researcher is able to accurately pick
utterances which listeners in general feel are representative of that emotion. This in
turn means that it will thus be acceptable that the researcher rates the expressiveness of
the sections of recordings and also (randomly) selects the tokens for analysis.
However, if less than 60% matches are made in an emotion in the perception test, it
means that the researcher’s perception of emotions is not similar to that of listeners in
general, and will thus need independent raters of expressiveness of the sections of
recordings before the researcher can select tokens for analysis from those sections.
35
Table 2.1: Results of perception test
Emotion
Angry
Matches
519 (96.11 %)
Sad
439 (81.30 %)
Happy
371 (68.70 %)
Neutral
488 (90.37 %)
Non-matches
13 Neutral (2.41 %)
8 Happy (1.48 %)
99 Neutral (18.33 %)
2 Angry (0.37 %)
138 Neutral (25.56 %)
22 Angry (4.07 %)
9 Sad (1.67 %)
9 Angry (1.67 %)
20 Happy (3.70 %)
23 Sad (4.26 %)
Figure 2.1: Bar chart of results of perception test
Percentage of matches &
non-matches
100%
80%
Sad
Happy
Angry
Neutral
Match
60%
40%
20%
0%
Angry
Sad
Happy Neutral
Intended em otion of anecdote
As can be seen from the table and chart of the results, there is a high accuracy
of recognition for Angry, Neutral, and Sad, and a percentage of matches large enough
for Happy, such that it can be concluded that the researcher’s perception of emotions is
an accurate reflection of that of listeners in general.
One possible reason for the large number of matches across emotions is that the
test only required listeners to choose from four emotions which were relatively
dissimilar from one another. Taking the dimensional approach, it can be explained that
when listeners are asked to recognise emotions with relatively different positions in the
underlying dimensional space, they only have to infer approximate positions on the
dimension in order to make accurate discriminations (Pittam & Scherer, 1993).
36
Another reason could be that despite the reminder from the researcher to judge based
on manner of expression, the semantics of the utterances might still have played a part
in affecting the decisions of the listeners. However, these reasons do not discount the
fact that listeners are generally able to infer emotions from voice samples, regardless
of the verbal content spoken, with a degree of accuracy that largely exceeds chance
(Johnstone & Scherer, 2000:228).
It is interesting to note that Neutral forms a large fraction of the
misidentifications of the Angry, Sad, and Happy utterances. This is probably due to the
fact that people are not extremely expressive when recounting past experiences. Days,
months, or even years might have passed since the event itself, and hence the emotions
expressed are possibly watered-down to some extent. It is therefore understandable
that when these expressions are extracted in the form of short utterances and judged
without the help of context, they can sound like Neutral utterances. However, while the
absolute differences between the emotions might be smaller because they may all be
closer to Neutral, the relative differences between the emotions are still accurate
representations of the relative differences between fully-expressed emotions.
Generally, the results of this perception test show that a large percentage of
listeners could identify the intended emotion of the utterances (likewise perceived by
the researcher as expressively uttered in the respective emotion) in the perception test.
Another possible interpretation of the results is that a large percentage of the utterances
chosen for the perception test could be correctly identified by the emotion expressed. It
can thus be concluded that the researcher’s choices of data are generally accurate and
representative of general opinion, and thus the researcher can rate the expressiveness
of the sections of recordings from which tokens of sound segments are analysed.
37
CHAPTER THREE
PRE-ANALYSIS DISCUSSION
3.1
Speech sounds explained
Before the presentation and analysis of data, it is necessary to briefly explain
how the speech sounds that are examined in this study are produced in general.
3.1.1
Vowels
In the course of speech, the source of a sound produced during phonation
consists of energy at the fundamental frequency and its harmonics. The sound energy
from this source is then filtered through the supralaryngeal vocal tract (Lieberman &
Blumstein, 1988:34ff; Kent & Read, 2002:18). The articulators such as the tongue and
the lips are responsible for the production of different vowels (Kent & Read, 2002:24).
The oral cavity changes its shape according to the tongue position and lip rounding
when we speak, and the cavity is shaped differently for different vowels. It is when the
air in each uniquely shaped cavity resonates at different frequencies simultaneously
that its characteristic sounds are produced (Ladefoged, 2001:171). These frequencies
then translate as dark bands of energy – known as formants – at various frequencies on
a spectrogram. The lowest of these bands is known as the first formant or F1, and the
subsequent bands are numbered accordingly (2001:173).
The first and second formants (F1 and F2) are most commonly used to
exemplify the articulatory-acoustic relationship in speech production, especially that of
vowels (see Figure 3.1). F1 varies inversely with vowel height: the lower the F1, the
higher the height of the vowel. F2 is generally related to the degree of backness of the
vowel (Kent & Read, 2002:92). F2 is lower for the back vowels than for the front
vowels. However, the degree of backness correlates better with the distance between
38
F1 and F2, i.e. F2-F1 (Ladefoged, 2001:177): its value is higher for front vowels and
lower for back vowels.
Figure 3.1: A schematic representation of the articulatory-acoustic relationships.
Figure adapted from Ladefoged (2001:200).
3.1.2
Consonants
Consonants differ significantly among themselves in their acoustic properties,
so it is easier to discuss them in groups that are distinctive in their acoustic properties
(Kent & Read, 2002:105). In this study, the groups of consonants examined are the
stop, fricative, and affricate. The following sub-sections briefly explain these groups of
consonants articulatorily and acoustically so as to provide a general understanding of
their differences and why they cannot be treated simply as one large class.
3.1.2.1 Stops
A stop consonant is formed by a momentary blockage of the vocal tract,
followed by a release of the pressure. When the vocal tract is obstructed, little or no
39
acoustic energy is produced. But upon the release, a burst of energy is created as the
impounded air escapes. In English, the blockage occurs at one of three sites: bilabial,
alveolar, or velar (the glottal is usually considered separate from the rest) (Kent &
Read, 2002:105-6). The stops examined in this study are the voiceless bilabial [S],
alveolar [W], and velar [N].
Stops are typically classified as either syllable-initial prevocalic or syllablefinal postvocalic (2002:106). Syllable-initial prevocalic stops are produced by, first, a
blockage of the vocal tract (stop gap), followed by a release of the pressure (noise
burst), and finally, formant transitions. Syllable-final postvocalic stops begin with
formant transitions, followed by the stop gap, and finally, an optional noise burst. Only
syllable-initial prevocalic stops are examined in this study since syllable-final
postvocalic stops do not always have noise burst and are therefore not reliable cues.
The stop gap is an interval of minimal energy because little or no sound is
produced, and for voiceless stops, the stop gap is virtually silent (2002:110). Silent
segments could sometimes be pauses instead of stop gaps, and thus stop gaps are
relatively difficult to identify and quantify in a spectrogram, especially when a stop
follows a pause.
The noise burst can be identified in a spectrogram by a short spike of energy
pattern usually lasting no longer than 40 milliseconds (2002:110).
Formant transitions are the shift of formant frequencies between a vowel and
consonant adjacent to each other. In the case of syllable-initial prevocalic stops
examined in this study, formant transitions are the shift of formant frequencies from
their values for the stop to that for the vowel. Considering that formant transitions are
mentioned as part of the acoustic properties of stops, it was decided that formant
transitions should be included in the measurements of the voiceless stops in this study.
40
The manner of inclusion of formant transition values will be explained in the later
section on consonant measurements.
There are a couple of acoustic properties of stops which are commonly
measured, of which one is the spectrum of the stop burst, which varies with the place
of articulation (Halle et al, 1957; Blumstein & Stevens, 1979; Forrest et al, 1988).
Kent & Read (2002:112-5) give a brief overview of some of the studies that have been
done on the identification of stops from their bursts, and surmise that correct
identification of stops is possible if several features are examined, namely spectrum at
burst onset, spectrum at voice onset, and time of voice onset relative to burst onset
(VOT).
VOT, or voice onset time, is another acoustic property commonly associated
with the measurement of stops. It is the time interval “between the articulatory release
of the stop and the onset of vocal fold vibrations” (2002:108). When the voicing onset
precedes the stop release (usually the case for voiced stops), the VOT has a negative
value. A positive value is obtained when the onset of voicing slightly lags the
articulatory release (usually so for voiceless stops).
These acoustic properties will be referred to in the later section on consonant
measurements, where the choice of acoustic properties measured, as well as the
methods by which the measurements are made, are explained.
3.1.2.2 Fricatives
Fricative consonants are formed by air passing through a narrow constriction
maintained at a certain place in the vocal tract, which then generates turbulence noise
(Kent & Read, 2002:121-2). A fricative can be identified in a spectrogram by its
relatively long period of turbulence noise. And as with stops, formant transitions join
41
fricatives to preceding and / or following vowels, reflecting the movement of the
tongue and jaw (Hayward, 2000:190).
Fricatives may be classified into stridents and non-stridents, the main
difference between them being that strident fricatives have much greater noise energy
than non-stridents. In this study, the voiceless fricatives are grouped into stridents [V]
and [6], and non-stridents [I] and [7].
In addition to intensity of noise energy, fricatives can be differentiated from
one another by comparing various features of their spectra, such as spectral shape,
length, prominent peaks, and slope. Evers et al (1998) provides a comprehensive
discussion of the role of spectral slope in classifying stridents. A good method by
which to obtain spectra readings for comparison of fricatives is to average several
spectra taken over the course of each fricative and then comparing the averages;
however not all speech analysis software provides this averaging function. If so, the
best alternative is to compare single readings of narrow band spectra (Hayward,
2000:190). The measurement of spectra is not the focus of this study, so to explain
spectrum bandwidth simply without going into too much detail, it must first be
mentioned that the number of points of analysis (always in powers of two, i.e. 16, 32,
64, 128, etc) is inversely proportional to the bandwidth of the analysing filters;
generally, the greater the number of points, the narrower the bandwidth (Hayward,
2000:75-6). 64-point analysis is taken as an example of acceptable narrow band spectra
by Hayward (2000), and hence this was the researcher’s default choice in this study
when narrow band spectral analyses were done.
Another acoustic measurement that can be made of fricatives is rise time,
which is “the time from the onset of the friction to its maximum amplitude” (Hayward,
2000:195), meaning that the reading is taken by observing the waveform, and not the
42
spectrogram. Rise time is one of the acoustic cues that distinguish fricatives from
affricates (Howell & Rosen, 1983) – it is longer in fricatives because the rise of
frication energy is more gradual in fricatives but more rapid in affricates. The mean
rise time measured by Howell & Rosen (1983) for affricates was 33 ms while that for
fricatives was 76 ms.
3.1.2.3 Affricates
The affricate involves a sequence of stop and fricative articulations. The
production of affricates begins with a period of complete obstruction of the vocal tract,
like stops, and is followed by a period of frication, like fricatives. In English, there are
only two affricates: [G=] and [W6], the latter being the one examined in this study since
it is voiceless.
Since the production of affricates involves both stop and fricative articulations,
it follows that the acoustic measurements mentioned in the earlier sub-sections on
stops and fricatives are also applicable to affricates. Hence, as will be explained in the
later section on consonant measurements, the acoustic properties the researcher
chooses to measure of stops and fricatives in this study will also be measured of the
voiceless affricate so that comparisons can be made between the affricate and the
former two classes of obstruents.
3.2
Method of analysis
The following sub-sections explain the criteria for the selection of data and the
details of the measurements of the vocal cues.
43
3.2.1
Selecting data for analysis
One of the difficulties of using natural speech is the selection of the “more
ideal” tokens (in this case, of specific vowels and consonants) for analysis. In a study
of speech segments, the phonological environment of the segments examined – and
even of the words in which the segments appear – should ideally be kept constant so as
to allow fair comparisons of the tokens. This can be done by recording scripted speech,
but is virtually impossible to achieve when natural speech is involved. In the case of
selecting segments from natural speech for data analysis, what is feasible is to
determine a set of conditions for the phonological environment of the segments, and
then find tokens which satisfy most – if not all – of those set conditions. Below is the
list of conditions for the tokens for analysis in this study, separated into two subsections.
3.2.1.1 Segmenting the recordings
1. Tokens should preferably be from sections in the recordings which the researcher
determines are more emotionally expressive.
It is never the case that a whole narrative is related in one high level of
expressiveness; the level of expressiveness always varies throughout. For example, a
narrative involving an Angry anecdote may consist of Neutral sections (such as at the
introduction of the narrative), and sections narrated in various degrees of anger.
Therefore, since this study is of the role of segmental features in emotional speech, it is
imperative that the tokens for analysis be extracted from the expressive sections.
Hence for the Angry, Sad, and Happy anecdotes, the researcher marks out the sections
which are very expressive and somewhat expressive, ignoring those which are hardly
expressive or not at all. (Sections which are considered very expressive are when the
44
subject sounds very emotional while speaking for a period of at least three seconds.
When the subject sounds expressive but perceivably less emotional than in comparison
to the very expressive sections, and does so for at least three seconds, the section is
marked as somewhat expressive. All other sections are when the subject sounds neutral
and are ignored.)
Six random tokens of each vowel and consonant variable are taken from the
very expressive sections, and four from the somewhat expressive ones (where the
expressiveness of the sections of recordings were earlier rated by the researcher). This
is to prevent the selection of data that is stereotypical of high levels of the emotions.
(When less than six tokens can be found in the very expressive sections, more tokens
are obtained from the somewhat expressive sections.) For the Neutral anecdotes,
sections in which the subject sounds expressive (such as when a joke is shared between
the subject and researcher) are ignored, and ten random tokens of the variables are
selected from the remaining sections.
2. The first minute of each emotional recording is ignored.
Because some subjects took a short while to get used to being recorded, tokens
are not taken from the first minute of each emotional recording. While this may not be
a large concern in the Angry, Sad, and Happy anecdotes because the first minute is
usually ignored anyway on the basis that the subject sounds non-expressive, it is a
valid consideration for the Neutral recording, especially since the subjects began their
recording session with Neutral (then followed by Sad, Angry, and Happy).
3.2.1.2 Phonological environment
1. Tokens of consonants should preferably be followed by a vowel and is never taken
from word-final position.
45
This condition ensures that word-final consonants are not selected as data
(there should be no compromising this condition) because certain voiceless obstruents
drop in intensity when placed in final position (Kent & Read, 2002). Furthermore, in
Singapore English, consonants in the final position are sometimes deleted, and
consonant clusters are sometimes simplified. Besides, if formant transitions are not to
be ignored in the acoustic measurements of the obstruents, the transitions (by
definition) have to be from consonant to vowel or vice versa, and not consonant to
consonant.
2. Tokens of vowels should preferably be followed by a voiceless consonant.
Because vowels tend to lengthen in open syllables or when followed by a
voiced consonant (Kent & Read, 2002:109; Ladefoged, 2001:83), tokens of vowels
should be followed by a voiceless consonant, as far as possible.
3. Tokens should preferably be taken from monosyllabic words.
The duration of a segment tends to shorten when more elements are added to a
single sound string (Kent & Read, 2002:146; Ladefoged, 2001:83). Therefore tokens
should preferably be taken from only monosyllabic words to ensure that data is
systematic. It was found that the subjects tended to use simple words in their
recordings, so there is a relatively large number of monosyllabic words from which to
select tokens for analysis. However, in the event that not enough satisfactory tokens of
a variable can be found in monosyllabic words, the search is expanded to disyllabic
words, and if that fails to provide enough tokens still, trisyllabic words are considered.
If kept to a minimum, it should not affect the results of the analysis greatly because all
values obtained for each variable would be averaged.
4. Tokens should be taken from words which are in the middle of a phrase, and the
phrase in the middle of a clause, sentence, or between two breath pauses.
46
The last stressable syllable in a major syntactic phrase or clause tends to be
lengthened (Kent & Read, 2002:150), and this phenomenon – called phrase-final
lengthening – is particularly marked in Singapore English (Low & Grabe, 1999).
Therefore, to avoid selecting data with phrase-final lengthening, the placement of the
words from which the tokens are taken should satisfy this condition.
3.2.2
Measurements of tokens
There are some similarities in the variables measured and calculated of the
vowel and consonant tokens (60 for each speech segment): the intensity and duration
(and also fundamental frequency, or F0, in the case of vowels) of each token are
measured, of which the means of the minimum, maximum, range, and average values
are calculated. Using the Kay CSL software, these values are easily obtained as long as
the speech segment for measurement is clearly marked for analysis.
Within every emotion, the measurements are separated according to gender.
For each gender, the mean minimum value is an average of the smallest reading from
each subject, the mean maximum is an average of the largest, and the mean average is
simply the average of all the values from all subjects of the gender. However, the mean
range is not an average of ranges, but the range between the mean maximum and mean
minimum values.
The researcher has decided that formant transitions between the vowels and
adjacent consonants should also be taken into consideration. As mentioned in the
earlier section detailing the acoustic properties of the various classes of obstruents,
formant transitions are always mentioned as part of the acoustic properties of stops,
fricatives, and affricates, indicating that they are part of the consonants themselves.
However, because they are transitions, they are not just a part of the consonant but also
47
a part of the vowel from or towards which the formants are moving. Thus it was
decided that for the purpose of this study, the midpoint of each transitions is marked
and transitions are viewed as two halves: the half nearer the consonant is considered as
part of the consonant and the half nearer the vowel is considered as part of the vowel,
and thus the transitions are included in the measurements of the vowels and
consonants.
Before explaining the details of vowel and consonant measurements, it should
be mentioned that all final values measured in decibels are rounded off to the nearest
0.01 dB; those measured in seconds are rounded off to the nearest 0.001 sec; those
measured in Hertz are rounded off to the nearest 0.01 Hz.
3.2.2.1 Vowel measurements
Vowels are identified in the spectrograms by their horizontal dark bands of
formant frequencies. As mentioned earlier, formant transitions are considered as part
of the vowel. Hence in the case that there are consonants before and after the vowel,
the midpoints of formant transitions are marked as the respective beginning and end of
the vowel. The duration of the vowel is obtained from between these marks. The
intensity and F0 readings are obtained by calling up the intensity and pitch contours
respectively, and noting the mean intensity and F0 (statistically calculated by the Kay
CSL analysis program).
The two other variables measured for each vowel are the F1 and F2 values. The
values are observed from the spectrogram and taken from the midpoint of the vowel.
The F2-F1 values are then calculated for every vowel uttered by subtracting the F1
value from the F2 value. For each emotion, the values are separated according to
gender. The averages of the F1 values and also of the F2-F1 values are then found for
48
each of the vowels for both speakers of each sex. The final values are rounded off to
the nearest integer before tabulation.
3.2.2.2 Consonant measurements
To recapitulate, the consonants examined in this study are all voiceless
obstruents falling under three groups: stop, fricative, and affricate. The measurements
taken are of the intensity and duration. However, consonants differ significantly among
themselves in their acoustic properties, so while the measuring across the entire
production of a vowel for intensity, duration, and F0 is feasible, one cannot measure
across the entire production of the different obstruents – stops with their stop gap and
release burst, fricatives throughout their noise segments, and the affricate with its stop
gap followed by noise segment (Kent & Read, 2002) – and expect to compare them
fairly. Since the presence of a frication interval is common to all obstruents – at the
release burst of stops and noise segments of fricatives and affricate – it is decided that
the intensity and duration measurements of the obstruents will be taken across their
frication interval, in the same manner that has been described of measuring vowel
intensity and duration. In the case of the stops and affricate, the onset of the stop gap is
marked as the beginning of the obstruent, and the end is marked at the first half of the
formant transition between the obstruent and the following vowel. With fricatives, the
beginning and end of the obstruent are marked by the midpoints of the formant
transitions.
In the interest of comparing the obstruents within their separate classes, further
measurements (besides intensity and duration) are taken, based on the relevance of the
variable to the class of the obstruent.
49
For stops, as explained in an earlier section, the two acoustic cues commonly
measured are spectrum and voice onset time (VOT). Kent & Read (2002) note that
spectra values at both burst onset and voice onset should be taken for best comparison.
But since stops, which are just one of the classes of obstruents, are not the main focus
of this study, and also due to time constraints and the need for simplicity and economy
of measurement, spectra values are not chosen as the intra-class variable. Hence the
variable with which to compare the stops within their obstruent class is VOT. Because
VOT is by definition the period between the onset of stop release and the onset of
voicing, the stop gap and formant transition are excluded from the time measurement,
and hence there is no confusion between the VOT measurements and the duration
measurements of the stops, despite their having the same unit of measurement.
The two acoustic cues with which to compare fricatives among themselves are
spectrum and rise time. Rise time is not feasible in this study simply because while it is
easy to measure in recordings of clear reading voices, it is virtually impossible to
distinguish in recordings of natural emotional speech where the start and end of the
rise time of the fricatives lie. And since literature (Manrique & Massone, 1981; Kent et
al, 1996; Evers et al, 1998; Hayward, 2000; Kent & Read, 2002) has shown that a
single feature of spectrum can be used as comparison within fricatives, it was decided
that the Fast Fourier Transform (FFT) power spectral peak is the variable with which
to compare the fricatives. It was discovered that the analysis program used for this
study does not have the function of averaging several spectra taken over the course of a
fricative, so as suggested by Hayward (2000), a single reading of narrow band
spectrum is done at the midpoint of each fricative, using a 64-point analysis. The
method involves calling up the FFT power spectrum at the midpoint of the fricative,
50
such that frequencies of up to 10 kHz are displayed, and noting the frequency at which
the intensity is highest.
Finally, with the voiceless affricate, both VOT and FFT power spectrum at
midpoint of noise are measured, because the affricate involves both stop and fricative
articulations. The affricate is then compared separately against the stops and fricatives.
3.2.3
Method of comparison
For all classes of vocal cues mentioned above, the values obtained for each
emotion are compared to those for Neutral, to determine if the sound segment
contributes to the expression of emotion, and statistical tests are applied to determine if
the differences are significant. Besides the statistical significance of the differences in
values, the just-noticeable difference (perceptible to the human auditory system) of the
vocal cues is also taken into consideration when comparing the values. Hence, a few
points with regard to the perception of F0, intensity, and duration are addressed in the
following sub-sections, followed by a brief explanation of the statistical tests which are
applied in this research.
3.2.3.1 Fundamental frequency (F0)
The average values for F0 in conversational speech in European languages are
about 200 Hz for men, 300 Hz for women, and 400 Hz for children (Kent & Read,
2002). F0 perception operates by intervals, such that the difference between one pair of
F0 values sounds the same as the difference between another pair if the ratio of each
pair is the same (Kent et al, 1996:233). In other words, the difference between 200 and
100 Hz is considered perceptually equivalent to the difference between 500 and 250
Hz or to other differences in which the ratio is 2:1. In terms of absolute values, the
51
just-noticeable difference for F0 perception is about 1 Hz, in the span of 80 to 160 Hz
(Flanagan, 1957:534; Kent et al, 1996:233).
3.2.3.2 Intensity
The human auditory system is very sensitive to the intensity of sounds, and can
cope with a wide range of intensities (Laver, 1994). A normal conversation is
conducted at a level around 70 dB, a quiet conversation at around50 dB, and a soft
whisper at around 30 dB (Moore, 1982:8). The just-noticeable difference in intensity
has a value of about 0.5 to 1 dB (Rodenburg, 1972) within the range of 20 to 100 dB
(Miller, 1947).
3.2.3.3 Duration
According to Lehiste (1972), the human auditory system is psychophysically
able to register minute temporal differences of duration under favourable experimental
conditions. The psychophysical threshold for just-noticeable different in duration
between two sounds is approximately 10 to 40 msec (1972:226), which is 0.01 to 0.04
sec.
3.2.3.4 Statistical tests explained
In this study, the comparisons of the measurements of vocal cues are always
made between only two sets of means. For example, the mean intensity values of all
the [$] uttered by females are averaged within each emotion, giving four values of
intensity means of female [$]: Angry, Sad, Happy, and Neutral. The Angry, Sad, and
Happy intensity means are then individually compared with the Neutral intensity
mean, and a statistical test is necessary to determine if the difference between the
52
means (i.e. difference between the intensity means of Neutral [$] and [$] of another
emotion) are significant. Because of the random nature of the sample values, the
resulting difference between these means can either be positive or negative. Therefore,
the two-tailed t-test is the appropriate statistical analysis for the sample values of vocal
cues in this study.
There are two distinct t-tests, which have slightly different mathematical
formulae: differences in means versus mean of the differences. The former t-test is
used when there are two independent samples, while the latter is used when the
samples are not independent but are in fact paired comparisons. For example, the t-test
analysing the differences in means is used when vocal cues are compared across
genders, since the sample values are taken from independent samples (male versus
female). Comparisons of means within each gender (across emotions) are analysed
using the t-test of mean of the differences, since the comparisons are between pairs of
values taken from same group of subjects (e.g. Angry-Neutral pairs of [$] intensity;
[£]-[$] pairs of F1 values from Sad recordings). Therefore, it is evident that both t-tests
are necessarily used in this study.
With t-tests, the essential steps are:
1.
Establishing the hypothesis in terms of the null hypothesis and the alternative
hypothesis
2.
Specifying the level of significance, and thereby determining the critical region
using a t table (easily found in any Statistic text)
3.
Applying the appropriate test statistic to determine the t-value
4.
Drawing a conclusion of rejecting or accepting the null hypothesis based on
where within the critical region the calculated t-value lies
53
In this study, for all comparisons of values of vocal cues, the null hypothesis is
that that means of the two cues are equal (i.e. difference = 0), and the alternative
hypothesis is that the means are not equal and that the difference is significant (as
opposed to being simply due to chance variations). The level of significance is
commonly set at 0.05 for most statistical analyses, while stricter tests set the level of
significance at 0.01. To have a significance level of 0.05 means that the probability
that significant results will occur due to chance variations is 5%. In this study, both
levels of significance are taken into consideration – results that are not significant at a
level of significance of 0.01 are not automatically dismissed as insignificant; they will
be checked to determine if they are significant at a level of 0.05. If the latter case is
discovered to be true, then the observation is noted that the difference is significant
only at a level of 5%, not 1%.
A spreadsheet program (Microsoft Excel 2002) is used to apply full-service
automated t-tests without the researcher having to do any calculations. Given two sets
of values, the program is able to find the means of, and apply t-tests to, both sets of
values. The outcome is a number representing the probability that significant results
will occur due to chance variations. In the rest of this dissertation, p is used to indicate
this number. In other words, when p[...]... work, and with friends The six subjects for this study, which consist of three males and three females since gender is an independent variable in this study, fulfil all these criteria, and thus can be said to be educated speakers of SE All subjects are Chinese Singaporeans between 22 and 27 years of age, and are either students or graduates of the National University of Singapore (NUS) or La Salle (a Singapore. .. etc But many educated Singaporeans of today speak a mixture 18 of both varieties: the morphology, lexicon, and syntax is that of SSE but the pronunciation system is that of CSE (Lim, 1999) Since the interest of this study lies in the relationship between emotions and the articulation of segmental features, a brief description will be given of the phonological phenomenon of vowel- and consonant-conflation... or only language at work Two decades later, despite any sociological change in Singapore, the criteria have not changed very much; Lim & Foley’s (to appear) general description of speakers who are considered native speakers of SE is as such: 25 i They are Singaporean, having been born in Singapore and having lived all, if not most, of their life in Singapore ii They have been educated in English as... is to provide an account for the emergence of certain linguistic features of SE; for example, some studies show evidence of the influence of the ethnic languages on SE SE generally has two variations: Standard Singapore English (SSE) and Colloquial Singapore English (CSE) (Gupta, 1994) SSE is very similar to most Standard Englishes, while CSE differs from Standard Englishes in terms of pronunciation,... 80% Sad Happy Angry Neutral Match 60% 40% 20% 0% Angry Sad Happy Neutral Intended em otion of anecdote As can be seen from the table and chart of the results, there is a high accuracy of recognition for Angry, Neutral, and Sad, and a percentage of matches large enough for Happy, such that it can be concluded that the researcher’s perception of emotions is an accurate reflection of that of listeners... emotions in speech Chung (1995) observes that in Korean and French, the pitch contour seems to carry a large part of emotional information, and anger and joy have a wider pitch range than sorrow or tenderness Likewise, McGilloway et al (1995) find that in English utterances, happiness, compared to fear, anger, sadness, and neutral, has the widest pitch range, longer pitch falls, faster pitch rises, and produces... throughout speech) , while the boys toned down intonational features even to the point of monotony 1.5 Singapore English Singapore is a multi-ethnic society whose resident population of four million are made up of 76.8% Chinese, 13.9% Malays, 7.9% Indians, and 1.4% of other races (Leow, 2001) While the official languages of the three main ethnic groups are Mandarin, Malay, and Tamil, respectively, English. .. obvious or great difference in emotional expression between males and females 24 CHAPTER TWO RESEARCH DESIGN AND METHODOLOGY 2.1 The phonetics study The analysis of the sounds of a language can be done in two ways: by auditory or instrumental means In this study, the choice of speech extracts (i.e passages taken from the anecdotal narratives) from which data will be obtained for analysis is based on auditory... significantly higher in joy than in sadness and tenderness (Chung, 1995; Hirose et al, 1997) Klasmeyer & Sendlmeier (1995) study glottis movement by analysing the glottis pulse shape in emotional speech data Laukkanen et al (1995) also examine the glottis and the role of glottal airflow waveform in identification of emotions in speech, but the study is, unfortunately, inconclusive, because, as admitted by. .. from spontaneous speech rather than actor simulation A long-term research programme (Rimé et al, 1998) has shown that most people tend to share their emotions by talking about their emotional experiences to others This means that most people will engage in emotional speech while recounting emotional experiences and therefore such recounts should have an abundance of speech segments uttered emotionally ... phonetics study The analysis of the sounds of a language can be done in two ways: by auditory or instrumental means In this study, the choice of speech extracts (i.e passages taken from the anecdotal... data for analysis One of the difficulties of using natural speech is the selection of the “more ideal” tokens (in this case, of specific vowels and consonants) for analysis In a study of speech. .. subjects of the gender However, the mean range is not an average of ranges, but the range between the mean maximum and mean minimum values The researcher has decided that formant transitions