Selection of Stress and Emotion Classification Features

Một phần của tài liệu Analysis and detection of human emotion and stress from speech signals (Trang 99 - 102)

Before classifying stressed or emotional speech, it is important to understand the effect of emotion and stress on the acoustic speech waveform and acoustic-phonetic differences between different stressed and emotional speech. As can be seen from Figure 5.1(a), voiced speech spoken under Stress is significantly different from voiced speech spoken under Neutral or Normal condition in both frequency and intensity variations. The similar trend can also be observed between Anger and Sadness emotions in Figure 5.1(b). In view of these factors, spectral structure is altered when speech is under emotion or workload stress. One possible measure of the stress and emotional content of speech is the distribution of the spectral energy across the speech range of frequencies.

(a) (b)

Figure 5.1: Waveforms of a segment of the speech signal produced under (a) Neutral and Anger conditions of the word ‘go’ by a male speaker (200ms duration) (b) Sadness and Anger emotions spoken by Burmese female speaker

(200ms duration)

Human auditory system is assumed to have a filtering system in which the entire audible frequency range is partitioned into frequency bands [11]. This system can be described most conveniently as nonlinear response to frequency selectivity. The

properties of nonlinear frequency response are related to critical bands. The critical bands correspond to frequency processing in the cochlea. The cochlea, part of the inner ear, decomposes spectral contents of incident sound waves for hearing [127].

According to Fletcher [128], speech sounds are preprocessed by the peripheral auditory system through a bank of bandpass filters. These auditory filters perform the process of frequency weighting for frequency selectivity of ear. Another important aspect in human auditory perception is loudness. In terms of perception of loudness, speech sound can be ranked on a scale extending from quiet to loud.

Stress may affect different frequency bands differently and an improved stress classification features could be obtained by analyzing energy in different frequency bands. Sarikaya [93] also reports that energy migration among subbands could be observed in stressed speech production. By extending the subbands to fundamental frequency, information of fundamental frequency can be included on this feature. By analyzing these feature data using a Hidden Markov Model (HMM) recognizer, the effects of speaking rate and variation of tone are also taken care of. Furthermore, HMM has the ability to model the series of changes of events. As discussed in Chapter 4, stress/emotion classifier should have the ability to model changes of stress attributes.

Based on all these assumptions, a feature based on the distribution of energy in different log-frequency bands is selected. Subband based features together with HMM combine all speech characteristics important for stress detection into one parameter.

In [93], wavelet based subband features were proposed as an important stressed speech relayers. However, wavelet based subband decomposition provides time

dependent spectral features which may be more suitable for speech recognition [129]

than stress classification. The reason is that specific phoneme sequence variation in time is important in recognizing words. However, stressed speech such as Anger cannot be assumed to contain specific sequential events in the signal. For example, if loudness is associated with the Anger stress, there is no fixed time in the utterance for loudness to occur. It can be an event at the beginning, the middle or the end of the utterance. As long as loudness occurs, Anger may be considered [130]. For this reason, DFT (Discrete Fourier Transform) based subband features together with HMM are more suitable for stress classification since Fourier analysis preserves linearity in frequency resolution with no time dependency.

Furthermore, FFT power spectrum based speech features have been extensively used in most speech recognition system because of its immunity to noise [131].

Therefore, the FFT based subband features are adopted since these features are robust to frequency dependent noise.

In linear acoustic theory, speech production process is described in terms of source/filter model [132]. This model includes a volume velocity source to represent the glottal signal, a filter associated to the vocal tract and a radiation component that relates the volume velocity at the lips to the radiated pressure in the far acoustic field.

This model is considered valid for frequencies below 4 to 5 kHz. The model assumes plane wave propagation in the vocal tract and neglects nonlinear terms. When producing stressed speech, increase in muscle tension of the vocal cords and vocal tract, changes in vocal fold movement and sub-glottal air pressure, variation in airflow from the lungs have been observed by the studies [82, 133]. Linear acoustic theory

suggests that frequency in vocal tract filter, intensity and duration of glottal signal may be assumed to change due to stressed speech production.

One recent research by Cairns [35] proposed that there is a net airflow through the glottis when producing speech. According to linear acoustic theory, this airflow only produces fricative sounds. However, vortices of air are created in the region of false vocal folds as a result of glottal flow propagation through the vocal tract. Sound could be produced from the vortex flow interactions and these are nonlinear [32, 35, 134]. Speech sound thus consists of both linear acoustic component and nonlinear component generated by vortex flow interactions. The study by Cairns [35] suggests that the nonlinear component changes appreciably between normal and stressed speech. Therefore, Teager Energy Operator (TEO) [32] based nonlinear features are investigated. TEO extracts nonlinear component of the speech signal. In addition, non- linguistic information such as stress and emotion included in speech signal is related to prosodic features such as energy of speech. Teager Energy Operator [135] is a very useful tool for analyzing signal from energy point-of-view. This operator has several important properties that make it possible to determine energy function of quite complicated functions which is found in speech production [135]. Therefore, TEO based subband energy features are computed from the speech signal.

Một phần của tài liệu Analysis and detection of human emotion and stress from speech signals (Trang 99 - 102)

Tải bản đầy đủ (PDF)

(230 trang)