Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 88 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
88
Dung lượng
2,46 MB
Nội dung
MODELING OF NON-NATIVE AUTOMATIC
SPEECH RECOGNITION
XIONG YUANTING
(B.Eng.(Hons.)), NTU
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF
COMPUTING
DEPARTMENT OF COMPUTER SCIENCE
NATIONAL UNIVERSITY OF SINGAPORE
2011
Acknowledgements
This thesis would not have been possible without the guidance and the help of several
individuals who in one way or another contributed and extended their valuable assistance
in the process of the research.
First and foremost, my utmost gratitude to Dr. Sim Khe Chai, Assistant Professor at
the School of Computing (SoC), National University of Singapore (NUS) whose
sincerity and encouragement I will never forget. Dr. Sim has been my inspiration as I
hurdle all the obstacles in the completion this research work.
Mr. Li Bo, PHD candidate of the Computer Science Department of National
University of Singapore, for his unselfish and unfailing advice in implementing my
research project.
Mr. Wang Xuncong, PHD candidate of the Computer Science Department of National
University of Singapore. He has shared his valuable suggestion in the relevance of the
fundamental knowledge in the automatic speech recognition system.
Dr. Mohan S Kankanhalli, Professor of Department of Computer Science, NUS, who
had kind concern and suggestion regarding my academic requirements.
Last but not the least, my husband and my friends, for giving me the strength to plod
on the study and research in computer science department, which is a big challenge for
me as my bachelor degree is in Electrical and Electronic Engineering School.
ii
Table of content
TABLE OF CONTENT
SUMMARY
III
V
LIST OF FIGURES
VI
LIST OF TABLES
VII
CHAPTER 1
1
INTRODUCTION
1
CHAPTER 2
3
BASIC KNOWLEDGE IN ASR SYSTEM
3
2.1
3
Overview of Automatic Speech Recognition (ASR)
2.2 Feature Extraction
5
2.3 Acoustic Models
2.3.1 The phoneme and state in the ASR
2.3.2 The Theory of Hidden Markov Model
2.3.3 HMM Methods in Speech Recognition
2.3.4 Artificial Neural Network
2.3.5 ANN Methods in Speech Recognition
8
9
10
12
15
17
2.4 Adaptation Techniques in Acoustic Model
2.4.1 SD and SI Acoustic model
2.4.2 Model Adaptation using Linear Transformations (MLLR)
2.4.3 Model adaptation using MAP
19
19
19
22
2.5 Lexical Model
23
2.6 Language Model
24
CHAPTER 3
27
LITERATURE REVIEW
27
3.1 Overview of the challenges in ASR for non-native speech
27
3.2 The solutions for non-native speech challenges
30
iii
CHAPTER 4
41
METHODS AND RESULTS
41
4.1 Non-Native English Data Collection
42
4.2 Project 1: Mixture level mapping for Mandarin acoustic model to English acoustic model
4.2.1 Overview of the Method in Project 1
4.2.2 Step Details in Project 1
4.2.3 Project 1 Results
43
43
46
52
4.3 Project 2: PoE to combine the bilingual NN models
4.3.1 Overview of the Method in Project 2
4.3.2 Step Details in Project 2
4.3.3 Project 2 Results
54
54
55
61
4.4 Project3: Training Non-native Speech Lexicon Model
4.4.1 Overview of the Method in Project 3
4.4.2 Step Details in Project 3
4.4.3 Project 3 Results
61
61
62
65
4.5 PROJECTS ACHIEVEMENT AND PROBLEM
4.5.1 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL
4.5.2 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL
66
66
67
CHAPTER 5
68
CONCLUSION AND RECOMMENDATION
68
APPENDIX
70
BIBLIOGRAPHY
72
iv
Summary
Heavily accented non-native speech represents a significant challenge for automatic
speech recognition (ASR). Globalization again emphasizes the urgent of the research to
address these challenges. The ASR consists of three parts, acoustic modeling, lexical
modeling and language modeling. In the thesis, the author will first give a brief
introduction of the research topic and the work has been done in chapter 1. In chapter 2,
the author will explain the fundamental knowledge in the ASR system, the concepts and
techniques illustrated in this chapter will be applied and used in the following chapters,
especially in chapter 4. In chapter 3, the author will present her literature review, which
will introduce the current concern in the natural language processing field and what are
the challenges and the major approaches to address those challenges. In chapter 4, the
author presents her research has done so far. Two projects are carried out to improve the
acoustic model in recognition the non-native speech in Mandarin accent. Another project
is targeted to improve the lexicon model of the word pronunciation of multi-national
speakers. The project process flow and step details are all covered. In chapter 5, the
author discusses her achievement and the problems regarding the results from the three
projects, and then gives her conclusion and recommendation to the work she has done
for this thesis.
v
List of Figures
FIGURE 2.1 AUTOMATIC SPEECH RECOGNITION .................................................................................... 4
FIGURE 2.2 FEATURE EXTRACTION ......................................................................................................... 5
FIGURE 2.3 FEATURE EXTRACTION BY FILTER BAND .............................................................................. 7
FIGURE 2.4 MEL FILTER BANK COEFFICIENT ........................................................................................... 8
FIGURE 2.5 HIDDEN MARKOV MODELS................................................................................................ 11
FIGURE 2.6 ASR USING HMM ............................................................................................................... 13
FIGURE 2.7 TRAIN HMM FROM MULTIPLE EXAMPLES ......................................................................... 15
FIGURE 2.8 ILLUSTRATION OF NEURAL NETWORK............................................................................... 15
FIGURE 2.9 SIGMOIDAL FUNCTION FOR NN ACTIVATION NODE ......................................................... 16
FIGURE 2.10 FEEDFORWARD NEURAL NETWORK (LEFT) VS RECURRENT NEURAL NETWORK (RIGHT)16
FIGURE 2.11 ANN APPLIED IN ACOUSTIC MODEL ................................................................................ 18
FIGURE 2.12 REGRESSION TREE FOR MLLR6 ........................................................................................ 22
FIGURE 2.13 DICTIONARY FORMAT IN HTK .......................................................................................... 24
FIGURE 2.14 THE GRAMMAR BASED LANGUAGE MODEL .................................................................... 25
FIGURE 3.1 SUMMARY OF ACOUSTIC MODELING RESULTS ................................................................. 32
FIGURE 3.2 PROCEDURE FOR CONSTRUCTING THE CUSTOM MODEL ................................................. 33
FIGURE 3.3 MAP AND MLLR ADAPTATION ........................................................................................... 35
FIGURE 3.4 PERFORMANCE WITH VARIOUS INTERPOLATION WEIGHTS ............................................. 35
FIGURE 3.5 BEST RESULTS OF VARIOUS SYSTEMS ................................................................................ 36
FIGURE 3.6 DIAGRAM OF HISPANIC-ENGLISH MULTI-PASS RECOGNITION SYSTEM ............................ 37
FIGURE 3.7 EXAMPLE OF THE MATCH BETWEEN STATE TARGET AND SOURCE .................................. 39
FIGURE 4.1 THE VOICE RECORDING COMMAND INPUTS AND OUTPUTS ............................................ 42
FIGURE 4.2 PROJECT 1 PROCESS FLOW ................................................................................................ 45
FIGURE 4.3 THE INPUTS AND OUTPUT FOR THE MFCC FEATURES ABSTRACTION ............................... 47
FIGURE 4.4 THE INPUTS AND OUTPUT OF SPEECH RECOGNITION COMMAND ................................... 48
FIGURE 4.5 THE INPUTS AND OUTPUTS FOR REGRESSION TRESS GENERATION ................................. 48
FIGURE 4.6 INPUTS AND OUTPUTS FOR MLLR ADAPTATION ............................................................... 49
FIGURE 4.7 MODIFIED MODEL ILLUSTRATION ..................................................................................... 50
FIGURE 4.8 THE RESULTS FOR PROJECT 1 ............................................................................................ 53
FIGURE 4.9 PROJECT 2 PROCESS FLOW ................................................................................................ 54
FIGURE 4.10 NN MODEL 1 TRAINING ................................................................................................... 56
FIGURE 4.11 NN MODEL 2 TRAINING (POE) ......................................................................................... 58
FIGURE 4.12 THE INVERSE FUNCTION OF GAUSSIAN ........................................................................... 59
FIGURE 4.13 PROJECT 3 PROCESS FLOW .............................................................................................. 61
FIGURE 4.14 PROJECT 3 STEP 3 PROCESS FLOW .................................................................................. 63
FIGURE 4.15 THE EXAMPLE FOR COMBINE.LIST ................................................................................... 64
FIGURE 4.16 THE EXAMPLE FOR THE RESULT OF FIND_GENERAL.PERL ............................................... 65
FIGURE 4.17 THE EXAMPLE FOR DICTIONARY ...................................................................................... 65
vi
List of Tables
Table 2.1 The 39 CMU Phoneme Set ................................................................................................. 9
Table 2.2 The probability of the pronunciations ............................................................................. 23
Table 3.1 Word error rate for the CSLR Sonic Recognizer ............................................................... 36
1
Table 3.2 World Error Rate % by Models ....................................................................................... 37
Table 4.1 Non-native English collection data .................................................................................. 43
Table 4.2 Mandarin mixture modified empty English model .......................................................... 53
Table 4.3 Results from NN Poe ........................................................................................................ 61
Table 4.4 Results for project 3 ......................................................................................................... 66
vii
Chapter 1
Introduction
The goal of automatic speech recognition (ASR) is to get computers to convert human
speech into text. The ASR system simulates the hearing and language processing ability
of human. Currently, in the real word application, a well-trained ASR system can
achieve more than 95% accuracy in controlled environment. In particular, ASR systems
work best when the acoustic models are speaker dependent, the training and testing
speech data are recorded in noise-free environment and the speakers have a fluent native
accent.
However, with globalization and widespread emergence of speech applications, the
need for more flexible ASR has never been greater. The flexible ASR means that the
system can be applied to multiple users, in a noisy environment, and even be applicable
to the non-native speakers. In this research, the author will focus on improving the ASR
performance for non-native speakers.
There are many challenges in tackling the problems arisen from non-native ASR.
Firstly, there is a lack of non-native speech resource for model training. Some
researchers have attempted to address this problem by adaptation the native acoustic
model with limited non-native speech data ([10], [13], [15], [18]). Secondly, for non-
1
native speakers, they have different nationality, thus different accent, to address those
problems, some researchers have tried the methods of MLLR, MAP, interpolation, statelevel mapping ([9], [10], [11], [12], [26]). Thirdly, even those non-native speaking with
same hometown accent, they are at different level of proficient to the target language.
Those problems cause the non-native speech recognition accuracy reduced significantly.
In the research covered in Chapter 4 of this thesis, the author will focus on the two
core parts of ASR system to improve the accuracy of the non-native speech recognition.
Firstly, the author will attempt to improve the acoustic model of ASR. Non-native
acoustic modeling is an essential component in many practical ASR systems. There are
two projects related to this problem. Secondly, the author also explores the issue in the
lexicon model. For non-native speakers, the pronunciations of some words are different
from those for native speakers. To make it worse, due to immaturity accent of non-native
speakers, the discrimination for words have similar pronunciation becomes difficult. In
this thesis, there is one project targeted to address this problem.
In this thesis, there are many professional terms, concepts and techniques in the field
of automatic speech recognition. In order to make the reader understand clearly, the
author has written all the key background knowledge in Chapter 2. In addition, many
previous researchers have attempted to solve those problems with certain approaches,
which also give insight to the author’s lateral projects. The development history of ASR
and those researchers’ approaches to address the similar issues in this thesis are included
in Chapter 3. Chapter 5 gives the conclusions and recommendations for possible future
work.
2
Chapter 2
Basic Knowledge in ASR System
In this chapter, the author will describe the basic knowledge in ASR system. First, the
author will give an overview of the ASR system. After that, the author will de scribe
recent frequently used acoustic feature format and its corresponding extraction
technique. Then, the author will focus on the knowledge of acoustic model, lexicon
model and language model one by one. Understanding acoustic model is very important
for the understanding of the ASR system, and the author will give more in depth
knowledge in this part, including some advanced techniques used in acoustic model.
2.1 Overview of Automatic Speech Recognition (ASR)
The Automatic Speech Recognition is a system to process a speech waveform file into a
language text format, by which this system converts audio captured by the microphone
into the text format language stored in the computer.
An ASR system generally consists of three models, acoustic model, lexical model and
language model. As illustrated in Figure 2.1, the audio waveform file is first converted
into a feature file, with reduced size and is matched to a particular acoustic model.
3
Figure 2.1 Automatic Speech Recognition
The acoustic model accepts the feature file as input, and produces the phone sequence as
output. The lexical model is the bridge between the acoustic model and language model,
and it models each vocabulary in the speech with one or more pronunciation, that is why
sometimes we call the lexical model a dictionary in the field. The language model
accepts the optional word sequences produced by the lexical model as input, and
produces the more common and grammatically accurate sentence. If all the models are
well trained, the ASR is capable of generating the corresponding sentence in text format
from human natural language with little error rate.
On the other hand, each of the models in the ASR system can be viewed as solving
ambiguity at one of the language processing levels. For example, the acoustic model
disambiguates a sequence of feature frames from another sequence of feature frames,
and categorizes them into a sequence of phones. The lexical model disambiguates a
sequence of phones from another sequence of phones, and categorizes them into a
sequence of words. The language model makes the disambiguation done by lexical
4
model more accurate by assigning higher probability to a more frequently occurred word
sequence, or integrating the knowledge of sentence grammar.
2.2 Feature Extraction
As previously mentioned, the acoustic model requires input in the format of feature file
instead of waveform file. This is because the feature file has much smaller size than the
waveform file, while the feature file still maintains some of the important information
and reduces some redundant and disturbing data, such as noisy.
To abstract a feature file from a waveform file, many parameters have to be
predefined. First, we need to define the sampling rate for the feature file vector, usually
we call this sampling time interval as frame period. For every frame period, a vector
parameter will be generated from the waveform file, and be stored in the feature file.
This vector parameter is based on the magnitude on the frequency spectrum for a piece
of audio waveform around its frame sampling point. Normally, the duration of the piece
of audio waveform is longer than the frame period, and we call it window duration.
Thus, there is some overlap-sampled information for the nearby frames.
Figure 2.2 Feature Extraction
5
For example (Figure 2.2), if the a waveform sampled at 0.0625 sec/sample, and the
window duration is defined as 25 msec long, the frame period is defined as 10 msec.
Thus, nearby frame samples will have 15 msec overlap, every window duration will have
400 samples from speech waveform with 16 kHz sampling rate.
By far, three forms of extracted feature vectors are mostly used. Linear Prediction
Coefficients (LPC)[1] is used very early in natural speech processing, but it is seldom
used now. Mel Frequency Cepstral Coefficients (MFCC)[
2 ]
is the default
parameterization for many speech recognition applications, and the ASR system
developed feature files on MFCC features have competitive performance. Perceptual
Linear Prediction Coefficients (PLP) [3] is developed more recently. In the following part,
I will only focus on the MFCC feature extraction technique.
To have a good understand about the MFCC features, we should first understand the
concept of filterbank. Filter banks are used to filter the information from the spectral
magnitude of each window period. (Figure 2.3) Filter banks are a series of triangular
filters on the frequency spectrum. To implement this filterbank, a window period of
speech data is transformed using a Fourier transform and the magnitude on the frequency
spectrum is obtained. The speech data magnitude on the frequency spectrum is then
multiplied by the corresponding filter gain from the filter bank and the results integrated.
Therefore, each filter bank will output a feature parameter, which is the integration of
the multiplication in the previous way along a filter bank channel. And the length of the
feature vector in Figure 2.2 (previous one) depends on the number of feature parameter,
in other words, the length of the feature vector depends on the number of the filter banks
we defined in a frequency spectrum.
6
Figure 2.3 Feature extraction by Filter Band1
Usually, the triangular filters spread over the whole frequency range, thus unlimited
in quantity. Practically, the lower and upper frequency cut-offs are defined, for example,
only sample the information from the frequency spectrum range from 300 Hz to 3400
Hz, thus a certain number of triangular filter banks are spread over this range.
Mel Frequency Cepstral Coefficients (MFCC) are coefficients calculated by feature
vector from the filter banks. But the filter banks used in MFCC are not equally spaced in
the frequency spectrum, and we call it Mel filterbank. Human can identify the voice
better at the lower frequency than that at the higher frequency. Practically, evidence also
suggests that using a non-linear spaced filter bank superior the equally spaced filter bank.
Mel filterbank is such kind of filter banks designed to model the non-linear listening
habits of human ear. The equation below defines the mel-scale.
( )
(
)
The Mel filter banks are located according to the mel-scale, all the triangular filter
banks locations are equally spaced on its mel-scale value instead of its frequency
spectrum. (Figure 2.4)
1
K. C. SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 7, 2010.
7
Figure 2.4 Mel filter bank coefficient1
After obtained the feature vector from the non-linear filter bank, Mel-Frequency
Cepstral Coefficients (MFCCs) can be calculated by the following formula.
√
∑
(
)
(
(
)
)
: nth MFCC coefficient
: kth Feature value from Mel filter bank
and
: number of filter banks and number of MFCC coefficients
MFCCs are the waveform compression format for many speech recognition
applications. They give good discrimination and can be transformed to other forms easily.
However, compressing the waveform format into feature format still causes the loss of
some information, and may reduce the robustness of the acoustic model been trained.
However, for a particular speech, the waveform file is about 6 times larger than the
feature file, thus the training time of the former also is much longer than that of the latter.
That is why the feature extraction is so preferable in the natural language processing
field.
2.3 Acoustic Models
1
K. C. SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 8, 2010.
8
2.3.1 The phoneme and state in the ASR
In a speech utterance, it is easy to understand what is referred as a sentence, and what is
referred as word. The author will explain more about the phoneme and state in the
natural language processing field.
Speakers and listeners divide words into component sounds, which are phonemes. In
this research, we use the Carnegie Mellon University pronouncing dictionary
[71] ,
the
phoneme set in this dictionary contains 39 phonemes (Table 2.1). The CMU dictionary is
a machine-readable pronunciation dictionary for North American English. The format of
this dictionary is very useful in the applications of speech recognition and synthesis. The
vowels in CMU phoneme set carry lexical stress.
Table 2.1 The 39 CMU Phoneme Set
Vowel Phonemes:
Consonant Phonemes:
PHONEME
EXAMPLES
PHONEME
EXAMPLES
PHONEME
EXAMPLES
PHONEME
EXAMPLES
aa
odd
ih
it
b
Baby
dh
thee
ae
at
iy
eat
d
Dog
hh
he
ah
hut
jh
gee
f
Field
z
zebra
ao
ought
oy
toy
g
Game
th
then
aw
cow
uh
hood
w
Was
s
sun
ay
hide
uw
two
y
Yield
ch
chip
er
wooden
ey
ate
k
Cook
sh
ship
ow
down
eh
Ed
l
Lamb
zh
treasure
m
Monkey
ng
ring
n
Nut
t
tap
p
Paper
v
van
r
Rabbit
Movements of the jaw, tongue and lips determine different phonemes. There are four
major classes of phonemes. Voiced phonemes have the vocal folds vibrate, vowels have
no blockage of the vocal and no turbulence, consonants are non-vowels, and plosives
involve an “explosion” of air. In addition, all those phoneme classes and even more
meticulous classes can be captured by the sound spectrum features. The sound spectrum
changes with the shape of the vocal tract. For example, “sh” has similar shape in the
9
spectrum as “s”, but “s” has very high frequency of around 4.5 kHz, while “sh” is lower
frequency because tongue is further back.
In the previous section, we know that feature extraction will give us the frequency
spectrum feature information at a rate of every 10 msec ( 1 frame period). One phoneme
can be several to tens of frames, which depends on the speakers’ rate and the context. In
order to model the flexibility of phoneme, we further divide a phoneme into several
consequent states. There is no standard definition for the number of states for a
particular phoneme, and different researchers have slightly different definition for
different purpose. The author defines about 3 emitting states for a particular phoneme.
Each of the state captures the distinguishing feature of that phoneme for a defined
window period. And the state may be repeated multiple times until it jumps to the second
state of a particular phoneme, and this is due to people have different duration in
pronouncing the phonemes.
2.3.2 The Theory of Hidden Markov Model
The most widely adopted statistical acoustic models in language processing are the
Hidden Markov Models (HMMs). It is a finite-state transducer and is defined by states
and the transitions between states. (Figure 2.5)
In a regular Markov model, the state is directly visible to the observer, thus the model
only has state transition probabilities parameters. In a Hidden Markov model, the state is
not directly visible, only output, which depends on the state, is visible. Therefore, the
HMM has both distribution model parameters for each state and the state transition
probabilities parameters.
10
Figure 2.5 Hidden Markov Models
As we can see from Figure 2.5, a HMM consists of a number of states. Each state j
has an associated observation probability distribution bj(ot), which is modeled to
determine the probability of generating observation
at a particular time t. The
is the
feature coefficient (MFCC) we illustrated previously, which is sampled at a particular
frame period for a window duration. Each pair of states i and j has a modeled transition
probability
. All those model parameters are obtained by statistical data driven
training method. In the experiment, the entry state 1 and the exist state N of an N state
HMM are non-emitting states.
Figure 2.5 shows a HMM with five states, and the HMM can be used to model a
particular phoneme. The three emitting states (2-4) have output probability distributions
associated with them. Each probability distribution of a state is represented by a mixture
Gaussian density. For example, for state j the probability bj(ot) is given by
( )
Where
∏[∑
(
)]
is the number of mixture components in state j for stream s,
weight of the m’th Gaussian component in that stream and N(.;
Gaussian with mean vector
is the
) is a multivariate
and covariance matrix ∑. The following equation shows
the multivariate Gaussian
11
(
)
(
√(
)
(
)
)
Where n is the dimensionality of o. The exponent
is a stream weight and its default
value is one. Generally, we only use the sum of mixtures of Gaussian models for a state,
and the stream level parameters are ignored. Practically, there is lack of training data to
model so many parameters. Therefore, the commonly used probability distributions is
given by
( )
∑
(
)
Standard Hidden Markov Model has two important assumptions.
1. Instantaneous first-order transition:
The probability of making a transition to the next state is independent of the historical
states.
2. Conditional independence assumption:
The probability of observing a feature vector is independent of the historical
observations.
2.3.3 HMM Methods in Speech Recognition
In the ASR system, the role of acoustic model is to decide the most likely phoneme for a
given series of observations. Hidden Markov model outputs the phoneme sequence in the
following ways.
Firstly, we should know that each phoneme is associated with a sequence of
observations.
The following maximum a-posteriori probability (MAP) is calculated to find out the
best phoneme candidate associated with the given series of observations.
* (
+
The posteriori probability cannot be modeled in the acoustic model, but it can be
12
calculated indirectly from the likelihood model according to Bayes Rule.
(
(
)
) ( )
( )
In the above formula, the probability of every randomly generated observations is the
same. And the probability of a particular phoneme occurrence will be counted in a
lexical model or a language model, and it can be ignored in an acoustic model. Thus, the
posteriori probability can be further simplified as following, which only the likelihood of
the observations given the phoneme sequence matters.
(
)
(
)
(
)
The likelihood can be calculated directly from the HHM models. For example, there
are six state models (Figure 2.6). For an given observations, there are many possible
sequences from the left-most non-emitting state to move to the right-most non-emitting
state. Take one possible sequence as example, and the sequence X=1; 2; 2; 3; 4; 4; 5; 6 is
Figure 2.6 ASR using HMM1
assumed for 7 frames observations. In Figure 2.6, when
makes the jump from state 1 to state 2, the
is read into the HMM, it
is equal to 1, the Gaussian mixture
probability in state 2 gives the likelihood of state 2 given the observation o 1, which is
( ). According to sequence X, the sequence continuous stays in state 2 for one more
frame, with some transition probability
and the likelihood of state 2 given the
1
S. Young, G. Evermann, “The HTK Book”, version 3.4, Cambridge University Engineering Department, pp13,
2006
13
observation o 2, which is
( ). When the
state with transition probability
which is
comes, the sequence X jumps to the third
and the likelihood of state 3 given the observation o 3,
( ), the summation of all outgoing transition probabilities of a particular
state is 1. In the case of state, the following equation is true.
So on and so forth, the way the sequence of X=1; 2; 2; 3; 4; 4; 5; 6 is processed
through the 6 states HMM phoneme model can be figured, and the likelihood of the 7
frames observation to belong to this particular phoneme model and the sequence of X is
given by
(
)
( )
( )
( )
Because multiple sequences exist in a single phoneme model, the likelihood of a
certain part of observations belong to a phoneme model is the summation of all the
likelihood probability of all possible sequences for that phoneme model.
(
)
∑
( ) ( )∏
( )(
)
() (
)
Where X=x(1), x(2), x(3)….., x(T)
In the word level recognition, the phoneme model described above can be changed to
the word model. In a word model, the states are the combination of state sequence in
every phoneme of a word model, thus the MAP searching space will be much larger.
When language model is used, the (
) in the likelihood function will assign higher
score to a frequently exist word.
In the experiment, the HMM model is developed by the statistical adjust the mode l
parameter to adapt to the training data. This requires a huge amount of data to train a
good model. Generally, an English acoustic HMM model is developed with about 300
hour speech, a Mandarin acoustic HMM model requires even more amount, due to more
phonemes in Mandarin speech phone set.
14
Figure 2.7 Train HMM from Multiple Examples1
2.3.4 Artificial Neural Network
Neural Network often refers as artificial neural networks, which is composed of artificial
neurons or nodes. When inputs process into the model, there are weights assign to every
input nodes, and then the summation of those scaled input nodes are processed into a
Figure 2.8 Illustration of neural network
transfer function, which is the activation node, mostly, we will use a log-sigmoid
function, also known as a logistic function, the curve of this function is shown in Figure
2.9. After that, the output of those functions become the input node of the next layer.
And so on so forth, there can be several intermediate layers for different Neural Network
1
K. C. SIM, “Acoustic Modelling Hidden Markov Model”, Speech Proceesing, PP, 8, 2010.
15
system, in our research, we will use the feedforward three layer neural network. Anyway,
every layer is essentially simple mathematical models defining a function
, but
those models’ parameters are also intimately associated with a particular learning
algorithm or learning rule. Back Propagation Algorithms is currently used in many
neural network applications.
Figure 2.9 Sigmoidal function for NN activation node
A feedforward neural network is an artificial neural network where connections
between the nodes do not form a directed cycle, which exists in recurrent neural
networks. (Figure 2.10)
Figure 2.10 feedforward neural network (left) vs recurrent neural network (right)
Back propagation learning algorithm plays an important role in the neural network
training process, it can be divided into two phases: propagation and weight update.
Phase 1: Propagation
Each propagation involves the following steps:
Forward propagation of a training pattern's input, thus one layer’s output becomes the
16
next layer’s input, finally, the end layer’s results are obtained.
Back propagation of the propagation's output of each layer, at the final layer, a target
results are compared with the output, and the target results of previous layers are
calculated by the reverse function of the model from the current target results, this ends
until reaching the first layer.
Phase 2: Weight update
For each weight arc:
Here, we multiply its output delta (The difference between the output and the target
output) and input activation (input node value) to get the gradient of the weight. Then we
bring the weight in the opposite direction of the gradient by subtracting a ratio of it from
the weight.
This ratio influences the speed and quality of neural network parameters learning, we
call this ration the learning rate. The sign of the gradient of a weight indicates where the
error is increasing, and this is why the weight must be updated in the opposite direction.
Repeat the phase 1 and 2 until the performance of the network is good enough.
QuickNet is a suite of software that facilitates the use of multi -layer perceptrons
(MLPs) in statistical pattern recognition systems. It is primarily designed for use in
speech processing but may be useful in other areas.
2.3.5 ANN Methods in Speech Recognition
Recently, many research works also take the advantage of multi-layer perceptrons
(MLPs) neural network to train the acoustic model and recognize the result using tandem
connectionis feature extraction 1. Hidden Markov model typically uses Gaussian mixture
models to estimate the distributions of decorrelated acoustic feature vectors that
correspond to observations of states of the phonemes. In contrast, artificial neural
network model uses discriminatively training to estimate the probability distribution
1
using the output of a neural network classifier as the input features for the Gaussian mixture models of a
conventional speech recognizer (HMM), The resulting system, which effectively has two acoustic models in
tandem
17
among states given the acoustic observations. The traditional HMM is faster in training
and recognition processes, and has better time alignment than the neural network
acoustic model, especially when the HMM is a context dependent model. On the other
hand, neural network model can capture the state boundary better, and it is flexible to be
manipulated and modified for different applications.
The general application idea of using ANN in the language processing is to use the
state probability output of the ANN as the input of a default Hidden Markov Model, then
we can recognize the final result using the HTK toolkit 1 directly. In details, we need to
transfer the mfcc feature files into pfiles. The manipulation of pfiles and the neural
network training all require the use of QuickNet toolkit 2. The pfiles will contain all the
mfcc coefficients information for every frame sample. Then the pfiles have to be
combined into a single pfile. The neural network is training towards a target output
values. Therefore, we have to first prepare a good acoustic model to obtain an alignment
for the training data, then to transfer the alignment into the required format, which is ilab
file. (Figure 2.11)
Figure 2.11 ANN applied in acoustic model
1
The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov
models.
2
QuickNet is a suite of software that facilitates the use of multi-layer perceptrons (MLPs) in statistical pattern
recognition systems.
18
2.4 Adaptation Techniques in Acoustic Model
2.4.1 SD and SI Acoustic model
With different training dataset, the acoustic model’s performance is different.
Speaker dependent (SD) Acoustic Model is an Acoustic Model that has been trained
using speech data from a particular person’s speech. SD model will recognize that
particular person’s speech correctly, but not a recommend model for the use to re cognize
speech data from other people. However, an acoustic model trained from many speakers
can gradually move towards Speaker dependent (SD) Model when applying the
adaptation technique.
Speaker independent (SI) Acoustic Model is trained using speech data contributed by
a big population. Speaker independent Acoustic Model may not perform as good as the
Speaker dependent Acoustic Model for a particular speaker, but it has much better
performance for a general population, since SI captures the general characteristic for a
group of people. In further, the SD captures the intar-speaker variability better than SI
model does, while the SI captures the inter-speaker variability better than SD model
does.
Usually, the dataset used in the training deviates somewhat from the speech data been
tested, the acoustic model performs even worse when the deviation is large, for example,
the American accented English speakers versus the England accented English speakers,
the native English speakers versus the non-native English speech.
To recognize the speech from non-native speakers, we usually will apply the
technique of the adaptation for a well-trained native acoustic model, using the limited
data. Because the native speech data can be easily obtained from many open source,
while the non-native speech data are rare and diverse.
2.4.2 Model Adaptation using Linear Transformations (MLLR)
19
The Maximum likelihood linear regression (MLLR) computes transformations, which
will reduce the mismatch between an original acoustic model and the non-native speech
data been tested. This calculation requires some adaptation data, which is the similar
speech data in the test. The transform matrices are obtained by solving a maximization
problem using the Expectation-Maximization (EM) technique. In details, MLLR is a
model adaptation technique that estimates one or several linear transformations for the
mean and variance parameters of Gaussian mixtures in the HMM model. By applying the
transformation, the Gaussian mixture means and variances are varied so that the
modified HMM model can perform better for the testing dataset.
The transformation matrix been estimated by the adaptation data can transform the
original Gaussian Mixture mean vector to a new estimate,
⃗
=W ⃗
where W is the n (n+1) transformation matrix and ⃗ is the original mean vector with
one more bias offset.
⃗
,
-
where w represents a bias offset, and in HTK toolkit, it is fixed at 1.
W can be decomposed into
W=,
where A is an n
-
n transformation matrix and b is a bias vector. This form of
transformation will adapt the original Gaussian mixtures’ means to the limited test data,
and the author will continue to explain how the variances in the Gaussian mixtures are
adapted.
Based on the standard in the HTK, there are two ways to adapt the variances linearly.
The first is of the form
where H is the linear transformation matrix to be estimated by adaptation data and B
is the inverse of the Choleski factor of
, which is
20
=C
and
The second way is that,
where H is the n
n covariance transformation matrix. This transformation can be
easily implemented as a transformation of the mean and the features.
(
where A =
)
(
)
(
)
.
The transformation matrix is obtained by the Expectation-Maximization (EM)
technique, which can make use of the limited adaptation data to transform the original
model by a big step.
With increasing amount of adaptation data, more transformation matrixs can be
estimated. And each transformation matrix is used for a certain group of mixtures, which
is categorized in the regression class tree. For example, if only a small amount of data is
available, then maybe only one transform can be generated. This transformation is
applied to every Gaussian component in the model set. However, with more adaptation
data, two or even tens of transformations are estimated for different group of mixtures in
original acoustic model. And each of the transformation is more specific to group the
gaussian mixtures further into the broad phone classes: silence, vowels, stops, glides,
nasals, fricatives, etc. Though it may not be classified so accurate in the adaptation for
non-native speech, as non-native speech it is confused in the phone classification.
21
Figure 2.12 regression tree for MLLR61
Figure 2.12 is a simple example of a binary regression tree with four base classes,
denoted as {
+. The solid arrow and circle mean that there is sufficient data
for a transformation matrix to be generated using the data associated with this class
(usually will be defined by researcher in the application), while the dotted line and circle
mean that there is insufficient data, for example, nodes 5, 6 and 7. Therefore, the
transformation will only constructed for nodes 2, 3 and 4, which are
data in group 5 will follow the transformation of
the transformation of
and
. The
, the data in group 6 and 7 will share
, while the data in group 4 will be transformed by
.
2.4.3 Model adaptation using MAP
Using limited adaptation data to transfer the original acoustic model also can be
accomplished by the maximum a posteriori (MAP) adaptation technique. Sometimes,
MAP approach is referred as Bayesian adaptation.
MLLR is an example of what is called transformation based adaptation, the
parameters in a certain group of component model are transferred together with a single
transform matrix. In contrast to MLLR, MAP re-estimate the model parameters
individually. Sample mean values are calculated for the adaptation data. An updated
mean is then formed by shifting each of the original value toward the sample value.
In the original acoustic model, the parameters are the information priors that are the
1
S. Young, G. Evermann, “The HTK Book”, version 3.4, Cambridge University Engineering Department, pp149,
2006
22
generated by previous data and are speaker independent model parameters. The update
formula for a single stream system for state j and mixture component m is
⃗
where
⃗
⃗
⃗
is a weighting of the original model training data to the adaptation data,
is the mean estimated from the adaptation data only.
is the occupation
likelihood of the adaptation data. If there was insufficient adaptation data for a phone to
reliably estimate a sample mean, there occupation likelihood will approach 0 and little
adaptation is performed.
2.5 Lexical Model
Lexical Model is the model to form the bridge between the acoustic model and language
model, the lexical model defines the mapping between the words and the phone set.
Usually, it is a pronunciation dictionary. For multiple pronunciations of a word, the
probability of each pronunciation can be modeled too. (Table 2.2)
Table 2.2 The probability of the pronunciations
Word
ABANDONED
ABILITY
Pronunciation
ah b ae n d ah n d sil
Probability
0.5
ah b ae n d ah n d sp
0.5
ah b ih l ah t iy sp
0.35
ah b ih l ah t iy sil
0.35
ah b ih l ah t iy sp
0.15
ah b ih l ah t iy sil
0.15
Pronouncing dictionaries are a valuable resource in case it is produced manually, they
may require a lot investment. There are a number of commercial and public domain
dictionaries available, those dictionaries have different formats and use different phone
sets. Normally, we use the CMU dictionary, which is a machine-readable pronunciation
dictionary for North American English and use a phone set with 39 phonemes, as more
23
phonemes are difficult to identify by machine learning currently.
The following is an example for dictionary used in HTK. (Figure 2.13) The “”
means sentence end, and “” means sentence start, the “[]” stands that if the model
recognize a sequence of phones are most likely to be the sentence start or sentence en d,
it will write nothing to the output. If the acoustic model given the highest likelihood
score to the phone sequence of “ ah sp” for the current observations, the recognizer will
output word “A” in output file.
Figure 2.13 Dictionary format in HTK
2.6 Language Model
A statistical language model assigns a probability to a sequence of m words, and this is
to model the grammar of a sentence. The most frequently used language model in HTK
is n-gram language model, which is to predict each word in the sequence given its (n-1)
previous words.
The probability of a sentence can be decomposed as a product of conditional
probability
(
)
∏ (
)
The n-gram models means to approximating the conditional probabilities to depend
only on its previous n words.
24
(
)
∏ (
)
Usually, we will model bigram or trigram in our experiment.
The conditional probabilities are based on maximum likelihood estimates-that is, by
counting the events in context on some given training text:
(
)
(
)
(
)
where C(W) is the count of a given word sequence in the training text.
After the n-gram probability is stored in a database for the existing training data,
some words may be mapped to an out of vocabulary class 1.
Sometimes, a more complicated Language Model can be archived by construct a
grammar based word net:
Figure 2.14 The grammar based language model2
In the grammar based language model, the plural, singular, prefix, and so on are
defined by the standard English grammar manually, which is expensive and may be not
useful practically. As in non-native speech, the grammar is poorly formed unless the
speech is read speech. In addition, the training data set of LM should match the test data
set. For example, an LM trained on newspaper text would be a good predictor for
1
Sometimes the word sequence in the test data has not been seen before in training data, then we define some equivalence
classes and given the probability of this new word sequence.
2
K. C. SIM, “Statistical Language Model”, Speech Proceesing, PP, 34, 2010.
25
recognizing the news reports, while the same LM would be a poor predictor for
recognizing personal conversation or speech in hotel reservation system.
26
Chapter 3
Literature Review
3.1 Overview of the challenges in ASR for non-native speech
The history of spoken language technology is milestoned by the ”SpeD” (“Speech and
Dialogue”) Conferences. After 2000, in the support of Academician Mihai Draganescu,
former President of Romanian Academy, the organization of a Conference in the field of
spoken language technology and human-computer dialogue (the first one was organized
in the 80’s years) is resumed.
In the 2nd edition of this conference (2003), it shows the evolution from speech
technology to spoken language technology. Mihai Draganescu has mentioned:
“This had to be foreseen however from the very beginning involving the use of artificial
intelligence, both for natural language processing and for acoustic - phonetic processes
of the spoken language. The language technology is seen today with two great
subdivisions: technologies of the written language and technologies of the spok en
language. These subdivisions have to work together in order to obtain a valuable and
efficient human-computer dialogue” [4].
27
In 2005, there is no dramatic changes occurred in the research domain since the 2 nd
conference. However, some trends were more and more obvious and some new fields of
interest appeared to be a promise the future. Corneliu Burileanu summarized:
“We were able to identify a constant development of what is called “speech interface
technology” which includes automatic speech recognition, synthetic speech, and natural
language processing. We noticed commercial applications in computer command,
consumer, data entry, speech-to-text, telephone, and voice verification. Robust speakerindependent recognition systems for command and navigation in personal computers
were already available; telephone-based transaction and database inquiry systems using
both speech synthesis and recognition were coming into use” [5].
In 2007, the SpeD edition is considered a very interesting and up-to-data analysis of
the achievements in the domain, it also presents the future trends at “IEEE/ACL
Workshop on Spoken Language Technology, Aruba, Dec, 11-13, 2006”. The following
research areas are strongly encouraged: spoken language understanding, dialog
management, spoken language generation, spoken document retrieval, information
extraction
from
speech,
question
answering
from
speech,
spoken
document
summarization, machine translation of spoken language, speech data mining and search,
voice-based human computer interfaces, spoken dialog systems, applications and
standards, multimodal processing, systems and standards, machine learning for spoken
language processing, speech and language processing in the World Wide Web.
Biing-Hwang Juang and S. Furui present a summary of system-level capabilities for
spoken language translation:
28
• first dialog demonstration systems: 1989-1993, restricted vocabulary, constrained
speaking style, speed (2-10)xreal-time, platform-workstations,
• one-way phrasebooks: 1997-present, restricted vocabulary, constrained speaking style,
speed (1-3)xreal-time, handheld devices,
• spontaneous two-way systems: 1993-present, unrestricted vocabulary, spontaneous
speaking style, speed (1-5)xreal-time, PCs/handheld devices, • translation of broadcast
news: 2003-present, unrestricted vocabulary, ready-prepared speech, offline, PCs/PCclusters,
• simultaneous translation of lectures: 2005-present,
unrestricted
vocabulary,
spontaneous speech, real-time, PCs/laptops.
With recent development in Spoken Language Technology domain, the target
research trend is more and more clear, the important challenges in those research
direction are appeared.
On the SpeD Conference in 2007, Hermann Hey, from Computer Science Department
of RWTH Aachen University, using Germany made his “Closing Remarks: How to
Continue?”. The main issue in this domain he emphasized is about the interaction
between speech and NLP (natural language processing) people in many areas of
interaction.
Since 2007, one important challenge in this domain seemed to be “Speech to Speech
Translation”. The main issue is speech recognition improvement. One aspect of the main
issue is how to improve ASR with spontaneous, conversational speech in multiple
languages. Another aspect is that the translated text much be “speakable” for oral
communication which means it is not enough to translate content adequately. And one
aspect is the cost-effective development of new languages and domains. The last aspect
is the challenged intonation translation.
In the 4th SpeD edition (2007), an important field of research is about the multilingual
spoken language processing, citied from “Multilingual Spoken Language Processing,”
29
“With more than 6,900 languages in the world and the current trend of globalization,
one of the most important challenges in spoken language technologies today is the need
to support multiple input and output languages, especially if applications are i ntended
for international markets, linguistically diverse user communities, and non-native
speakers. In many cases, these applications have to support multiple languages
simultaneously to meet the needs of a multicultural society. Consequently, new
algorithms and tools are required that support the simultaneous recognition of mixed language input, the summarization of multilingual text and spoken documents, the
generation of output in the appropriate language, or the accurate translation from one
language to another” [6]
3.2 The solutions for non-native speech challenges
As illustrated in Section 3.1, the Spoken Language Technology is more and more
emphasize on improving the ASR with spontaneous, conversational speech in multiple
languages. With the globalization, foreign accent is especially a crucial problem that
ASR systems must address. To recognize a foreign accent English speech with similar
performance as to recognize the native accent English speech is the challenge of the
acoustic model. There are state level and phone level disturbance in foreign accent.
First challenge in the non-native acoustic model is that we cannot train a non-native
acoustic model directly, this is due to the difficult to collect enough non-native speech
data (about 300 to 500 hours). Usually, there are a lot of open data source for native
speech, for example, speech database of Wall Street Journal, Broadcast News, Newswire
Stories, and speech corpus from some research programs. Unfortunately, there is lack of
such broadcast or speech recording resource for non-native speech. Usually, using native
acoustic model to recognize the non-native speech, the word error rate is about 2 to 3
time that of the native speech [7]. Before the non-native acoustic model can be improved
to a high level, the lexical model and language model cannot be expected to perform
well.
Therefore, the author want to focus her research to improve the acoustic model for
non-native speech, especially in recognition Mandarin accented non-native speech to the
30
Target language of English.
In fact, the variation among non-native speakers, even with the same motherland
accent, is very large. Those differences are characterized by different level of fluencies
in pronunciations, different level of familiarity with the target language, and different
individual mistakes in pronounce unfamiliar words.
The presence of a multitude of accents among non-native speakers, is unavoidable
even we ignore the levels of proficiency, and this will dramatically degrade the ASR
performance. As mentioned in Section 3.1, the spoken language technology domain
advocates the research to address these challenges just 3 years ago. But in the beginning
of 21’s century, some research to tackle these challenges already emerged. The most
straightforward approach is to train a solo non-native acoustic model by the non-native
speech data[8], however, the non-native speech data is rarely public available, which is
explained just now. Another approach is to apply general adaptation techniques such as
MLLR and MAP with some testing data, by which the baseline acoustic model is
modified toward a foreign accent. [9] Some researchers are working on the multilingual
HMM for non-native speech.[10] Some researchers find methods to combine the native
and limited non-native speech data, such as interpolation.[11] Some researchers apply
both the recognizer combination methods and multilingual acoustical models on the nonnative digit recognition. [12]
In 2001, Tomokiyo write a dissertation, “Recognizing
non-native speech:
Characterizing and adapting to non-native usage in LVSCR”[13], the author takes great
detailed research to characterize low-proficiency non-native English spoken by Japanese.
Properties such as fluency, vocabulary, and pace in read and spontaneous speech are
measured for both general and proficiency-controlled data sets.
31
Figure 3.1 Summary of acoustic modeling results1
Then Tomokiyo explores methods of adapting to non-native speech. A summary of
the individual contributions of each adapting method are shown in Figure 3.1. By using
the data of Japanese-accented English and native Japanese, and the allophonic decision
tree from the previous characterizing step, he apply both MLLR and MAP adaptation
method with retraining and interpolate at the end, 29% relative WER reduction is
archived over the baseline.
The research of Tomokiyo shows us the non-native is very diverse. Even just restrict
the research to a specific source language, proficiency level, and model of speech. From
the result of characterizing, we see tremendous intra- and inter-speaker variation in the
production of spoken language. The study also shows that those non-native speakers
sometimes generating common patterns and sometimes generating unique events that
defy classification. However, this dissertation has realized that by using a small amount
of non-native speech data, the recognition error for non-native speakers can be
effectively reduced.
Also in 2001, Wu and Chang also discuss approaches in which a little bit test speaker
sentences are used to modify the already trained SI models to a customized model, it is
presented in the “Cohorts based custom models for rapid speaker and dialect adaptation
.”[14]
1
Tomokiyo, L. M. (2001). Recognizing non-native speech: Characterizing and adapting to non-native usage in
LVSCR. Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA.
32
It is well known that speaker dependent acoustic models perform much better for
speaker independent acoustic models, as inter speaker variability is reduced, about 50%
reduction can be achieved in the SD model compared with SI model. In the paper, Wu
can Chang present an approach that uses as few as three sentences from the test speaker
to select closest speakers (cohorts) from both the original training set and newly
available training speakers to construct customized models.
Firstly, the parameters of the speaker adapted model for each “on-call” speakers are
estimated by MLLR technique mentioned in paper [ 15 ]. The authors adopt only two
criteria that can directly reflect the improvement in system performance to select
cohorts. The first one is the accuracy of the enrollment data (3 testing sentences) in a
syllable recognition task. The second one is the likelihood of the adaptation data after
forced alignment against the true transcriptions, which is the true transcript text of the 3
testing sentence prepared ahead. With the enrollment data, the speakers are sorted
according to their likelihood and top N speakers with the highest likelihood are picked as
the cohorts. And the final cohort list is tuned according to both the syllable accuracy and
the likelihood. The data from the speakers listed in cohort are used to enhance the model
of the test speaker by a lot ways, such as the retraining technique, MAP or MLLR.
(Figure 3.2)
Figure 3.2 Procedure for constructing the custom model 1
The results of research from Wu and Chang show that the cohorts based custom
1
Legetter C.J. and Woodland P.C., “Maximum Likelihood Linear Regression for Speaker Adaptation”,
Computer Speech and Language, Volume 9, No. 2, pp 171-186.
33
model can achieve better performance even than SA model (MLLR adaptation with 170
sentences) sometimes, and a relative error reduction of 22% was obtained. A material
inspiration from the approach of Wu and Chang is that the adaptation scheme can be
updated online without reconfiguring any parameters.
In 2003, Zhirong Wang et al present a paper with powerful comparison of 4 different
adaptation methods for the non-native English data from German speakers, “Comparison
of acoustic model adaptation technique on non-native speech.”
[16] The
four methods are
bilingual models, speaker adaptation, acoustic model interpolation and Polyphone
Decision Tree Specialization.
Firstly, they train a native English model and a German accented English model
(limited data. WER of the native model on native test data is 16.2%, on non-native test
data is 43.5%. WER of non-native model on non-native test data is 49.3%. When pool
the native and non-native data together, a pool model gives WER of 42.7%.
Bilingual acoustic model is trained earlier
[ 17 ]
with English part of Verbmobil
(ESST 1) and German part of Verbmobil (GSST 2). The bilingual model is designed to
improve the robustness of the recognizer against the accent of non-native speakers. The
common phone set for English and German is investigated by the knowledge-based
(IPA) approach. The best WER of 48.7% is obtained for bilingual model on non-native
data.
MLLR and MAP adaptation techniques are investigated in the paper. (Figure 3.3) For
enough adaptation data amount ( > 20 min), MAP shows better performance in reducing
the WER than MLLR. And for both techniques, better adaptation can be achieved by
increasing the data amount per speaker.
1
ESST data was the English speech data collected for the Verbmobil project, a long-term research project aimed
at automatic speech-to-speech translation between English, German and Japanese
2
GSST data was the German speech data collected for the Verbmobil project.
34
Figure 3.3 MAP and MLLR adaptation1
Acoustic model interpolation adjusts the weighted averaging of the PDFs of several
models to produce a single output. In the pooled training, the non-native speech data is
very little compared with the native one. By adjusting the weight of the non-native
speech data, the interpolation shows the optimum WER at a point. (Figure 3.4)
Figure 3.4 Performance with various interpolation weights1
There is a big mismatch of the context between the speech of native speakers and that
of non-native speakers. In th decoding process, the context decision tree we use was
built from native speech to model the context of non-native speech. Zhirong Wang et al
adopt the Polyphone Decision Tree Specialization (PDTS) [18] method for modifying a
decision tree to a new language. In this approach, the clustered multilingual polyphone
continues the decision tree growing process with a limited amount of adaptation data.
And the best result from PDTS is 35.5%.
1
Z. Wang, U. Topkara, T. Schultz, A. Waibel, Towards Universal Speech Recognition, Proc. ICMI 2001.
35
The comparison of all the adaptation methods results are shown in Figure 3.5. Based
on this paper, they discovery MAP outperform MLLR with adaptation data higher than
20 min. Interpolation outperform MAP, and even better recognition results can be
achieved by PDTS.
Figure 3.5 Best results of Various systems1
In 2003, Ayako Ikeno et al published a paper, “Issues in recognition of Spanishaccented spontaneous English,” [ 19 ] illustrating the University of Colorado Large
Vocabulary Speech Recognition system sonic[20], comparing the results from different
acoustic model and language model, presenting the method to characterize the full, onschwa vowels of the Spanish non-native English speakers.
In their research, in order to Spanish-accented spontaneous English, they design some
conversation situation to collect the speech, in which the response from the speaker is
spontaneous.
Table 3.1 Word error rate for the CSLR Sonic Recognizer2
In 2002, Sonic was ported from English to the Spanish, Turkish, and Japanese
1
T. Schultz, A. Waibel. Polyphone Decision Tree Specialization for Language Adaptation Proc. ICASSP,
Istanbul, Turkey, June 2000.
2
Ikeno, A., Pellom, B., Cer, D., Thornton, A., Brenier, J. M., Jurafsky, D., et al. (2003, April) Issues in
recognition of Spanish-accented spontaneous English. Paper presented at the ISCA & IEEE ISCA & IEEE
workshop on spontaneous speech processing and recognition, Tokyo, Japan
36
language. Sonic is a multi-pass recognition system and has been shown to have
competitive recognition accuracy to other recognition systems. Its performance is shown
in Table 3.1.
During each recognition pass in the sonic system, a voice activity detector (VAD) is
dynamically updated by the feedback of each current adapted system acoustic model,
and the means and variance of the acoustic model are modified in an unsupervised way.
(Figure 3.6) In the first pass, the gender-dependent acoustic models are performed. The
second pass performs with vocal-tract length normalized model.
In this paper, the author compare different acoustic model and language model in
recognizing the Spanish accented speech. Two sets of acoustic models were used, one
trained on Wall Street Journal and ones trained on the accented speech. Three language
Figure 3.6 Diagram of Hispanic-English multi-pass recognition system1
models were selected, they are WSJ, SwitchBoard and a language model trained on the
accented data. (Table 3.2)
Table 3.2 World Error Rate % by Models1
The paper also presents the ways to capture the characteristic different length of
vowels from the non-native speakers. They determine the average duration of reduced
vowels first, and then they normalize these values by the average vowel duration in
1
Ikeno, A., Pellom, B., Cer, D., Thornton, A., Brenier, J. M., Jurafsky, D., et al. (2003, April) Issues in
recognition of Spanish-accented spontaneous English. Paper presented at the ISCA & IEEE ISCA & IEEE
workshop on spontaneous speech processing and recognition, Tokyo, Japan
37
accented and native speech respectively. The phoneme normalized results can be used to
modify lexicon model in future, which can capture the full, non-schwa vowels of the
Hispanic-English speakers.
⁄
⁄
In recent years, there has been notable research on non-native speech. Some papers
propose transforming or adapting acoustic models based on the variation of
pronunciation[21][22]. Some research propose to adjust the language model designed from
native speech to adapt to the speaking style of the non-native speakers[ 23 ]. In
combination the two approaches, some hybrid modeling approaches emerges[24].
The paper published by Yoo Rhee Oh (2009)[ 25 ] use the pronunciation variants
abstracted from the non-native speech to modify the pronunciation model and the
acoustic model. However, the results do not show much improvement, but it present the
pronunciation variant rules to map the target phoneme to the variant classification.
The non-native speakers have a target pronunciation heavily biased by their
motherland accent. Inspiring by this axiom, a lot of researchers think up methods to
integrate some of the source language information into the acoustic model to gain some
improvement in the recognized result.
Some research focus on mapping the source phones to the target phones, we take
Mandarin as source language and English as target language here. Then they add the
phone information of well-trained Mandarin acoustic model to the mapped English
phone. By doing so, a pronunciation of Mandarin accented English phone will be
accepted by this modified acoustic model.
In 2009, a research group from the Hong Kong Polytechnic University, Qingqing
Zhang, Jielin Pan, Yonghong Yan published a paper in ISNN 2009, “Non-native Speech
Recognition Based on Bilingual Model Modification at State Level”. [26] In their research,
they first train good Mandarin and English acoustic models, both are Hidden Markov
Models. Using the initial English model, only 46.9% phrase error rate is obtained. Then
38
pooling, MLLR, MAP adaptation techniques are applied on the English model and
compared, and MAP is the best, which is 34.3% phrase error rate. After that, a State Level Mapping Algorithm is used to modify the current adapted English model.
In the algorithm, the English (adapted) and Mandarin acoustic model are both used to
recognize a certain non-native speech data, thus they obtain the two parallel state level
time labeling. By set certain threshold, for example, 60% match of a particular Mandarin
state with an English state is considered as a match case, they calculate the actual
number of matches (Figure 3.7). After that, they calculate the possibility a particular
Mandarin state match to a particular English state, and then they use the best n matched
Mandarin state’s information into the adapted English state. Finally, the modified model
is capable of recognize the non-native speech at phase error rate of 31.6%.
Figure 3.7 Example of the match between state Target and Source1
Among those approaches presented here, some research do have addressed some
problem in the non-native ASR, but most of them can not deviate much from the
adaptation technique of MLLR or MAP or the combination of them. The MAP tends to
perform better with larger adaptation data (20 min). Although the interpolation method
shows us better performance than MAP, it requires the feedback from the final
recognized results, which is not practical in implementation.
Some other method shows even better result, such as Polyphone Decision Tree
1
Q. Zhang, T. Li, J. Pan, and Y. Yan, "Non-native Speech Recognition Based on State-Level Bilingual Model
Modification," in Proc. of Third International Conference on Convergence and Hybrid Information
Technology (ICCIT), Vol. 2, pp.1220-1225, 2008.
39
Specialization (PDTS), but it is very complex and need a lot of charactering information.
Wu and Chang (2001) discusses the use of very little speech data from the testing
speakers with largely improve the recognition results, and this may be very helpful in
some realistic applications. Qingqing Zhang (2009) group propose an algorithm to map
the Mandarin phonemes to English phonemes, a little more improvement can be
achieved over MAP technique, but it need a relative long time for the dynamic
programing to process. The Vocabulary Speech Recognition system sonic from
University of Colorado Large integrates almost all cutting-edge techniques in this field.
40
Chapter 4
Methods and Results
Although English is being learned as the second language other than the motherland
language in many countries. There are a lot of difference between non-native English
speech and native English speech.
First, non-native speakers tend to pronounce phoneme differently, sometimes will
have reduced or increased phone set, sometimes will have a lower or higher speech rate,
and sometimes will have high frequency of mistakes. In total, native acoustic model
cannot recognize the phones pronounced by non-native speaker properly. This is what
the author is trying to address in her first and second project, and this part is very crucial
step to improve the nonnative speech recognition result.
Second, non-native speakers also pronounce a word with different phoneme sequence,
with some phoneme inserting or deleting, to solve this problem we need to modify the
native lexicon model. The author has tried to modify the native dictionary in her third
project, which is a very difficult part in the whole ASR system.
Third, non-native speakers are used to speak with poorly grammar to organize their
words, this means that the native English language model can will have poor
performance for the non-native unsupervised speech. The author has not reached this
part yet.
41
4.1 Non-Native English Data Collection
As mentioned previously, there is a lack of public resource for non-native speech. And
for different group of non-native speakers, the speaking style, pronunciation and accent
are varied with the language familiarity level and the hometown language accent.
However, it requires expensive investment to collect enough non-native speech for
training of an acoustic model. Usually, a good native English acoustic model is trained
by a about 300 to 500 hour speech data, which can be easily obtained from public
resource, such as BBC broadcast, Walk Street Journal. However, we cannot find any
well recorded non-native broadcast or media. Therefore, we usually will collect the nonnative speech in the lab for the research.
The author has collected some Mandarin accented non-native speech for her later
research. The 6 participants (three male, three female) were from China, and all had
studied in English spoken University at least one year and had a median ability to
understand, speak and read English. Each speaker read out the same 630 sentences in a
room with little noise with the same microphone, sound card and software.
To collect the data the author use a speech recorder software designed for the HMM
research purpose, with high definition and will generate the required scripts during
recording. The command used is shown in Appendix Command 1.
The inputs and outputs for this command are shown obviously:
Figure 4.1 The voice recording command inputs and outputs
After the recording, some of the utterances 1 are just silence or not properly recorded,
1
Utterances means the speech sentence here.
42
thus we remove those speech data and its associated transcription 1. Only 500 sentences
are selected from the 630 sentences. The detailed data amount is shown in Table 4.1.
Table 4.1 Non-native English collection data
Speaker
Utterances
Time
Speaker A
500
1.18 hr
Speaker B
500
0.85 hr
Speaker C
500
1.04 hr
Speaker D
500
0.9 hr
Speaker E
500
0.93 hr
Speaker G
500
0.98 hr
Total Time
5.89 hr
Though the collected data is just about 1 hour per speaker, the actual time used is
more than about 3 hours per speaker. It is really a very time and energy consuming for
the participants.
Data collection is very crucial step to the success of the lateral non-native speech
research. First, the data amount must be enough to achieve significant result. Second, the
speakers must be choose wisely for the purpose of the research, in the author’s
experiment, since she want to study the non-native speech, she should choose non-native
participants with similar familiarity and fluency for the target language 2. Third, carefully
monitor the data been collected and remove those dummy or wrong data can reduce
unnecessary redundancy of future research work.
4.2 Project 1: Mixture level mapping for Mandarin acoustic model to English
acoustic model
4.2.1 Overview of the Method in Project 1
Project 1 focuses only on reducing the sound errors of non-native speakers, and my task
is to train a better acoustic model for Mandarin accented non-native testing data with
limited non-native training data. (less than 6 hours) Therefore, adaptation technique has
1
2
The transcription is the file of sentences given to speakers for reading.
Target language is the language the ASR attempt to recognize.
43
to be applied to modify the native English model with such a limited adaptation data.
In this project, we hold a knowledge that most non-native speakers tend to use
phonemes from their motherland language when they speak the foreign language. Thus,
we expected improvement when the motherland acoustic model information is integrated
into the native English acoustic model. Hence, we will use the source language 1 acoustic
model in this project as well. For the limited non-native speech data the author collected
in section 4.1 will be used in this project and the second project, and some portion of the
non-native speech data will be used as adaptation and retraining data, the remaining non native speech data is the data been tested in the recognition.
In this project, there are three parallel processes. (Figure 4.2) The first process is use
the native English acoustic HMM to recognize the non-native speech data, and then we
apply the adaptation technique to the native English model, and evaluate its recognition
result for the non-native speech again, after that, we apply adaptive training for the
acoustic model, and use the final model to recognize the non-native speech.
The second process is to insert the Mandarin acoustic model information into the
English model. The author has mentioned in Section 2.3 that, in an acoustic hidden
Markov models (HMM), the top model level are phonemes for a specific phone set. Each
phoneme consists of multiple states, which are emitting states and capture the
characteristics for a portion of phoneme. Each state consists of many Gaussian mixtures
(In this experiment, the author will investigate the 1 mixtures per state HMM, 2 mixtures
per
state HMM, 4 mixtures per state HMM, 8 mixtures per state HMM), and shape combined
by a lot of Gaussian mixtures determine the model of that state. In this project, the
author will insert all the mixtures of native Mandarin model into every state of an empty
English model. As the author have native Mandarin models with different number of
mixtures per state (from 2 to 32), she has obtained 5 different modified models in this
1
Source language is the motherland language of the non-native speaker.
44
way. Then all the 5 models are retrained by the adaptation data, then the author use them
to recognize the test data for the phone level. The best-modified model is chosen
Compare the phone recognition results of the three
processes and see the improvement from left to right
Figure 4.2 Project 1 Process flow
MLLR as the modified HMM model. Here, the author will use the adapted model to
recognize the test data again and record it. Lastly, adaptive training is applied to the
adapted modified HMM model, the final adapted modified HMM model will recognize
the test data and the result will be recorded.
The third process is very similar to the second process. The only difference is that the
Mandarin HMM mixtures are not inserted into an empty English HMM state, but the one
she has obtained from the process 1 final stage. In process 1 final stage, the native
English HMM model has been retrained and adapted, and the mixtures in each of its
45
state contain the English acoustic information of the non-native speaker, and are trained
towards the non-native speakers’ acoustic features. The author assigns some weight of
the mixtures already exist in English HMM state, and the remaining weight to the
mixtures from a Mandarin whole model. After that, all the steps are the same like the
process 2, retraining and then adaptive training again. Actually, the MLLR adaptation is
involved in the first iterate of the adaptive training.
4.2.2 Step Details in Project 1
In this project, the author has prepared the following data, models and tool for the
implementation of the all the process:
A well-trained Native English HMM model (trained by about 300 hours )
5 well-trained Native Mandarin HMM model (from 1 mixtures per state to 8
mixtures per state)
The non-native speech data collected in lab. (70% is used for adaptation and
retraining, 30% is used for testing)
Hidden Markov Model Toolkit (HTK) 1
In the first process, the collected speech waveform files have to be transferred into
MFCC feature files first, the command used is in Appendix Command 2.
The inputs and output for this command are shown:
1
Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov
models, and it is open source toolkit for non-commercial use.
46
Figure 4.3 The inputs and output for the mfcc features abstraction
The “HCopy” is the HTK tool, which can abstract the feature file for a given speech
waveform file according to the specification in the configuration file. In the
configuration file, the author will have to specify, for example, the window period, the
frame period. The script for wav to mfcc mapping is a script to specify which waveform
file will be transferred into which feature file with directory and file name details.
Since the author has an already well-trained native English acoustic HMM model on
hand, she can recognize the non-native speech in the test data set directly.
The
recognition command is in Appendix Command3.
The inputs and outputs for this command are shown in Figure 4.4.
The “HVite” is also a tool from HTK toolkit, the configuration file here is different
from the previous command, as it specifys the details of what kind of test feature file in
the input and take into consideration of context information or not in the recognition
process. The phone loop shows the relationship between each phone, normally we
construct the phone loop in a way that every phoneme can jump to another phoneme
randomly. The phone dictionary file is very simple, it just shows which phoneme model
is correspond with which phone symbol. Mono phone list is a list of all the phoneme
models in the HMM model. “-s” is the grammar scale factor, which is needed to adjust in
word level recognition. “-p” is the insertion penalty, which will prevent the insertion
errors in the final recognition result. “-t” is the pruning thresholds, it will prune off the
lattice path with score below some threshold, thus smaller the threshold, faster the
recognition, but lower the output accuracy. Usually, set “-t” can be good enough and not
affect the final result much.
47
Figure 4.4 the inputs and output of speech recognition command
After the recognition, we can test the phone level accuracy or error rate by the known
phone level test data transcription. The command is in Appendix Command 4.
This command will give us the phone accuracy of the output directly in the screen.
Apply MLLR adaptation technique is a little bit complex, as it requires the generation
of the regression tree. To generate the regression tree, the command is in Appendix
Command 5.
The inputs and outputs of this command is shown below in Figure 4.5.
The configuration file here will specify directory of the global configuration file and
the maximum number of regression tree terminals. The global configuration file will
Figure 4.5 The inputs and outputs for regression tress generation
specify the model components can be classified into groups. The output regression tree
base will give the index of each group and the model components under it, the regression
tree file label each of its node with a base group index.
The linear transformation adaptation matrix is generated by two commands in
Appendix Command 6.
48
The Figure for the inputs and outputs of the two commands is shown in Figure 4.6.
This two commands are cascaded to generated a transform matrixes for every data
sufficient terminal in the regression tree. “-u” must choose “a” as its flag, as this will
make the tool “HERest” to train the adaptation data in a linear transform way, thus the
final transform matrixes can be generated. The mask is a symbol to tell the software
which part of the file name stands for the speaker id, thus the user dependence can be
trained in the adaptation, and different matrix transforms are generated for different
speakers.
Figure 4.6 inputs and outputs for MLLR adaptation
The HMM model after adaptation only need some changes in the recognition
command (Appendix Command 7).
The HERest1 will generate the basic transform matrix for MLLR adaptation, the
HERest2 will generate the transform matrix that can be used in the recognition process.
In the adaptive training we use the transform generated in HERest2, the HMM model
model parameters will be modified according to reduce the difference between speakers
in the transform, with the adaptation data. In this experiment, the author will control the
system only to update the mixture weight. The command is shown in Appendix
Command 8.
49
“-u” choses the flag “w” will control the retraining to only update the Gaussian
mixtures weight, thus the mean and variance are kept as the original model. “-w” set as 2
will prevent some weight is decreased too much that the component will be get rid of in
the output HMM file, and the weight cannot decrease any more when it reaches a
threshold. In the adaptive training, after the new HMM model is obtained, the HERest2
is running again for the new HMM model, and then repeat the adaptive training again
using the new HMM model and new transform as the input, and so on so forth. The
recognition process for model after adaptive training is the same as the MLLR adapted
model.
Figure 4.7 Modified model Illustration
In the second process, the author will insert the Gaussian mixtures from a native
Mandarin model into every state of an empty native English model. This is the most
dedicate part of this research which is attempt to transfer the acoustic information from
50
the Mandarin HMM structure into the English HMM structure. The above shows an
illustration. The left side is the native Mandarin HMM model, with mixtures under each
of its state, the right side is the empty English HMM model, which has no mixtures
under each of its state. The author will modify the empty English HMM model by
inserting all the well-trained Gaussian mixtures from the native Mandarin HMM model
into every empty English state. The Gaussian mixtures from the Mandarin states will be
assigned a constant scalar, which will assure the summation of all the mixtures’ weights
in one English state is close to 1. This can be seen more clearly from the likelihood
output compositions:
(
where
)
∑ (
) (
)
is the overall mixture from Mandarin acoustic model, for example, there are in
total of 135 (45 Mandarin phonemes and 3 states per phoneme) mixtures from 1 mixture
Mandarin model, and all the initialized weights of the mixtures are
the mth expert of state s, (
. The likelihood of
) is given by a Gaussian distribution:
(
)
(
)
A perl script edited by author does the modification.
Different native Mandarin models are used in the model modification process.
Besides, 1 mixture per state Mandarin HMM, 2, 4, 8 and even more mixtures per state
Mandarin HMMs are investigated in this project. Then the every modified model is
retraining with the adaptation data. (Appendix Command 9)
In the command above, the flag “w” is used for “-u”, which means that only the
Gaussian mixture weight will be updated in the output HMM model file. This retraining
will be iterated for about 7 to 8 times, and the final HMM model will be recognized by
HVite and the result will be recorded. This retraining process is so called Baum-Welch
reestimation. As we notice that in this retraining command we do not include the “ -w 2”,
and this means that there is no control for the weight of the mixture to drop small enough
51
to be pruned off. Thus in the retraining process, the mixtures per state will keep on
reducing in the later iteration, and the author call this the self-shrink mechanism, which
will help to get rid of redundant mixtures automatically.
After the retraining, the modified model with best recognition result is cho sen to
repeat the MLLR adaptation, adaptive training steps as described in process 1.
The third process is different from the second process only for the first step. The
author will insert the Gaussian mixtures from a native Mandarin model into every state
of non-empty native English model. The non-empty native English is the same model
used in process 1. In addition, some weight has to be assigned to the native English
mixtures in the modification process. After this, retraining, selection, adaptive training is
followed one by one.
4.2.3 Project 1 Results
The following Figure (Figure 4.8) shows the results of project in the steps of every
process. The native English acoustic model has the poorest recognition results for nonnative speech 71.8% phone error rate. MLLR adaptation and adaptive training each
contribute some improvement for the recognition performance, no matter in process 1 or
process2. Process 2 has lower phone error rate than process1 after applied the same
technique, which shows an observation that it is better to use Mandarin model mixtures
to recognize the Mandarin-accented speech than to use English model mixtures. Process
shows that further improvement can be achieved by combining the both models’
mixtures.
52
Figure 4.8 The results for Project 1
In Table 4.2, we can see that the modified empty English model has lower PER when
it is modified from more complex Mandarin model. However, in the experiment, the 16
or 32 mixture per state Mandarin model cannot be practically processed, due to the
initial weight assigned to each mixture is too small, and the mixture will easily be get rid
of during the first adaptive training round.
Table 4.2 Mandarin mixture modified empty English model
Mandarin Model
Initial English
Model
English Model after retraining
1 mixture Mandarin
135 mixtures
Eng_M1
62 mixtures
PER 72.56%
2 mixture Mandarin
282 mixtures
Eng_M2
103 mixtures
PER 69.62%
4 mixture Mandarin
564 mixtures
Eng_M4
117 mixtures
PER 68.52%
8 mixture Mandarin
1128 mixtures
Eng_M8
237 mixtures
PER 67.09%
53
4.3 Project 2: PoE to combine the bilingual NN models
4.3.1 Overview of the Method in Project 2
This project will continue to focus on acoustic model adaptation for non-native speech,
which is the most basic and important recognition step in the non-native ASR system.
Without a good phoneme output from this part, the final sentence level output cannot be
recognized properly.
Figure 4.9 Project 2 Process flow
Project 2 implements the neural network acoustic model (Section 2.3.4, 2.3.5), and
uses the product of expert technique (Section 4.3.2). In this project, we still bear in mind
that there will be an improvement in the recognition results when both Mandarin
54
acoustic model (in NN1 form) and English acoustic model (in NN form) are combined.
From Figure 4.9, the author shows out the general process flow for this project, and
the final recognition results of the three processes are compared. The first process is to
develop a neural network model, which uses the information from a native English
acoustic model, then another two layer level neural network is developed to further
improve the recognition result as well as to parallel the lateral two processes. After that,
the neural network model’s result is post processed, and the probability of each state is
transferred into the distribution position (Gaussian function: x p=g(x), and now we get
x by
( )
x). Then, we pass the test data distribution value into the initialized
HMM. Finally, the HVite recognized result is the recognition result of the two cascaded
neural network models.
The second process is different from the first process only in its first step, i n which it
develop its first neural network model with the information from an native Mandarin
acoustic model. All other steps just repeat the steps in process 1. One difference is worth
to be mentioned, the Mandarin NN model has 138 nodes in the output, and it transferred
from 138 nodes to the 117 nodes in the second two layer NN model, which is important
to be the same states as in the English phone set.
The third process also different from the first process in its first step, in which it
combines the post-edited results from the previous two processes’ first NN model, and
then use the combined results as the input of its second neural network model.
4.3.2 Step Details in Project 2
The prepared material for this project is described as follows:
A well trained native English model (it is based on broadcast data source tdt2,
tdt32)
1
2
NN stands for neural network.
Tdt2 and tdt3 are some versions of recorded broadcast with the speech transcription and other information
55
A well trained Mandarin acoustic model
The QuickNet toolkit 1
The non-native speech, Mandarin accented, about 6 hours, 50% for training,
and the rest for testing (Collected in Section 4.1)
In the first step of process 1, the author train an neural network acoustic model using
the ilab file aligned from an well trained native English acoustic model. To train an NN
acoustic model, the author has to prepare the NN input file in the pfile format and the
target file in the ilab file format. Some details for neural network acoustic model training
has already been explained in Section 2.3.4 and 2.3.5, and the method applied applied
has no difference with the method described in 2.3.5.
Figure 4.10 NN model 1 training
The details of the first NN model training is shown in Figure 4.10. The input files are
transferred from the Non-native training speech feature files, and the final format pfile is
only another format and still contains all the feature value. The pfile transformation is
done by the toolkit in the QuikNet, the tool can be compiled and easily used according to
the example. For every feature value in the pfile, it will be sent to a function node in the
1
QuickNet is a suite of software that facilitates the use of multi-layer perceptrons (MLPs) in statistical pattern
recognition systems.
56
middle, which is a non-linear function, the result of the function node will be sent to all
the node at the final function node, which again calculates the result for a non-linear
function (usually Sigmoidal function). From the input layer to the second layer, weight
or constant will be multiplied to the input, and different arc from the layer 1 to layer 2
has different weight, and the weight exists in the arcs between the layer 2 and layer 3
too. One input pfile has 39 feature value for this experiment, thus there is 39 input node
for this pfile. However, multiple pfiles, eg., 3, 5, 7, 11 or 13 pfiles can be placed at the
input node together, and the information between the neighboring pfiles can be captured
as well. In this research, 11 pfiles (we call it window_size=11) are the optimal one.
The final layer’s results have fixed output node, which is the number of states for that
acoustic model, in this research, the English acoustic model has 117 states, while the
Mandarin acoustic model has 135 states. The result at the final layer shows the
probability of the input pfile (the pfile in the middle, for example, 3 input pfiles, the
actual training one is the second input pfile, 11 input pfiles, the actual training one is the
sixth input pfile) to be any one of the output state. As we know one feature file is
corresponding to one frame period, and this file can be classified as a particular state in a
particular phoneme. Pfile is the transformation of feature file, and it captures the state
level characteristic as well. The final layer results have different value for different node
(which stands for a state), the highest probability one will be as the state recognized for
that feature file by this NN network. However, initially, the NN system is initialized with
same arc weights and the parameters in the non-linear functions, for the final layer
results, they are very different from what we expected. The ilab file is a file labeling
only the correct state for that feature file as 1, and the rest as 0. And the ilab file is
transferred from an alignment file by some scripts, editing by the author. The alignment
file is the result by forcing aligns the training non-native speech through a Native
English acoustic model with the speech’s transcription. The NN system final layer’s
results are compared with the correct result from the ilab file, the difference is feedback
57
to the final layer’s function and the arcs’ weight, then the weights an d parameters are
updated to reduce the error. Finally, the results of the second layer are again compared
with the computed results from the ilab, and so on so forth. The system is updated with
the training data and ilab files, where a larger training data size leads to a better system
performance.
After the first NN model is trained, the results of the first NN network’s training data
contain 117 values per frame. This 117 values stand for the likelihood of this frame file
for each of the 117 states. The post editing for this part is firstly performing “log”
function to each of the likelihood value, then transferring the edited results into pfile to
be the input file of the second NN model training.
Figure 4.11 NN model 2 training (Poe)
The training of the second model is a two-layer NN model (Figure 4.11). The ilab file
is still the same as the previous model training. Before the log function process, the
training of a two-layer NN model’s result will depend on the weight summation of the
input:
→
After the log editing for the input, the NN model’s result will depend on the product
of the input:
( )
( )
( )
(
)
58
(
)→
This is why we call the second NN model training as the product of experts, it
calculated the result of the final result by the product of the input node with weights in
the arc as power of it. In this training, there is slightly more information can be trained in
this step, but not significant. But we have to implement this step for the Mandarin model
training and combined model training. Because for this lateral two process, the resulted
state number for each frame is different from the actual state number for the English
phone set, they have to change in this step and train towards English speech in this step.
After the second NN model is trained, the author has to edit the result further to be
the input of the initialized HMM. And the result only for testing data will be concerned
here. Actually, the first and second step use the training data to update the model
parameter, and the test data has to follow the format of training data to go th rough the
two models again. As the author mentioned previously, the values in the result stand for
the likelihood of this frame file for each of the states. After the author obtains the
probability values, she transferred the probability value into its distribution value (Figure
4.12). The distribution of the x is obtained by perform the inverse function of the
Gaussian, and only the positive x value is calculated to present the distribution of a state
for a frame.
Figure 4.12 The inverse function of Gaussian
59
An initialized HMM for this application has the dimension of features equal to its
number of states. Every state only contains one Gaussian mixture, and the Gaussian
mixture has 0 means for all its dimensions, and one variance value which is the
relatively small (In this experiment, default as 1), the other variance values are all very
big (In this experiment, default as 1000). The position of the small variance value stands
for that particular state, which is consistent with the state sequence in the ilab file. Thus,
when we input our distribution file, the state with an x value closer to the mean 0 will be
assigned with a higher probability score, and the state combined result will be account
into the final phoneme recognized result.
The second process is different from the first process in its first two steps. In the first
step of the second process, it uses a native Mandarin model to obtain the alignment for
the training data, then transfers the alignment into ilab file. Hence, the output state
number is consistent with the Mandarin states (135). In the second step of process 2, the
product of expert NN is trained, and over there, the input pfile for this NN has 135
feature values per frame, but we trained them towards an output with 117 nodes, and the
ilab file used in this model training aligned by the native English acoustic model. This
step is to transfer the information is the native Mandarin acoustic model into the
information of English acoustic model. After that, the author use an initialized HMM to
decode the test data results from the second model.
The third process combined the results from the first process’s model 1 and the
second process’s model 1. As we know, the former will produce an result with 117
features per frame, the latter will produce an result with 135 features per frame.
Therefore, the combined results will have 252 features per frame. The combined results
are the input of the second model in process 3. The second model still trained towards
the English ilab file, with fixed 117 at its output nodes. By doing this, the native
Mandarin acoustic model information and native English acoustic model information are
integrated into the two-stage NN model, and an better result is expected for the final
60
result. The last step of the third process is the same as the first and second process.
4.3.3 Project 2 Results
The results of the project 2 are shown below. From the results we can see that the third
process trained the best an acoustic model. This is because the third process model uses
both the information from the native Mandarin acoustic model and native English
acoustic model.
Table 4.3 Results from NN Poe
Process
Models
PER
1
Native English model after Poe
63.9%
2
Native Mandarin model after Poe
64.12%
3
Merge the two model results after Poe
61.12%
4.4 Project3: Training Non-native Speech Lexicon Model
4.4.1 Overview of the Method in Project 3
In this project, the author will focus on training a lexicon model to capture the
vocabulary pronunciation of the non-native speakers, which deviates much from the
standard dictionary and is majorly used in the native English speech recognition.
Divide the nonnative speech
into two parts,
train and test
data sets
Use an nonnative acoustic
model to
recognize and
force alignment
the train data
set
Use the standard
dictionary, train data
set recognition
output and alignment
as the inputs of a
dictionary updating
system and obtain a
new dictionary
Keep the same
acoustic model, use
the new dictionary
and the standard
dictionary to
recognize the test
data set
Figure 4.13 Project 3 Process flow
61
The above chart flow (Figure 4.13) shows an overview of project 3. First, the author
divides the non-native data into two parts, one is the train set for the non-native lexical
model training, the other is the test set.
In order to update the standard lexical model in a statistic and data driven way, the
author developed a system to archive this. The system is composed of several perl scripts
more details will be given in the next section. Due to the requirement of the system, the
phone level recognition result of the train data set, and its corresponding alignment will
be needed and have to be prepared ahead.
With the three inputs, the recognition output, alignment, the standard dictionary, the
dictionary updating system will generate a new dictionary, which has modified, added or
deleted pronunciation for some of the words, and assigns the probability weights for the
pronunciation variations in a word.
Last step is to use the standard dictionary and updated dictionary to recognize the test
data set individually in the word level and compare the results.
4.4.2 Step Details in Project 3
The prepared material for this project is described as follows:
A well trained non-native English model (it is based on the data collected
from my supervisor’s students)
The non-native speech, multi-nation, about 17 speakers (each devotes about
100 to 200 sentences, one word per sentence), 14 speakers for train data set,
and the rest for test data set
Hidden Markov Model Toolkit
The step 1 and step 4 are very simple, and the author will not describe them further in
this section.
The step 2’s non-native acoustic model is a triphone HMM model, which is context
dependent model and has better performance compared with mono phone acoustic
62
model. Besides, the non-native acoustic model is trained to recognize only one word per
sentence, the performance of the acoustic model need not so much robust or too much
training data. On the other hand, the recognition process just follow the steps in project 1
using HVite, the force alignment can be done adding “-a” and “-I ” in the
HVite.
Figure 4.14 Project 3 Step 3 Process flow
The step 3 is to use the information from the train data set to update the standard
dictionary, this is done by a perl script system. The system is illustrated above, the script
is in the gray box, and the input or output are in the white box. The system consists of
several perl scripts, each script play important role, and some script will involve its
sub.script.
The script “multiphone.perl” and its sub script “multiphone2.perl” are scripts to
compare the input information from the recognition result and alignment result. It will
match the alignment word with the corresponding phone level recognition result based
63
on their Id and the time frames. Then the scripts will output a list of matched word to
recognition phone sequence.
The script dict_update.perl is used to sort all the pronunciations listed in the write.list
to the matched words, and it also adds the standard pronunciation from the standard
dictionary to each word, for example, the word COMMUNIST has the format
COMMUNIST
k aa m y ah n ah s t 1
k aa m y ah n ah s t
k aa m y ah n ah s t
k aa m y ah n ah s t
k aa m y ah n ah s t
k aa m y ah n ah s t
k aa m y ah n ah s t
k aa m y ah n ah s t
sil k ah m y uw n ih t iy z sil
k ah m y uw n ih t iy z
k aa m y ah n ah s t
k aa m y ah n ah s t
k ah m y uw n ih t iy z
k aa n ah m ih s t
Figure 4.15 The example for combine.list
in the Figure 4.15. The first pronunciation is from the standard dictionary, thus marked
as 1. The following pronunciations are sort from the write list, which are actually the
recognition results samples. The script find_new_pronounce.perl is actually check some
redundancy in the pronunciation samples and makes some modifications. For example,
in Figure 4.15, the COMMUNIST has a pronunciation case for “ sil k ah m y uw n ih t iy
z sil”, and “sil” or “sp” are silence phonemes and are useless in training new lexicon
model, and they will be deleted here.
Script find_general.perl is to count the number of occurrence of same pronunciation,
and to adjust the position of the every phoneme to the matched phoneme in standard
pronunciation. This script has many sub scripts. To count the statistics is operated by the
sub script combine.statistic.perl. To adjust the position, we first find the sample
pronunciation and standard pronunciation’s Levenshtein Distance, and find the right
position base on the maximum score given by matching the sample pronunciation to the
64
standard pronunciation. If deleting or inserting, there will be 0 score. If the two phoneme
are matched, 10 score are given, if two phoneme are very similar, 5 score are given, if
two phoneme are not similar, 2 score are given. The maximum score can be computed
using dynamic programming, and the position of the sample pronunciation can be found
through the score matrix. The similar phonemes are categoried in script category.perl.
COMMUNIST
k aa m y ah n ah s t 1 1
k aa m y ah n ah s t 9
k ah m y uw n ih t 3
k aa
n ah m ih 1
Figure 4.16 The example for the result of find_general.perl
update_train_words.perl is to find the most popular phoneme in each standard
position for a particular word, and use the phoneme sequence with every position the
most frequently occurred phoneme according to statistics, this is the best pronunciation
for that word. In the phoneme sequence, all the position may has the second best
phoneme, update one position with its second best phoneme, which has relatively higher
rate than other second best phonemes, and this phoneme sequence will be as the second
best pronunciation for that word. The rate of the second best pronunciation is also
written into the new dictionary, as it is weight compared to the best pronunciation. For
example, the final result for this system in the dictionary file is like the following format
for word “COMMUNIST”.
COMMUNIST
1
0.333
k aa m y ah n ah s t
k aa m y uw n ah s t
Figure 4.17 The example for dictionary
4.4.3 Project 3 Results
The result of project 3 is shown in the table below. The percentage shows the word level
accuracy for the recognition process. With the same acoustic model, the updated model
65
shows a 2 percent improvement for one non-native speaker, whose pronunciation is far
below the middle level, but does not show any improvement for the other two non -native
speakers.
Table 4.4 Results for project 3
Dictionary
Speaker A
Speaker B
Speaker C
All
Standard dict
80.4%
84.43%
38%
71.50%
Updated dict
80.4%
84.43%
40%
71.97%
4.5 PROJECTS ACHIEVEMENT AND PROBLEM
4.5.1 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL
Project 1 and Project 2 both show some promising result for the combined model. In
project 1, process 3 result has 1.27% better than that of process 2, 3.68% better than that
of process 1. In project 2, process 3 result has 2.78% better than that of process 1, 3.0%
better than that of process 2. Both the projects obtained the expected results and
improvements, and both archived the target to prove that the information contained in
the Mandarin model and English model can combine to given a better recognition result
for the Mandarin accented English speech.
In addition, the results obtained in the two projects are quiet promising in the
condition that the training data is very limited. The first project uses about 4.2 hours
non-native speech data for training, while the second project uses only 3 hours non native speech data for training. If the author has enough fund support to obtain more
research related data, for example, 20 hours Mandarin accented English speech, the
result will be much better, because all the training and adaptation processes are data
driving.
However, both the projects have not archived a good base line. Phone error rate of
54.55% is the best overall result for project 1, phone error rate of 61.12% is the best
66
overall result for project 2. This is due to the limited training data for non-native speech,
normally, a non-native speech research is carried out with about 20 hours’ non-native
training data.
My work in the project 1 in Section 4.2 shows some unexpected result, the modified
model (empty English model inserted with Mandarin mixtures) performs better than the
native English model. In the process final result, the modified model process has PER
55.82%, while the native English model has PER 58.23%. We know that the non-native
speakers’ accent is heavily biased by their motherland accent, but their speech is English
but not Mandarin, it is very surprising that the acoustic model made from the Mandarin
mixtures performs better than the one made from English mixtures. However, in project
2, the model trained with Mandarin acoustic information shows worse performance than
the model trained with English acoustic information. The end result for the project 2
process 2 is 64.12%, while the end result for the project 2 process 1 is 63.9%. It seems
the different results may be due to the different training data distribution.
4.5.2 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL
The results in project 3 show that the updated model works for one speaker whose
pronunciation is too difficult to be captured by the standard dictionary.
However, the results in project 3 also show that the improvement is not found in the
recognition output of other two speakers. This is majorly due to two causes. First, the
other two speakers’ recognition outputs have little mistakes due to lexicon model, the
error is due to the phonemes which are recognized wrongly at the acoustic model stage.
Second, the data used for training is not enough, one word only has about 10 to 14
samples for training, which is too little to capture the variations for the different
pronunciations in the non-native speakers. In addition, the speakers in this project are
multi-national, it is difficult to train a good lexicon model to capture so many
pronunciation variations.
67
Chapter 5
CONCLUSION AND RECOMMENDATION
In this thesis, the author has first introduced the basic knowledge in the automatic speech
recognition (ASR), which includes the concepts of feature extraction, the acoustic
model, lexicon model and language model in the ASR system.
After that, she summarized the history of the natural language processing field, and
highlighted the major challenges in the field, which shows that the research for the nonnative speech recognition improvement is very demanding in this field. Then some frontend research work to improve the non-native speech recognition system is mentioned or
illustrated in details.
In Chapter 4, the author presented her research work has been finished so far. It
includes three projects, the first two projects focus on addressing the issues in the
acoustic model for the non-native speech, the third project focus on addressing the issues
in the lexicon model for the non-native speech. The detailed achievements and problems
in those projects are in Section 4.5.
If any researcher intents to carry out the similar research in this field, the author will
recommend the following tips in improving the final research result.
If the research is targeted to solve ASR problems in a certain non-native speech,
68
for example, all the research data is recorded from the Mandarin accented
speaker, the collected data should be about 10 to 20 hours to be enough for
training. If the research will use the data recorded from multi-national non-native
speakers, the data size is better to be doubled (20 to 40 hours).
If the researcher will train the acoustic NN model, the alignment used for the ilab
file should be as accurate as possible for the training data. To create the
alignment, it is better to use the context dependent HMM to force align the
training data.
If the research want to carry out research to improve the lexical model for nonnative speakers, it is better to research with a well-train acoustic model. Because
poor acoustic model can not recognize the phonemes correctly in the first stage, it
is difficult to investigate the performance of the lexicon model base on such
poorly recognized phoneme sequence.
For the lexicon model, the author only focused on improving the dictionary in a
data driving way, and just obtained the improvement in the single word sentences.
In case of enough data, the researcher can also look into the discriminate training
to exclude the unnecessary pronunciation update for recognition of multiple word
long sentence.
Currently, there is still a lot of issues in non-native automatic speech recognition
system, the most problematic part is that it cannot improve acoustic model well enough.
Only when this part’s issue is fixed, the lexicon model and language model’s issues will
be addressed. Therefore, for non-native speech, it is very demanding that the future
researchers can develop a noble acoustic model, or that there emerges a more robust way
to abstract and use the audio features, or that there comes some methods to make use of
the vision information ( face expression, the gestures, eg.).
69
APPENDIX
Command 1:
>java –jar
Command 2:
>HCopy -T 1 -C -S < script for wav to mfcc mapping>
Command 3:
>HVite -A -D -V -X rec -T 1 -C -H -t 250 –s 0
–p -10 –w -o M -i -S < mono phone list>
Command 4:
>HResults –I
Command 5:
>HHEd –A –D –V –T 1 –H -M
< configuration file>
Command 6:
>HERest –A –D –V –T 1 –C -C -S -I < transcription file> -H < HMM model
file> -u a –K -J -h
> HERest –A –D –V –T 1 –C -C -S -I < transcription file> -H < HMM model
file> -u a –J –K -J -h
Command 7:
>HVite –H < HMM model> -S < test file list> –J
-h -k -i < recognition output> -w -J <
regression tree and base directory> -C < model configuration file> -t 250 –p -10 –s 0
70
< mono phone list>
Command 8:
>HERest –A –D –V –T 1 –u w –w 2 –C < model configuration file> –J
> -J < regression tree and base directory> -h
-S -I < adataptation transcription> -a -H < input HMM
model> -M < output HMM model>
Command 9:
>HERest –C -u w –S -I < adatation
data transcription> -H -M
71
BIBLIOGRAPHY
[1] K. C. SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 3, 2010.
[ 2 ]D. Jurafsky & J.H. Martin, “Speech And Language Processing: An Introduction to Natural
Language Processing, Computational Linguaistics, and Speech Recognition”, 2 nd Edition, PP.329,
[3]K. C. SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 12, 2010.
[4]M.
Drăgănescu, “Spoken Language Technology,” in C. Burileanu, ed. Speech Technology and
Human-Computer Dialogue, Editura Academiei Române, Bucharest, pp. 11-12, 2003.
[5]
B. H. Juang and S. Furui, Guest editors, Special issue on spoken language processing, Proceedings
IEEE, August 2000.
[6]P.
Fung, and T. Schultz, “Multilingual Spoken Language Processing,” in ref. 4, pp. 89-97.
D. V. Compernolle, “Recognizing speech of goats, wolves, sheep and ..., non-natives,” Speech
[7]
Communication, vol. 35, no. 1, pp. 71-79, Aug. 2001.
[8]
U. Uebler, M. Boros, Recognition of Non-native German Speech with Multilingual Recognizers,
Proc. Eurospeech, Volume 2, pages 911-914, Budapest, 1999.
G. Zavaliagkos, R. Schwartz, J. Makhoul, Batch Incremental and Instantaneous Adaptation
[9]
Techniques for Speech Recognition, Proc. ICASSP, 1995.
L. Mayfield Tomokiyo. Recognizing Non-native Speech: Characterizing and Adapting to Non-
[10]
native Usage in Speech Recognition. Ph.D. thesis, Carnegie Mellon University, 2001.
[ 11 ]
L. M. Tomokiyo, “Lexical and acoutic modeling of non-native speach in LVCSR,” in the
Proceedings of the ICSLP Workshop, 2000
V. Fischer, E. Janke, S. Kunzmann, Likelihood Combination and Recognition Output Voting for
[12]
the Decoding of Non-native Speech with Multilingual HMMs, Proc. ICSLP, 2002.
Tomokiyo, L. M. (2001). Recognizing non-native speech: Characterizing and adapting to non-
[13]
native usage in LVSCR. Unpublished doctoral dissertation, Carnegie Mellon University,
Pittsburgh, PA.
[ 14 ]
Wu, J., & Chang, E. (2001). Cohorts based custom models for rapid speaker and dialect
adaptation. In Proceedings of EuroSpeech ‘01 (pp. 1261–1264). Grenoble, France: European
Speech Communication Association.
72
Legetter C.J. and Woodland P.C., “Maximum Likelihood Linear Regression for Speaker
[ 15 ]
Adaptation”, Computer Speech and Language, Volume 9, No. 2, pp 171-186.
[16]Wang,
on
Z., Schultz, T., & Waibel, A. (2003). Comparison of acoustic model adaptation technique
non-native
speech.
Retrieved
August
30,
2006,
from
http://www.cs.cmu.edu/~tanja/Papers/ICASSP03-wang.pdf
Z. Wang, U. Topkara, T. Schultz, A. Waibel, Towards Universal Speech Recognition, Proc. ICMI
[17]
2001.
T. Schultz, A. Waibel. Polyphone Decision Tree Specialization for Language Adaptation Proc.
[18]
ICASSP, Istanbul, Turkey, June 2000.
Ikeno, A., Pellom, B., Cer, D., Thornton, A., Brenier, J. M., Jurafsky, D., et al. (2003, April)
[19]
Issues in recognition of Spanish-accented spontaneous English. Paper presented at the ISCA &
IEEE ISCA & IEEE workshop on spontaneous speech processing and recognition, Tokyo, Japan.
[20] B.
Pellom, “Sonic: The University of Colorado Continuous Speech Recognizer,” Technical Report
TR-CSLR-2001-01, CSLR, University of Colorado, 2001
S. Steidl, G. Stemmer, C. Hacker, and E. Noth, “Adaptation in the pronunciation space for non-
[21]
native speech recognition,” in Proc. ICSLP, Jeju Island, Korea, pp. 2901-2904, Oct. 2004.
J. Morgan, “Making a speech recognizer tolerate non-native speech through Gaussian mixture
[22]
merging.” in Proc. InSTIL/ICALL Symposium on Computer-Assisted Language Learning,
Venice, Italy, pp. 213–216, June 2004.
[23] J.
Bellegarda, “An overview of statistical language model adaptation,” in Proc. ISCA Workshop on
Adaptation Methods for Speech Recognition, Sophia-Antipolis, France, pp. 165–174, Aug. 2001.
[24] G.
Bouselmi and I. Illina, “Combined acoustic and pronunciation modelling for non-native speech
recognition,” in Proc. Interspeech, Antwerp, Belgium, pp. 1449-1452, Aug. 2007.
[25]
Yoo Rhee Oh , Hong Kook Kim, A hybrid approach to adapting acoustic and pronunciation
models for non-native speech recognition, Proceedings of the 43rd Asilomar conference on
Signals, systems and computers, November 01-04, 2009, Pacific Grove, California, USA
[26]
Q. Zhang, T. Li, J. Pan, and Y. Yan, "Non-native Speech Recognition Based on State-Level
Bilingual Model Modification," in Proc. of Third International Conference on Convergence and
Hybrid Information Technology (ICCIT), Vol. 2, pp.1220-1225, 2008.
73
[27]W.
Bryne, E. Knodt, S. Kudanpur, J. Bernstein, “Is automatic speech recognition ready for non-
native speech? A data collection effort and initial experiments in modeling conversational
Hispanic English,” in Proceedings of the ESCA-ITR Workshop on speech technology in language
learning, 1998
[28]
P. R. Clarkson, R. Rosenfeld, “Statistical Language Modeling Using the CMU-Cambridge
Toolkit”, in Proceedings of ESCA Eurospeech, 1997
[29]
J. E. Flege, O.S. Bohn, S. Jang, “Effects of experience on non-native speakers' production and
preception of English vowels,” Journal of Phonetics, Vol. 25, 1997
[30] J.
Godfrey, E. Holliman, J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research
and development,” in Proceedings of the ICASSP, 1992
[31] B.
Pellom, “Sonic: The University of Colorado Continuous Speech Recognizer,” Technical Report
TR-CSLR-2001-01, CSLR, University of Colorado, 2001
[32]
B. Pellom, K. Hacioglu, “Recent Improvements in the CU Sonic ASR system for Noisy Speech:
The SPINE Task,” in Proceedings of the ICASSP 2003
[33]
L. M. Tomokiyo, “Lexical and acoutic modeling of nonnative speach in LVCSR,” in the
Proceedings of the ICSLP Workshop, 2000
[34]
Huang X. and Lee K.F., “On Speaker Independent, Speaker Dependent, and Speaker Adaptive
Speech Recognition”, IEEE Trans. Speech and Audio Processing, Volume 1, No. 2, pp. 150-157,
April 1993.
[35]
Legetter C.J. and Woodland P.C., “Maximum Likelihood Linear Regression for Speaker
Adaptation”, Computer Speech and Language, Volume 9, No. 2, pp 171-186.
[36] Anastasakos
T., McDonough J., Schwartz R., and Makhoul J., “A Compact Model for Speaker-
Adaptive Training”, Proc. ICSLP 96, Volume 2, pp. 764-767.
[37]
Chang E. and Lippmann R., “Improving Wordspotting Performance with Artificially Generated
Data”, Proc. ICASSP 96, Volume 1, pp. 526-529.
[38]
Padmanabhan M., Bahl L., Nahamoo D., and Picheny M., “Speaker Clustering and
Transformation for Speaker Adaptation in Speech Recognition Systems”, IEEE Trans. Speech and
Audio Processing, Volume 6, No. 1, Jan 1998, pp. 71–77.
74
[39]
Huang C., Chang E., Zhou J.L., and Lee K. F., “Accent Modeling Based on Pronunciation
Dictionary Adaptation for Large Vocabulary Mandarin Speech Recognition,” Proc. ICSLP 2000,
Volume III, pp. 818-821.
[40]
Chang E., Zhou J.L, Di S., Huang C., and Lee K. F., “Large Vocabulary Mandarin Speech
Recognition with Different Approaches in Modeling Tones,” Proc. ICSLP 2000, Volume II, pp.
983-986.
[41] Huang
J. and Padmanabhan M., “A Study of Adaptation Techniques on a Voicemail Transcription
Task”, Proc. Eurospeech 99, vol. 1, pp. 13-16.
[42]
Gales M.J.F., “Cluster Adaptive Training for Speech Recognition”, Proc. ICSLP 98, pp.17831786, Sydney.
[45] Hazen
T. and Glass J., “A comparison of novel techniques for instantaneous speaker adaptation”,
Proc. Eurospeech 97, vol. 4, pp.2047-2050.
[46]
Kuhn R., Junqua J.-C. and etc., “Rapid speaker adaptation in eigenvoice space”, IEEE
Transactions on Speech and Audio Processing, vol 8, n6, Nov 2000, pp. 695-707.
[47]
T. Schultz, A. Waibel. Polyphone Decision Tree Specialization for Language Adaptation Proc.
ICASSP, Istanbul, Turkey, June 2000.
[48]
V. Fischer, E. Janke, S. Kunzmann, Likelihood Combination and Recognition Output Voting for
the Decoding of Non-native Speech with Multilingual HMMs, Proc. ICSLP, 2002.
[49] Z.
Wang, U. Topkara, T. Schultz, A. Waibel, Towards Universal Speech Recognition, Proc. ICMI
2002.
[50]
L. Mayfield Tomokiyo. Recognizing Non-native Speech: Characterizing and Adapting to Nonnative Usage in Speech Recognition. Ph.D. thesis, Carnegie Mellon University, 2001.
[51]
G. Zavaliagkos, R. Schwartz, J. Makhoul, Batch, Incremental and Instantaneous Adaptation
Techniques for Speech Recognition, Proc. ICASSP, 1995.
[52]
U. Uebler, M. Boros, Recognition of Non-native German Speech with Multilingual Recognizers,
Proc. Eurospeech, Volume 2, pages 911-914, Budapest, 1999.
[53]
M. Finke, P. Geutner, H. Hild, T. Kemp, K. Ries, M. Westphal, the Karlsruhe-verbmobil Speech
Recognition Engine, ICASSP, Munich, 1997.
75
[54] H.
Soltau, T. Schaaf, F. Metze, A. Waibel. The ISL Evaluation System for Verbmobil II. ICASSP
2001, Salt Lake City, May 2001.
[55] L.
M. Arslan, J. Hansen, “A study of temporal features and frequency characteristics in American
English foreign accent,” in Journal of the Acoustical Society of America, Vol. 101, No. 1, 1997
[56] W.
Bryne, E. Knodt, S. Kudanpur, J. Bernstein, “Is automatic speech recognition ready for non-
native speech? A data collection effort and initial experiments in modeling conversational
Hispanic English,” in Proceedings of the ESCA-ITR Workshop on speech technology in language
learning, 1998
[57]
P. R. Clarkson, R. Rosenfeld, “Statistical Language Modeling Using the CMU-Cambridge
Toolkit”, in Proceedings of ESCA Eurospeech, 1997
[58]
J. E. Flege, O.S. Bohn, S. Jang, “Effects of experience on non-native speakers' production and
preception of English vowels,” Journal of Phonetics, Vol. 25, 1997
[59] J.
Godfrey, E. Holliman, J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research
and development,” in Proceedings of the ICASSP, 1992
[60]B.
Pellom, “Sonic: The University of Colorado Continuous Speech Recognizer,” Technical Report
TR-CSLR-2001-01, CSLR, University of Colorado, 2001
[61] B.
Pellom, K. Hacioglu, “Recent Improvements in the CU Sonic ASR system for Noisy Speech:
The SPINE Task,” in Proceedings of the ICASSP 2003
[62]
L. M. Tomokiyo, “Lexical and acoutic modeling of nonnative speach in LVCSR,” in the
Proceedings of the ICSLP Workshop, 2000
[63]
W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky, W. Byrne, “Lexicon
adaptation for LVCSR: Speaker idiosyncracies, non-native speakers, and pronunciation choice,” in
PMLA Workshop, 2002
[64] Tomokiyo,
L. M. and Waibel, A., "Adaptation methods for nonnative speech," in Proceedings of
Multilinguality in Spoken Language Processing, 2001.
[65] J.
Humphries, P. Woodland, and D. Pearce. "Using accent-specific pronunciation modeling for
robust speech recognition." In Proc. ICSLP '96, pages 2324-2327, Philadelphia, PA, October 1996.
76
[66]
C. Teixeira, I. Trancoso, and A. Serralheiro. "Recognition of non-native accents". In Proc.
Eurospeech '97, pages 2375-2378, Rhodes, Greece, September 1997.
[67]
K. Livescu. "Analysis and modeling of non-native speech for automatic speech recognition".
Master's thesis, MIT, August 1999.
[68] Z.
Wang, T. Schultz, A. Waibel, "Comparison of acoustic model adaptation techniques on non-
native speech", Proc. ICASSP 2003.
[69]
Clarke, Constance / Jurafsky, Daniel (2006) "Limitations of MLLR adaptation with Spanishaccented English: an error analysis", In INTERSPEECH-2006, paper 1611-Tue2BuP.7.
[70] Bohn,
O.-S., Flege, J.E., "The production of new and similar vowels by adult German learners of
English." Stud. Second Lang. Acquis. 14, 131-158, 1992.
[71]The
CMU
Pronouncing
Dictionary
v0.6,
The
Carnegia
Mellon
University,
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[72] IPA,
(1993). The International Phonetic Association (revised to 1993) IPA Chart. Journal of the
International Phonetic Association 23, 1993.
[73]
Flege, J.E. (1993) "Production and perception of a novel, second-language phonetic contrast",
Journal of the Acoustical Society of America 93: 1589-1608.
[74]
Li A., Yin Z., Wang T., Fang Q., Hu F.,"RASC863 - A Chinese Speech Corpus with Four
Regional Accents", ICSLT-o-COCOSDA: New Delhi, India 2004.
[75]NIST,
“The 1997 Hub-4NE evaluation plan for recognition of Broadcast News, in Spanish and
Mandarin”, http://www.nist.gov/speech/tests/bnr/hub4ne_97/current_plan.htm, 1997.
[76] 云南少数民族双语教学研究课题组,
“云南少数民族 双语教学研究”, 昆明:云南民族出版社,
1995.
[77] 潘复平,
赵庆卫, 颜永红, “一种用于方言口音语音识别的字典自适应技术 (Pronunciation
Dictionary Adaptation Based Accent Modeling for Large Vocabulary Continuous Speech
Recognition) ”, 计算机工程与应用, pp.5-6, 2005 年 23 期.
77
[78]刘明宽,
徐波, 黄泰翼, 胡伟湘, “音节混淆字典以及在汉语口音自适应中的应用研究 (Study
on syllable confusion dictionary and putonghua accent adaptation)”. 声学学报,pp. 53-58, 2002
年 01 期.
[79] E.
Chang, Y. Shi, J. L. Zhou and C. Huang, “Speech Lab in a Box: A Mandarin Speech Toolbox
to Jumpstart Speech Related Research”, Eurospeech 2001, pp.2799-2802 Aalborg, Denmark,
2001.
[80]V.
Hoste, W. Daelemans, S. Gillis, “Using ruleinduction techniques to model pronunciation
variation in Dutch”, Computer Speech and Language, Vol. 18, Issue 1, pp.1-23, Jan. 2004.
[81]
M. Wester, “Pronunciation modeling for ASR -knowledge-based and data-derived methods”,
Computer Speech and Language, Vol. 17, Issue 1, pp. 69-85, Jan. 2003.
[82]刘林泉,
“基于小数据量的方言背景普通话语音识别声学建模研究 (Research on A Small Data
Set Based Acoustic Modeling for Dialectal Chinese Speech Recognition) ”, 博士论文, 北京:
清华大学计算机科学与技术系,2007.
[83]R.
Sproat, F. Zheng, Gu L, et al., “Dialectal Chinese Speech Recognition: Final Report”, CLSP
Summer Workshop, http://www.clsp.jhu.edu/ ws2004/, Nov 15, 2004.
[84] R.
Gruhn, K. Markov, and S. Nakamura, “A statistical lexicon for nonnative speech recognition,”
in Proc. ICSLP, Jeju Island, Korea, pp.1497- 1500, Oct. 2004.
[85] S.
Steidl, G. Stemmer, C. Hacker, and E. Noth, “Adaptation in the pronunciation space for non-
native speech recognition,” in Proc. ICSLP, Jeju Island, Korea, pp. 2901-2904, Oct. 2004.
[86]
J. Morgan, “Making a speech recognizer tolerate non-native speech through Gaussian mixture
merging.” in Proc. InSTIL/ICALL Symposium on Computer-Assisted Language Learning, Venice,
Italy, pp. 213–216, June 2004.
[87] A.
Raux, “Automated lexical adaptation and speaker clustering based on pronunciation habits for
non-native speech recognition,” in Proc. ICSLP, Jeju Island, Korea, pp. 616-616, Oct. 2004.
[88]
H. Strik and C. Cucchiarini, “Modeling pronunciation variation for ASR: A survey of the
literature,” Speech Communication, vol. 29, nos. 2-4, pp. 225-246, Nov. 1999.
[89]
E. Fosler-Lussier, “Multi-level decision trees for static and dynamic pronunciation models,” in
Proc. Eurospeech, Budapest, Hungary, pp. 463-466, Sept. 1999.
78
[90]
I. Amdal, F. Korkmazasky, and A. C. Suredan, “Data-driven pronunciation modelling for nonnative speakers using association strength between phones,” in Proc. ASRU, Kyoto, Japan, vol. 1,
pp. 85-90, Aug. 2000.
[91]
S. Goronzy, S. Rapp, and R. Kompe, “Generating non-native pronunciation variants for lexicon
adaptation,” Speech Communication, vol. 42, no. 1, pp. 109-123, Sept. 2003.
[92]J.
Bellegarda, “An overview of statistical language model adaptation,” in Proc. ISCA Workshop on
Adaptation Methods for Speech Recognition, Sophia-Antipolis, France, pp. 165–174, Aug. 2001.
[93] G.
Bouselmi and I. Illina, “Combined acoustic and pronunciation modelling for non-native speech
recognition,” in Proc. Interspeech, Antwerp, Belgium, pp. 1449-1452, Aug. 2007.
[94]
M. Kim, Y. R. Oh, and H. K. Kim, “Non-native pronunciation variation modeling using an
indirect data driven method,” in Proc. ASRU, Kyoto, Japan, pp. 231-236, Dec. 2007.
[95]
J. R. Quinlan, C4.5: Programs for Machine Learning, San Mateo, CA, Morgan Kaufmann
Publishers, 1993.
[96]
Y. R. Oh, J. S. Yoon, and H. K. Kim, “Acoustic model adaptation based on pronunciation
variability analysis for non-native speech recognition,” Speech Communication, vol. 49, no. 1, pp.
59-70, Jan. 2007.
[97]
D. Paul and J. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Proc.
DARPA Speech and Language Workshop, Arden House, NY, pp. 357-362, Feb. 1992.
[98]
S.-C. Rhee, S.-H. Lee, S.-K. Kang, and Y.-J. Lee, “Design and construction of Korean-spoken
English corpus (K-SEC),” in Proc. ICSLP, Jeju Island, Korea, pp. 2769-2772, Oct. 2004.
[99]
S. Young, et al, The HTK Book (for HTK Version 3.2), Microsoft Corporation, Cambridge
University Engineering Department, Dec. 2002.
[100]
S. Young, J. Odell, and P. Woodland, “Tree-based state tying for high accuracy acoustic
modeling,” in Proc. ARPA Human Language Technology Workshop, Princeton, NJ, pp. 307-312,
Mar. 1994.
[101]
H. Weide, The CMU Pronunciation Dictionary, release 0.6, Carnegie Mellon University, 1998.
79
[102] Y.
R. Oh, M. Kim, and H. K. Kim, “Acoustic and pronunciation model adaptation for context-
independent and context-dependent pronunciation variability of non-native speech,” in Proc.
ICASSP, Las Vegas, NV, pp. 4281-4284, Apr. 2008.
[103] S.
Witt, "Use of Speech Recognition in Computer Assisted Language Learning," University of
Cambridge, Ph.D Thesis,1999.
[104]
S. Goronzy, Robust Adaptation to Non-Native Accents in Automatic Speech Recognition,
Springer Verlag, German, 2002.
[105]
Z. Wang and T. Schultz, "Non-Native Spontaneous Speech Recognition through Polyphone
Decision Tree Specialization," Proc. Eurospeech-03, Geneva, pp. 1449-1452, 2003.
[106] J.
Flege, "The Production of 'New' and 'Similar' Phones in a Foreign Language: Evidence for
the Effect of Equivalence Classification," Journal ofPhonetics vol. 15, pp. 47-65, 1987.
[107]B.H.
Juang and L.R. Rabiner, "A Probabilistic Distance Measure for Hidden Markov Models,"
AT&T Technical Journal vol. 64, no. 2, pp. 391-408, 1985.
[108]J.J.
Humpries and P. Woodland, "The Use of Accent-Specific Pronunciation Dictionaries in
Acoustic Model Training," ICASSP- 98, Seattle, vol. 1, pp. 317-320, 1998.
[109]D.C.
Montgomery, E.A. Peck and G. Geoffrey Vining, Introduction to Linear Regression
Analysis, 3rd Edition, Wiley, 2001.
Kuhn, P. Nguyen, L. Goldwasser, N. Niedzielski, S. Fincke and M. Contolini,
[110]R.
"Eigenvoices for Speaker Adaptation," ICSLP 98, Sydney, pp. 1774-1777, 1998.
[111]P.
Nguyen, "Fast Speaker Adaptation," Technical report, Eurecom, 1998.
[112]R.J.
Westwood, "Speaker Adaptation Using Eigenvoices," University of Cambridge, MPhil
Thesis, 1999.
[113]T.P.
Tan and L. Besacier, "A French Non-Native Corpus for Automatic Speech Recognition,"
LREC 2006, Genoa, pp. 1610- 1613, 2006.
[114]L.F.
Lamel, J.L. Gauvain and E. M., "BREF, a Large Vocabulary Spoken Corpus for
French," Eurospeech-91, Genoa, pp. 505-508, 1991.
80
[115]V.B.
Le, T. Do-Dat, E. Casteli, L. Besacier and J.F. Serignat, "Spoken and written language
resources for Vietnamese," LREC 2004, Lisbon, pp. 599-602, 2004.
[116]
RASTA: Hynek Hermansky and Nelson Morgan. RASTA processing of speech. IEEE
Transactions on Speech and Audio Processing, October 1994
[117]
MSG: Brian Kingsbury. Perceptually-inspired signal processing strategies for robust speech
recognition in reverberant environments. PhD Dissertation. University of California at Berkeley,
December 1998
[118]PLP: Hynek Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the
Acoustical Society of America, 87(4):1738-1752, April 1990
[119]Hybrid
ASR:
Nelson
Morgan
and
Hervé Bourlard.
An
Introduction
to
Hybrid
HMM/Connectionist Continuous Speech Recognition. IEEE Signal Processing Magazine, pp. 2542, May 1995
[120]Multi-stream
ASR: Dan Ellis. Improved recognition by combining different features and different
systems. Proc. AVIOS-2000, San Jose, May 2000
[121]Hynek
Hermansky, Dan Ellis and Sangita Sharma. Tandem connectionist feature stream
extraction for conventional HMM systems. ICASSP-2000, Istanbul, June 2000
[122]A.
Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S. Kajarekar, N.
Morgan, and S. Sivadas. Qualcomm-ICSI-OGI Features for ASR. ICSLP-2002, Denver, Colorado,
USA, September 2002.
[123]Q.
Zhu, B. Chen, N. Morgan, and A. Stolcke. On using MLP features in LVCSR. Proc. Intl. Conf.
Spoken Language Processing, Jeju, Korea, October 2004
81
[...]... England accented English speakers, the native English speakers versus the non- native English speech To recognize the speech from non- native speakers, we usually will apply the technique of the adaptation for a well-trained native acoustic model, using the limited data Because the native speech data can be easily obtained from many open source, while the non- native speech data are rare and diverse 2.4.2... and 2 until the performance of the network is good enough QuickNet is a suite of software that facilitates the use of multi -layer perceptrons (MLPs) in statistical pattern recognition systems It is primarily designed for use in speech processing but may be useful in other areas 2.3.5 ANN Methods in Speech Recognition Recently, many research works also take the advantage of multi-layer perceptrons (MLPs)... data, two or even tens of transformations are estimated for different group of mixtures in original acoustic model And each of the transformation is more specific to group the gaussian mixtures further into the broad phone classes: silence, vowels, stops, glides, nasals, fricatives, etc Though it may not be classified so accurate in the adaptation for non- native speech, as non- native speech it is confused... “Statistical Language Model”, Speech Proceesing, PP, 34, 2010 25 recognizing the news reports, while the same LM would be a poor predictor for recognizing personal conversation or speech in hotel reservation system 26 Chapter 3 Literature Review 3.1 Overview of the challenges in ASR for non- native speech The history of spoken language technology is milestoned by the ”SpeD” ( Speech and Dialogue”) Conferences... obvious and some new fields of interest appeared to be a promise the future Corneliu Burileanu summarized: “We were able to identify a constant development of what is called speech interface technology” which includes automatic speech recognition, synthetic speech, and natural language processing We noticed commercial applications in computer command, consumer, data entry, speech- to-text, telephone,... Department of RWTH Aachen University, using Germany made his “Closing Remarks: How to Continue?” The main issue in this domain he emphasized is about the interaction between speech and NLP (natural language processing) people in many areas of interaction Since 2007, one important challenge in this domain seemed to be Speech to Speech Translation” The main issue is speech recognition improvement One aspect of. .. 1 K C SIM, Automatic Speech Recognition , Speech Proceesing, PP, 7, 2010 7 Figure 2.4 Mel filter bank coefficient1 After obtained the feature vector from the non- linear filter bank, Mel-Frequency Cepstral Coefficients (MFCCs) can be calculated by the following formula √ ∑ ( ) ( ( ) ) : nth MFCC coefficient : kth Feature value from Mel filter bank and : number of filter banks and number of MFCC coefficients... models to estimate the distributions of decorrelated acoustic feature vectors that correspond to observations of states of the phonemes In contrast, artificial neural network model uses discriminatively training to estimate the probability distribution 1 using the output of a neural network classifier as the input features for the Gaussian mixture models of a conventional speech recognizer (HMM), The resulting... practically As in non- native speech, the grammar is poorly formed unless the speech is read speech In addition, the training data set of LM should match the test data set For example, an LM trained on newspaper text would be a good predictor for 1 Sometimes the word sequence in the test data has not been seen before in training data, then we define some equivalence classes and given the probability of this new... understand the concept of filterbank Filter banks are used to filter the information from the spectral magnitude of each window period (Figure 2.3) Filter banks are a series of triangular filters on the frequency spectrum To implement this filterbank, a window period of speech data is transformed using a Fourier transform and the magnitude on the frequency spectrum is obtained The speech data magnitude ... a native English model and a German accented English model (limited data WER of the native model on native test data is 16.2%, on non- native test data is 43.5% WER of non- native model on non- native. .. the speech of native speakers and that of non- native speakers In th decoding process, the context decision tree we use was built from native speech to model the context of non- native speech Zhirong... lack of such broadcast or speech recording resource for non- native speech Usually, using native acoustic model to recognize the non- native speech, the word error rate is about to time that of the