Modeling of non native speech automatic speech recognition

MODELING OF NON-NATIVE AUTOMATIC SPEECH RECOGNITION XIONG YUANTING (B.Eng.(Hons.)), NTU A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF COMPUTING DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2011 Acknowledgements This thesis would not have been possible without the guidance and the help of several individuals who in one way or another contributed and extended their valuable assistance in the process of the research. First and foremost, my utmost gratitude to Dr. Sim Khe Chai, Assistant Professor at the School of Computing (SoC), National University of Singapore (NUS) whose sincerity and encouragement I will never forget. Dr. Sim has been my inspiration as I hurdle all the obstacles in the completion this research work. Mr. Li Bo, PHD candidate of the Computer Science Department of National University of Singapore, for his unselfish and unfailing advice in implementing my research project. Mr. Wang Xuncong, PHD candidate of the Computer Science Department of National University of Singapore. He has shared his valuable suggestion in the relevance of the fundamental knowledge in the automatic speech recognition system. Dr. Mohan S Kankanhalli, Professor of Department of Computer Science, NUS, who had kind concern and suggestion regarding my academic requirements. Last but not the least, my husband and my friends, for giving me the strength to plod on the study and research in computer science department, which is a big challenge for me as my bachelor degree is in Electrical and Electronic Engineering School. ii Table of content TABLE OF CONTENT SUMMARY III V LIST OF FIGURES VI LIST OF TABLES VII CHAPTER 1 1 INTRODUCTION 1 CHAPTER 2 3 BASIC KNOWLEDGE IN ASR SYSTEM 3 2.1 3 Overview of Automatic Speech Recognition (ASR) 2.2 Feature Extraction 5 2.3 Acoustic Models 2.3.1 The phoneme and state in the ASR 2.3.2 The Theory of Hidden Markov Model 2.3.3 HMM Methods in Speech Recognition 2.3.4 Artificial Neural Network 2.3.5 ANN Methods in Speech Recognition 8 9 10 12 15 17 2.4 Adaptation Techniques in Acoustic Model 2.4.1 SD and SI Acoustic model 2.4.2 Model Adaptation using Linear Transformations (MLLR) 2.4.3 Model adaptation using MAP 19 19 19 22 2.5 Lexical Model 23 2.6 Language Model 24 CHAPTER 3 27 LITERATURE REVIEW 27 3.1 Overview of the challenges in ASR for non-native speech 27 3.2 The solutions for non-native speech challenges 30 iii CHAPTER 4 41 METHODS AND RESULTS 41 4.1 Non-Native English Data Collection 42 4.2 Project 1: Mixture level mapping for Mandarin acoustic model to English acoustic model 4.2.1 Overview of the Method in Project 1 4.2.2 Step Details in Project 1 4.2.3 Project 1 Results 43 43 46 52 4.3 Project 2: PoE to combine the bilingual NN models 4.3.1 Overview of the Method in Project 2 4.3.2 Step Details in Project 2 4.3.3 Project 2 Results 54 54 55 61 4.4 Project3: Training Non-native Speech Lexicon Model 4.4.1 Overview of the Method in Project 3 4.4.2 Step Details in Project 3 4.4.3 Project 3 Results 61 61 62 65 4.5 PROJECTS ACHIEVEMENT AND PROBLEM 4.5.1 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL 4.5.2 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL 66 66 67 CHAPTER 5 68 CONCLUSION AND RECOMMENDATION 68 APPENDIX 70 BIBLIOGRAPHY 72 iv Summary Heavily accented non-native speech represents a significant challenge for automatic speech recognition (ASR). Globalization again emphasizes the urgent of the research to address these challenges. The ASR consists of three parts, acoustic modeling, lexical modeling and language modeling. In the thesis, the author will first give a brief introduction of the research topic and the work has been done in chapter 1. In chapter 2, the author will explain the fundamental knowledge in the ASR system, the concepts and techniques illustrated in this chapter will be applied and used in the following chapters, especially in chapter 4. In chapter 3, the author will present her literature review, which will introduce the current concern in the natural language processing field and what are the challenges and the major approaches to address those challenges. In chapter 4, the author presents her research has done so far. Two projects are carried out to improve the acoustic model in recognition the non-native speech in Mandarin accent. Another project is targeted to improve the lexicon model of the word pronunciation of multi-national speakers. The project process flow and step details are all covered. In chapter 5, the author discusses her achievement and the problems regarding the results from the three projects, and then gives her conclusion and recommendation to the work she has done for this thesis. v List of Figures FIGURE 2.1 AUTOMATIC SPEECH RECOGNITION .................................................................................... 4 FIGURE 2.2 FEATURE EXTRACTION ......................................................................................................... 5 FIGURE 2.3 FEATURE EXTRACTION BY FILTER BAND .............................................................................. 7 FIGURE 2.4 MEL FILTER BANK COEFFICIENT ........................................................................................... 8 FIGURE 2.5 HIDDEN MARKOV MODELS................................................................................................ 11 FIGURE 2.6 ASR USING HMM ............................................................................................................... 13 FIGURE 2.7 TRAIN HMM FROM MULTIPLE EXAMPLES ......................................................................... 15 FIGURE 2.8 ILLUSTRATION OF NEURAL NETWORK............................................................................... 15 FIGURE 2.9 SIGMOIDAL FUNCTION FOR NN ACTIVATION NODE ......................................................... 16 FIGURE 2.10 FEEDFORWARD NEURAL NETWORK (LEFT) VS RECURRENT NEURAL NETWORK (RIGHT)16 FIGURE 2.11 ANN APPLIED IN ACOUSTIC MODEL ................................................................................ 18 FIGURE 2.12 REGRESSION TREE FOR MLLR6 ........................................................................................ 22 FIGURE 2.13 DICTIONARY FORMAT IN HTK .......................................................................................... 24 FIGURE 2.14 THE GRAMMAR BASED LANGUAGE MODEL .................................................................... 25 FIGURE 3.1 SUMMARY OF ACOUSTIC MODELING RESULTS ................................................................. 32 FIGURE 3.2 PROCEDURE FOR CONSTRUCTING THE CUSTOM MODEL ................................................. 33 FIGURE 3.3 MAP AND MLLR ADAPTATION ........................................................................................... 35 FIGURE 3.4 PERFORMANCE WITH VARIOUS INTERPOLATION WEIGHTS ............................................. 35 FIGURE 3.5 BEST RESULTS OF VARIOUS SYSTEMS ................................................................................ 36 FIGURE 3.6 DIAGRAM OF HISPANIC-ENGLISH MULTI-PASS RECOGNITION SYSTEM ............................ 37 FIGURE 3.7 EXAMPLE OF THE MATCH BETWEEN STATE TARGET AND SOURCE .................................. 39 FIGURE 4.1 THE VOICE RECORDING COMMAND INPUTS AND OUTPUTS ............................................ 42 FIGURE 4.2 PROJECT 1 PROCESS FLOW ................................................................................................ 45 FIGURE 4.3 THE INPUTS AND OUTPUT FOR THE MFCC FEATURES ABSTRACTION ............................... 47 FIGURE 4.4 THE INPUTS AND OUTPUT OF SPEECH RECOGNITION COMMAND ................................... 48 FIGURE 4.5 THE INPUTS AND OUTPUTS FOR REGRESSION TRESS GENERATION ................................. 48 FIGURE 4.6 INPUTS AND OUTPUTS FOR MLLR ADAPTATION ............................................................... 49 FIGURE 4.7 MODIFIED MODEL ILLUSTRATION ..................................................................................... 50 FIGURE 4.8 THE RESULTS FOR PROJECT 1 ............................................................................................ 53 FIGURE 4.9 PROJECT 2 PROCESS FLOW ................................................................................................ 54 FIGURE 4.10 NN MODEL 1 TRAINING ................................................................................................... 56 FIGURE 4.11 NN MODEL 2 TRAINING (POE) ......................................................................................... 58 FIGURE 4.12 THE INVERSE FUNCTION OF GAUSSIAN ........................................................................... 59 FIGURE 4.13 PROJECT 3 PROCESS FLOW .............................................................................................. 61 FIGURE 4.14 PROJECT 3 STEP 3 PROCESS FLOW .................................................................................. 63 FIGURE 4.15 THE EXAMPLE FOR COMBINE.LIST ................................................................................... 64 FIGURE 4.16 THE EXAMPLE FOR THE RESULT OF FIND_GENERAL.PERL ............................................... 65 FIGURE 4.17 THE EXAMPLE FOR DICTIONARY ...................................................................................... 65 vi List of Tables Table 2.1 The 39 CMU Phoneme Set ................................................................................................. 9 Table 2.2 The probability of the pronunciations ............................................................................. 23 Table 3.1 Word error rate for the CSLR Sonic Recognizer ............................................................... 36 1 Table 3.2 World Error Rate % by Models ....................................................................................... 37 Table 4.1 Non-native English collection data .................................................................................. 43 Table 4.2 Mandarin mixture modified empty English model .......................................................... 53 Table 4.3 Results from NN Poe ........................................................................................................ 61 Table 4.4 Results for project 3 ......................................................................................................... 66 vii Chapter 1 Introduction The goal of automatic speech recognition (ASR) is to get computers to convert human speech into text. The ASR system simulates the hearing and language processing ability of human. Currently, in the real word application, a well-trained ASR system can achieve more than 95% accuracy in controlled environment. In particular, ASR systems work best when the acoustic models are speaker dependent, the training and testing speech data are recorded in noise-free environment and the speakers have a fluent native accent. However, with globalization and widespread emergence of speech applications, the need for more flexible ASR has never been greater. The flexible ASR means that the system can be applied to multiple users, in a noisy environment, and even be applicable to the non-native speakers. In this research, the author will focus on improving the ASR performance for non-native speakers. There are many challenges in tackling the problems arisen from non-native ASR. Firstly, there is a lack of non-native speech resource for model training. Some researchers have attempted to address this problem by adaptation the native acoustic model with limited non-native speech data ([10], [13], [15], [18]). Secondly, for non- 1 native speakers, they have different nationality, thus different accent, to address those problems, some researchers have tried the methods of MLLR, MAP, interpolation, statelevel mapping ([9], [10], [11], [12], [26]). Thirdly, even those non-native speaking with same hometown accent, they are at different level of proficient to the target language. Those problems cause the non-native speech recognition accuracy reduced significantly. In the research covered in Chapter 4 of this thesis, the author will focus on the two core parts of ASR system to improve the accuracy of the non-native speech recognition. Firstly, the author will attempt to improve the acoustic model of ASR. Non-native acoustic modeling is an essential component in many practical ASR systems. There are two projects related to this problem. Secondly, the author also explores the issue in the lexicon model. For non-native speakers, the pronunciations of some words are different from those for native speakers. To make it worse, due to immaturity accent of non-native speakers, the discrimination for words have similar pronunciation becomes difficult. In this thesis, there is one project targeted to address this problem. In this thesis, there are many professional terms, concepts and techniques in the field of automatic speech recognition. In order to make the reader understand clearly, the author has written all the key background knowledge in Chapter 2. In addition, many previous researchers have attempted to solve those problems with certain approaches, which also give insight to the author’s lateral projects. The development history of ASR and those researchers’ approaches to address the similar issues in this thesis are included in Chapter 3. Chapter 5 gives the conclusions and recommendations for possible future work. 2 Chapter 2 Basic Knowledge in ASR System In this chapter, the author will describe the basic knowledge in ASR system. First, the author will give an overview of the ASR system. After that, the author will de scribe recent frequently used acoustic feature format and its corresponding extraction technique. Then, the author will focus on the knowledge of acoustic model, lexicon model and language model one by one. Understanding acoustic model is very important for the understanding of the ASR system, and the author will give more in depth knowledge in this part, including some advanced techniques used in acoustic model. 2.1 Overview of Automatic Speech Recognition (ASR) The Automatic Speech Recognition is a system to process a speech waveform file into a language text format, by which this system converts audio captured by the microphone into the text format language stored in the computer. An ASR system generally consists of three models, acoustic model, lexical model and language model. As illustrated in Figure 2.1, the audio waveform file is first converted into a feature file, with reduced size and is matched to a particular acoustic model. 3 Figure 2.1 Automatic Speech Recognition The acoustic model accepts the feature file as input, and produces the phone sequence as output. The lexical model is the bridge between the acoustic model and language model, and it models each vocabulary in the speech with one or more pronunciation, that is why sometimes we call the lexical model a dictionary in the field. The language model accepts the optional word sequences produced by the lexical model as input, and produces the more common and grammatically accurate sentence. If all the models are well trained, the ASR is capable of generating the corresponding sentence in text format from human natural language with little error rate. On the other hand, each of the models in the ASR system can be viewed as solving ambiguity at one of the language processing levels. For example, the acoustic model disambiguates a sequence of feature frames from another sequence of feature frames, and categorizes them into a sequence of phones. The lexical model disambiguates a sequence of phones from another sequence of phones, and categorizes them into a sequence of words. The language model makes the disambiguation done by lexical 4 model more accurate by assigning higher probability to a more frequently occurred word sequence, or integrating the knowledge of sentence grammar. 2.2 Feature Extraction As previously mentioned, the acoustic model requires input in the format of feature file instead of waveform file. This is because the feature file has much smaller size than the waveform file, while the feature file still maintains some of the important information and reduces some redundant and disturbing data, such as noisy. To abstract a feature file from a waveform file, many parameters have to be predefined. First, we need to define the sampling rate for the feature file vector, usually we call this sampling time interval as frame period. For every frame period, a vector parameter will be generated from the waveform file, and be stored in the feature file. This vector parameter is based on the magnitude on the frequency spectrum for a piece of audio waveform around its frame sampling point. Normally, the duration of the piece of audio waveform is longer than the frame period, and we call it window duration. Thus, there is some overlap-sampled information for the nearby frames. Figure 2.2 Feature Extraction 5 For example (Figure 2.2), if the a waveform sampled at 0.0625 sec/sample, and the window duration is defined as 25 msec long, the frame period is defined as 10 msec. Thus, nearby frame samples will have 15 msec overlap, every window duration will have 400 samples from speech waveform with 16 kHz sampling rate. By far, three forms of extracted feature vectors are mostly used. Linear Prediction Coefficients (LPC)[1] is used very early in natural speech processing, but it is seldom used now. Mel Frequency Cepstral Coefficients (MFCC)[ 2 ] is the default parameterization for many speech recognition applications, and the ASR system developed feature files on MFCC features have competitive performance. Perceptual Linear Prediction Coefficients (PLP) [3] is developed more recently. In the following part, I will only focus on the MFCC feature extraction technique. To have a good understand about the MFCC features, we should first understand the concept of filterbank. Filter banks are used to filter the information from the spectral magnitude of each window period. (Figure 2.3) Filter banks are a series of triangular filters on the frequency spectrum. To implement this filterbank, a window period of speech data is transformed using a Fourier transform and the magnitude on the frequency spectrum is obtained. The speech data magnitude on the frequency spectrum is then multiplied by the corresponding filter gain from the filter bank and the results integrated. Therefore, each filter bank will output a feature parameter, which is the integration of the multiplication in the previous way along a filter bank channel. And the length of the feature vector in Figure 2.2 (previous one) depends on the number of feature parameter, in other words, the length of the feature vector depends on the number of the filter banks we defined in a frequency spectrum. 6 Figure 2.3 Feature extraction by Filter Band1 Usually, the triangular filters spread over the whole frequency range, thus unlimited in quantity. Practically, the lower and upper frequency cut-offs are defined, for example, only sample the information from the frequency spectrum range from 300 Hz to 3400 Hz, thus a certain number of triangular filter banks are spread over this range. Mel Frequency Cepstral Coefficients (MFCC) are coefficients calculated by feature vector from the filter banks. But the filter banks used in MFCC are not equally spaced in the frequency spectrum, and we call it Mel filterbank. Human can identify the voice better at the lower frequency than that at the higher frequency. Practically, evidence also suggests that using a non-linear spaced filter bank superior the equally spaced filter bank. Mel filterbank is such kind of filter banks designed to model the non-linear listening habits of human ear. The equation below defines the mel-scale. ( ) ( ) The Mel filter banks are located according to the mel-scale, all the triangular filter banks locations are equally spaced on its mel-scale value instead of its frequency spectrum. (Figure 2.4) 1 K. C. SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 7, 2010. 7 Figure 2.4 Mel filter bank coefficient1 After obtained the feature vector from the non-linear filter bank, Mel-Frequency Cepstral Coefficients (MFCCs) can be calculated by the following formula. √ ∑ ( ) ( ( ) ) : nth MFCC coefficient : kth Feature value from Mel filter bank and : number of filter banks and number of MFCC coefficients MFCCs are the waveform compression format for many speech recognition applications. They give good discrimination and can be transformed to other forms easily. However, compressing the waveform format into feature format still causes the loss of some information, and may reduce the robustness of the acoustic model been trained. However, for a particular speech, the waveform file is about 6 times larger than the feature file, thus the training time of the former also is much longer than that of the latter. That is why the feature extraction is so preferable in the natural language processing field. 2.3 Acoustic Models 1 K. C. SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 8, 2010. 8 2.3.1 The phoneme and state in the ASR In a speech utterance, it is easy to understand what is referred as a sentence, and what is referred as word. The author will explain more about the phoneme and state in the natural language processing field. Speakers and listeners divide words into component sounds, which are phonemes. In this research, we use the Carnegie Mellon University pronouncing dictionary [71] , the phoneme set in this dictionary contains 39 phonemes (Table 2.1). The CMU dictionary is a machine-readable pronunciation dictionary for North American English. The format of this dictionary is very useful in the applications of speech recognition and synthesis. The vowels in CMU phoneme set carry lexical stress. Table 2.1 The 39 CMU Phoneme Set Vowel Phonemes: Consonant Phonemes: PHONEME EXAMPLES PHONEME EXAMPLES PHONEME EXAMPLES PHONEME EXAMPLES aa odd ih it b Baby dh thee ae at iy eat d Dog hh he ah hut jh gee f Field z zebra ao ought oy toy g Game th then aw cow uh hood w Was s sun ay hide uw two y Yield ch chip er wooden ey ate k Cook sh ship ow down eh Ed l Lamb zh treasure m Monkey ng ring n Nut t tap p Paper v van r Rabbit Movements of the jaw, tongue and lips determine different phonemes. There are four major classes of phonemes. Voiced phonemes have the vocal folds vibrate, vowels have no blockage of the vocal and no turbulence, consonants are non-vowels, and plosives involve an “explosion” of air. In addition, all those phoneme classes and even more meticulous classes can be captured by the sound spectrum features. The sound spectrum changes with the shape of the vocal tract. For example, “sh” has similar shape in the 9 spectrum as “s”, but “s” has very high frequency of around 4.5 kHz, while “sh” is lower frequency because tongue is further back. In the previous section, we know that feature extraction will give us the frequency spectrum feature information at a rate of every 10 msec ( 1 frame period). One phoneme can be several to tens of frames, which depends on the speakers’ rate and the context. In order to model the flexibility of phoneme, we further divide a phoneme into several consequent states. There is no standard definition for the number of states for a particular phoneme, and different researchers have slightly different definition for different purpose. The author defines about 3 emitting states for a particular phoneme. Each of the state captures the distinguishing feature of that phoneme for a defined window period. And the state may be repeated multiple times until it jumps to the second state of a particular phoneme, and this is due to people have different duration in pronouncing the phonemes. 2.3.2 The Theory of Hidden Markov Model The most widely adopted statistical acoustic models in language processing are the Hidden Markov Models (HMMs). It is a finite-state transducer and is defined by states and the transitions between states. (Figure 2.5) In a regular Markov model, the state is directly visible to the observer, thus the model only has state transition probabilities parameters. In a Hidden Markov model, the state is not directly visible, only output, which depends on the state, is visible. Therefore, the HMM has both distribution model parameters for each state and the state transition probabilities parameters. 10 Figure 2.5 Hidden Markov Models As we can see from Figure 2.5, a HMM consists of a number of states. Each state j has an associated observation probability distribution bj(ot), which is modeled to determine the probability of generating observation at a particular time t. The is the feature coefficient (MFCC) we illustrated previously, which is sampled at a particular frame period for a window duration. Each pair of states i and j has a modeled transition probability . All those model parameters are obtained by statistical data driven training method. In the experiment, the entry state 1 and the exist state N of an N state HMM are non-emitting states. Figure 2.5 shows a HMM with five states, and the HMM can be used to model a particular phoneme. The three emitting states (2-4) have output probability distributions associated with them. Each probability distribution of a state is represented by a mixture Gaussian density. For example, for state j the probability bj(ot) is given by ( ) Where ∏[∑ ( )] is the number of mixture components in state j for stream s, weight of the m’th Gaussian component in that stream and N(.; Gaussian with mean vector is the ) is a multivariate and covariance matrix ∑. The following equation shows the multivariate Gaussian 11 ( ) ( √( ) ( ) ) Where n is the dimensionality of o. The exponent is a stream weight and its default value is one. Generally, we only use the sum of mixtures of Gaussian models for a state, and the stream level parameters are ignored. Practically, there is lack of training data to model so many parameters. Therefore, the commonly used probability distributions is given by ( ) ∑ ( ) Standard Hidden Markov Model has two important assumptions. 1. Instantaneous first-order transition: The probability of making a transition to the next state is independent of the historical states. 2. Conditional independence assumption: The probability of observing a feature vector is independent of the historical observations. 2.3.3 HMM Methods in Speech Recognition In the ASR system, the role of acoustic model is to decide the most likely phoneme for a given series of observations. Hidden Markov model outputs the phoneme sequence in the following ways. Firstly, we should know that each phoneme is associated with a sequence of observations. The following maximum a-posteriori probability (MAP) is calculated to find out the best phoneme candidate associated with the given series of observations. * ( + The posteriori probability cannot be modeled in the acoustic model, but it can be 12 calculated indirectly from the likelihood model according to Bayes Rule. ( ( ) ) ( ) ( ) In the above formula, the probability of every randomly generated observations is the same. And the probability of a particular phoneme occurrence will be counted in a lexical model or a language model, and it can be ignored in an acoustic model. Thus, the posteriori probability can be further simplified as following, which only the likelihood of the observations given the phoneme sequence matters. ( ) ( ) ( ) The likelihood can be calculated directly from the HHM models. For example, there are six state models (Figure 2.6). For an given observations, there are many possible sequences from the left-most non-emitting state to move to the right-most non-emitting state. Take one possible sequence as example, and the sequence X=1; 2; 2; 3; 4; 4; 5; 6 is Figure 2.6 ASR using HMM1 assumed for 7 frames observations. In Figure 2.6, when makes the jump from state 1 to state 2, the is read into the HMM, it is equal to 1, the Gaussian mixture probability in state 2 gives the likelihood of state 2 given the observation o 1, which is ( ). According to sequence X, the sequence continuous stays in state 2 for one more frame, with some transition probability and the likelihood of state 2 given the 1 S. Young, G. Evermann, “The HTK Book”, version 3.4, Cambridge University Engineering Department, pp13, 2006 13 observation o 2, which is ( ). When the state with transition probability which is comes, the sequence X jumps to the third and the likelihood of state 3 given the observation o 3, ( ), the summation of all outgoing transition probabilities of a particular state is 1. In the case of state, the following equation is true. So on and so forth, the way the sequence of X=1; 2; 2; 3; 4; 4; 5; 6 is processed through the 6 states HMM phoneme model can be figured, and the likelihood of the 7 frames observation to belong to this particular phoneme model and the sequence of X is given by ( ) ( ) ( ) ( ) Because multiple sequences exist in a single phoneme model, the likelihood of a certain part of observations belong to a phoneme model is the summation of all the likelihood probability of all possible sequences for that phoneme model. ( ) ∑ ( ) ( )∏ ( )( ) () ( ) Where X=x(1), x(2), x(3)….., x(T) In the word level recognition, the phoneme model described above can be changed to the word model. In a word model, the states are the combination of state sequence in every phoneme of a word model, thus the MAP searching space will be much larger. When language model is used, the ( ) in the likelihood function will assign higher score to a frequently exist word. In the experiment, the HMM model is developed by the statistical adjust the mode l parameter to adapt to the training data. This requires a huge amount of data to train a good model. Generally, an English acoustic HMM model is developed with about 300 hour speech, a Mandarin acoustic HMM model requires even more amount, due to more phonemes in Mandarin speech phone set. 14 Figure 2.7 Train HMM from Multiple Examples1 2.3.4 Artificial Neural Network Neural Network often refers as artificial neural networks, which is composed of artificial neurons or nodes. When inputs process into the model, there are weights assign to every input nodes, and then the summation of those scaled input nodes are processed into a Figure 2.8 Illustration of neural network transfer function, which is the activation node, mostly, we will use a log-sigmoid function, also known as a logistic function, the curve of this function is shown in Figure 2.9. After that, the output of those functions become the input node of the next layer. And so on so forth, there can be several intermediate layers for different Neural Network 1 K. C. SIM, “Acoustic Modelling Hidden Markov Model”, Speech Proceesing, PP, 8, 2010. 15 system, in our research, we will use the feedforward three layer neural network. Anyway, every layer is essentially simple mathematical models defining a function , but those models’ parameters are also intimately associated with a particular learning algorithm or learning rule. Back Propagation Algorithms is currently used in many neural network applications. Figure 2.9 Sigmoidal function for NN activation node A feedforward neural network is an artificial neural network where connections between the nodes do not form a directed cycle, which exists in recurrent neural networks. (Figure 2.10) Figure 2.10 feedforward neural network (left) vs recurrent neural network (right) Back propagation learning algorithm plays an important role in the neural network training process, it can be divided into two phases: propagation and weight update. Phase 1: Propagation Each propagation involves the following steps: Forward propagation of a training pattern's input, thus one layer’s output becomes the 16 next layer’s input, finally, the end layer’s results are obtained. Back propagation of the propagation's output of each layer, at the final layer, a target results are compared with the output, and the target results of previous layers are calculated by the reverse function of the model from the current target results, this ends until reaching the first layer. Phase 2: Weight update For each weight arc: Here, we multiply its output delta (The difference between the output and the target output) and input activation (input node value) to get the gradient of the weight. Then we bring the weight in the opposite direction of the gradient by subtracting a ratio of it from the weight. This ratio influences the speed and quality of neural network parameters learning, we call this ration the learning rate. The sign of the gradient of a weight indicates where the error is increasing, and this is why the weight must be updated in the opposite direction. Repeat the phase 1 and 2 until the performance of the network is good enough. QuickNet is a suite of software that facilitates the use of multi -layer perceptrons (MLPs) in statistical pattern recognition systems. It is primarily designed for use in speech processing but may be useful in other areas. 2.3.5 ANN Methods in Speech Recognition Recently, many research works also take the advantage of multi-layer perceptrons (MLPs) neural network to train the acoustic model and recognize the result using tandem connectionis feature extraction 1. Hidden Markov model typically uses Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to observations of states of the phonemes. In contrast, artificial neural network model uses discriminatively training to estimate the probability distribution 1 using the output of a neural network classifier as the input features for the Gaussian mixture models of a conventional speech recognizer (HMM), The resulting system, which effectively has two acoustic models in tandem 17 among states given the acoustic observations. The traditional HMM is faster in training and recognition processes, and has better time alignment than the neural network acoustic model, especially when the HMM is a context dependent model. On the other hand, neural network model can capture the state boundary better, and it is flexible to be manipulated and modified for different applications. The general application idea of using ANN in the language processing is to use the state probability output of the ANN as the input of a default Hidden Markov Model, then we can recognize the final result using the HTK toolkit 1 directly. In details, we need to transfer the mfcc feature files into pfiles. The manipulation of pfiles and the neural network training all require the use of QuickNet toolkit 2. The pfiles will contain all the mfcc coefficients information for every frame sample. Then the pfiles have to be combined into a single pfile. The neural network is training towards a target output values. Therefore, we have to first prepare a good acoustic model to obtain an alignment for the training data, then to transfer the alignment into the required format, which is ilab file. (Figure 2.11) Figure 2.11 ANN applied in acoustic model 1 The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. 2 QuickNet is a suite of software that facilitates the use of multi-layer perceptrons (MLPs) in statistical pattern recognition systems. 18 2.4 Adaptation Techniques in Acoustic Model 2.4.1 SD and SI Acoustic model With different training dataset, the acoustic model’s performance is different. Speaker dependent (SD) Acoustic Model is an Acoustic Model that has been trained using speech data from a particular person’s speech. SD model will recognize that particular person’s speech correctly, but not a recommend model for the use to re cognize speech data from other people. However, an acoustic model trained from many speakers can gradually move towards Speaker dependent (SD) Model when applying the adaptation technique. Speaker independent (SI) Acoustic Model is trained using speech data contributed by a big population. Speaker independent Acoustic Model may not perform as good as the Speaker dependent Acoustic Model for a particular speaker, but it has much better performance for a general population, since SI captures the general characteristic for a group of people. In further, the SD captures the intar-speaker variability better than SI model does, while the SI captures the inter-speaker variability better than SD model does. Usually, the dataset used in the training deviates somewhat from the speech data been tested, the acoustic model performs even worse when the deviation is large, for example, the American accented English speakers versus the England accented English speakers, the native English speakers versus the non-native English speech. To recognize the speech from non-native speakers, we usually will apply the technique of the adaptation for a well-trained native acoustic model, using the limited data. Because the native speech data can be easily obtained from many open source, while the non-native speech data are rare and diverse. 2.4.2 Model Adaptation using Linear Transformations (MLLR) 19 The Maximum likelihood linear regression (MLLR) computes transformations, which will reduce the mismatch between an original acoustic model and the non-native speech data been tested. This calculation requires some adaptation data, which is the similar speech data in the test. The transform matrices are obtained by solving a maximization problem using the Expectation-Maximization (EM) technique. In details, MLLR is a model adaptation technique that estimates one or several linear transformations for the mean and variance parameters of Gaussian mixtures in the HMM model. By applying the transformation, the Gaussian mixture means and variances are varied so that the modified HMM model can perform better for the testing dataset. The transformation matrix been estimated by the adaptation data can transform the original Gaussian Mixture mean vector to a new estimate, ⃗ =W ⃗ where W is the n (n+1) transformation matrix and ⃗ is the original mean vector with one more bias offset. ⃗ , - where w represents a bias offset, and in HTK toolkit, it is fixed at 1. W can be decomposed into W=, where A is an n - n transformation matrix and b is a bias vector. This form of transformation will adapt the original Gaussian mixtures’ means to the limited test data, and the author will continue to explain how the variances in the Gaussian mixtures are adapted. Based on the standard in the HTK, there are two ways to adapt the variances linearly. The first is of the form where H is the linear transformation matrix to be estimated by adaptation data and B is the inverse of the Choleski factor of , which is 20 =C and The second way is that, where H is the n n covariance transformation matrix. This transformation can be easily implemented as a transformation of the mean and the features. ( where A = ) ( ) ( ) . The transformation matrix is obtained by the Expectation-Maximization (EM) technique, which can make use of the limited adaptation data to transform the original model by a big step. With increasing amount of adaptation data, more transformation matrixs can be estimated. And each transformation matrix is used for a certain group of mixtures, which is categorized in the regression class tree. For example, if only a small amount of data is available, then maybe only one transform can be generated. This transformation is applied to every Gaussian component in the model set. However, with more adaptation data, two or even tens of transformations are estimated for different group of mixtures in original acoustic model. And each of the transformation is more specific to group the gaussian mixtures further into the broad phone classes: silence, vowels, stops, glides, nasals, fricatives, etc. Though it may not be classified so accurate in the adaptation for non-native speech, as non-native speech it is confused in the phone classification. 21 Figure 2.12 regression tree for MLLR61 Figure 2.12 is a simple example of a binary regression tree with four base classes, denoted as { +. The solid arrow and circle mean that there is sufficient data for a transformation matrix to be generated using the data associated with this class (usually will be defined by researcher in the application), while the dotted line and circle mean that there is insufficient data, for example, nodes 5, 6 and 7. Therefore, the transformation will only constructed for nodes 2, 3 and 4, which are data in group 5 will follow the transformation of the transformation of and . The , the data in group 6 and 7 will share , while the data in group 4 will be transformed by . 2.4.3 Model adaptation using MAP Using limited adaptation data to transfer the original acoustic model also can be accomplished by the maximum a posteriori (MAP) adaptation technique. Sometimes, MAP approach is referred as Bayesian adaptation. MLLR is an example of what is called transformation based adaptation, the parameters in a certain group of component model are transferred together with a single transform matrix. In contrast to MLLR, MAP re-estimate the model parameters individually. Sample mean values are calculated for the adaptation data. An updated mean is then formed by shifting each of the original value toward the sample value. In the original acoustic model, the parameters are the information priors that are the 1 S. Young, G. Evermann, “The HTK Book”, version 3.4, Cambridge University Engineering Department, pp149, 2006 22 generated by previous data and are speaker independent model parameters. The update formula for a single stream system for state j and mixture component m is ⃗ where ⃗ ⃗ ⃗ is a weighting of the original model training data to the adaptation data, is the mean estimated from the adaptation data only. is the occupation likelihood of the adaptation data. If there was insufficient adaptation data for a phone to reliably estimate a sample mean, there occupation likelihood will approach 0 and little adaptation is performed. 2.5 Lexical Model Lexical Model is the model to form the bridge between the acoustic model and language model, the lexical model defines the mapping between the words and the phone set. Usually, it is a pronunciation dictionary. For multiple pronunciations of a word, the probability of each pronunciation can be modeled too. (Table 2.2) Table 2.2 The probability of the pronunciations Word ABANDONED ABILITY Pronunciation ah b ae n d ah n d sil Probability 0.5 ah b ae n d ah n d sp 0.5 ah b ih l ah t iy sp 0.35 ah b ih l ah t iy sil 0.35 ah b ih l ah t iy sp 0.15 ah b ih l ah t iy sil 0.15 Pronouncing dictionaries are a valuable resource in case it is produced manually, they may require a lot investment. There are a number of commercial and public domain dictionaries available, those dictionaries have different formats and use different phone sets. Normally, we use the CMU dictionary, which is a machine-readable pronunciation dictionary for North American English and use a phone set with 39 phonemes, as more 23 phonemes are difficult to identify by machine learning currently. The following is an example for dictionary used in HTK. (Figure 2.13) The “” means sentence end, and “” means sentence start, the “[]” stands that if the model recognize a sequence of phones are most likely to be the sentence start or sentence en d, it will write nothing to the output. If the acoustic model given the highest likelihood score to the phone sequence of “ ah sp” for the current observations, the recognizer will output word “A” in output file. Figure 2.13 Dictionary format in HTK 2.6 Language Model A statistical language model assigns a probability to a sequence of m words, and this is to model the grammar of a sentence. The most frequently used language model in HTK is n-gram language model, which is to predict each word in the sequence given its (n-1) previous words. The probability of a sentence can be decomposed as a product of conditional probability ( ) ∏ ( ) The n-gram models means to approximating the conditional probabilities to depend only on its previous n words. 24 ( ) ∏ ( ) Usually, we will model bigram or trigram in our experiment. The conditional probabilities are based on maximum likelihood estimates-that is, by counting the events in context on some given training text: ( ) ( ) ( ) where C(W) is the count of a given word sequence in the training text. After the n-gram probability is stored in a database for the existing training data, some words may be mapped to an out of vocabulary class 1. Sometimes, a more complicated Language Model can be archived by construct a grammar based word net: Figure 2.14 The grammar based language model2 In the grammar based language model, the plural, singular, prefix, and so on are defined by the standard English grammar manually, which is expensive and may be not useful practically. As in non-native speech, the grammar is poorly formed unless the speech is read speech. In addition, the training data set of LM should match the test data set. For example, an LM trained on newspaper text would be a good predictor for 1 Sometimes the word sequence in the test data has not been seen before in training data, then we define some equivalence classes and given the probability of this new word sequence. 2 K. C. SIM, “Statistical Language Model”, Speech Proceesing, PP, 34, 2010. 25 recognizing the news reports, while the same LM would be a poor predictor for recognizing personal conversation or speech in hotel reservation system. 26 Chapter 3 Literature Review 3.1 Overview of the challenges in ASR for non-native speech The history of spoken language technology is milestoned by the ”SpeD” (“Speech and Dialogue”) Conferences. After 2000, in the support of Academician Mihai Draganescu, former President of Romanian Academy, the organization of a Conference in the field of spoken language technology and human-computer dialogue (the first one was organized in the 80’s years) is resumed. In the 2nd edition of this conference (2003), it shows the evolution from speech technology to spoken language technology. Mihai Draganescu has mentioned: “This had to be foreseen however from the very beginning involving the use of artificial intelligence, both for natural language processing and for acoustic - phonetic processes of the spoken language. The language technology is seen today with two great subdivisions: technologies of the written language and technologies of the spok en language. These subdivisions have to work together in order to obtain a valuable and efficient human-computer dialogue” [4]. 27 In 2005, there is no dramatic changes occurred in the research domain since the 2 nd conference. However, some trends were more and more obvious and some new fields of interest appeared to be a promise the future. Corneliu Burileanu summarized: “We were able to identify a constant development of what is called “speech interface technology” which includes automatic speech recognition, synthetic speech, and natural language processing. We noticed commercial applications in computer command, consumer, data entry, speech-to-text, telephone, and voice verification. Robust speakerindependent recognition systems for command and navigation in personal computers were already available; telephone-based transaction and database inquiry systems using both speech synthesis and recognition were coming into use” [5]. In 2007, the SpeD edition is considered a very interesting and up-to-data analysis of the achievements in the domain, it also presents the future trends at “IEEE/ACL Workshop on Spoken Language Technology, Aruba, Dec, 11-13, 2006”. The following research areas are strongly encouraged: spoken language understanding, dialog management, spoken language generation, spoken document retrieval, information extraction from speech, question answering from speech, spoken document summarization, machine translation of spoken language, speech data mining and search, voice-based human computer interfaces, spoken dialog systems, applications and standards, multimodal processing, systems and standards, machine learning for spoken language processing, speech and language processing in the World Wide Web. Biing-Hwang Juang and S. Furui present a summary of system-level capabilities for spoken language translation: 28 • first dialog demonstration systems: 1989-1993, restricted vocabulary, constrained speaking style, speed (2-10)xreal-time, platform-workstations, • one-way phrasebooks: 1997-present, restricted vocabulary, constrained speaking style, speed (1-3)xreal-time, handheld devices, • spontaneous two-way systems: 1993-present, unrestricted vocabulary, spontaneous speaking style, speed (1-5)xreal-time, PCs/handheld devices, • translation of broadcast news: 2003-present, unrestricted vocabulary, ready-prepared speech, offline, PCs/PCclusters, • simultaneous translation of lectures: 2005-present, unrestricted vocabulary, spontaneous speech, real-time, PCs/laptops. With recent development in Spoken Language Technology domain, the target research trend is more and more clear, the important challenges in those research direction are appeared. On the SpeD Conference in 2007, Hermann Hey, from Computer Science Department of RWTH Aachen University, using Germany made his “Closing Remarks: How to Continue?”. The main issue in this domain he emphasized is about the interaction between speech and NLP (natural language processing) people in many areas of interaction. Since 2007, one important challenge in this domain seemed to be “Speech to Speech Translation”. The main issue is speech recognition improvement. One aspect of the main issue is how to improve ASR with spontaneous, conversational speech in multiple languages. Another aspect is that the translated text much be “speakable” for oral communication which means it is not enough to translate content adequately. And one aspect is the cost-effective development of new languages and domains. The last aspect is the challenged intonation translation. In the 4th SpeD edition (2007), an important field of research is about the multilingual spoken language processing, citied from “Multilingual Spoken Language Processing,” 29 “With more than 6,900 languages in the world and the current trend of globalization, one of the most important challenges in spoken language technologies today is the need to support multiple input and output languages, especially if applications are i ntended for international markets, linguistically diverse user communities, and non-native speakers. In many cases, these applications have to support multiple languages simultaneously to meet the needs of a multicultural society. Consequently, new algorithms and tools are required that support the simultaneous recognition of mixed language input, the summarization of multilingual text and spoken documents, the generation of output in the appropriate language, or the accurate translation from one language to another” [6] 3.2 The solutions for non-native speech challenges As illustrated in Section 3.1, the Spoken Language Technology is more and more emphasize on improving the ASR with spontaneous, conversational speech in multiple languages. With the globalization, foreign accent is especially a crucial problem that ASR systems must address. To recognize a foreign accent English speech with similar performance as to recognize the native accent English speech is the challenge of the acoustic model. There are state level and phone level disturbance in foreign accent. First challenge in the non-native acoustic model is that we cannot train a non-native acoustic model directly, this is due to the difficult to collect enough non-native speech data (about 300 to 500 hours). Usually, there are a lot of open data source for native speech, for example, speech database of Wall Street Journal, Broadcast News, Newswire Stories, and speech corpus from some research programs. Unfortunately, there is lack of such broadcast or speech recording resource for non-native speech. Usually, using native acoustic model to recognize the non-native speech, the word error rate is about 2 to 3 time that of the native speech [7]. Before the non-native acoustic model can be improved to a high level, the lexical model and language model cannot be expected to perform well. Therefore, the author want to focus her research to improve the acoustic model for non-native speech, especially in recognition Mandarin accented non-native speech to the 30 Target language of English. In fact, the variation among non-native speakers, even with the same motherland accent, is very large. Those differences are characterized by different level of fluencies in pronunciations, different level of familiarity with the target language, and different individual mistakes in pronounce unfamiliar words. The presence of a multitude of accents among non-native speakers, is unavoidable even we ignore the levels of proficiency, and this will dramatically degrade the ASR performance. As mentioned in Section 3.1, the spoken language technology domain advocates the research to address these challenges just 3 years ago. But in the beginning of 21’s century, some research to tackle these challenges already emerged. The most straightforward approach is to train a solo non-native acoustic model by the non-native speech data[8], however, the non-native speech data is rarely public available, which is explained just now. Another approach is to apply general adaptation techniques such as MLLR and MAP with some testing data, by which the baseline acoustic model is modified toward a foreign accent. [9] Some researchers are working on the multilingual HMM for non-native speech.[10] Some researchers find methods to combine the native and limited non-native speech data, such as interpolation.[11] Some researchers apply both the recognizer combination methods and multilingual acoustical models on the nonnative digit recognition. [12] In 2001, Tomokiyo write a dissertation, “Recognizing non-native speech: Characterizing and adapting to non-native usage in LVSCR”[13], the author takes great detailed research to characterize low-proficiency non-native English spoken by Japanese. Properties such as fluency, vocabulary, and pace in read and spontaneous speech are measured for both general and proficiency-controlled data sets. 31 Figure 3.1 Summary of acoustic modeling results1 Then Tomokiyo explores methods of adapting to non-native speech. A summary of the individual contributions of each adapting method are shown in Figure 3.1. By using the data of Japanese-accented English and native Japanese, and the allophonic decision tree from the previous characterizing step, he apply both MLLR and MAP adaptation method with retraining and interpolate at the end, 29% relative WER reduction is archived over the baseline. The research of Tomokiyo shows us the non-native is very diverse. Even just restrict the research to a specific source language, proficiency level, and model of speech. From the result of characterizing, we see tremendous intra- and inter-speaker variation in the production of spoken language. The study also shows that those non-native speakers sometimes generating common patterns and sometimes generating unique events that defy classification. However, this dissertation has realized that by using a small amount of non-native speech data, the recognition error for non-native speakers can be effectively reduced. Also in 2001, Wu and Chang also discuss approaches in which a little bit test speaker sentences are used to modify the already trained SI models to a customized model, it is presented in the “Cohorts based custom models for rapid speaker and dialect adaptation .”[14] 1 Tomokiyo, L. M. (2001). Recognizing non-native speech: Characterizing and adapting to non-native usage in LVSCR. Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA. 32 It is well known that speaker dependent acoustic models perform much better for speaker independent acoustic models, as inter speaker variability is reduced, about 50% reduction can be achieved in the SD model compared with SI model. In the paper, Wu can Chang present an approach that uses as few as three sentences from the test speaker to select closest speakers (cohorts) from both the original training set and newly available training speakers to construct customized models. Firstly, the parameters of the speaker adapted model for each “on-call” speakers are estimated by MLLR technique mentioned in paper [ 15 ]. The authors adopt only two criteria that can directly reflect the improvement in system performance to select cohorts. The first one is the accuracy of the enrollment data (3 testing sentences) in a syllable recognition task. The second one is the likelihood of the adaptation data after forced alignment against the true transcriptions, which is the true transcript text of the 3 testing sentence prepared ahead. With the enrollment data, the speakers are sorted according to their likelihood and top N speakers with the highest likelihood are picked as the cohorts. And the final cohort list is tuned according to both the syllable accuracy and the likelihood. The data from the speakers listed in cohort are used to enhance the model of the test speaker by a lot ways, such as the retraining technique, MAP or MLLR. (Figure 3.2) Figure 3.2 Procedure for constructing the custom model 1 The results of research from Wu and Chang show that the cohorts based custom 1 Legetter C.J. and Woodland P.C., “Maximum Likelihood Linear Regression for Speaker Adaptation”, Computer Speech and Language, Volume 9, No. 2, pp 171-186. 33 model can achieve better performance even than SA model (MLLR adaptation with 170 sentences) sometimes, and a relative error reduction of 22% was obtained. A material inspiration from the approach of Wu and Chang is that the adaptation scheme can be updated online without reconfiguring any parameters. In 2003, Zhirong Wang et al present a paper with powerful comparison of 4 different adaptation methods for the non-native English data from German speakers, “Comparison of acoustic model adaptation technique on non-native speech.” [16] The four methods are bilingual models, speaker adaptation, acoustic model interpolation and Polyphone Decision Tree Specialization. Firstly, they train a native English model and a German accented English model (limited data. WER of the native model on native test data is 16.2%, on non-native test data is 43.5%. WER of non-native model on non-native test data is 49.3%. When pool the native and non-native data together, a pool model gives WER of 42.7%. Bilingual acoustic model is trained earlier [ 17 ] with English part of Verbmobil (ESST 1) and German part of Verbmobil (GSST 2). The bilingual model is designed to improve the robustness of the recognizer against the accent of non-native speakers. The common phone set for English and German is investigated by the knowledge-based (IPA) approach. The best WER of 48.7% is obtained for bilingual model on non-native data. MLLR and MAP adaptation techniques are investigated in the paper. (Figure 3.3) For enough adaptation data amount ( > 20 min), MAP shows better performance in reducing the WER than MLLR. And for both techniques, better adaptation can be achieved by increasing the data amount per speaker. 1 ESST data was the English speech data collected for the Verbmobil project, a long-term research project aimed at automatic speech-to-speech translation between English, German and Japanese 2 GSST data was the German speech data collected for the Verbmobil project. 34 Figure 3.3 MAP and MLLR adaptation1 Acoustic model interpolation adjusts the weighted averaging of the PDFs of several models to produce a single output. In the pooled training, the non-native speech data is very little compared with the native one. By adjusting the weight of the non-native speech data, the interpolation shows the optimum WER at a point. (Figure 3.4) Figure 3.4 Performance with various interpolation weights1 There is a big mismatch of the context between the speech of native speakers and that of non-native speakers. In th decoding process, the context decision tree we use was built from native speech to model the context of non-native speech. Zhirong Wang et al adopt the Polyphone Decision Tree Specialization (PDTS) [18] method for modifying a decision tree to a new language. In this approach, the clustered multilingual polyphone continues the decision tree growing process with a limited amount of adaptation data. And the best result from PDTS is 35.5%. 1 Z. Wang, U. Topkara, T. Schultz, A. Waibel, Towards Universal Speech Recognition, Proc. ICMI 2001. 35 The comparison of all the adaptation methods results are shown in Figure 3.5. Based on this paper, they discovery MAP outperform MLLR with adaptation data higher than 20 min. Interpolation outperform MAP, and even better recognition results can be achieved by PDTS. Figure 3.5 Best results of Various systems1 In 2003, Ayako Ikeno et al published a paper, “Issues in recognition of Spanishaccented spontaneous English,” [ 19 ] illustrating the University of Colorado Large Vocabulary Speech Recognition system sonic[20], comparing the results from different acoustic model and language model, presenting the method to characterize the full, onschwa vowels of the Spanish non-native English speakers. In their research, in order to Spanish-accented spontaneous English, they design some conversation situation to collect the speech, in which the response from the speaker is spontaneous. Table 3.1 Word error rate for the CSLR Sonic Recognizer2 In 2002, Sonic was ported from English to the Spanish, Turkish, and Japanese 1 T. Schultz, A. Waibel. Polyphone Decision Tree Specialization for Language Adaptation Proc. ICASSP, Istanbul, Turkey, June 2000. 2 Ikeno, A., Pellom, B., Cer, D., Thornton, A., Brenier, J. M., Jurafsky, D., et al. (2003, April) Issues in recognition of Spanish-accented spontaneous English. Paper presented at the ISCA & IEEE ISCA & IEEE workshop on spontaneous speech processing and recognition, Tokyo, Japan 36 language. Sonic is a multi-pass recognition system and has been shown to have competitive recognition accuracy to other recognition systems. Its performance is shown in Table 3.1. During each recognition pass in the sonic system, a voice activity detector (VAD) is dynamically updated by the feedback of each current adapted system acoustic model, and the means and variance of the acoustic model are modified in an unsupervised way. (Figure 3.6) In the first pass, the gender-dependent acoustic models are performed. The second pass performs with vocal-tract length normalized model. In this paper, the author compare different acoustic model and language model in recognizing the Spanish accented speech. Two sets of acoustic models were used, one trained on Wall Street Journal and ones trained on the accented speech. Three language Figure 3.6 Diagram of Hispanic-English multi-pass recognition system1 models were selected, they are WSJ, SwitchBoard and a language model trained on the accented data. (Table 3.2) Table 3.2 World Error Rate % by Models1 The paper also presents the ways to capture the characteristic different length of vowels from the non-native speakers. They determine the average duration of reduced vowels first, and then they normalize these values by the average vowel duration in 1 Ikeno, A., Pellom, B., Cer, D., Thornton, A., Brenier, J. M., Jurafsky, D., et al. (2003, April) Issues in recognition of Spanish-accented spontaneous English. Paper presented at the ISCA & IEEE ISCA & IEEE workshop on spontaneous speech processing and recognition, Tokyo, Japan 37 accented and native speech respectively. The phoneme normalized results can be used to modify lexicon model in future, which can capture the full, non-schwa vowels of the Hispanic-English speakers. ⁄ ⁄ In recent years, there has been notable research on non-native speech. Some papers propose transforming or adapting acoustic models based on the variation of pronunciation[21][22]. Some research propose to adjust the language model designed from native speech to adapt to the speaking style of the non-native speakers[ 23 ]. In combination the two approaches, some hybrid modeling approaches emerges[24]. The paper published by Yoo Rhee Oh (2009)[ 25 ] use the pronunciation variants abstracted from the non-native speech to modify the pronunciation model and the acoustic model. However, the results do not show much improvement, but it present the pronunciation variant rules to map the target phoneme to the variant classification. The non-native speakers have a target pronunciation heavily biased by their motherland accent. Inspiring by this axiom, a lot of researchers think up methods to integrate some of the source language information into the acoustic model to gain some improvement in the recognized result. Some research focus on mapping the source phones to the target phones, we take Mandarin as source language and English as target language here. Then they add the phone information of well-trained Mandarin acoustic model to the mapped English phone. By doing so, a pronunciation of Mandarin accented English phone will be accepted by this modified acoustic model. In 2009, a research group from the Hong Kong Polytechnic University, Qingqing Zhang, Jielin Pan, Yonghong Yan published a paper in ISNN 2009, “Non-native Speech Recognition Based on Bilingual Model Modification at State Level”. [26] In their research, they first train good Mandarin and English acoustic models, both are Hidden Markov Models. Using the initial English model, only 46.9% phrase error rate is obtained. Then 38 pooling, MLLR, MAP adaptation techniques are applied on the English model and compared, and MAP is the best, which is 34.3% phrase error rate. After that, a State Level Mapping Algorithm is used to modify the current adapted English model. In the algorithm, the English (adapted) and Mandarin acoustic model are both used to recognize a certain non-native speech data, thus they obtain the two parallel state level time labeling. By set certain threshold, for example, 60% match of a particular Mandarin state with an English state is considered as a match case, they calculate the actual number of matches (Figure 3.7). After that, they calculate the possibility a particular Mandarin state match to a particular English state, and then they use the best n matched Mandarin state’s information into the adapted English state. Finally, the modified model is capable of recognize the non-native speech at phase error rate of 31.6%. Figure 3.7 Example of the match between state Target and Source1 Among those approaches presented here, some research do have addressed some problem in the non-native ASR, but most of them can not deviate much from the adaptation technique of MLLR or MAP or the combination of them. The MAP tends to perform better with larger adaptation data (20 min). Although the interpolation method shows us better performance than MAP, it requires the feedback from the final recognized results, which is not practical in implementation. Some other method shows even better result, such as Polyphone Decision Tree 1 Q. Zhang, T. Li, J. Pan, and Y. Yan, "Non-native Speech Recognition Based on State-Level Bilingual Model Modification," in Proc. of Third International Conference on Convergence and Hybrid Information Technology (ICCIT), Vol. 2, pp.1220-1225, 2008. 39 Specialization (PDTS), but it is very complex and need a lot of charactering information. Wu and Chang (2001) discusses the use of very little speech data from the testing speakers with largely improve the recognition results, and this may be very helpful in some realistic applications. Qingqing Zhang (2009) group propose an algorithm to map the Mandarin phonemes to English phonemes, a little more improvement can be achieved over MAP technique, but it need a relative long time for the dynamic programing to process. The Vocabulary Speech Recognition system sonic from University of Colorado Large integrates almost all cutting-edge techniques in this field. 40 Chapter 4 Methods and Results Although English is being learned as the second language other than the motherland language in many countries. There are a lot of difference between non-native English speech and native English speech. First, non-native speakers tend to pronounce phoneme differently, sometimes will have reduced or increased phone set, sometimes will have a lower or higher speech rate, and sometimes will have high frequency of mistakes. In total, native acoustic model cannot recognize the phones pronounced by non-native speaker properly. This is what the author is trying to address in her first and second project, and this part is very crucial step to improve the nonnative speech recognition result. Second, non-native speakers also pronounce a word with different phoneme sequence, with some phoneme inserting or deleting, to solve this problem we need to modify the native lexicon model. The author has tried to modify the native dictionary in her third project, which is a very difficult part in the whole ASR system. Third, non-native speakers are used to speak with poorly grammar to organize their words, this means that the native English language model can will have poor performance for the non-native unsupervised speech. The author has not reached this part yet. 41 4.1 Non-Native English Data Collection As mentioned previously, there is a lack of public resource for non-native speech. And for different group of non-native speakers, the speaking style, pronunciation and accent are varied with the language familiarity level and the hometown language accent. However, it requires expensive investment to collect enough non-native speech for training of an acoustic model. Usually, a good native English acoustic model is trained by a about 300 to 500 hour speech data, which can be easily obtained from public resource, such as BBC broadcast, Walk Street Journal. However, we cannot find any well recorded non-native broadcast or media. Therefore, we usually will collect the nonnative speech in the lab for the research. The author has collected some Mandarin accented non-native speech for her later research. The 6 participants (three male, three female) were from China, and all had studied in English spoken University at least one year and had a median ability to understand, speak and read English. Each speaker read out the same 630 sentences in a room with little noise with the same microphone, sound card and software. To collect the data the author use a speech recorder software designed for the HMM research purpose, with high definition and will generate the required scripts during recording. The command used is shown in Appendix Command 1. The inputs and outputs for this command are shown obviously: Figure 4.1 The voice recording command inputs and outputs After the recording, some of the utterances 1 are just silence or not properly recorded, 1 Utterances means the speech sentence here. 42 thus we remove those speech data and its associated transcription 1. Only 500 sentences are selected from the 630 sentences. The detailed data amount is shown in Table 4.1. Table 4.1 Non-native English collection data Speaker Utterances Time Speaker A 500 1.18 hr Speaker B 500 0.85 hr Speaker C 500 1.04 hr Speaker D 500 0.9 hr Speaker E 500 0.93 hr Speaker G 500 0.98 hr Total Time 5.89 hr Though the collected data is just about 1 hour per speaker, the actual time used is more than about 3 hours per speaker. It is really a very time and energy consuming for the participants. Data collection is very crucial step to the success of the lateral non-native speech research. First, the data amount must be enough to achieve significant result. Second, the speakers must be choose wisely for the purpose of the research, in the author’s experiment, since she want to study the non-native speech, she should choose non-native participants with similar familiarity and fluency for the target language 2. Third, carefully monitor the data been collected and remove those dummy or wrong data can reduce unnecessary redundancy of future research work. 4.2 Project 1: Mixture level mapping for Mandarin acoustic model to English acoustic model 4.2.1 Overview of the Method in Project 1 Project 1 focuses only on reducing the sound errors of non-native speakers, and my task is to train a better acoustic model for Mandarin accented non-native testing data with limited non-native training data. (less than 6 hours) Therefore, adaptation technique has 1 2 The transcription is the file of sentences given to speakers for reading. Target language is the language the ASR attempt to recognize. 43 to be applied to modify the native English model with such a limited adaptation data. In this project, we hold a knowledge that most non-native speakers tend to use phonemes from their motherland language when they speak the foreign language. Thus, we expected improvement when the motherland acoustic model information is integrated into the native English acoustic model. Hence, we will use the source language 1 acoustic model in this project as well. For the limited non-native speech data the author collected in section 4.1 will be used in this project and the second project, and some portion of the non-native speech data will be used as adaptation and retraining data, the remaining non native speech data is the data been tested in the recognition. In this project, there are three parallel processes. (Figure 4.2) The first process is use the native English acoustic HMM to recognize the non-native speech data, and then we apply the adaptation technique to the native English model, and evaluate its recognition result for the non-native speech again, after that, we apply adaptive training for the acoustic model, and use the final model to recognize the non-native speech. The second process is to insert the Mandarin acoustic model information into the English model. The author has mentioned in Section 2.3 that, in an acoustic hidden Markov models (HMM), the top model level are phonemes for a specific phone set. Each phoneme consists of multiple states, which are emitting states and capture the characteristics for a portion of phoneme. Each state consists of many Gaussian mixtures (In this experiment, the author will investigate the 1 mixtures per state HMM, 2 mixtures per state HMM, 4 mixtures per state HMM, 8 mixtures per state HMM), and shape combined by a lot of Gaussian mixtures determine the model of that state. In this project, the author will insert all the mixtures of native Mandarin model into every state of an empty English model. As the author have native Mandarin models with different number of mixtures per state (from 2 to 32), she has obtained 5 different modified models in this 1 Source language is the motherland language of the non-native speaker. 44 way. Then all the 5 models are retrained by the adaptation data, then the author use them to recognize the test data for the phone level. The best-modified model is chosen Compare the phone recognition results of the three processes and see the improvement from left to right Figure 4.2 Project 1 Process flow MLLR as the modified HMM model. Here, the author will use the adapted model to recognize the test data again and record it. Lastly, adaptive training is applied to the adapted modified HMM model, the final adapted modified HMM model will recognize the test data and the result will be recorded. The third process is very similar to the second process. The only difference is that the Mandarin HMM mixtures are not inserted into an empty English HMM state, but the one she has obtained from the process 1 final stage. In process 1 final stage, the native English HMM model has been retrained and adapted, and the mixtures in each of its 45 state contain the English acoustic information of the non-native speaker, and are trained towards the non-native speakers’ acoustic features. The author assigns some weight of the mixtures already exist in English HMM state, and the remaining weight to the mixtures from a Mandarin whole model. After that, all the steps are the same like the process 2, retraining and then adaptive training again. Actually, the MLLR adaptation is involved in the first iterate of the adaptive training. 4.2.2 Step Details in Project 1 In this project, the author has prepared the following data, models and tool for the implementation of the all the process:  A well-trained Native English HMM model (trained by about 300 hours )  5 well-trained Native Mandarin HMM model (from 1 mixtures per state to 8 mixtures per state)  The non-native speech data collected in lab. (70% is used for adaptation and retraining, 30% is used for testing)  Hidden Markov Model Toolkit (HTK) 1 In the first process, the collected speech waveform files have to be transferred into MFCC feature files first, the command used is in Appendix Command 2. The inputs and output for this command are shown: 1 Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models, and it is open source toolkit for non-commercial use. 46 Figure 4.3 The inputs and output for the mfcc features abstraction The “HCopy” is the HTK tool, which can abstract the feature file for a given speech waveform file according to the specification in the configuration file. In the configuration file, the author will have to specify, for example, the window period, the frame period. The script for wav to mfcc mapping is a script to specify which waveform file will be transferred into which feature file with directory and file name details. Since the author has an already well-trained native English acoustic HMM model on hand, she can recognize the non-native speech in the test data set directly. The recognition command is in Appendix Command3. The inputs and outputs for this command are shown in Figure 4.4. The “HVite” is also a tool from HTK toolkit, the configuration file here is different from the previous command, as it specifys the details of what kind of test feature file in the input and take into consideration of context information or not in the recognition process. The phone loop shows the relationship between each phone, normally we construct the phone loop in a way that every phoneme can jump to another phoneme randomly. The phone dictionary file is very simple, it just shows which phoneme model is correspond with which phone symbol. Mono phone list is a list of all the phoneme models in the HMM model. “-s” is the grammar scale factor, which is needed to adjust in word level recognition. “-p” is the insertion penalty, which will prevent the insertion errors in the final recognition result. “-t” is the pruning thresholds, it will prune off the lattice path with score below some threshold, thus smaller the threshold, faster the recognition, but lower the output accuracy. Usually, set “-t” can be good enough and not affect the final result much. 47 Figure 4.4 the inputs and output of speech recognition command After the recognition, we can test the phone level accuracy or error rate by the known phone level test data transcription. The command is in Appendix Command 4. This command will give us the phone accuracy of the output directly in the screen. Apply MLLR adaptation technique is a little bit complex, as it requires the generation of the regression tree. To generate the regression tree, the command is in Appendix Command 5. The inputs and outputs of this command is shown below in Figure 4.5. The configuration file here will specify directory of the global configuration file and the maximum number of regression tree terminals. The global configuration file will Figure 4.5 The inputs and outputs for regression tress generation specify the model components can be classified into groups. The output regression tree base will give the index of each group and the model components under it, the regression tree file label each of its node with a base group index. The linear transformation adaptation matrix is generated by two commands in Appendix Command 6. 48 The Figure for the inputs and outputs of the two commands is shown in Figure 4.6. This two commands are cascaded to generated a transform matrixes for every data sufficient terminal in the regression tree. “-u” must choose “a” as its flag, as this will make the tool “HERest” to train the adaptation data in a linear transform way, thus the final transform matrixes can be generated. The mask is a symbol to tell the software which part of the file name stands for the speaker id, thus the user dependence can be trained in the adaptation, and different matrix transforms are generated for different speakers. Figure 4.6 inputs and outputs for MLLR adaptation The HMM model after adaptation only need some changes in the recognition command (Appendix Command 7). The HERest1 will generate the basic transform matrix for MLLR adaptation, the HERest2 will generate the transform matrix that can be used in the recognition process. In the adaptive training we use the transform generated in HERest2, the HMM model model parameters will be modified according to reduce the difference between speakers in the transform, with the adaptation data. In this experiment, the author will control the system only to update the mixture weight. The command is shown in Appendix Command 8. 49 “-u” choses the flag “w” will control the retraining to only update the Gaussian mixtures weight, thus the mean and variance are kept as the original model. “-w” set as 2 will prevent some weight is decreased too much that the component will be get rid of in the output HMM file, and the weight cannot decrease any more when it reaches a threshold. In the adaptive training, after the new HMM model is obtained, the HERest2 is running again for the new HMM model, and then repeat the adaptive training again using the new HMM model and new transform as the input, and so on so forth. The recognition process for model after adaptive training is the same as the MLLR adapted model. Figure 4.7 Modified model Illustration In the second process, the author will insert the Gaussian mixtures from a native Mandarin model into every state of an empty native English model. This is the most dedicate part of this research which is attempt to transfer the acoustic information from 50 the Mandarin HMM structure into the English HMM structure. The above shows an illustration. The left side is the native Mandarin HMM model, with mixtures under each of its state, the right side is the empty English HMM model, which has no mixtures under each of its state. The author will modify the empty English HMM model by inserting all the well-trained Gaussian mixtures from the native Mandarin HMM model into every empty English state. The Gaussian mixtures from the Mandarin states will be assigned a constant scalar, which will assure the summation of all the mixtures’ weights in one English state is close to 1. This can be seen more clearly from the likelihood output compositions: ( where ) ∑ ( ) ( ) is the overall mixture from Mandarin acoustic model, for example, there are in total of 135 (45 Mandarin phonemes and 3 states per phoneme) mixtures from 1 mixture Mandarin model, and all the initialized weights of the mixtures are the mth expert of state s, ( . The likelihood of ) is given by a Gaussian distribution: ( ) ( ) A perl script edited by author does the modification. Different native Mandarin models are used in the model modification process. Besides, 1 mixture per state Mandarin HMM, 2, 4, 8 and even more mixtures per state Mandarin HMMs are investigated in this project. Then the every modified model is retraining with the adaptation data. (Appendix Command 9) In the command above, the flag “w” is used for “-u”, which means that only the Gaussian mixture weight will be updated in the output HMM model file. This retraining will be iterated for about 7 to 8 times, and the final HMM model will be recognized by HVite and the result will be recorded. This retraining process is so called Baum-Welch reestimation. As we notice that in this retraining command we do not include the “ -w 2”, and this means that there is no control for the weight of the mixture to drop small enough 51 to be pruned off. Thus in the retraining process, the mixtures per state will keep on reducing in the later iteration, and the author call this the self-shrink mechanism, which will help to get rid of redundant mixtures automatically. After the retraining, the modified model with best recognition result is cho sen to repeat the MLLR adaptation, adaptive training steps as described in process 1. The third process is different from the second process only for the first step. The author will insert the Gaussian mixtures from a native Mandarin model into every state of non-empty native English model. The non-empty native English is the same model used in process 1. In addition, some weight has to be assigned to the native English mixtures in the modification process. After this, retraining, selection, adaptive training is followed one by one. 4.2.3 Project 1 Results The following Figure (Figure 4.8) shows the results of project in the steps of every process. The native English acoustic model has the poorest recognition results for nonnative speech 71.8% phone error rate. MLLR adaptation and adaptive training each contribute some improvement for the recognition performance, no matter in process 1 or process2. Process 2 has lower phone error rate than process1 after applied the same technique, which shows an observation that it is better to use Mandarin model mixtures to recognize the Mandarin-accented speech than to use English model mixtures. Process shows that further improvement can be achieved by combining the both models’ mixtures. 52 Figure 4.8 The results for Project 1 In Table 4.2, we can see that the modified empty English model has lower PER when it is modified from more complex Mandarin model. However, in the experiment, the 16 or 32 mixture per state Mandarin model cannot be practically processed, due to the initial weight assigned to each mixture is too small, and the mixture will easily be get rid of during the first adaptive training round. Table 4.2 Mandarin mixture modified empty English model Mandarin Model Initial English Model English Model after retraining 1 mixture Mandarin 135 mixtures Eng_M1 62 mixtures PER 72.56% 2 mixture Mandarin 282 mixtures Eng_M2 103 mixtures PER 69.62% 4 mixture Mandarin 564 mixtures Eng_M4 117 mixtures PER 68.52% 8 mixture Mandarin 1128 mixtures Eng_M8 237 mixtures PER 67.09% 53 4.3 Project 2: PoE to combine the bilingual NN models 4.3.1 Overview of the Method in Project 2 This project will continue to focus on acoustic model adaptation for non-native speech, which is the most basic and important recognition step in the non-native ASR system. Without a good phoneme output from this part, the final sentence level output cannot be recognized properly. Figure 4.9 Project 2 Process flow Project 2 implements the neural network acoustic model (Section 2.3.4, 2.3.5), and uses the product of expert technique (Section 4.3.2). In this project, we still bear in mind that there will be an improvement in the recognition results when both Mandarin 54 acoustic model (in NN1 form) and English acoustic model (in NN form) are combined. From Figure 4.9, the author shows out the general process flow for this project, and the final recognition results of the three processes are compared. The first process is to develop a neural network model, which uses the information from a native English acoustic model, then another two layer level neural network is developed to further improve the recognition result as well as to parallel the lateral two processes. After that, the neural network model’s result is post processed, and the probability of each state is transferred into the distribution position (Gaussian function: x p=g(x), and now we get x by ( ) x). Then, we pass the test data distribution value into the initialized HMM. Finally, the HVite recognized result is the recognition result of the two cascaded neural network models. The second process is different from the first process only in its first step, i n which it develop its first neural network model with the information from an native Mandarin acoustic model. All other steps just repeat the steps in process 1. One difference is worth to be mentioned, the Mandarin NN model has 138 nodes in the output, and it transferred from 138 nodes to the 117 nodes in the second two layer NN model, which is important to be the same states as in the English phone set. The third process also different from the first process in its first step, in which it combines the post-edited results from the previous two processes’ first NN model, and then use the combined results as the input of its second neural network model. 4.3.2 Step Details in Project 2 The prepared material for this project is described as follows:  A well trained native English model (it is based on broadcast data source tdt2, tdt32) 1 2 NN stands for neural network. Tdt2 and tdt3 are some versions of recorded broadcast with the speech transcription and other information 55  A well trained Mandarin acoustic model  The QuickNet toolkit 1  The non-native speech, Mandarin accented, about 6 hours, 50% for training, and the rest for testing (Collected in Section 4.1) In the first step of process 1, the author train an neural network acoustic model using the ilab file aligned from an well trained native English acoustic model. To train an NN acoustic model, the author has to prepare the NN input file in the pfile format and the target file in the ilab file format. Some details for neural network acoustic model training has already been explained in Section 2.3.4 and 2.3.5, and the method applied applied has no difference with the method described in 2.3.5. Figure 4.10 NN model 1 training The details of the first NN model training is shown in Figure 4.10. The input files are transferred from the Non-native training speech feature files, and the final format pfile is only another format and still contains all the feature value. The pfile transformation is done by the toolkit in the QuikNet, the tool can be compiled and easily used according to the example. For every feature value in the pfile, it will be sent to a function node in the 1 QuickNet is a suite of software that facilitates the use of multi-layer perceptrons (MLPs) in statistical pattern recognition systems. 56 middle, which is a non-linear function, the result of the function node will be sent to all the node at the final function node, which again calculates the result for a non-linear function (usually Sigmoidal function). From the input layer to the second layer, weight or constant will be multiplied to the input, and different arc from the layer 1 to layer 2 has different weight, and the weight exists in the arcs between the layer 2 and layer 3 too. One input pfile has 39 feature value for this experiment, thus there is 39 input node for this pfile. However, multiple pfiles, eg., 3, 5, 7, 11 or 13 pfiles can be placed at the input node together, and the information between the neighboring pfiles can be captured as well. In this research, 11 pfiles (we call it window_size=11) are the optimal one. The final layer’s results have fixed output node, which is the number of states for that acoustic model, in this research, the English acoustic model has 117 states, while the Mandarin acoustic model has 135 states. The result at the final layer shows the probability of the input pfile (the pfile in the middle, for example, 3 input pfiles, the actual training one is the second input pfile, 11 input pfiles, the actual training one is the sixth input pfile) to be any one of the output state. As we know one feature file is corresponding to one frame period, and this file can be classified as a particular state in a particular phoneme. Pfile is the transformation of feature file, and it captures the state level characteristic as well. The final layer results have different value for different node (which stands for a state), the highest probability one will be as the state recognized for that feature file by this NN network. However, initially, the NN system is initialized with same arc weights and the parameters in the non-linear functions, for the final layer results, they are very different from what we expected. The ilab file is a file labeling only the correct state for that feature file as 1, and the rest as 0. And the ilab file is transferred from an alignment file by some scripts, editing by the author. The alignment file is the result by forcing aligns the training non-native speech through a Native English acoustic model with the speech’s transcription. The NN system final layer’s results are compared with the correct result from the ilab file, the difference is feedback 57 to the final layer’s function and the arcs’ weight, then the weights an d parameters are updated to reduce the error. Finally, the results of the second layer are again compared with the computed results from the ilab, and so on so forth. The system is updated with the training data and ilab files, where a larger training data size leads to a better system performance. After the first NN model is trained, the results of the first NN network’s training data contain 117 values per frame. This 117 values stand for the likelihood of this frame file for each of the 117 states. The post editing for this part is firstly performing “log” function to each of the likelihood value, then transferring the edited results into pfile to be the input file of the second NN model training. Figure 4.11 NN model 2 training (Poe) The training of the second model is a two-layer NN model (Figure 4.11). The ilab file is still the same as the previous model training. Before the log function process, the training of a two-layer NN model’s result will depend on the weight summation of the input: → After the log editing for the input, the NN model’s result will depend on the product of the input: ( ) ( ) ( ) ( ) 58 ( )→ This is why we call the second NN model training as the product of experts, it calculated the result of the final result by the product of the input node with weights in the arc as power of it. In this training, there is slightly more information can be trained in this step, but not significant. But we have to implement this step for the Mandarin model training and combined model training. Because for this lateral two process, the resulted state number for each frame is different from the actual state number for the English phone set, they have to change in this step and train towards English speech in this step. After the second NN model is trained, the author has to edit the result further to be the input of the initialized HMM. And the result only for testing data will be concerned here. Actually, the first and second step use the training data to update the model parameter, and the test data has to follow the format of training data to go th rough the two models again. As the author mentioned previously, the values in the result stand for the likelihood of this frame file for each of the states. After the author obtains the probability values, she transferred the probability value into its distribution value (Figure 4.12). The distribution of the x is obtained by perform the inverse function of the Gaussian, and only the positive x value is calculated to present the distribution of a state for a frame. Figure 4.12 The inverse function of Gaussian 59 An initialized HMM for this application has the dimension of features equal to its number of states. Every state only contains one Gaussian mixture, and the Gaussian mixture has 0 means for all its dimensions, and one variance value which is the relatively small (In this experiment, default as 1), the other variance values are all very big (In this experiment, default as 1000). The position of the small variance value stands for that particular state, which is consistent with the state sequence in the ilab file. Thus, when we input our distribution file, the state with an x value closer to the mean 0 will be assigned with a higher probability score, and the state combined result will be account into the final phoneme recognized result. The second process is different from the first process in its first two steps. In the first step of the second process, it uses a native Mandarin model to obtain the alignment for the training data, then transfers the alignment into ilab file. Hence, the output state number is consistent with the Mandarin states (135). In the second step of process 2, the product of expert NN is trained, and over there, the input pfile for this NN has 135 feature values per frame, but we trained them towards an output with 117 nodes, and the ilab file used in this model training aligned by the native English acoustic model. This step is to transfer the information is the native Mandarin acoustic model into the information of English acoustic model. After that, the author use an initialized HMM to decode the test data results from the second model. The third process combined the results from the first process’s model 1 and the second process’s model 1. As we know, the former will produce an result with 117 features per frame, the latter will produce an result with 135 features per frame. Therefore, the combined results will have 252 features per frame. The combined results are the input of the second model in process 3. The second model still trained towards the English ilab file, with fixed 117 at its output nodes. By doing this, the native Mandarin acoustic model information and native English acoustic model information are integrated into the two-stage NN model, and an better result is expected for the final 60 result. The last step of the third process is the same as the first and second process. 4.3.3 Project 2 Results The results of the project 2 are shown below. From the results we can see that the third process trained the best an acoustic model. This is because the third process model uses both the information from the native Mandarin acoustic model and native English acoustic model. Table 4.3 Results from NN Poe Process Models PER 1 Native English model after Poe 63.9% 2 Native Mandarin model after Poe 64.12% 3 Merge the two model results after Poe 61.12% 4.4 Project3: Training Non-native Speech Lexicon Model 4.4.1 Overview of the Method in Project 3 In this project, the author will focus on training a lexicon model to capture the vocabulary pronunciation of the non-native speakers, which deviates much from the standard dictionary and is majorly used in the native English speech recognition. Divide the nonnative speech into two parts, train and test data sets Use an nonnative acoustic model to recognize and force alignment the train data set Use the standard dictionary, train data set recognition output and alignment as the inputs of a dictionary updating system and obtain a new dictionary Keep the same acoustic model, use the new dictionary and the standard dictionary to recognize the test data set Figure 4.13 Project 3 Process flow 61 The above chart flow (Figure 4.13) shows an overview of project 3. First, the author divides the non-native data into two parts, one is the train set for the non-native lexical model training, the other is the test set. In order to update the standard lexical model in a statistic and data driven way, the author developed a system to archive this. The system is composed of several perl scripts more details will be given in the next section. Due to the requirement of the system, the phone level recognition result of the train data set, and its corresponding alignment will be needed and have to be prepared ahead. With the three inputs, the recognition output, alignment, the standard dictionary, the dictionary updating system will generate a new dictionary, which has modified, added or deleted pronunciation for some of the words, and assigns the probability weights for the pronunciation variations in a word. Last step is to use the standard dictionary and updated dictionary to recognize the test data set individually in the word level and compare the results. 4.4.2 Step Details in Project 3 The prepared material for this project is described as follows:  A well trained non-native English model (it is based on the data collected from my supervisor’s students)  The non-native speech, multi-nation, about 17 speakers (each devotes about 100 to 200 sentences, one word per sentence), 14 speakers for train data set, and the rest for test data set  Hidden Markov Model Toolkit The step 1 and step 4 are very simple, and the author will not describe them further in this section. The step 2’s non-native acoustic model is a triphone HMM model, which is context dependent model and has better performance compared with mono phone acoustic 62 model. Besides, the non-native acoustic model is trained to recognize only one word per sentence, the performance of the acoustic model need not so much robust or too much training data. On the other hand, the recognition process just follow the steps in project 1 using HVite, the force alignment can be done adding “-a” and “-I ” in the HVite. Figure 4.14 Project 3 Step 3 Process flow The step 3 is to use the information from the train data set to update the standard dictionary, this is done by a perl script system. The system is illustrated above, the script is in the gray box, and the input or output are in the white box. The system consists of several perl scripts, each script play important role, and some script will involve its sub.script. The script “multiphone.perl” and its sub script “multiphone2.perl” are scripts to compare the input information from the recognition result and alignment result. It will match the alignment word with the corresponding phone level recognition result based 63 on their Id and the time frames. Then the scripts will output a list of matched word to recognition phone sequence. The script dict_update.perl is used to sort all the pronunciations listed in the write.list to the matched words, and it also adds the standard pronunciation from the standard dictionary to each word, for example, the word COMMUNIST has the format COMMUNIST k aa m y ah n ah s t 1 k aa m y ah n ah s t k aa m y ah n ah s t k aa m y ah n ah s t k aa m y ah n ah s t k aa m y ah n ah s t k aa m y ah n ah s t k aa m y ah n ah s t sil k ah m y uw n ih t iy z sil k ah m y uw n ih t iy z k aa m y ah n ah s t k aa m y ah n ah s t k ah m y uw n ih t iy z k aa n ah m ih s t Figure 4.15 The example for combine.list in the Figure 4.15. The first pronunciation is from the standard dictionary, thus marked as 1. The following pronunciations are sort from the write list, which are actually the recognition results samples. The script find_new_pronounce.perl is actually check some redundancy in the pronunciation samples and makes some modifications. For example, in Figure 4.15, the COMMUNIST has a pronunciation case for “ sil k ah m y uw n ih t iy z sil”, and “sil” or “sp” are silence phonemes and are useless in training new lexicon model, and they will be deleted here. Script find_general.perl is to count the number of occurrence of same pronunciation, and to adjust the position of the every phoneme to the matched phoneme in standard pronunciation. This script has many sub scripts. To count the statistics is operated by the sub script combine.statistic.perl. To adjust the position, we first find the sample pronunciation and standard pronunciation’s Levenshtein Distance, and find the right position base on the maximum score given by matching the sample pronunciation to the 64 standard pronunciation. If deleting or inserting, there will be 0 score. If the two phoneme are matched, 10 score are given, if two phoneme are very similar, 5 score are given, if two phoneme are not similar, 2 score are given. The maximum score can be computed using dynamic programming, and the position of the sample pronunciation can be found through the score matrix. The similar phonemes are categoried in script category.perl. COMMUNIST k aa m y ah n ah s t 1 1 k aa m y ah n ah s t 9 k ah m y uw n ih t 3 k aa n ah m ih 1 Figure 4.16 The example for the result of find_general.perl update_train_words.perl is to find the most popular phoneme in each standard position for a particular word, and use the phoneme sequence with every position the most frequently occurred phoneme according to statistics, this is the best pronunciation for that word. In the phoneme sequence, all the position may has the second best phoneme, update one position with its second best phoneme, which has relatively higher rate than other second best phonemes, and this phoneme sequence will be as the second best pronunciation for that word. The rate of the second best pronunciation is also written into the new dictionary, as it is weight compared to the best pronunciation. For example, the final result for this system in the dictionary file is like the following format for word “COMMUNIST”. COMMUNIST 1 0.333 k aa m y ah n ah s t k aa m y uw n ah s t Figure 4.17 The example for dictionary 4.4.3 Project 3 Results The result of project 3 is shown in the table below. The percentage shows the word level accuracy for the recognition process. With the same acoustic model, the updated model 65 shows a 2 percent improvement for one non-native speaker, whose pronunciation is far below the middle level, but does not show any improvement for the other two non -native speakers. Table 4.4 Results for project 3 Dictionary Speaker A Speaker B Speaker C All Standard dict 80.4% 84.43% 38% 71.50% Updated dict 80.4% 84.43% 40% 71.97% 4.5 PROJECTS ACHIEVEMENT AND PROBLEM 4.5.1 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL Project 1 and Project 2 both show some promising result for the combined model. In project 1, process 3 result has 1.27% better than that of process 2, 3.68% better than that of process 1. In project 2, process 3 result has 2.78% better than that of process 1, 3.0% better than that of process 2. Both the projects obtained the expected results and improvements, and both archived the target to prove that the information contained in the Mandarin model and English model can combine to given a better recognition result for the Mandarin accented English speech. In addition, the results obtained in the two projects are quiet promising in the condition that the training data is very limited. The first project uses about 4.2 hours non-native speech data for training, while the second project uses only 3 hours non native speech data for training. If the author has enough fund support to obtain more research related data, for example, 20 hours Mandarin accented English speech, the result will be much better, because all the training and adaptation processes are data driving. However, both the projects have not archived a good base line. Phone error rate of 54.55% is the best overall result for project 1, phone error rate of 61.12% is the best 66 overall result for project 2. This is due to the limited training data for non-native speech, normally, a non-native speech research is carried out with about 20 hours’ non-native training data. My work in the project 1 in Section 4.2 shows some unexpected result, the modified model (empty English model inserted with Mandarin mixtures) performs better than the native English model. In the process final result, the modified model process has PER 55.82%, while the native English model has PER 58.23%. We know that the non-native speakers’ accent is heavily biased by their motherland accent, but their speech is English but not Mandarin, it is very surprising that the acoustic model made from the Mandarin mixtures performs better than the one made from English mixtures. However, in project 2, the model trained with Mandarin acoustic information shows worse performance than the model trained with English acoustic information. The end result for the project 2 process 2 is 64.12%, while the end result for the project 2 process 1 is 63.9%. It seems the different results may be due to the different training data distribution. 4.5.2 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL The results in project 3 show that the updated model works for one speaker whose pronunciation is too difficult to be captured by the standard dictionary. However, the results in project 3 also show that the improvement is not found in the recognition output of other two speakers. This is majorly due to two causes. First, the other two speakers’ recognition outputs have little mistakes due to lexicon model, the error is due to the phonemes which are recognized wrongly at the acoustic model stage. Second, the data used for training is not enough, one word only has about 10 to 14 samples for training, which is too little to capture the variations for the different pronunciations in the non-native speakers. In addition, the speakers in this project are multi-national, it is difficult to train a good lexicon model to capture so many pronunciation variations. 67 Chapter 5 CONCLUSION AND RECOMMENDATION In this thesis, the author has first introduced the basic knowledge in the automatic speech recognition (ASR), which includes the concepts of feature extraction, the acoustic model, lexicon model and language model in the ASR system. After that, she summarized the history of the natural language processing field, and highlighted the major challenges in the field, which shows that the research for the nonnative speech recognition improvement is very demanding in this field. Then some frontend research work to improve the non-native speech recognition system is mentioned or illustrated in details. In Chapter 4, the author presented her research work has been finished so far. It includes three projects, the first two projects focus on addressing the issues in the acoustic model for the non-native speech, the third project focus on addressing the issues in the lexicon model for the non-native speech. The detailed achievements and problems in those projects are in Section 4.5. If any researcher intents to carry out the similar research in this field, the author will recommend the following tips in improving the final research result.  If the research is targeted to solve ASR problems in a certain non-native speech, 68 for example, all the research data is recorded from the Mandarin accented speaker, the collected data should be about 10 to 20 hours to be enough for training. If the research will use the data recorded from multi-national non-native speakers, the data size is better to be doubled (20 to 40 hours).  If the researcher will train the acoustic NN model, the alignment used for the ilab file should be as accurate as possible for the training data. To create the alignment, it is better to use the context dependent HMM to force align the training data.  If the research want to carry out research to improve the lexical model for nonnative speakers, it is better to research with a well-train acoustic model. Because poor acoustic model can not recognize the phonemes correctly in the first stage, it is difficult to investigate the performance of the lexicon model base on such poorly recognized phoneme sequence.  For the lexicon model, the author only focused on improving the dictionary in a data driving way, and just obtained the improvement in the single word sentences. In case of enough data, the researcher can also look into the discriminate training to exclude the unnecessary pronunciation update for recognition of multiple word long sentence. Currently, there is still a lot of issues in non-native automatic speech recognition system, the most problematic part is that it cannot improve acoustic model well enough. Only when this part’s issue is fixed, the lexicon model and language model’s issues will be addressed. Therefore, for non-native speech, it is very demanding that the future researchers can develop a noble acoustic model, or that there emerges a more robust way to abstract and use the audio features, or that there comes some methods to make use of the vision information ( face expression, the gestures, eg.). 69 APPENDIX Command 1: >java –jar Command 2: >HCopy -T 1 -C -S < script for wav to mfcc mapping> Command 3: >HVite -A -D -V -X rec -T 1 -C -H -t 250 –s 0 –p -10 –w -o M -i -S < mono phone list> Command 4: >HResults –I Command 5: >HHEd –A –D –V –T 1 –H -M < configuration file> Command 6: >HERest –A –D –V –T 1 –C -C -S -I < transcription file> -H < HMM model file> -u a –K -J -h > HERest –A –D –V –T 1 –C -C -S -I < transcription file> -H < HMM model file> -u a –J –K -J -h Command 7: >HVite –H < HMM model> -S < test file list> –J -h -k -i < recognition output> -w -J < regression tree and base directory> -C < model configuration file> -t 250 –p -10 –s 0 70 < mono phone list> Command 8: >HERest –A –D –V –T 1 –u w –w 2 –C < model configuration file> –J > -J < regression tree and base directory> -h -S -I < adataptation transcription> -a -H < input HMM model> -M < output HMM model> Command 9: >HERest –C -u w –S -I < adatation data transcription> -H -M 71 BIBLIOGRAPHY [1] K. C. SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 3, 2010. [ 2 ]D. Jurafsky & J.H. Martin, “Speech And Language Processing: An Introduction to Natural Language Processing, Computational Linguaistics, and Speech Recognition”, 2 nd Edition, PP.329, [3]K. C. SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 12, 2010. [4]M. Drăgănescu, “Spoken Language Technology,” in C. Burileanu, ed. Speech Technology and Human-Computer Dialogue, Editura Academiei Române, Bucharest, pp. 11-12, 2003. [5] B. H. Juang and S. Furui, Guest editors, Special issue on spoken language processing, Proceedings IEEE, August 2000. [6]P. Fung, and T. Schultz, “Multilingual Spoken Language Processing,” in ref. 4, pp. 89-97. D. V. Compernolle, “Recognizing speech of goats, wolves, sheep and ..., non-natives,” Speech [7] Communication, vol. 35, no. 1, pp. 71-79, Aug. 2001. [8] U. Uebler, M. Boros, Recognition of Non-native German Speech with Multilingual Recognizers, Proc. Eurospeech, Volume 2, pages 911-914, Budapest, 1999. G. Zavaliagkos, R. Schwartz, J. Makhoul, Batch Incremental and Instantaneous Adaptation [9] Techniques for Speech Recognition, Proc. ICASSP, 1995. L. Mayfield Tomokiyo. Recognizing Non-native Speech: Characterizing and Adapting to Non- [10] native Usage in Speech Recognition. Ph.D. thesis, Carnegie Mellon University, 2001. [ 11 ] L. M. Tomokiyo, “Lexical and acoutic modeling of non-native speach in LVCSR,” in the Proceedings of the ICSLP Workshop, 2000 V. Fischer, E. Janke, S. Kunzmann, Likelihood Combination and Recognition Output Voting for [12] the Decoding of Non-native Speech with Multilingual HMMs, Proc. ICSLP, 2002. Tomokiyo, L. M. (2001). Recognizing non-native speech: Characterizing and adapting to non- [13] native usage in LVSCR. Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA. [ 14 ] Wu, J., & Chang, E. (2001). Cohorts based custom models for rapid speaker and dialect adaptation. In Proceedings of EuroSpeech ‘01 (pp. 1261–1264). Grenoble, France: European Speech Communication Association. 72 Legetter C.J. and Woodland P.C., “Maximum Likelihood Linear Regression for Speaker [ 15 ] Adaptation”, Computer Speech and Language, Volume 9, No. 2, pp 171-186. [16]Wang, on Z., Schultz, T., & Waibel, A. (2003). Comparison of acoustic model adaptation technique non-native speech. Retrieved August 30, 2006, from http://www.cs.cmu.edu/~tanja/Papers/ICASSP03-wang.pdf Z. Wang, U. Topkara, T. Schultz, A. Waibel, Towards Universal Speech Recognition, Proc. ICMI [17] 2001. T. Schultz, A. Waibel. Polyphone Decision Tree Specialization for Language Adaptation Proc. [18] ICASSP, Istanbul, Turkey, June 2000. Ikeno, A., Pellom, B., Cer, D., Thornton, A., Brenier, J. M., Jurafsky, D., et al. (2003, April) [19] Issues in recognition of Spanish-accented spontaneous English. Paper presented at the ISCA & IEEE ISCA & IEEE workshop on spontaneous speech processing and recognition, Tokyo, Japan. [20] B. Pellom, “Sonic: The University of Colorado Continuous Speech Recognizer,” Technical Report TR-CSLR-2001-01, CSLR, University of Colorado, 2001 S. Steidl, G. Stemmer, C. Hacker, and E. Noth, “Adaptation in the pronunciation space for non- [21] native speech recognition,” in Proc. ICSLP, Jeju Island, Korea, pp. 2901-2904, Oct. 2004. J. Morgan, “Making a speech recognizer tolerate non-native speech through Gaussian mixture [22] merging.” in Proc. InSTIL/ICALL Symposium on Computer-Assisted Language Learning, Venice, Italy, pp. 213–216, June 2004. [23] J. Bellegarda, “An overview of statistical language model adaptation,” in Proc. ISCA Workshop on Adaptation Methods for Speech Recognition, Sophia-Antipolis, France, pp. 165–174, Aug. 2001. [24] G. Bouselmi and I. Illina, “Combined acoustic and pronunciation modelling for non-native speech recognition,” in Proc. Interspeech, Antwerp, Belgium, pp. 1449-1452, Aug. 2007. [25] Yoo Rhee Oh , Hong Kook Kim, A hybrid approach to adapting acoustic and pronunciation models for non-native speech recognition, Proceedings of the 43rd Asilomar conference on Signals, systems and computers, November 01-04, 2009, Pacific Grove, California, USA [26] Q. Zhang, T. Li, J. Pan, and Y. Yan, "Non-native Speech Recognition Based on State-Level Bilingual Model Modification," in Proc. of Third International Conference on Convergence and Hybrid Information Technology (ICCIT), Vol. 2, pp.1220-1225, 2008. 73 [27]W. Bryne, E. Knodt, S. Kudanpur, J. Bernstein, “Is automatic speech recognition ready for non- native speech? A data collection effort and initial experiments in modeling conversational Hispanic English,” in Proceedings of the ESCA-ITR Workshop on speech technology in language learning, 1998 [28] P. R. Clarkson, R. Rosenfeld, “Statistical Language Modeling Using the CMU-Cambridge Toolkit”, in Proceedings of ESCA Eurospeech, 1997 [29] J. E. Flege, O.S. Bohn, S. Jang, “Effects of experience on non-native speakers' production and preception of English vowels,” Journal of Phonetics, Vol. 25, 1997 [30] J. Godfrey, E. Holliman, J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in Proceedings of the ICASSP, 1992 [31] B. Pellom, “Sonic: The University of Colorado Continuous Speech Recognizer,” Technical Report TR-CSLR-2001-01, CSLR, University of Colorado, 2001 [32] B. Pellom, K. Hacioglu, “Recent Improvements in the CU Sonic ASR system for Noisy Speech: The SPINE Task,” in Proceedings of the ICASSP 2003 [33] L. M. Tomokiyo, “Lexical and acoutic modeling of nonnative speach in LVCSR,” in the Proceedings of the ICSLP Workshop, 2000 [34] Huang X. and Lee K.F., “On Speaker Independent, Speaker Dependent, and Speaker Adaptive Speech Recognition”, IEEE Trans. Speech and Audio Processing, Volume 1, No. 2, pp. 150-157, April 1993. [35] Legetter C.J. and Woodland P.C., “Maximum Likelihood Linear Regression for Speaker Adaptation”, Computer Speech and Language, Volume 9, No. 2, pp 171-186. [36] Anastasakos T., McDonough J., Schwartz R., and Makhoul J., “A Compact Model for Speaker- Adaptive Training”, Proc. ICSLP 96, Volume 2, pp. 764-767. [37] Chang E. and Lippmann R., “Improving Wordspotting Performance with Artificially Generated Data”, Proc. ICASSP 96, Volume 1, pp. 526-529. [38] Padmanabhan M., Bahl L., Nahamoo D., and Picheny M., “Speaker Clustering and Transformation for Speaker Adaptation in Speech Recognition Systems”, IEEE Trans. Speech and Audio Processing, Volume 6, No. 1, Jan 1998, pp. 71–77. 74 [39] Huang C., Chang E., Zhou J.L., and Lee K. F., “Accent Modeling Based on Pronunciation Dictionary Adaptation for Large Vocabulary Mandarin Speech Recognition,” Proc. ICSLP 2000, Volume III, pp. 818-821. [40] Chang E., Zhou J.L, Di S., Huang C., and Lee K. F., “Large Vocabulary Mandarin Speech Recognition with Different Approaches in Modeling Tones,” Proc. ICSLP 2000, Volume II, pp. 983-986. [41] Huang J. and Padmanabhan M., “A Study of Adaptation Techniques on a Voicemail Transcription Task”, Proc. Eurospeech 99, vol. 1, pp. 13-16. [42] Gales M.J.F., “Cluster Adaptive Training for Speech Recognition”, Proc. ICSLP 98, pp.17831786, Sydney. [45] Hazen T. and Glass J., “A comparison of novel techniques for instantaneous speaker adaptation”, Proc. Eurospeech 97, vol. 4, pp.2047-2050. [46] Kuhn R., Junqua J.-C. and etc., “Rapid speaker adaptation in eigenvoice space”, IEEE Transactions on Speech and Audio Processing, vol 8, n6, Nov 2000, pp. 695-707. [47] T. Schultz, A. Waibel. Polyphone Decision Tree Specialization for Language Adaptation Proc. ICASSP, Istanbul, Turkey, June 2000. [48] V. Fischer, E. Janke, S. Kunzmann, Likelihood Combination and Recognition Output Voting for the Decoding of Non-native Speech with Multilingual HMMs, Proc. ICSLP, 2002. [49] Z. Wang, U. Topkara, T. Schultz, A. Waibel, Towards Universal Speech Recognition, Proc. ICMI 2002. [50] L. Mayfield Tomokiyo. Recognizing Non-native Speech: Characterizing and Adapting to Nonnative Usage in Speech Recognition. Ph.D. thesis, Carnegie Mellon University, 2001. [51] G. Zavaliagkos, R. Schwartz, J. Makhoul, Batch, Incremental and Instantaneous Adaptation Techniques for Speech Recognition, Proc. ICASSP, 1995. [52] U. Uebler, M. Boros, Recognition of Non-native German Speech with Multilingual Recognizers, Proc. Eurospeech, Volume 2, pages 911-914, Budapest, 1999. [53] M. Finke, P. Geutner, H. Hild, T. Kemp, K. Ries, M. Westphal, the Karlsruhe-verbmobil Speech Recognition Engine, ICASSP, Munich, 1997. 75 [54] H. Soltau, T. Schaaf, F. Metze, A. Waibel. The ISL Evaluation System for Verbmobil II. ICASSP 2001, Salt Lake City, May 2001. [55] L. M. Arslan, J. Hansen, “A study of temporal features and frequency characteristics in American English foreign accent,” in Journal of the Acoustical Society of America, Vol. 101, No. 1, 1997 [56] W. Bryne, E. Knodt, S. Kudanpur, J. Bernstein, “Is automatic speech recognition ready for non- native speech? A data collection effort and initial experiments in modeling conversational Hispanic English,” in Proceedings of the ESCA-ITR Workshop on speech technology in language learning, 1998 [57] P. R. Clarkson, R. Rosenfeld, “Statistical Language Modeling Using the CMU-Cambridge Toolkit”, in Proceedings of ESCA Eurospeech, 1997 [58] J. E. Flege, O.S. Bohn, S. Jang, “Effects of experience on non-native speakers' production and preception of English vowels,” Journal of Phonetics, Vol. 25, 1997 [59] J. Godfrey, E. Holliman, J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in Proceedings of the ICASSP, 1992 [60]B. Pellom, “Sonic: The University of Colorado Continuous Speech Recognizer,” Technical Report TR-CSLR-2001-01, CSLR, University of Colorado, 2001 [61] B. Pellom, K. Hacioglu, “Recent Improvements in the CU Sonic ASR system for Noisy Speech: The SPINE Task,” in Proceedings of the ICASSP 2003 [62] L. M. Tomokiyo, “Lexical and acoutic modeling of nonnative speach in LVCSR,” in the Proceedings of the ICSLP Workshop, 2000 [63] W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky, W. Byrne, “Lexicon adaptation for LVCSR: Speaker idiosyncracies, non-native speakers, and pronunciation choice,” in PMLA Workshop, 2002 [64] Tomokiyo, L. M. and Waibel, A., "Adaptation methods for nonnative speech," in Proceedings of Multilinguality in Spoken Language Processing, 2001. [65] J. Humphries, P. Woodland, and D. Pearce. "Using accent-specific pronunciation modeling for robust speech recognition." In Proc. ICSLP '96, pages 2324-2327, Philadelphia, PA, October 1996. 76 [66] C. Teixeira, I. Trancoso, and A. Serralheiro. "Recognition of non-native accents". In Proc. Eurospeech '97, pages 2375-2378, Rhodes, Greece, September 1997. [67] K. Livescu. "Analysis and modeling of non-native speech for automatic speech recognition". Master's thesis, MIT, August 1999. [68] Z. Wang, T. Schultz, A. Waibel, "Comparison of acoustic model adaptation techniques on non- native speech", Proc. ICASSP 2003. [69] Clarke, Constance / Jurafsky, Daniel (2006) "Limitations of MLLR adaptation with Spanishaccented English: an error analysis", In INTERSPEECH-2006, paper 1611-Tue2BuP.7. [70] Bohn, O.-S., Flege, J.E., "The production of new and similar vowels by adult German learners of English." Stud. Second Lang. Acquis. 14, 131-158, 1992. [71]The CMU Pronouncing Dictionary v0.6, The Carnegia Mellon University, http://www.speech.cs.cmu.edu/cgi-bin/cmudict [72] IPA, (1993). The International Phonetic Association (revised to 1993) IPA Chart. Journal of the International Phonetic Association 23, 1993. [73] Flege, J.E. (1993) "Production and perception of a novel, second-language phonetic contrast", Journal of the Acoustical Society of America 93: 1589-1608. [74] Li A., Yin Z., Wang T., Fang Q., Hu F.,"RASC863 - A Chinese Speech Corpus with Four Regional Accents", ICSLT-o-COCOSDA: New Delhi, India 2004. [75]NIST, “The 1997 Hub-4NE evaluation plan for recognition of Broadcast News, in Spanish and Mandarin”, http://www.nist.gov/speech/tests/bnr/hub4ne_97/current_plan.htm, 1997. [76] 云南少数民族双语教学研究课题组, “云南少数民族双语教学研究”, 昆明：云南民族出版社, 1995. [77] 潘复平, 赵庆卫, 颜永红, “一种用于方言口音语音识别的字典自适应技术 (Pronunciation Dictionary Adaptation Based Accent Modeling for Large Vocabulary Continuous Speech Recognition) ”, 计算机工程与应用, pp.5-6, 2005 年 23 期. 77 [78]刘明宽, 徐波, 黄泰翼, 胡伟湘, “音节混淆字典以及在汉语口音自适应中的应用研究 (Study on syllable confusion dictionary and putonghua accent adaptation)”. 声学学报，pp. 53-58, 2002 年 01 期. [79] E. Chang, Y. Shi, J. L. Zhou and C. Huang, “Speech Lab in a Box: A Mandarin Speech Toolbox to Jumpstart Speech Related Research”, Eurospeech 2001, pp.2799-2802 Aalborg, Denmark, 2001. [80]V. Hoste, W. Daelemans, S. Gillis, “Using ruleinduction techniques to model pronunciation variation in Dutch”, Computer Speech and Language, Vol. 18, Issue 1, pp.1-23, Jan. 2004. [81] M. Wester, “Pronunciation modeling for ASR -knowledge-based and data-derived methods”, Computer Speech and Language, Vol. 17, Issue 1, pp. 69-85, Jan. 2003. [82]刘林泉, “基于小数据量的方言背景普通话语音识别声学建模研究 (Research on A Small Data Set Based Acoustic Modeling for Dialectal Chinese Speech Recognition) ”, 博士论文, 北京：清华大学计算机科学与技术系，2007. [83]R. Sproat, F. Zheng, Gu L, et al., “Dialectal Chinese Speech Recognition: Final Report”, CLSP Summer Workshop, http://www.clsp.jhu.edu/ ws2004/, Nov 15, 2004. [84] R. Gruhn, K. Markov, and S. Nakamura, “A statistical lexicon for nonnative speech recognition,” in Proc. ICSLP, Jeju Island, Korea, pp.1497- 1500, Oct. 2004. [85] S. Steidl, G. Stemmer, C. Hacker, and E. Noth, “Adaptation in the pronunciation space for non- native speech recognition,” in Proc. ICSLP, Jeju Island, Korea, pp. 2901-2904, Oct. 2004. [86] J. Morgan, “Making a speech recognizer tolerate non-native speech through Gaussian mixture merging.” in Proc. InSTIL/ICALL Symposium on Computer-Assisted Language Learning, Venice, Italy, pp. 213–216, June 2004. [87] A. Raux, “Automated lexical adaptation and speaker clustering based on pronunciation habits for non-native speech recognition,” in Proc. ICSLP, Jeju Island, Korea, pp. 616-616, Oct. 2004. [88] H. Strik and C. Cucchiarini, “Modeling pronunciation variation for ASR: A survey of the literature,” Speech Communication, vol. 29, nos. 2-4, pp. 225-246, Nov. 1999. [89] E. Fosler-Lussier, “Multi-level decision trees for static and dynamic pronunciation models,” in Proc. Eurospeech, Budapest, Hungary, pp. 463-466, Sept. 1999. 78 [90] I. Amdal, F. Korkmazasky, and A. C. Suredan, “Data-driven pronunciation modelling for nonnative speakers using association strength between phones,” in Proc. ASRU, Kyoto, Japan, vol. 1, pp. 85-90, Aug. 2000. [91] S. Goronzy, S. Rapp, and R. Kompe, “Generating non-native pronunciation variants for lexicon adaptation,” Speech Communication, vol. 42, no. 1, pp. 109-123, Sept. 2003. [92]J. Bellegarda, “An overview of statistical language model adaptation,” in Proc. ISCA Workshop on Adaptation Methods for Speech Recognition, Sophia-Antipolis, France, pp. 165–174, Aug. 2001. [93] G. Bouselmi and I. Illina, “Combined acoustic and pronunciation modelling for non-native speech recognition,” in Proc. Interspeech, Antwerp, Belgium, pp. 1449-1452, Aug. 2007. [94] M. Kim, Y. R. Oh, and H. K. Kim, “Non-native pronunciation variation modeling using an indirect data driven method,” in Proc. ASRU, Kyoto, Japan, pp. 231-236, Dec. 2007. [95] J. R. Quinlan, C4.5: Programs for Machine Learning, San Mateo, CA, Morgan Kaufmann Publishers, 1993. [96] Y. R. Oh, J. S. Yoon, and H. K. Kim, “Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition,” Speech Communication, vol. 49, no. 1, pp. 59-70, Jan. 2007. [97] D. Paul and J. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Proc. DARPA Speech and Language Workshop, Arden House, NY, pp. 357-362, Feb. 1992. [98] S.-C. Rhee, S.-H. Lee, S.-K. Kang, and Y.-J. Lee, “Design and construction of Korean-spoken English corpus (K-SEC),” in Proc. ICSLP, Jeju Island, Korea, pp. 2769-2772, Oct. 2004. [99] S. Young, et al, The HTK Book (for HTK Version 3.2), Microsoft Corporation, Cambridge University Engineering Department, Dec. 2002. [100] S. Young, J. Odell, and P. Woodland, “Tree-based state tying for high accuracy acoustic modeling,” in Proc. ARPA Human Language Technology Workshop, Princeton, NJ, pp. 307-312, Mar. 1994. [101] H. Weide, The CMU Pronunciation Dictionary, release 0.6, Carnegie Mellon University, 1998. 79 [102] Y. R. Oh, M. Kim, and H. K. Kim, “Acoustic and pronunciation model adaptation for context- independent and context-dependent pronunciation variability of non-native speech,” in Proc. ICASSP, Las Vegas, NV, pp. 4281-4284, Apr. 2008. [103] S. Witt, "Use of Speech Recognition in Computer Assisted Language Learning," University of Cambridge, Ph.D Thesis,1999. [104] S. Goronzy, Robust Adaptation to Non-Native Accents in Automatic Speech Recognition, Springer Verlag, German, 2002. [105] Z. Wang and T. Schultz, "Non-Native Spontaneous Speech Recognition through Polyphone Decision Tree Specialization," Proc. Eurospeech-03, Geneva, pp. 1449-1452, 2003. [106] J. Flege, "The Production of 'New' and 'Similar' Phones in a Foreign Language: Evidence for the Effect of Equivalence Classification," Journal ofPhonetics vol. 15, pp. 47-65, 1987. [107]B.H. Juang and L.R. Rabiner, "A Probabilistic Distance Measure for Hidden Markov Models," AT&T Technical Journal vol. 64, no. 2, pp. 391-408, 1985. [108]J.J. Humpries and P. Woodland, "The Use of Accent-Specific Pronunciation Dictionaries in Acoustic Model Training," ICASSP- 98, Seattle, vol. 1, pp. 317-320, 1998. [109]D.C. Montgomery, E.A. Peck and G. Geoffrey Vining, Introduction to Linear Regression Analysis, 3rd Edition, Wiley, 2001. Kuhn, P. Nguyen, L. Goldwasser, N. Niedzielski, S. Fincke and M. Contolini, [110]R. "Eigenvoices for Speaker Adaptation," ICSLP 98, Sydney, pp. 1774-1777, 1998. [111]P. Nguyen, "Fast Speaker Adaptation," Technical report, Eurecom, 1998. [112]R.J. Westwood, "Speaker Adaptation Using Eigenvoices," University of Cambridge, MPhil Thesis, 1999. [113]T.P. Tan and L. Besacier, "A French Non-Native Corpus for Automatic Speech Recognition," LREC 2006, Genoa, pp. 1610- 1613, 2006. [114]L.F. Lamel, J.L. Gauvain and E. M., "BREF, a Large Vocabulary Spoken Corpus for French," Eurospeech-91, Genoa, pp. 505-508, 1991. 80 [115]V.B. Le, T. Do-Dat, E. Casteli, L. Besacier and J.F. Serignat, "Spoken and written language resources for Vietnamese," LREC 2004, Lisbon, pp. 599-602, 2004. [116] RASTA: Hynek Hermansky and Nelson Morgan. RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, October 1994 [117] MSG: Brian Kingsbury. Perceptually-inspired signal processing strategies for robust speech recognition in reverberant environments. PhD Dissertation. University of California at Berkeley, December 1998 [118]PLP: Hynek Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4):1738-1752, April 1990 [119]Hybrid ASR: Nelson Morgan and Hervé Bourlard. An Introduction to Hybrid HMM/Connectionist Continuous Speech Recognition. IEEE Signal Processing Magazine, pp. 2542, May 1995 [120]Multi-stream ASR: Dan Ellis. Improved recognition by combining different features and different systems. Proc. AVIOS-2000, San Jose, May 2000 [121]Hynek Hermansky, Dan Ellis and Sangita Sharma. Tandem connectionist feature stream extraction for conventional HMM systems. ICASSP-2000, Istanbul, June 2000 [122]A. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S. Kajarekar, N. Morgan, and S. Sivadas. Qualcomm-ICSI-OGI Features for ASR. ICSLP-2002, Denver, Colorado, USA, September 2002. [123]Q. Zhu, B. Chen, N. Morgan, and A. Stolcke. On using MLP features in LVCSR. Proc. Intl. Conf. Spoken Language Processing, Jeju, Korea, October 2004 81 [...]... England accented English speakers, the native English speakers versus the non- native English speech To recognize the speech from non- native speakers, we usually will apply the technique of the adaptation for a well-trained native acoustic model, using the limited data Because the native speech data can be easily obtained from many open source, while the non- native speech data are rare and diverse 2.4.2... and 2 until the performance of the network is good enough QuickNet is a suite of software that facilitates the use of multi -layer perceptrons (MLPs) in statistical pattern recognition systems It is primarily designed for use in speech processing but may be useful in other areas 2.3.5 ANN Methods in Speech Recognition Recently, many research works also take the advantage of multi-layer perceptrons (MLPs)... data, two or even tens of transformations are estimated for different group of mixtures in original acoustic model And each of the transformation is more specific to group the gaussian mixtures further into the broad phone classes: silence, vowels, stops, glides, nasals, fricatives, etc Though it may not be classified so accurate in the adaptation for non- native speech, as non- native speech it is confused... “Statistical Language Model”, Speech Proceesing, PP, 34, 2010 25 recognizing the news reports, while the same LM would be a poor predictor for recognizing personal conversation or speech in hotel reservation system 26 Chapter 3 Literature Review 3.1 Overview of the challenges in ASR for non- native speech The history of spoken language technology is milestoned by the ”SpeD” ( Speech and Dialogue”) Conferences... obvious and some new fields of interest appeared to be a promise the future Corneliu Burileanu summarized: “We were able to identify a constant development of what is called speech interface technology” which includes automatic speech recognition, synthetic speech, and natural language processing We noticed commercial applications in computer command, consumer, data entry, speech- to-text, telephone,... Department of RWTH Aachen University, using Germany made his “Closing Remarks: How to Continue?” The main issue in this domain he emphasized is about the interaction between speech and NLP (natural language processing) people in many areas of interaction Since 2007, one important challenge in this domain seemed to be Speech to Speech Translation” The main issue is speech recognition improvement One aspect of. .. 1 K C SIM, Automatic Speech Recognition , Speech Proceesing, PP, 7, 2010 7 Figure 2.4 Mel filter bank coefficient1 After obtained the feature vector from the non- linear filter bank, Mel-Frequency Cepstral Coefficients (MFCCs) can be calculated by the following formula √ ∑ ( ) ( ( ) ) : nth MFCC coefficient : kth Feature value from Mel filter bank and : number of filter banks and number of MFCC coefficients... models to estimate the distributions of decorrelated acoustic feature vectors that correspond to observations of states of the phonemes In contrast, artificial neural network model uses discriminatively training to estimate the probability distribution 1 using the output of a neural network classifier as the input features for the Gaussian mixture models of a conventional speech recognizer (HMM), The resulting... practically As in non- native speech, the grammar is poorly formed unless the speech is read speech In addition, the training data set of LM should match the test data set For example, an LM trained on newspaper text would be a good predictor for 1 Sometimes the word sequence in the test data has not been seen before in training data, then we define some equivalence classes and given the probability of this new... understand the concept of filterbank Filter banks are used to filter the information from the spectral magnitude of each window period (Figure 2.3) Filter banks are a series of triangular filters on the frequency spectrum To implement this filterbank, a window period of speech data is transformed using a Fourier transform and the magnitude on the frequency spectrum is obtained The speech data magnitude ... a native English model and a German accented English model (limited data WER of the native model on native test data is 16.2%, on non- native test data is 43.5% WER of non- native model on non- native. .. the speech of native speakers and that of non- native speakers In th decoding process, the context decision tree we use was built from native speech to model the context of non- native speech Zhirong... lack of such broadcast or speech recording resource for non- native speech Usually, using native acoustic model to recognize the non- native speech, the word error rate is about to time that of the

Định dạng
Số trang	88
Dung lượng	2,46 MB