Tăng cường mô hình âm học cho nhận dạng tiếng nói tiếng việt sử dụng đặc trưng âm học làm đầu vào cho mạng nơtron

Nguyin QuIc Bao vd Dtg Tap chi KHOA HQC & CONG NGHE 132(02): 71 -76 IMPROVING ACOUSTIC MODELS FOR VIETNAMESE SPEECH RECOGJVITION USING TONAL FEATURES AS INPUT O F NEURAL NETWORKS Nguyen Quoc Bao', Nguyen Thanh Trung, Nguyen Thu Phuong, Pham Thi Huong College of Information and Communication Technology - TNU SUMMARY In this paper, a neural network method for improving acoustic models of Vietnamese speech recognition is presented Deep neural network (DNN) for acoustic modeling is able to achieve significant improvements over baseline systems The experiments are carried out on the dataset containing speeches on Voice of Vietnam channel (VOV) The results show that adding tonal feature as input feature of the network reached around 18% relative recognition performance The DNN using tonal feature for Vietnamese recognition decrease the error rate by 49.6%, compared to the MFCC baseline Keywords: Deep neural network, Vietnamese automatic speech recognition INTRODUCTION In the automatic speech recognition system (ASR) acoustic model is an important module It is used to model the acoustic space of input feature, The state-of-the-art acoustic models for speech recognition utilize a statistical pattern recognition framework called HMM/GMM (Hidden Markov Model/Gaussian Mixture Model) [1] with short time spectral input features Although the HMM/GMM approach has been effective in capturing speech patterns, it has several inherent limitations For example, speech feature vectors at different frames are assumed to be statistically independent given the state sequence Hence, many researchers have been trying to incorporate the power of artificial neural networks in acoustic modeling to improve performance over the traditional HMM/GMM approach In Vietnamese speech recognition system, another acoustic models problems which can occur in Vietnamese speech is when there are similar monosyllabic -words like: vang, ving, vang, vang, v|ng that will be easily confused Therefore, tonal features that present tone information is an essential part of the Vietnamese speech recognition system Previous studies [2][3][4] showed efforts toward Vietnamese speech recognition ' Tel: 0919 114252 Email nqbao@ictu.edu.v However, their systems did not employ the fiill range of state-of-the-art techniques for acoustic model The purpose of this study is to improve acoustic model for Vietnamese speech recognition using tonal features as input of neural networks We also show the way to extract the pitch feature using modified algorithm which can achieve large improvement The rest of this paper is organized as follows Next section, a brief description automatic speech recognition This is followed by Section which shows the way to extract the pitch features Section 4, we briefly describe the deep neural network architecture for acoustic modeling Sections and 6, the experiments setup and results are presented Finally, conclusions and future research are given in the last section ACOUSTIC MODEL IN SPEECH RECOGNITION This section presents a brief introduction of the acoustic model in ASR system Figure \.A left-to-right HMMmodel NguySn Qu6c Bao vo Dtg Tap chi KHOA HQC & CONG NGHE The acoustic model is used to model the statistics of speech features for each phone or sub phone Hidden Markov Model (HMM) [1] is the standard used in the state-of-the-art acoustic models It is a very powerful statistical method to model the observed data in a discrete-time series An HMM is a structure that contains a group of states connected by transitions Each transition is specified by its transition probability In speech recognition, state transitions are usually connected from left to right or self repetition (the left-to-right model) as shown in Figure Each state of HMMs is usually represented by a Gaussian Mixture Model (GMM) to model the distribution of feature vectors for the given state, A GMM is a weighted sum of M component Gaussian densities and is described by Eq 1, Phc\X) = Y^^w,s(A\kS.d (1) Where P {x\X) is the likelihood of a D dimensional continuous-valued feature vector ^, given the model parameters X — {"••; Uj,Si}, where w, is the mixture weight which satisfies the constraint S f W; = 1, Li, is the mean vector, and : is the covariance matrix of the f'' Gaussian function which is defined by rei-p;-;(r-M:)'i;rKx-ii,)) (2) PITCH FEATURE EXTRACTION In this work, the pitch features is extracted using new algorithm described in [5] which is called Kaldi pitch tracker It improves pitch tracking as measured by Gross Pitch Error, versus the off-the-shelf methods they tested The algorithm is a highly modified version of the getfD algorithm [6] Basically, getfD is based on the Normalized Cross Correlation Function (NCCF) as defined in (3): ;VCCf(r) = ^ = y x(n)x(n + T) , (3) y =(»•-), 132(02) 14)' where x(n) is the input speech sample, N is the length of the speech analysis window, T is the lag number in range between and N-1 A Kaldi pitch tracker algorithm is based on finding lag values that maximize the NCCF The most important change from getfD is that rather than making hard decisions about voicing on each frame, all frames are treated as voiced to be continuous and allow the Viterbi search to naturally interpolate across unvoiced regions The output of the algorithm is 2-dimensional features consisting of pitch in Hz and NCCF on each frame, and then the output is postprocessed for use as features for ASR,, produces 3-dimensional features consisting of pov-feature, pitch-feature and delta-pitchfeature: pov-feature is warped NCCF This method was designed to give the feature a reasonably Gaussian distribution Let the NCCF on a given frame be written c I f - < c < is the raw NCCF, the output feature be / = 2C(1.0001-c)°'-' - ) pitch-feature is feature that on each time t, subtraction of a weighted average pitch value, computed over a window of width 151 frames centered at t and weighted by the probability of voicing value p Where p is obtained by plotting the log of count(voiced)/' count(unvoiced) on the Keele database [7] as a fiinction of the NCCF and delta-pitch-feature is delta feature computed on raw log pitch DEEP NEURAL NETWORK ARCHITECTURE FOR ACOUSTIC MODELING In this section, the deep neural network architecture for acoustic modeling Is briefly described as in Figure The network consJstSi of a variable number of moderately large fully connected hidden laye,, which is followed by the final classificatio ^y^^ Nguyin QuIc Bao vd Dtg Tap chi KHOA HQC & CONG NGHE 132(02): 71-76 Hidden Layers Output layer Figure Deep Network Architecture I The hidden layers are initialized using ! unsupervised, layer-wise pre-training Thanks ] to their success in the deep learning p community, restricted Boltzmann machines , have become the default choice for preI training the individual layers of deep neural j networks used in speech recognition Gehring i et al [8] demonstrated that denoising autoencoders [9] which are straight-forward , models that have been successfully used for pre-training neural architectures for computer vision and sentiment classification [10] are applicable to speech data as well We follow their training scheme and inirialize the hidden layers as denoising auto-encoders, too Like regular auto-encoders, these models I consist of one hidden layer and two identically-sized layers representing the input I and output values The network is usually trained to reconstruct its input at the output layer with the goal to generate a useful intermediate representation in the hidden layer In denoising auto-encoders, the network is trained to reconstruct a randomly corrupted version of its input, which can be interpreted as a regularizing mechanism that facilitates the learning of large and over complete hidden representations [9] For denoising auto-encoders working on binary data (i.e grayscale images or sigmoid activations of a previous hidden layer), Vincent et al proposed the use of masking noise for corrupting each input vector [9] in order to extract useful features, the network is forced to reconstruct the original input from a corrupted version, generated by adding random noise to the data This corruption of the input data can be formalized as applying a stochastic process qo to an input vector x This approach also used in our work is to apply masking noise to the data by setting every element of the input vector to zero with a fixed probability Then the corrupted input first maps (with an encoder) to the hidden representation y using the weight matrix W of the hidden layer, the bias vector b of the hidden units and a non-linear activation function a^, as follows: y = Oj {Wx ^ h) (6) The latent representation y or code is then mapped back with a decoder into reconstruction z using the transposed weight matrix and the visible bias vector c Because in a model using tied weights, the weight values are used for both encoding and decoding, again through a similar transformation o,: z= aJW'^y -he) (7) The parameters of this model (namely W, W^, b, c) are optimized such that the average reconstruction error is minimized The reconstruction error can be measured by the cross-entropy error objective as defined in (8) in order to obtain the gradients necessary for adjusting the network weights 73 Nguyin QuIc Bao vd Dtg Inix.z) -2^xJog2t-\-(l- Tsip chi KHOA HOC & CONG NGHE x,}lo3Z, (8) When training a network on speech features like MFCCs, the first layer models real valued rather than binary data, so the mean squared error L2(x,z) = Z,(x, — z,)- is selected as the training criterion In this work, we also apply masking noise to the first layer, although other types of noise could be used as well [9] After a stack of auto-encoders has been pretrained in this fashion, a deep neural network can be constructed The bottleneck layer, an additional hidden layer and the classification layer are initialized with random weights and connected to the hidden representation of the top-most auto-encoder The resulting network is fine-tuned with standard backpropagation When employing neural networks as acoustic models, they are used to compute a posteriori emission probabilities of phone states If the network is trained to estimate probabilities p(s.|Xj.) of state given observations as input feature vector 3:^ using a cross-entropy criterion, the emission probabilities can be obtained with Bayes' rule: ^^"-'^-^^ Vis,) ^^^ where p(_s,) denotes the prior probability of a phone state, which is estimated using the available training data During decoding, the most likely sequence of states is computed by the HMM, Since the observation A is independent of the state sequence, its probability p(a"-) can be ignored EXPERIMENTS SETUP Corpora Description and Baseline Systems Our systems are trained on the Voice of Vietnam (VOV) dataset which is a collection of story reading, mailbag, new reports, and colloquy from the radio program the Voice of Vietnam The total time of this corpus is about 19hours with 17 hours of training set and hours of testing set The data is in the 132(02): 71-76 wav format with 16 kHz sampling rate and analog/digital conversion precision of 16bits, The language model in this experiment is a tri-gram model which is trained by using all of transcriptions in the training data Baseline HMM/GMM systems were performed with the Kaldi developed at Johns [11], In feature Hopkins University extraction, I6-KHz speech input is coded with 13-dimensional MFCCs with a 25mE window and a 10ms frame-shift Each frame of the speech data is represented by a 39-^ dimensional feature vector that consists of 13 MFCCs with their deltas and double-deltas Nine consecutive feattire frames are spliced to 40 dimensions using linear discriminant analysis (LDA) and maximum likelihood linear transformation (MLLT) All acoustic models used 5500 contextdependent state and 96000 Gaussian mixture components The baseline systems were built, follow a typical maximum likelihood acoustic training recipe, beginning with a flat-start initialization of context-independent phonetic HMMs, followed by tri-phone system with 13-dimensional MFCCs plus their deltas and double-deltas and ending with tri-phone system with LDA+MLLT Network training ' In our experiments, we extracted deep bottleneck features from MFCCs and its combination with pitch features (MFCC+pitch) The network input for these features was pre-processed using the approach in [12] that is called splicing speaker-adapted features During supervised fine-tuning, the neural network was trained to predict contextdependent HMM states (there were about 4600 states in our experiments).For pretraining the stack of auto-encoders in the architecture at section 4,mini-batch gradient descent with a batch size of 128 and a learning rate of 0,01 was used •r':ijt vectors were corrupted by applying ma^ noise to Nguyin QuIc Bao vd Dtg T?p chi KHOA HQC & CONG NGHE set a random 20% of their elements to zero Each auto-encoder contained 1024 hidden units and received million updates before its weights were fixed and the next one was trained on top of it Again, gradients were computed by averaging across a mini-batch of training examples; for fine-tuning, we used a larger batch size of 256 The learning rate was decided by the "newbob" algorithm: for the first epoch, we used 0.008 as the learning rate, and this was kept fixed as long as the increment in crossvalidation fi-ame accuracy in a single epoch was higher than 0.5% For the subsequent epochs, the learning rate was halved; this was repeated until the increase in cross-validation accuracy per epoch is less than a stopping threshold, of 0.1%, After each epoch, the current model was evaluated on a separate validation set, and the model performing best on this set was used in the speech recognition system afterwards, EXPERIMENTS RESULTS Baseline Table I Recognition performance for the different systems described Layer Features WER size MFCC 21.25% MFCC-HPitch 16.77% 1000 MFCC 13 20% u^4^^/r.^,l., 2000 MFCC 13.03% HMM/DNN ^^^^ MFCC+Pitch 10,96% 2000 MFCC+Pitch 10.71% Table lists the recognition performance of baseline systems and HMM/DNN systems with varying features and effective layer sizes on the Vietnamese language in terms of the word error rate (WER) Regarding the performance of the baseline system, the MFCC number is 21.25% WER and combination of MFCC and tonal feature results in systems which gives much gain in term of WER (about 5% absolute) The hybrid DBN/HMM combination outperforms baseline setup, resulting in \ Acoustic Model 132(02): 71 -76 relative improvements of up to 37.8% (13.20% WER) over the MFCC baseline It can be seen that increasing the layer size of neural network to 2000 is slightly better than the one with 1000 hidden units Specially, when adding tonal feature as input of neural network give relatively improvement up to 49.6% (10.71% WER) over MFCC baseline and 17.8% over DNN/HMM without tonal feature in neural network input CONCLUSION In this work, we have improved acoustic model for Vietnamese recognition using deep neural network and tonal feature as input of neural network It was shown that adding pitch features that obtained from new algorithm increased the performance by 20% relative on baseline system and by 17.8% on DNN/HMM systems We have also evaluated the layer size in the DBNFs architectures The gains were achieved by increasing the layer size to 2000 hidden units The systems were tuned on a small-sized VOV speech corpus, which increased the relative improvement in word error rate over the MFCC baseline to 49.6% In the future, we intend to apply a hybrid DNN on top of deep bottleneck features [13] as well as multi-lingual network trainingapproaches to improve acoustic model for Vietnamese recognition system REFERENCES Rabiner, Lawrence "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE 77.2 (1989) 257-286 Thang Tat Vu, Dung Tien Nguyen, Mai Chi Luong, and John-Paul Hosom, "Vietnamese large vocabulary continuous speech recognition," in INTERSPEECH, 2005 Nguyen Hong Quang, Pascal Nocera, Eric Castelli, and Trinh Van Loan, "A novel approach in continuous speech recognition for Vietnamese," in SLTU, 2008 Ngoc Thang Vu and Tanja Schultz "Vietnamese Large Vocabulary Continuous Speech Recognition", in ASRU, Italy 2009 75 Nguyin QuIc Bao vd Dtg Tap cht KHOA HOC & CONG NGHE Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmai, and Sanjeev Khudanpur, "A pitch exfraction algorithm tuned for automatic speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on May 2014, IEEE Signal Processing Society David Talkin, "A robust algorithm for pitch tracking (RAPT)," in Speech Coding and Synthesis, W B Klein and K K.PalivaI, Eds Elsevier, 1995 F Plante, Georg F.Meyer, and William A Ainsworth., "A pitch extraction reference database," in Eurospeech, 1995, pp, 837-840, Jonas Gehring, Yajie Miao, Florian Metze, and Alex Waibel, "Extracting deep bottleneck features using stacked autoencoders," in 1CASSP2013, Vancouver, CA, 2013, pp 3377-3381 Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol, "Extracting and composing robust features with denoising autoencoders," in ICML08, 2008, pp 1096-1103 132(02) : i -76 10 Xavier Glorot, Antoine Bordes, and Yoshua Bengio, "Domain adaptation for large-scale sentiment classification: A deep learning approach," in Proceedings of the 28tli International Conference on Machine Learning (ICML-11), 2011, pp.513-520 11 Daniel Povey, Amab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, j Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, "The kaldi speech I recognition toolkit," in IEEE 2011 Workshop on Automatic Speech Recognition and Understandmg Dec, 2011, IEEE SignalProcessing Society, IEEE Catalog No,; CFPl ISRW-USB 12 Shakti P Rath, Daniel Povey, Karel Vesely, and Jan Honza Cemock, "Improved feature processing for deep neural networks,"in Proc Interspeech, 2013 13 Quoc Bao Nguyen, Jonas Gehring, Kevin Kilgour, and Alex Waibel, "Optimizing Deep Bottleneck Feature Extraction" in Proceedings of the RIVF2013,Hanoi, Vietnam, 2013, pp 152-156, TOM TAT TANG CUdNG MO HINH AM HOC CHO NHAN DANG TIENG NOI TIENG VIET sty DUNG DAC T R U N G AM HOC LAM DAU VAO CHO MANG NORON Nguyin Qu6c Bao*, Nguyen Th^nh Trung, Nguyen Thu Phuong, Ph?m Thj Hudng Trmng Dgi hpc Cong nghe Thdng tin vd Truyin thdng - DH That Nguyen Trong bai b^o nay, chiing toi trinh bay mot phuong phip cua mang noron nhan tao se giiip tang cuong m6 hinh am hoc cho nhan dang ti6ng noi tifing Viet Mang noron nhan tao su dung phuang phap hoc sau co the giup giam ti If 16i nhan dang tilng ndi Thuc nghiem duuc thuc hien tren da lieu am dugc thu am hi Dai tiSng n6i Viet nam, Kgt qua thuc nghiem chi ring viec them dac trung dieu lam dau vao cho mang noron giilp gi^m ti 16 I6i toi 18% v^ len t(^i 49.6% so vai he th6ng nhan dang ca sa chi su dung dac trung MFCC, Til- kh6a: Mgng noron hpc sdu, Nhdn dgng tiing noi tiing Viit Ngdy nhdn bdr.09/10/2014; Ngayphdn biin: 24/10/2014, Ngdy duyit ddng- 05/3/2015 Phan bien khoa hgc: TS Phung Trung NghTa - Trudng Dgi hgc Cdng nghe Thdng tin & Truyin thong - DHTH ' Tel: 0919 114252 Email ngbao@iclu edu.vn 76 ... Vietnam, 2013, pp 152-156, TOM TAT TANG CUdNG MO HINH AM HOC CHO NHAN DANG TIENG NOI TIENG VIET sty DUNG DAC T R U N G AM HOC LAM DAU VAO CHO MANG NORON Nguyin Qu6c Bao*, Nguyen Th^nh Trung, Nguyen... success in the deep learning p community, restricted Boltzmann machines , have become the default choice for preI training the individual layers of deep neural j networks used in speech recognition... chiing toi trinh bay mot phuong phip cua mang noron nhan tao se giiip tang cuong m6 hinh am hoc cho nhan dang ti6ng noi tifing Viet Mang noron nhan tao su dung phuang phap hoc sau co the giup

Định dạng
Số trang	6
Dung lượng	260,59 KB