Mô hình hóa đặc tính âm học động cho hệ thống nhận dạng tiếng nói việt bằng phần mềm kaldi và ứng dụng cho việc phân tích sự chuyển tiếp nguyên âm phụ âm
Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 99 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
99
Dung lượng
5,75 MB
Nội dung
MINISTRY OF EDUCATION AND TRAINING NGUYEN HANG PHUONG HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY - Nguyen Hang Phuong MODELING DYNAMIC ACOUSTIC FEATURE OF SPEECH FOR VIETNAMESE SPEECH RECOGNITION AND APPLICATION FOR ANALYZING VOWEL – TO – VOWEL TRANSITIONS COMPUTER SCIENCE MƠ HÌNH ĐẶC TÍNH ÂM HỌC ĐỘNG CHO NHẬN DẠNG TIẾNG NÓI TIẾNG VIỆT VÀ ỨNG DỤNG CHO VIỆC PHÂN TÍCH SỰ CHUYỂN TIẾP NGUYÊN ÂM – NGUYÊN ÂM MASTER THESIS OF SCIENCE COMPUTER SCIENCE 2016A Hanoi – 2018 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY - Nguyen Hang Phuong MODELING DYNAMIC ACOUSTIC FEATURE OF SPEECH FOR VIETNAMESE SPEECH RECOGNITION AND APPLICATION FOR ANALYZING VOWEL – TO – VOWEL TRANSITIONS MƠ HÌNH ĐẶC TÍNH ÂM HỌC ĐỘNG CHO NHẬN DẠNG TIẾNG NÓI TIẾNG VIỆT VÀ ỨNG DỤNG CHO VIỆC PHÂN TÍCH SỰ CHUYỂN TIẾP NGUYÊN ÂM – NGUYÊN ÂM Specialty: Computer Science International Research Institute MICA MASTER THESIS OF SCIENCE COMPUTER SCIENCE SUPERVISOR: Prof.Dr Eric Castelli Dr Nguyen Viet Son Hanoi – 2018 DECLARATION OF AUTHORSHIP I, NGUYEN Hang Phuong, declare that this thesis titled, ―Modeling dynamic acoustic feature of speech for Vietnamese speech recognition and application for analyzing Vowel – to – Vowel transitions‖ and the work presented in it are my own I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated Where I have consulted the published work of others, this is always clearly attributed Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work I have acknowledged all main sources of help Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself Signed: Date: ACKNOWLEDGEMENTS It is an honor for me to be here to write thankful words to those who have been supporting, guiding and inspiriting me from the moment, when I started my work in International Research Institute MICA until now I owe my deepest gratitude to my supervisors, Prof Eric Castelli and Dr Nguyen Viet Son Their expertise, understanding and generous guidance made it possible to work in a new topic for me They have made available their support in a number of ways to find out the solution to my works It is a pleasure to work with them Special thanks to Dr Mac Dang Khoa, Dr Do Thi Ngoc Diep, Dr Nguyen Cong Phuong and all of members in the Speech Communication Department for their guidance which help me a lot in how to study and to research in right way, and also the valuable advices for my works Finally, this thesis would not have been possible if there were no encouragement from my family and friends Their words give me power in order to overcome all the embarrassment, discouragement and other difficulties Thanks for everything helping me to get this day Hanoi, 23/03/2018 Nguyen Hang Phuong TABLE OF CONTENTS DECLARATION OF AUTHORSHIP ACKNOWLEDGEMENTS TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES LIST OF ABBREVIATIONS 10 CHAPTER 1: INTRODUCTION 11 1.1 Overview about Automatic Speech Recognition 11 1.2 The objective of thesis 12 1.3 Thesis outline 12 CHAPTER 2: ACOUSTIC FEATURES FOR ASR SYSTEM 13 2.1 The goals of acoustic feature extraction in an ASR system 13 2.2 Speech signal characterization 14 2.2.1 Speech is non-stationary 14 2.2.2 Static and dynamic characterization of speech signal 19 2.2.2.1 Static characterization of speech signal 19 2.2.2.2 Dynamic characterization of speech signal 19 2.3 Static speech feature 21 2.3.1 An overview about MFCCs feature 21 2.3.2 Limitation of MFCCs 22 2.4 State of the art of modeling dynamic acoustic speech feature 23 2.5 Conclusion of chapter 23 CHAPTER 3: IMPROVEMENT ON AUTOMATIC MODELING DYNAMIC ACOUSTIC SPEECH FEATURE 25 3.1 Improvement on computing Spectral Sub-band Centroid Frequency (SSCF)25 3.1.1 SSCF features generated from the original definition 25 3.1.2 Influence of subband filters on the SSCF features 28 3.1.3 New proposal design of subband filters 40 3.1.4 Analyis SSCFs when applied new design of subband filter 42 3.1.4.1 /ai/ and /ae/ transition 43 3.1.4.2 /ao/ and /au/ transition 46 3.1.4.3 /ia/ and /ea/ transition 48 3.1.4.4 /oa/ and /ua/ transition 51 3.1.4.5 /iu/ transition 53 3.1.4.6 /ui/transition 54 3.2 New proposal approach to automatic SSCF angles computation 55 3.2.1 Definition and Automatic SSCF angles calculation 55 3.2.2 Analysis SSCF angles in Vowel-Vowel transition 57 3.2.2.1 /ai/ and /ae/ transition 58 3.2.2.2 /ao/ and /au/ transition 61 3.2.2.3 /ia/ and /ea/ transition 63 3.2.2.4 /oa/ and /ua/ transition 66 3.2.2.5 /iu/ transition 69 3.2.2.6 /ui/ transition 71 3.3 Conclusion of chapter 73 CHAPTER 4: APPLY SSCF ANGLES TO A SPEECH RECOGNITION TOOLKIT, KALDI 74 4.1 An overview about Kaldi – an open source for speech recognition 74 4.2 Balanced speaker experiments on Kaldi 76 4.2.1 Using MFCC features 76 4.2.2 Using SSCF angles 77 4.3 Unbalanced speaker experiments on Kaldi using SSCF angles 79 4.4 Conclusion of chapter 80 CHAPTER 5: CONCLUSION AND FUTURE WORK 82 5.1 Conclusion 82 5.2 Future work 83 PUBLICATIONS 84 REFERENCES 85 APPENDIX 87 LIST OF FIGURES Figure 2-1: An outline of a typical speech recognition system [3] 13 Figure 2-2: a) Single-tone sine wave of 10 Hz sampled using a sampling frequency of 1000 Hz, b) Magnitude spectrum of single-tone sine wave respectively [9] 15 Figure 2-3: a) Multi-tone sine wave of 10, 50 and 100 Hz sampled using a sampling frequency of 1000 Hz, b) Magnitude spectrum of corresponding multi-tone sine wave [9] 16 Figure 2-4: a) Non-stationary multi-tone sine wave of 10, 50 and 100 Hz sampled using a sampling frequency of 1000 Hz, b) Magnitude spectra of corresponding non-stationary multitone sine wave [9] 17 Figure 2-5: a) Speech signal for the Hindi word ―sakshaat‖, b) Corresponding spectra of different segments of the Hindi speech signal [9] 18 Figure 2-6: Flow chart for MFCC computation [21] 22 Figure 3-1: The algorithm is for extracting SSCFs 25 Figure 3-2: The shape of six-triangle overlapped subband filters for computing SSCF 26 Figure 3-3: SSCF parameters extraction from a speech signal following frame by frame [13] 26 Figure 3-4: Comparison results between formants and SSCFs in /ai/ when apply six subband filters: a) F1, b) F2, c) M1 and d) M2 27 Figure 3-5: The shape of five-triangle overlapped subband filters for computing SSCF 28 Figure 3-6: Comparison results between formants and SSCFs in /ai/ when apply five overlap subband filters: a) F1, b) F2, c) M1 and d) M2 29 Figure 3-7: The method for evaluation the effect of the number of subband filters on SSCF results 30 Figure 3-8: Comparison when using or Triangular Subband Filters in /ai/ transition: a) F1, b) F2, c) M1 and d) M2 32 Figure 3-9: Comparison when using or Triangular Subband Filters in /ae/ transition: a) F1, b) F2, c) M1 and d) M2 34 Figure 3-10: Comparison when using or Triangular Subband Filters in /ao/ transition: a) F1, b) F2, c) M1 and d) M2 36 Figure 3-11: Comparison when using or Triangular Subband Filters in /au/ transition: a) the first female (F1), b) the second female (F2), c) the first male (M1), d) the second male (M2) 38 Figure 3-12: [aV] trajectories for native French speakers at normal rate: a) F1-F2 plane at publishcation [25]; SSCF1-SSCF2 plane from measurement results when b) using triangular filters, c) using triangular filters 39 Figure 3-13: The definition of subband filter with equal length in mel-scale: a) five-triangle subband filters, b) six -triangle subband filters 41 Figure 3-14: The shape of new proposal six subband filters 42 Figure 3-15: a) The trajectories in Vowel-to-Vowel French transition obtained with simulation [27], [28], French vocalic triangle in SSCF1-SSCF2 plane: b) For two native females, c) For two native males 43 Figure 3-16: Comparison between SSCFs and Formants using the purpose subband filters in /ai/ transition: a) F1, b) F2, c) M1 and d) M2 44 Figure 3-17: Comparison between SSCFs and Formants using the purpose subband filters in /ae/ transition: a) F1, b) F2, c) M1 and d) M2 45 Figure 3-18: Comparison between SSCFs and Formants using the purpose subband filters in /ao/ transition: a) F1, b) F2, c) M1 and d) M2 46 Figure 3-19: Comparison between SSCFs and Formants using the purpose subband filters in /au/ transition: a) F1, b) F2, c) M1 and d) M2 47 Figure 3-20: Comparison between SSCFs and Formants using the purpose subband filters in /ia/ transition: a) F1, b) F2, c) M1 and d) M2 49 Figure 3-21: Comparison between SSCFs and Formants using the purpose subband filters in /ea/ transition: a) F1, b) F2, c) M1 and d) M2 50 Figure 3-22: Comparison between SSCFs and Formants using the purpose subband filters in /oa/ transition: a) F1, b) F2, c) M1 and d) M2 51 Figure 3-23: Comparison between SSCFs and Formants using the purpose subband filters in /ua/ transition: a) F1, b) F2, c) M1 and d) M2 52 Figure 3-24: Comparison between SSCFs and Formants using the purpose subband filters in /iu/ transition: a) F1, b) F2, c) M1 and d) M2 53 Figure 3-25: Comparison between SSCFs and Formants using the purpose subband filters in /ui/ transition: a) F1, b) F2, c) M1 and d) M2 54 Figure 3-26: SSCF angles12 in SSCF1/SSCF2 plane [13] 56 Figure 3-27: SSCF angles calculated from the purpose definition in /ai/ transition: a) F1, b) F2, c) M1 and d) M2 59 Figure 3-28: SSCF angles calculated from the purpose definition in /ae/ transition: a) F1, b) F2, c) M1 and d) M2 60 Figure 3-29: SSCF angles calculated from the purpose definition in /ao/ transition: a) F1, b) F2, c) M1 and d) M2 61 Figure 3-30: SSCF angles calculated from the purpose definition in /au/ transition: a) F1, b) F2, c) M1 and d) M2 62 Figure 3-31: SSCF angles calculated from the purpose definition in /ia/ transition: a) F1, b) F2, c) M1 and d) M2 64 Figure 3-32: SSCF angles calculated from the purpose definition in /ea/ transition: a) F1, b) F2, c) M1 and d) M2 65 Figure 3-33: SSCF angles calculated from the purpose definition in /oa/ transition: a) F1, b) F2, c) M1 and d) M2 67 Figure 3-34: SSCF angles calculated from the purpose definition in /ua/ transition: a) F1, b) F2, c) M1 and d) M2 68 Figure 3-35: SSCF angles calculated from the purpose definition in /iu/ transition: a) F1, b) F2, c) M1 and d) M2 70 Figure 3-36: SSCF angles calculated from the purpose definition in /ui/ transition: a) F1, b) F2, c) M1 and d) M2 72 Figure 4-1: A schematic overview of the Kaldi toolkit [30] 75 LIST OF TABLES Table 2-1: Four types of signal as elaborated [8] 14 Table 3-1: The definition of SSCF angles 57 Table 4-1: A full description of speech database 76 Table 4-2: Syllable Error Rate (SER%) using MFCCs and their derivations 77 Table 4-3: Syllable Error Rate (SER) using SSCF angles and their derivations 78 Table 4-4: Syllable error rate (%) in Vietnamese ASR using SSCF angles and their derivations in the unbalanced speaker experiment 80 LIST OF ABBREVIATIONS ASR MFCC SSCF Formant Automatic Speech Recognition Mel-Frequency Cepstral Coefficients Spectral Subband Centroid Features F 10 REFERENCES [1] D O‘Shaughnessy, ―Invited paper: Automatic speech recognition: History, methods and challenges,‖ Pattern Recognit., vol 41, no 10, pp 2965–2979, Oct 2008 [2] D Yu and L Deng, Automatic Speech Recognition London: Springer London, 2015 [3] R E Gruhn, W Minker, and S Nakamura, Statistical Pronunciation Modeling for NonNative Speech Processing Berlin, Heidelberg: Springer Berlin Heidelberg, 2011 [4] H Bourlard, H Hermansky, and N Morgan, ―Towards increasing speech recognition error rates,‖ Speech Commun., vol 18, no 3, pp 205–231, 1996 [5] S Narang and M D Gupta, ―Speech Feature Extraction Techniques: A Review,‖ Int J Comput Sci Mob Comput., vol 4, no 3, pp 107–114, 2015 [6] N Desai, K Dhameliya, and V Desai, ―Feature extraction and classification techniques for speech recognition: A review,‖ Int J Emerg Technol Adv Eng., vol 3, no 12, pp 367–371, 2013 [7] U Shrawankar and V M Thakare, ―Techniques for feature extraction in speech recognition system: A comparative study,‖ ArXiv Prepr ArXiv13051145, 2013 [8] R Kaur and V Singh, ―Time-frequency domain characterization of stationary and non stationary signals,‖ Int J Res Appl Sci Eng Technol., vol 2, no 5, pp 438–447, 2014 [9] ―Non-Stationary Nature of Speech Signal (Theory) : Speech Signal Processing Laboratory : Electronics & Communications : IIT GUWAHATI Virtual Lab.‖ [Online] Available: http://iitg.vlab.co.in/?sub=59&brch=164&sim=371&cnt=1104 [Accessed: 13Mar-2018] [10] V G Skuk and S R Schweinberger, ―Influences of Fundamental Frequency, Formant Frequencies, Aperiodicity, and Spectrum Level on the Perception of Voice Gender,‖ J Speech Lang Hear Res., vol 57, no 1, p 285, Feb 2014 [11] H Traunmüller and A Eriksson, ―The frequency range of the voice fundamental in the speech of male and female adults,‖ Unpubl Manuscr., 1995 [12] P Divenyi, S Greenberg, and G Meyer, Eds., Dynamics of speech production and perception Amsterdam ; Washington, DC: Ios Press, 2006 [13] Thi-Anh-Xuan TRAN, ―ACOUSTIC GESTURE MODELING APPLICATION TO A VIETNAMESE SPEECH RECOGNITION SYSTEM,‖ Doctoral thesis, COMMUNITY UNIVERSITY GRENOBLE ALPES, 2016 [14] B Schuller, ―Voice and Speech Analysis in Search of States and Traits,‖ in Computer Analysis of Human Behavior, Springer, London, 2011, pp 227–253 [15] ―[Xuedong_Huang,_Alex_Acero,_Hsiao-Wuen_Hon]_Spoken(BookZZ.org).pdf.‖ [16] H Hermansky, B Hanson, and H Wakita, ―Perceptually based linear predictive analysis of speech,‖ in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’85., 1985, vol 10, pp 509–512 [17] C K On, P M Pandiyan, S Yaacob, and A Saudi, ―Mel-frequency cepstral coefficient analysis in speech recognition,‖ in Computing & Informatics, 2006 ICOCI’06 International Conference on, 2006, pp 1–5 [18] C J Long and S Datta, ―Wavelet based feature extraction for phoneme recognition,‖ in Spoken Language, 1996 ICSLP 96 Proceedings., Fourth International Conference on, 1996, vol 1, pp 264–267 [19] D Seung and L Lee, ―Algorithms for non-negative matrix factorization,‖ Adv Neural Inf Process Syst., vol 13, pp 556–562, 2001 [20] F Zheng, G Zhang, and Z Song, ―Comparison of different implementations of MFCC,‖ J Comput Sci Technol., vol 16, no 6, pp 582–589, 2001 [21] S Molau, M Pitz, R Schluter, and H Ney, ―Computing mel-frequency cepstral coefficients on the power spectrum,‖ in Acoustics, Speech, and Signal Processing, 2001 85 Proceedings.(ICASSP’01) 2001 IEEE International Conference on, 2001, vol 1, pp 73– 76 [22] R Vergin, D O‘shaughnessy, and A Farhat, ―Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition,‖ IEEE Trans Speech Audio Process., vol 7, no 5, pp 525–532, 1999 [23] K Kido, Digital Fourier Analysis: Advanced Techniques New York, NY: Springer New York, 2015 [24] K K Paliwal, ―SPECTRAL SUBBAND CENTROIDS AS FEATURES FOR ROBUST SPEECH RECOGNITION,‖ Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998 [25] R Carré, ―Signal dynamics in the production and perception of vowels,‖ 2009 Approaches Phonol Complex Berlín-Nueva York Mouton Gruyter, pp 59–81, 2009 [26] R Carré, ―From an acoustic tube to speech production,‖ Speech Commun., vol 42, no 2, pp 227–240, Feb 2004 [27] R Carré, ―Dynamic properties of an acoustic tube: Prediction of vowel systems,‖ Speech Commun., vol 51, no 1, pp 26–41, Jan 2009 [28] T Kamiyama and J Vaissière, ―Perception and production of French close and closemid rounded vowels by Japanese-speaking learners,‖ Acquis Interact En Lang Étrangère, no Aile Lia 2, pp 9–41, Dec 2009 [29] L Ménard, J.-L Schwartz, L.-J Boë, S Kandel, and N Vallée, ―Auditory normalization of French vowels synthesized by an articulatory model simulating growth from birth to adulthood,‖ J Acoust Soc Am., vol 111, no 4, pp 1892–1905, Apr 2002 [30] D Povey et al., ―The Kaldi speech recognition toolkit,‖ in IEEE 2011 workshop on automatic speech recognition and understanding, 2011 [31] K S and C E, ―A Review on Automatic Speech Recognition Architecture and Approaches,‖ Int J Signal Process Image Process Pattern Recognit., vol 9, no 4, pp 393–404, Apr 2016 [32] L R Rabiner and B.-H Juang, ―Speech recognition: Statistical methods,‖ Encycl Linguist., pp 1–18, 2006 [33] L C Thompson, ―The Problem of the Word in Vietnamese,‖ WORD, vol 19, no 1, pp 39–52, Jan 1963 [34] ―Speech Recognition Using Monophone and Triphone Based Continuous Density Hidden Markov Models - Semantic Scholar.‖ [Online] Available: /paper/SpeechRecognition-Using-Monophone-and-TriphoneSajjan/14ad44a595fb529cc405589c99f933c0982477af [Accessed: 22-Mar-2018] 86 APPENDIX This appendix shows some scripts in Kaldi for building a speech recognition engine There are three important scripts including: cmd.sh, path.sh and run.sh when running a speech recognition engine in Kaldi These scripts are put in kaldi-trunk/egs/ directory that is a place where user will put all the stuff related to their project From these scripts, Kaldi will call to related functions and implement recognition mission This appendix will show cmd.sh and path.sh for both approached of MFCC and SSCF angles The basis different code is based on how to write speech feature and configuration of CMVN when applying MFCC and SSCF angles The run.sh for each approach will be detail described this thing 1) cmd.sh This script is to setting local system jobs It means to configure the buiding mode in Kaldi for runing everything locally on a single machine Kaldi is designed to work best with software such as Sun GridEngine or other software that works on a similar principle; and if multiple machines are to work together in a cluster then they need access to a shared file system such as one based on NFS However, by changing code in cmd.sh, the user can easily run at single machine mode export train_cmd="run.pl mem 2G" export decode_cmd="run.pl mem 4G" export mkgraph_cmd="run.pl mem 8G" 2) path.sh This script is to link and connect all of functions in Kaldi source library to user’s project This script will some misisions, including: defining the Kaldi root directory, seting important paths of source library to useful tools, storing speech databse of project and sorting data # Defining Kaldi root directory export KALDI_ROOT=`pwd`/ / # Setting paths to useful tools export PATH=$PWD/utils/:$KALDI_ROOT/src/bin:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/src/fstbin/:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/src/lmbin/:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/src/gmmbin/:$PWD:$PATH 87 export PATH=$PWD/utils/:$KALDI_ROOT/src/featbin/:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/src/lm/:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/src/sgmmbin/:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/src/sgmm2bin/:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/src/fgmmbin/:$PWD:$PATH export PATH=$PWD/utils/:$KALDI_ROOT/src/latbin/:$PWD:$PATH # digits data will be stored in: export DATA_ROOT="/home/mica/MyKaldi/kaldi-trunk/egs/demo_sscf/digits_audio/audio" like /media/secondary/digits # e.g something if [ -z $DATA_ROOT ]; then echo "You need to set \"DATA_ROOT\" variable in path.sh to point to the directory to host VoxForge's data" exit fi # Make sure that MITLM shared libs are found by the dynamic linker/loader export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/tools/mitlm-svn/lib # Needed for "correct" sorting export LC_ALL=C 3) run.sh when using SSCF angles This is the most important script when building a speech recognition engine There are some main purposes in this script The first task is preparing acoustic data from speech database input The second task is features writing It is because that the SSCF angles are computed by outside python script, and then written to Kaldi feature format at file.ark After that, the run.sh script will prepare language data and made language model Finally and most importantly, traning with CMVN specific configure, decoding and showing the best result #!/bin/bash # Copyright 2012 Vassil Panayotov # Apache 2.0 88 # NOTE: You will want to download the data set first, before executing this script # This can be done for example by: # Setting the DATA_ROOT variable to point to a directory with enough free # # space (at least 20-25GB currently (Feb 2014)) Running "getdata.sh" # The second part of this script comes mostly from egs/rm/s5/run.sh # with some parameters changed /path.sh || exit # If you have cluster of machines running GridEngine you may want to # change the train and decode commands in the file below /cmd.sh || exit # The number of parallel jobs to be started for some parts of the recipe # Make sure you have enough resources(CPUs and RAM) to accomodate this number of jobs njobs=1 # The number of randomly selected speakers to be put in the test set nspk_test=1 # Test-time language model order lm_order=3 # Word position dependent phones? pos_dep_phones=true # Removing previously created data (from last run.sh execution) rm -rf exp mfcc data/train/spk2utt data/train/cmvn.scp data/train/feats.scp data/train/split1 data/test/spk2utt data/test/cmvn.scp data/test/feats.scp data/test/split1 data/local/lang data/lang data/local/tmp data/local/dict/lexiconp.txt data/local/lm.arpa data/local/spk2gender data/local/spk2gender.tmp data/local/test.spk2utt data/local/test.utt2spk data/local/test_wav.scp data/local/train.spk2utt data/local/train.utt2spk data/local/train_wav.scp # The user of this script could change some of the above parameters Example: # /bin/bash run.sh pos-dep-phones false 89 utils/parse_options.sh || exit [[ $# -ge ]] && { echo "Unexpected arguments"; exit 1; } # Initial normalization of the data local/voxforge_data_prep.sh nspk_test ${nspk_test} ${DATA_ROOT} || exit # Prepare ARPA LM and vocabulary using SRILM local/voxforge_prepare_lm.sh order ${lm_order} || exit # Prepare data/lang and data/local/lang directories utils/prepare_lang.sh position-dependent-phones $pos_dep_phones \ data/local/dict '!SIL' data/local/lang data/lang || exit # Prepare G.fst and data/{train,test} directories local/voxforge_format_data.sh || exit # writing SSCF angle to Kaldi sscfdir=${DATA_ROOT}/sscf copy-feats compress=true ark:/home/mica/MyKaldi/kalditrunk/egs/demo_sscf/digits_audio/audio/sscf/feature_train_angles.ark ark,scp:/home/mica/MyKaldi/kalditrunk/egs/demo_sscf/digits_audio/audio/sscf/train.ark,/home/mica/MyKaldi/kalditrunk/egs/demo_sscf/digits_audio/audio/sscf/train.scp copy-feats compress=true ark:/home/mica/MyKaldi/kalditrunk/egs/demo_sscf/digits_audio/audio/sscf/feature_test_angles.ark ark,scp:/home/mica/MyKaldi/kalditrunk/egs/demo_sscf/digits_audio/audio/sscf/test.ark,/home/mica/MyKaldi/kalditrunk/egs/demo_sscf/digits_audio/audio/sscf/test.scp ## feature for train steps/compute_cmvn_stats.sh /home/mica/MyKaldi/kaldi-trunk/egs/demo_sscf/data/train /home/mica/MyKaldi/kaldi-trunk/egs/demo_sscf/exp/data_prep /home/mica/MyKaldi/kalditrunk/egs/demo_sscf/digits_audio/audio/sscf || exit 1; ## feature for test 90 steps/compute_cmvn_stats.sh /home/mica/MyKaldi/kaldi-trunk/egs/demo_sscf/data/test /home/mica/MyKaldi/kaldi-trunk/egs/demo_sscf/exp/data_prep /home/mica/MyKaldi/kalditrunk/egs/demo_sscf/digits_audio/audio/sscf || exit 1; ####################################################################### # Train monophone models on a subset of the data with features #utils/subset_data_dir.sh data/train 3020 data/train.3k02 || exit 1; steps/train_mono.sh delta-opts " delta-order=0" cmvn-opts " norm-means=false" nj $njobs cmd "$train_cmd" data/train data/lang exp/mono_new5 || exit 1; # Monophone decoding utils/mkgraph.sh mono data/lang_test exp/mono_new5 exp/mono_new5/graph || exit # note: local/decode.sh calls the command line once for each # test, and afterwards averages the WERs into (in this case # exp/mono/decode/ steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/mono_new5/graph data/test exp/mono_new5/decode # Get alignments from monophone system steps/align_si.sh nj $njobs cmd "$train_cmd" \ data/train data/lang exp/mono_new5 exp/mono_new5_ali || exit 1; ######################################################################## # Train monophone models on a subset of the data with 10 features steps/train_mono.sh delta-opts " delta-order=1" cmvn-opts " norm-means=false" nj $njobs cmd "$train_cmd" data/train data/lang exp/mono_new10 || exit 1; # Monophone decoding utils/mkgraph.sh mono data/lang_test exp/mono_new10 exp/mono_new10/graph || exit # note: local/decode.sh calls the command line once for each # test, and afterwards averages the WERs into (in this case # exp/mono/decode/ steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/mono_new10/graph data/test exp/mono_new10/decode 91 # Get alignments from monophone system steps/align_si.sh nj $njobs cmd "$train_cmd" \ data/train data/lang exp/mono_new10 exp/mono_new10_ali || exit 1; ######################################################################## # Train monophone models on a subset of the data with 15 features steps/train_mono.sh delta-opts " delta-order=2" cmvn-opts " norm-means=false" nj $njobs cmd "$train_cmd" data/train data/lang exp/mono_new15 || exit 1; # Monophone decoding utils/mkgraph.sh mono data/lang_test exp/mono_new15 exp/mono_new15/graph || exit # note: local/decode.sh calls the command line once for each # test, and afterwards averages the WERs into (in this case # exp/mono/decode/ steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/mono_new15/graph data/test exp/mono_new15/decode # Get alignments from monophone system steps/align_si.sh nj $njobs cmd "$train_cmd" \ data/train data/lang exp/mono_new15 exp/mono_new15_ali || exit 1; ######################################################################## # delta order #delta_opts=0 # train tri1 [first triphone pass] with features steps/train_deltas.sh "$train_cmd" \ delta-opts " delta-order=0" cmvn-opts " norm-means=false" cmd 2000 11000 data/train data/lang exp/mono_new5_ali exp/tri_new5 || exit 1; # decode tri1 utils/mkgraph.sh data/lang_test exp/tri_new5 exp/tri_new5/graph || exit 1; steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ 92 exp/tri_new5/graph data/test exp/tri_new5/decode #draw-tree data/lang/phones.txt exp/tri1/tree | dot -Tps -Gsize=8,10.5 | ps2pdf - tree.pdf # align tri1 steps/align_si.sh nj $njobs cmd "$train_cmd" \ use-graphs true data/train data/lang exp/tri_new5 exp/tri_new5_ali || exit 1; ######################################################################### # delta order #delta_opts=0 # train tri1 [first triphone pass] with 10 features steps/train_deltas.sh "$train_cmd" \ delta-opts " delta-order=1" cmvn-opts " norm-means=false" cmd 2000 11000 data/train data/lang exp/mono_new10_ali exp/tri_new10 || exit 1; # decode tri1 utils/mkgraph.sh data/lang_test exp/tri_new10 exp/tri_new10/graph || exit 1; steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/tri_new10/graph data/test exp/tri_new10/decode #draw-tree data/lang/phones.txt exp/tri1/tree | dot -Tps -Gsize=8,10.5 | ps2pdf - tree.pdf # align tri1 steps/align_si.sh nj $njobs cmd "$train_cmd" \ use-graphs true data/train data/lang exp/tri_new10 exp/tri_new10_ali || exit 1; ######################################################################### # train tri2a [delta+delta-deltas] with 15 features steps/train_deltas.sh "$train_cmd" \ delta-opts " delta-order=2" cmvn-opts " norm-means=false" cmd 2000 11000 data/train data/lang exp/mono_new15_ali exp/tri_new15 || exit 1; 93 #steps/train_deltas_delta.sh cmd "$train_cmd" 2000 11000 \ # data/train data/lang exp/mono_new15_ali exp/tri_new15 || exit 1; # decode tri2a utils/mkgraph.sh data/lang_test exp/tri_new15 exp/tri_new15/graph steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/tri_new15/graph data/test exp/tri_new15/decode # align tri2a steps/align_si.sh nj $njobs cmd "$train_cmd" \ use-graphs true data/train data/lang exp/tri_new15 exp/tri_new15_ali || exit 1; #score for x in exp/*/decode*; [ -d $x ] && grep WER $x/wer_* | utils/best_wer.sh; done 4) run.sh when using MFCCs This is the most important script when building a speech recognition engine There are some main purposes in this script The first task is preparing acoustic data from speech database input The second task is MFCC feature extraction.After that, the run.sh script will prepare language data and made language model Finally and most importantly, traning with default configure, decoding and showing the best result #!/bin/bash # Copyright 2012 Vassil Panayotov # Apache 2.0 # NOTE: You will want to download the data set first, before executing this script # This can be done for example by: # Setting the DATA_ROOT variable to point to a directory with enough free # # space (at least 20-25GB currently (Feb 2014)) Running "getdata.sh" # The second part of this script comes mostly from egs/rm/s5/run.sh # with some parameters changed /path.sh || exit 94 # If you have cluster of machines running GridEngine you may want to # change the train and decode commands in the file below /cmd.sh || exit # The number of parallel jobs to be started for some parts of the recipe # Make sure you have enough resources(CPUs and RAM) to accomodate this number of jobs njobs=2 # The number of randomly selected speakers to be put in the test set nspk_test=8 # Test-time language model order lm_order=3 # Word position dependent phones? pos_dep_phones=true # Removing previously created data (from last run.sh execution) rm -rf exp mfcc data/train/spk2utt data/train/cmvn.scp data/train/feats.scp data/train/split1 data/test/spk2utt data/test/cmvn.scp data/test/feats.scp data/test/split1 data/local/lang data/lang data/local/tmp data/local/dict/lexiconp.txt data/local/lm.arpa data/local/spk2gender data/local/spk2gender.tmp data/local/test.spk2utt data/local/test.utt2spk data/local/test_wav.scp data/local/train.spk2utt data/local/train.utt2spk data/local/train_wav.scp # The user of this script could change some of the above parameters Example: # /bin/bash run.sh pos-dep-phones false utils/parse_options.sh || exit [[ $# -ge ]] && { echo "Unexpected arguments"; exit 1; } # Initial normalization of the data local/voxforge_data_prep.sh nspk_test ${nspk_test} ${DATA_ROOT} || exit # Prepare ARPA LM and vocabulary using SRILM local/voxforge_prepare_lm.sh order ${lm_order} || exit 95 # Prepare data/lang and data/local/lang directories utils/prepare_lang.sh position-dependent-phones $pos_dep_phones \ data/local/dict '!SIL' data/local/lang data/lang || exit # Prepare G.fst and data/{train,test} directories local/voxforge_format_data.sh || exit # Now make MFCC features # mfccdir should be some place with a largish disk where you # want to store MFCC features mfccdir=${DATA_ROOT}/mfcc for x in train test; steps/make_mfcc.sh cmd "$train_cmd" nj $njobs \ data/$x exp/make_mfcc/$x $mfccdir || exit 1; steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x $mfccdir || exit 1; done ####################################################################### # Train monophone models on a subset of the data with 13 features utils/subset_data_dir.sh data/train 3020 data/train.3k02 || exit 1; steps/train_mono.sh delta-opts " delta-order=0" nj $njobs cmd "$train_cmd" data/train.3k02 data/lang exp/mono13 || exit 1; # Monophone decoding utils/mkgraph.sh mono data/lang_test exp/mono13 exp/mono13/graph || exit # note: local/decode.sh calls the command line once for each # test, and afterwards averages the WERs into (in this case # exp/mono/decode/ steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/mono13/graph data/test exp/mono13/decode # Get alignments from monophone system steps/align_si.sh nj $njobs cmd "$train_cmd" \ data/train data/lang exp/mono13 exp/mono13_ali || exit 1; 96 ######################################################################## # Train monophone models on a subset of the data with 26 features utils/subset_data_dir.sh data/train 3020 data/train.3k02 || exit 1; steps/train_mono.sh delta-opts " delta-order=1" nj $njobs cmd "$train_cmd" data/train.3k02 data/lang exp/mono26 || exit 1; # Monophone decoding utils/mkgraph.sh mono data/lang_test exp/mono26 exp/mono26/graph || exit # note: local/decode.sh calls the command line once for each # test, and afterwards averages the WERs into (in this case # exp/mono/decode/ steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/mono26/graph data/test exp/mono26/decode # Get alignments from monophone system steps/align_si.sh nj $njobs cmd "$train_cmd" \ data/train data/lang exp/mono26 exp/mono26_ali || exit 1; ######################################################################## ######################################################################## # Train monophone models on a subset of the data with 39 features utils/subset_data_dir.sh data/train 3020 data/train.3k02 || exit 1; steps/train_mono.sh delta-opts " delta-order=2" nj $njobs cmd "$train_cmd" data/train.3k02 data/lang exp/mono39 || exit 1; # Monophone decoding utils/mkgraph.sh mono data/lang_test exp/mono39 exp/mono39/graph || exit # note: local/decode.sh calls the command line once for each # test, and afterwards averages the WERs into (in this case # exp/mono/decode/ steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/mono39/graph data/test exp/mono39/decode 97 # Get alignments from monophone system steps/align_si.sh nj $njobs cmd "$train_cmd" \ data/train data/lang exp/mono39 exp/mono39_ali || exit 1; ######################################################################## # delta order #delta_opts=0 # train tri1 [first triphone pass] with 13 features steps/train_deltas.sh delta-opts " delta-order=0" cmd "$train_cmd" \ 2000 11000 data/train data/lang exp/mono13_ali exp/tri13 || exit 1; # decode tri1 utils/mkgraph.sh data/lang_test exp/tri13 exp/tri13/graph || exit 1; steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/tri13/graph data/test exp/tri13/decode #draw-tree data/lang/phones.txt exp/tri1/tree | dot -Tps -Gsize=8,10.5 | ps2pdf - tree.pdf # align tri1 steps/align_si.sh nj $njobs cmd "$train_cmd" \ use-graphs true data/train data/lang exp/tri13 exp/tri13_ali || exit 1; ######################################################################### # delta order #delta_opts=0 # train tri1 [first triphone pass] with 26 features steps/train_deltas.sh delta-opts " delta-order=1" cmd "$train_cmd" \ 2000 11000 data/train data/lang exp/mono26_ali exp/tri26 || exit 1; # decode tri1 utils/mkgraph.sh data/lang_test exp/tri26 exp/tri26/graph || exit 1; steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ 98 exp/tri26/graph data/test exp/tri26/decode #draw-tree data/lang/phones.txt exp/tri1/tree | dot -Tps -Gsize=8,10.5 | ps2pdf - tree.pdf # align tri1 steps/align_si.sh nj $njobs cmd "$train_cmd" \ use-graphs true data/train data/lang exp/tri26 exp/tri26_ali || exit 1; ######################################################################### # train tri2a [delta+delta-deltas] with 39 features steps/train_deltas.sh delta-opts " delta-order=2" cmd "$train_cmd" \ 2000 11000 data/train data/lang exp/mono39_ali exp/tri39 || exit 1; #steps/train_deltas_delta.sh cmd "$train_cmd" 2000 11000 \ # data/train data/lang exp/mono39_ali exp/tri39 || exit 1; # decode tri2a utils/mkgraph.sh data/lang_test exp/tri39 exp/tri39/graph steps/decode.sh config conf/decode.config nj $njobs cmd "$decode_cmd" \ exp/tri39/graph data/test exp/tri39/decode # align tri1 steps/align_si.sh nj $njobs cmd "$train_cmd" \ use-graphs true data/train data/lang exp/tri39 exp/tri39_ali || exit 1; #steps/align_si.sh nj cmd utils/run.pl \ # use-graphs true data/train data/lang exp/tri39 exp/tri39_ali || exit 1; #score for x in exp/*/decode*; [ -d $x ] && grep WER $x/wer_* | utils/best_wer.sh; done 99 ... VOWEL – TO – VOWEL TRANSITIONS MƠ HÌNH ĐẶC TÍNH ÂM HỌC ĐỘNG CHO NHẬN DẠNG TIẾNG NÓI TIẾNG VIỆT VÀ ỨNG DỤNG CHO VIỆC PHÂN TÍCH SỰ CHUYỂN TIẾP NGUYÊN ÂM – NGUYÊN ÂM Specialty: Computer Science International... SPEECH RECOGNITION TOOLKIT, KALDI 74 4.1 An overview about Kaldi – an open source for speech recognition 74 4.2 Balanced speaker experiments on Kaldi 76 4.2.1 Using... SSCF Angles are evaluated in Vowel – to Vowel transitions The overview of Kaldi software and some ASR experiments in Kaldi are shown in Chapter The Vietnamese continuous speech database is used