Vietnamese speech synthesis for some assistant services on mobile devices

90 291 0
Vietnamese speech synthesis for some assistant services on mobile devices

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

NGUYEN TIEN THANH MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY - Nguyen Tien Thanh COMPUTER SCIENCE VIETNAMESE SPEECH SYNTHESIS FOR SOME ASSISTANT SERVICES ON MOBILE DEVICES MASTER OF SCIENCE THESIS COMPUTER SCIENCE 2014B Hanoi – 2016 Master of science thesis 2016 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY Nguyen Tien Thanh VIETNAMESE SPEECH SYNTHESIS FOR SOME ASSISTANT SERVICES ON MOBILE DEVICES Department : International research institute MICA MASTER THESIS OF SCIENCE COMPUTER SCIENCE SUPERVISOR: Dr Mac Dang Khoa Hanoi – 2016 Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page ii Master of science thesis 2016 COMMITMENT I commit myself to be the person who was responsible for conducting this study All reference figures were extracted with clear derivation The presented results are truthful and have not published in any other person‟s work Nguyễn Tiến Thành Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page iii Master of science thesis 2016 ACKNOWLEDGEMENT During the progress of master student, many people gave me generous help and inspiration I wish to thank all my professors and colleagues at MICA International Research Institute, who have helped me with generous supports Their advice and knowledge they imparted to me are gratefully appreciated, inspiring me a lot to finish this thesis Special thanks to my supervisor Dr Mạc Đăng Khoa and colleagues of Speech Communication Department, MICA Institute for their advice and encouragement they gave to me, especially Assoc Prof Trần Đỗ Đạt for their thorough review and invaluable suggestions I would like to thank to Mr Nguyễn Mạnh Hà and Ms Nguyễn Hằng Phương for their guide in recording the corpus I would also like to thank to a lot of MICA members, who spent much of time for testing for my research I am grateful to Prof Eric Castelli, Dr Nguyễn Việt Sơn and MICA‟s directorate for supporting me the best working conditions in MICA International Research Institute Finally, I owe a great deal to my parents and my younger brother for their encouragement and support They have given me strength and motivation in my work and in my life Nguyễn Tiến Thành Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page iv Master of science thesis 2016 List of figures Figure 1-1 Representation of sound.(Huang et al 2001) Figure 1-2 A schematic diagram of the human speech production apparatus (Huang et al 2001) Figure 1-3 Glottal airflow and the resulting sound pressure at the mouth (Rabiner and Juang 1993) Figure 1-4 Waveform plot of the beginning of the utterance “It‟s time”(Huang et al 2001) Figure 1-5 Signal of sound “my speech” and its spectrogram Figure 1-6 Speech recognition and speech synthesis (Chandra and Akila 2012) .10 Figure 1-7 Schematic of text-to-speech synthesis 11 Figure 1-8 A schematic of the construction of an articulatory speech synthesizer and how a such a synthesizer may be considered to contain a model of information encoding in the speech signal (Palo 2006) 14 Figure 1-9 Block diagram of a synthesis-by-rule system Pitch and formants are listed as the only parameters of the synthesizer for convenience In practice, such system has about 40 parameters (Huang et al 2001) .15 Figure 1-10 Core architecture of HMM-based speech synthesis system (Yoshimura 2002) 18 Figure 1-11 General HMM-based synthesis scheme (Zen et al 2009) 19 Figure 1-12 A diagram of the Hunt and Black algorithm, showing one particular sequence of units and how the target cost measures a distance between a unit and the specification, and how the join cost measures a distance between the two adjacent units (Taylor 2009) .25 Figure 2-1 Schematic diagram of Hanoi Vietnamese tones (Michaud 2004) 35 Figure 2-2 Base system of Vu Hui Quan consists of parts: training part and synthesis part.(Quan and Nam 2009) 36 Figure 2-3 Vietnamese speech recognition system (Vu et al 2006) 37 Figure 2-4 Non-uniform unit selection model (Van Do et al 2011) 38 Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page v Master of science thesis 2016 Figure 2-5 Parse tree to search (Van Do et al 2011) 39 Figure 3-1 Target cost of target units and candidate units (Tran 2007) 42 Figure 3-2 Sentence splits into phrases and syllables .44 Figure 3-3 Average length of syllables in different positions (Tran 2007) 45 Figure 3-4 Average length of syllables (Tran 2007) 46 Figure 3-5 Signal of “giỏi” syllable in two difference positions 47 Figure 3-6 Sub-cost based on the difference in position of phrase 49 Figure 3-7 Sub-cost based on the difference in context of preceding syllable and following syllable 50 Figure 3-8 Syllable “Quanh” is composed of four phonemes 51 Figure 3-9 Sub-cost based on the difference in context of preceding phoneme and following phoneme 51 Figure 3-10 Acoustic units network 56 Figure 3-11 The algorithm of separating sentence into as long as possible phrases 57 Figure 3-12 Finding the longest phrase in database 58 Figure 3-13 Search space before applying acoustic units network 59 Figure 3-14 Search space after applying acoustic units network 60 Figure 3-15 Finding candidates of word “chúng tôi” .61 Figure 4-1 Interface of Adobe Audition 3.0 .65 Figure 4-2 Interface of Praat .66 Figure 4-3 Most test result by domain 68 Figure 4-4 Perception test 69 Figure 4-5 Result of the perception test 70 Figure 4-6 Speed of synthesis process of two systems .72 Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page vi Master of science thesis 2016 List of tables Table 1-1 Types of using some popular units 29 Table 2-1 The concluded structure of Vietnamese syllables (Tran 2003) 33 Table 2-2 Symbol of Vietnamese tones 34 Table 2-3 Advantages and disadvantages between two synthesis systems of Quan and Thao 40 Table 3-1 Position difference and cost value (min is better) Target unit is begin or end of sentence 48 Table 3-2 Position difference and cost value (min is better) Target unit is both begin and end or is middle of sentence 48 Table 3-3 Phoneme types in Vietnamese (Tran 2007) 52 Table 3-4 Direction and complexity of Vietnamese tones 54 Table 4-1 Number of sentences and distinct syllables in each domain 63 Table 4-2 Tags and Meaning of xml file 67 Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page vii Master of science thesis 2016 Contents COMMITMENT iii ACKNOWLEDGEMENT iv List of figures v List of tables vii Introduction Chapter Overview of speech processing and text-to-speech 1.1 Speech and speech processing .4 1.1.1 Sound 1.1.2 Human vocal mechanism 1.1.3 Speech representation in the time and frequency domains .7 1.1.4 Speech processing 10 1.2 Text-To-Speech 11 1.2.1 Introduction 11 1.2.2 Speech synthesis techniques 12 1.2.3 Articulatory synthesis 13 1.2.4 Formant synthesis 15 1.2.5 Concatenative synthesis 16 1.2.6 Statistical Parametric synthesis .17 1.3 From concatenative synthesis to unit selection synthesis 21 1.3.1 Extending concatenative synthesis .21 1.3.2 The algorithm of Hunt and Black 24 1.3.3 Speech synthesis based on non-uniform units selection 27 1.4 Conclusion 30 Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page viii Master of science thesis 2016 Chapter Text-to-speech for Vietnamese .31 2.1 Overview Vietnamese language and phonology 31 2.1.1 Characteristics .31 2.1.2 Vietnamese syllable structure .33 2.2 Overview text-to-speech in Vietnamese 35 2.3 Discussion and proposal 39 2.4 Conclusion 41 Chapter Improvement of Non-uniform unit selection technique for Vietnamese Text-to-speech 42 3.1 Quality improvement: using target costs for unit selection 42 3.1.1 Target costs in Vietnamese synthesis 42 3.1.2 Separating sentence into phrases 43 3.1.3 Target cost computation 44 3.2 Performance improvement: using acoustic units network 55 3.2.1 Acoustic units network 55 3.2.2 Separating sentence into the longest phrases 56 3.2.3 Searching candidates 59 3.3 Conclusion 61 Chapter Implementations and evaluation 62 4.1 System overview 62 4.2 Building database 62 4.2.1 Text database building 62 4.2.2 Speech corpus recording .64 4.2.3 Database processing 64 Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page ix Master of science thesis 2016 4.3 Evaluation 67 4.3.1 Quality of synthesized speech 67 4.3.2 Cost target improvement .69 4.3.3 Performance 71 4.4 Conclusion 73 Chapter Conclusions and perspectives .74 References 76 Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page x Master of science thesis 2016 Figure 4-2 Interface of Praat Extracting phonetic information of syllable : Extracting phonetic information of syllable is the next stage which is also important These information will be used in searching, selecting and concatenating process of speech synthesis system Extracted data consists many type of information, such as : duration (in sample number), carrying file, preceding syllable, following syllable, position in sentence, initial phoneme, middle phoneme, nucleus phoneme, final phoneme, tone, start point, end point, Building meta-data file Meta-data file is written to save extracted information in previous step Data is distributed in standard format of a xml file This file describing all the information of syllables such as how they are written, the time they appear in the recorded file, their preceding syllable, etc The tags and their meanings are shown in Table 4-2 Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 66 Master of science thesis 2016 Table 4-2 Tags and Meaning of xml file 4.3 Tag Meaning Name Name of syllable Start_index Start index of syllable End_index End index of syllable Initial Initial phone of syllable Middle Middle phone of syllable Nucleus Nucleus phone of syllable Final Final phone of syllable Type Type of syllable LeftSyl The syllable in the left side RightSyl The syllable in the right side Tone The tone of syllable finalPhnm Final phone initialPhnm Initial phone Evaluation The purpose of the evaluation is to demonstrate the improvements in speech quality and performance of our synthesis system There are two experiment tests to evaluating quality of synthesized speech Other test shows advanced performance of our system 4.3.1 Quality of synthesized speech To evaluate the quality of output speech of our system, a perception test was carried out 20 Vietnamese listeners (10 males and 10 females, who speak the same dialect as the speaker, participated in this experiment The testing material are 28 sentences in domains, which are presented in types: (1) synthesized by our system, synthesized by vnSpeak, the mostly-use TTS engine use for Vietnamese; and (3) the records of naturel speech The perception tests were carried out in a Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 67 Master of science thesis 2016 quiet room, using a high-quality headset at a comfortable hearing level With a testing interface on smartphone, the listeners listen sentence by sentence, in random order After listening each sentence, they gave their opinion from (hardly understandable) to (natural like) All the given marks for each domain were then averaged The result of the test for each domain is shown in the Figure 4-3 4,81 4,5 3,7 3,38 3,5 2,5 4,88 4,75 4,38 2,76 2,49 2,39 1,5 0,5 Daily Info Screen Reader Our TTS VnSpeak News Natural Speech Figure 4-3 Most test result by domain As we can clearly see in the chart that output speech of our TTS engine has higher quality than vnSpeak TTS in every domain Another notable point is that Screen Reader is the domain having the highest mark among the three domains with 4.38, while News has the lowest one with 3.38 This can be explained as Screen Reader is quite a small domain, not having so many words to be covered, whereas News is totally opposite, having so many complicated words which sometimes even not exists in Vietnamese Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 68 Master of science thesis 2016 4.3.2 Cost target improvement The objective of this experiment is to compare the output's naturalness of two synthesis systems: the first one was developed in (Van Do et al 2011) and the other is our system The input text used for this Perception test consists of 40 sentences taken randomly from newspapers, the Internet and books, etc After input text has been prepared, it is synthesized by two synthesis systems using common database The first system was developed by (Van Do et al 2011) The second system is developed by us Figure 4-4 Perception test Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 69 Master of science thesis 2016 There were 20 volunteers participating in this perception test Each of them had to hear 40 sentences synthesized by the two systems That means 40 pairs of synthesized sentences were played pair by pair for the volunteers to hear After listening, volunteers had to choose whether the new system produces better output speech, or the old one does, or they have the same quality output speech In summary, we have the percent of decisions for three options as: Compare the naturalness of both systems Our system is better 44% Our system is worse 25% Equal 31% Figure 4-5 Result of the perception test Obtained results show that there is a relative improvement of the naturalness of synthesized speech system in the second system Figure 4-5 shows the test results of the perception test The blue zone represents the percentage of participants who claimed that the first system produced better voice, the red zone represents the proportion of people who said that both systems generated voices with equal quality, and the green zone represents the percentage of people who thought that the voice of the second system was better Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 70 Master of science thesis 2016 According to the result, it can be clearly seen that the second system produced better voices than the first one This indicates that there is a significant improvement in quality of our system Apparently, the method of calculating target cost and join cost helped the system choose better voice units, making the synthesized voice sounds more natural The sentences of this section include voice units recorded by many candidates, so the second system can choose the suitable voices and generate more natural synthesized voice The area of red zone is relatively large, with up 31% Most sentences of this section are synthesized by using voice units recorded by less candidates Both systems choose voice units from the same person, as a result, the voice quality is the same However approximately 25% of people preferred the voice quality of the old system than the new one There might be some reasons One of the most important reasons is that the proposed cost calculating method did not suit the sentences which were synthesized Our coefficients could not suit all the contexts We have to continue doing experiments in order to find the coefficients which can suit as many sentences as possible Another reason could be that the contexts of voice units in the sentences are too different And when the selecting criteria and processes are the same, different voices of different candidates lead to the different synthesized voice quality 4.3.3 Performance For bringing a speech synthesis system to end-users, executing time is a very important (total processing time: dividing sentences, searching and choosing units, ) The executing time must be optimized as much as possible to make users convenient Target of this test is point out the improvement in our newly developed system's speed, comparing with the system developed by (Van Do et al 2011) Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 71 Master of science thesis 2016 We prepared approximately 100 sentences which contain from to 20 syllables These sentences were randomly chosen from popular sources of text such as bulletin, news, newspapers, books and stories Then we measured the sum of the executing time of the system by smart devices (Figure 4-6 shows the results obtained by smartphone Samsung Galaxy S4 mini) The results are grouped by phrase's length: 1-5 syllable(s), 6-10 syllables, 1115 syllables and more than 15 syllables Execute time of two synthesis systems 4000 3500 Time (ms) 3000 2500 2000 our system 1500 previous system (Do 2011) 1000 500 1-5 syllables 6-10 syllables 11-15 syllables more than 15 syllables Length of phrase Figure 4-6 Speed of synthesis process of two systems In general, the second system has significant improvement in executing time compared with the first system For short phrases (1-5 syllable(s)), the number of units processed in searching process and selecting process is little, thus the differences are not really noticeable For longer sentences, the number of voice units being searched is larger, we noticed significant differences in executing time of both the systems For the first system, the time spent on synthesizing long sentences multifolds that of short sentences As for the second system, the time spent on synthesizing long sentences Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 72 Master of science thesis 2016 is a little more than that of short sentences Comparing the two systems, we found that the longer the sentences were, the distinctions grew more largely In a relatively ideal amount of time (less than one second), the second system responds nearly immediately to users requests 4.4 Conclusion In chapter 4, we described the process of building a new Vietnamese speech corpus database Finally, three experimental tests are conducted The first one and the second one indicates the improvement in synthetic voice quality of the new system, while the other one assessed show that the speed of the new system is also improved Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 73 Master of science thesis 2016 Chapter Conclusions and perspectives The thesis subject is "Vietnamese speech synthesis for some aid services on mobile devices” However, as we all know, simulating a system that can synthesize speech which is similar to the human voice is an extremely large and complex task and requires a lot of researches on the language, acoustic, speech processing and machine learning Therefore, within the scope of this master thesis, we have studied and given some proposals to improve the quality and speed of an existing system Firstly, a new Vietnamese speech corpus was built, which not only serves for the current research but also can be applied to speech synthesis systems on smartphones This work takes a lot of time due to the requirement of high accuracy, and there are few tasks needing to be done manually Moreover, a new proposal for calculating the cost of synthesis system based on unit selection technique was given The system using this new cost can choose more appropriate candidates, which can improve the quality of synthesized speech and makes it more natural Furthermore, an acoustic unit network to manage large databases is proposed This network consists of some kinds of nodes including sentences, phrases and syllables linked together The network of phonetic units can split the data into smaller pieces to help us on the search This will reduce the searching time when we work with a very large database The summary of the works we have done in this master thesis, the limitations and future approaches are described as follow:  The achievements :  Understanding theoretical basis of speech synthesis Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 74 Master of science thesis 2016  Building a new Vietnamese speech corpus which can cover almost all syllables in Vietnamese  Proposing new sub-costs to calculate cost used in a synthesis system based on unit selection technique  Proposing a unit network which includes syllables, phrases and sentences linked together  Applying new calculation of cost and using the unit network for a Vietnamese speech synthesis system on smartphone   A submitted paper for SLTU workshop in 2016 The limitations :  Synthesized speech still remains discontinuous points at concatenative positions  Some special loaned syllables are not covered in the database  Some syllables have too few candidates to be able to select the best candidates  Future approaches :  Using a Hybrid TTS between Unit Selection and HMM-based TTS to smooth signal at discontinuous points  Expanding database With greater coverage, the system can synthesize more syllables Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 75 Master of science thesis 2016 References Chandra DE, Akila A (2012) An Overview of Speech Recognition and Speech Synthesis Algorithms Int J Comput Technol Appl 1426–1430 Chen SF, others (2003) Conditional and joint models for grapheme-to-phoneme conversion In: INTERSPEECH, Geneva, Switzerland Christos Vosnidis VD (2001) Use of clustering information for coarticulation compensation in speech synthesis by word concatenation In: EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology pp 833–836 Chu M, Peng H, Zhao Y, et al (2003) Microsoft Mulan - a bilingual TTS system In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 Proceedings (ICASSP ‟03) pp I–264–I–267 vol.1 Clark RA, Richmond K, King S (2004) Festival 2–build your own general purpose unit selection speech synthesiser 5th ESCA Workshop Speech Synth p 147– 151 Coorman G, Fackrell J, Rutten P, van Coile B (2000) Segment Selection in the L&H Realspeak Laboratory TTS System In: INTERSPEECH, Beijing, China Doan TT (1977) Ngữ âm tiếng Việt (Vietnamese phonetics) Vietnam National University Publishing House 373p 136/77 Donovan RE, Eide E (1998) The IBM trainable speech synthesis system In: ICSLP - International Conference on Spoken Language Processing, Sydney, Australia Donovan RE, Franz M, Sorensen JS, Roukos S (1999) Phrase splicing and variable substitution using the IBM trainable speech synthesis system In: , 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999 Proceedings pp 373–376 vol.1 Donovan RE, Woodland PC (1995) Automatic speech synthesiser parameter estimation using HMMs In: International Conference on Acoustics, Speech, and Signal Processing, 1995 ICASSP-95 pp 640–643 vol.1 Huang X, Acero A, Hon H-W (2001) Spoken Language Processing: A Guide to Theory, Algorithm and System Development, edition Prentice Hall, Upper Saddle River, NJ Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 76 Master of science thesis 2016 Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996 ICASSP-96 Conference Proceedings pp 373–376 vol Iwahashi N, Kaiki N, Sagisaka Y (1993) Speech segment selection for concatenative synthesis based on spectral distortion minimization IEICE Trans Fundam Electron Commun Comput Sci 76:1942–1948 Jindrich Matousek ZH (2005) Hybrid syllable/triphone speech synthesis Ninth Eur Conf Speech Commun Technol 2529–2532 Lê Hồng M (2003) Some results of research and development of text-to-speech software for Vietnamese by formant synthesis Proc First Natl Sci Conf Res Dev Appl Inf Technol Commun ICTrda‟03 Meng HM, Keung C-K, Siu K-C, et al (2002) CU VOCAL: corpus-based syllable concatenation for Chinese speech synthesis across domains and dialects In: INTERSPEECH, Denver, Colorado, USA Michaud A (2004) Final consonants and glottalization: new perspectives from Hanoi Vietnamese Phonetica 61:119–146 doi: 10.1159/000082560 Möhler G, Conkie A (1998) Parametric Modeling of Intonation Using Vector Quantization In: Proc 3rd ESCA Workshop on speech synthesis (Jenolan caves, Australia) pp 311–316 Nakajima S, Hamada H (1988) Automatic generation of synthesis units based on context oriented clustering In: , 1988 International Conference on Acoustics, Speech, and Signal Processing, 1988 ICASSP-88 pp 659–662 vol.1 Nguyen TC (1996) Ngữ pháp tiếng Việt (Vietnamese grammar) National University Publishing House in Hanoi 397 p 02.01 DH 96 Nguyen T-TT (2015) HMM-based Vietnamese Text-to-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation PhD thesis, Hanoi University of Science and Technology Olive JP, Greenwood A, Coleman J (1993) Acoustics of American English speech: a dynamic approach Springer Science & Business Media Palo P (2006) A Review of Articulatory Speech Synthesis Master‟s Thesis, Helsinki University Of Technology Peter Rutten MPA (2002) A statistically motivated database pruning technique for unit selection synthesis Seventh Int Conf Spok Lang Process Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 77 Master of science thesis 2016 Phan T-S, Duong T-C, Dinh A-T, et al (2013) Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information In: 2013 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) pp 276–281 Portele T, Stöber K-H, Meyer H, Hess W (1996) Generation of multiple synthesis inventories by a bootstrapping procedure In: Spoken Language, 1996 ICSLP 96 Proceedings., Fourth International Conference on IEEE, pp 2391–2394 Quân VH, Nam CX (2009) Vietnamese speech synthesis by concatenating phrases Proj Res Dev Inf Technol Commun V-1: Rabiner L, Juang B-H (1993) Fundamentals of Speech Recognition Prentice-Hall, Inc., Upper Saddle River, NJ, USA Sagisaka Y (1988) Speech synthesis by rule using an optimal selection of nonuniform synthesis units In International Conference on Acoustics, Speech, and Signal Processing, 1988 ICASSP-88 pp 679–682 vol.1 Steve Pearson NK (1998) A synthesis method based on concatenation of demisyllables and a residual excited vocal tract model In: ICSLP, Sydney, Australia Stöber K, Portele T, Wagner P, Hess W (1999) Synthesis by word concatenation In: INTERSPEECH, Budapest, Hungary Tanya Lambert APB (2004) A database design for a TTS synthesis system using lexical diphones In: INTERSPEECH, Jeju Island, Korea Taylor P (2009) Text-to-Speech Synthesis, edition Cambridge University Press, Cambridge, UK ; New York Taylor P, Black AW (1998) Assigning phrase breaks from part-of-speech sequences Comput Speech Lang 12:99–117 doi: 10.1006/csla.1998.0041 Toshio Hirai ST (2004) USING ms SEGMENTS IN CONCATENATIVE SPEECH SYNTHESIS In: Fifth ISCA Workshop on Speech Synthesis, Pittsburgh, U.S.A Tran D-D (2003) Building a large Vietnamese Speech Database Master Thesis, Hanoi University of Technology Tran D-D (2007) Speech synthesis from text in Vietnamese language PhD thesis, Grenoble, INPG Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 78 Master of science thesis 2016 Trịnh AT (2000) Some methods of improving quality of Vietnamese synthesis systems V-TALK J Post Telecommun No Hanoi T Saito YH (1996) High-quality speech synthesis using context-dependent syllabic units 1:381 – 384 vol doi: 10.1109/ICASSP.1996.541112 Van Do T, Tran D-D, Nguyen T-TT (2011) Non-uniform Unit Selection in Vietnamese Speech Synthesis In: Proceedings of the Second Symposium on Information and Communication Technology ACM, New York, NY, USA, pp 165–171 Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm IEEE Trans Inf Theory 13:260–269 doi: 10.1109/TIT.1967.1054010 Vu Q, Demuynck K, Van Compernolle D (2006) Vietnamese automatic speech recognition: the FLaVoR approach In: Chinese Spoken Language Processing Springer, pp 464–474 Vu TT, Luong MC, Nakamura S (2009) An HMM-based Vietnamese speech synthesis system In: 2009 Oriental COCOSDA International Conference on Speech Database and Assessments pp 116–121 Xu J, Choy T, Dong M, et al (2003) On unit analysis for Cantonese corpus-based TTS In: 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003 Yoshimura T (2002) Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems PhD thesis, Nagoya Institute of Technology Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis Speech Commun 51:1039–1064 Zhenli Yu KW (2004) Data pruning approach to unit selection for inventory generation of concatenative embeddable Chinese TTS systems In: INTERSPEECH, Jeju Island, Korea Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 79 Master of science thesis 2016 APPENDIX Submitted international conference & workshop papers: - Nguyen Manh Ha, Nguyen Tien Thanh, Tran Do Dat & Mac Dang Khoa, “Vietnamese Text-to-speech on mobile device : Domain oriented approach for nonuniform unit selection” Spoken Language Technologies for Under-Resourced Languages (SLTU), Yogyakarta, Indonesia, 2016 (Submitted) Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 80 ... focus on learning and researching on speech synthesis Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 10 Master of science thesis 2016 1.2 Text-To -Speech. .. talking-head synthesis Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 14 Master of science thesis 2016 1.2.4 Formant synthesis Formant synthesis was... state-duration distribution to model the temporal structure of speech Choices for state-duration distributions are the Gaussian Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile

Ngày đăng: 16/07/2017, 17:26

Từ khóa liên quan

Mục lục

  • COMMITMENT

  • ACKNOWLEDGEMENT

  • List of figures

  • List of tables

  • Introduction

  • Chapter 1. Overview of speech processing and text-to-speech

    • 1.

    • 1.1. Speech and speech processing

      • 1.1.1. Sound

      • 1.1.2. Human vocal mechanism

      • 1.1.3. Speech representation in the time and frequency domains

      • 1.1.4. Speech processing

      • 1.2. Text-To-Speech

        • 1.2.1. Introduction

        • 1.2.2. Speech synthesis techniques

        • 1.2.3. Articulatory synthesis

        • 1.2.4. Formant synthesis

        • 1.2.5. Concatenative synthesis

        • 1.2.6. Statistical Parametric synthesis

        • 1.3. From concatenative synthesis to unit selection synthesis

          • 1.3.1. Extending concatenative synthesis

          • 1.3.2. The algorithm of Hunt and Black

          • 1.3.3. Speech synthesis based on non-uniform units selection

            • 1.3.3.1. Basic unit types

            • 1.3.3.2. Non-uniform units selection

Tài liệu cùng người dùng

Tài liệu liên quan