1. Trang chủ
  2. » Luận Văn - Báo Cáo

Efficient neural speech synthesis = cải tiến tổng hợp tiếng nói sử dụng học sâu

58 5 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

HA NOI UNIVERSITY OF SCIENCE AND TECHNOLOGY MASTER’S THESIS Efficient Neural Speech Synthesis LAM XUAN THU Data Science and Artificial Intelligence Supervisor: Dr Dinh Viet Sang Institute: School of Information and Communication Technology HANOI, 2021 HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY MASTER’S THESIS IN DATA SCIENCE AND ARTIFICIAL INTELLIGENCE Efficient Neural Speech Synthesis LAM XUAN THU thu.lx202712m@sis.hust.edu.vn Supervisor: Department: Dr Dinh Viet Sang Computer Science —————————————- HANOI, 12/2021 CỘNG HÒA Xà HỘI CHỦ NGHĨA VIỆT NAM Độc lập – Tự – Hạnh phúc BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ tên tác giả luận văn: Lâm Xuân Thư Đề tài luận văn: Cải tiến tổng hợp tiếng nói sử dụng học sâu Chuyên ngành: Khoa học liệu Trí tuệ nhân tạo Mã số SV: 20202712M Tác giả, Người hướng dẫn khoa học Hội đồng chấm luận văn xác nhận tác giả sửa chữa, bổ sung luận văn theo biên họp Hội đồng ngày 24/12/2021 với nội dung sau: - Thêm nội dung đặt vấn đề mục tiêu chương - Tách Related work khỏi chương Background để tạo thành chương riêng - Phân tích thêm nghiên cứu liên quan (chương 3) - Phân tích lý xây dựng thành phần mơ hình đề xuất (chương 4) - Chi tiết phần thí nghiệm (chương 5) - Trình bày lại số nội dung lý thuyết tham khảo Ngày tháng năm 2022 Giáo viên hướng dẫn CHỦ TỊCH HỘI ĐỒNG Tác giả luận văn Declaration of Authorship and Topic Sentences Personal Information • Full name: LAM XUAN THU • Email: thu.lx202712m@sis.hust.edu.vn • Class: Data Science • Tel: 098 994 7001 • Program: Full-time program • This thesis is performed at: Department of Computer Science - School of Information and Communication Technology • This thesis is performed: from 22/03/2021 to 24/10/2021 Goals of the Thesis • Proposing a novel neural network for Speech Synthesis • Conducting experiments and evaluating the proposed model Main Tasks of the Thesis • Introduce the Speech Synthesis problem and review traditional approaches for this problem • Present machine learning and deep learning background for Speech Synthesis problem • Propose a novel neural network-based system for Speech Synthesis • Implement experiments and evaluation • Conclude and outline future developments Declaration of Authorship I - Lam Xuan Thu - hereby warrant that the work and presentation in this thesis are performed by myself under the supervision of Dr Dinh Viet Sang All results presented in this thesis are truthful and are not copied from any other works Hanoi, 24th Nov 2021 Author Lam Xuan Thu Attestation of the Supervisor on the Fulfillment of the Requirements for the Thesis: Hanoi, 24th Nov 2021 Supervisor Dr Dinh Viet Sang Acknowledgements I am extremely grateful to my supervisor, Dr Dinh Viet Sang, who gave me the golden opportunity to this wonderful project on the topic of Speech Synthesis, which also helped me in doing a lot of research and I came to know about so many new things It was a great privilege and honor to work and study under his guidance I would also like to express my gratitude to my parents for their love, caring, and sacrifices for educating and preparing me for my future Finally, I would like to thank my friends for their immense support and help during this project Without their help, completing this project would have been very difficult Abstract Speech synthesis technology is an essential component of current human-computer interaction systems, which assists users in receiving the output of intelligent machines more naturally and intuitively, and has thus received increasing interest in recent years The primary use of speech synthesis or text-to-speech technology is to translate a text into spoken speech automatically The current research focus is the deep learning-based end-to-end speech synthesis technology, which has a more powerful modeling ability This report proposes a novel deep learningbased speech synthesis model called FastTacotron, which can resolve some issues of previous models Experiments on the LJSpeech dataset show that FastTacotron can converge in just a couple of hours of training using a single GPU Moreover, the proposed model can accelerate the inference process and achieve high speech quality Furthermore, our model also allows controlling the prosody of synthesized speech, thus can create expressive speech Keywords: Deep learning, text-to-speech Table of Contents Declaration of Authorship and Topic Sentences Acknowledgement Abstract Lists of Figures Lists of Tables Introduction 10 Background 2.1 Machine learning 2.2 Deep learning 2.3 1D Convolutional neural networks 2.4 Recurrent neural networks 2.5 Attention 2.6 Transformer Related work 3.1 Autoregressive models 3.1.1 Tacotron 3.1.2 Tacotron 3.1.3 ForwardTacotron 3.2 Non-autoregressive models 3.2.1 FastSpeech 3.2.2 FastPitch 3.2.3 FastSpeech Proposed method 4.1 Pre-net and Post-net 4.2 LSTM 4.3 Variation predictor 4.3.1 Duration predictor 4.3.2 Pitch predictor 4.3.3 Energy predictor 4.3.4 Length regulator 4.4 Vocoder 12 12 13 15 17 18 20 23 23 24 25 26 27 28 28 29 31 32 34 35 35 37 37 37 38 Experiments and Evaluations 5.1 Training Setup 5.1.1 Dataset 5.1.2 Grapheme-to-phoneme 5.1.3 Model configuration 5.2 Evaluations 5.2.1 Teacher model 5.2.2 Metrics 5.2.3 Results Conclusion converter 40 40 40 41 44 44 45 46 48 50 References 51 List of Figures 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 Speech synthesis or Text-to-speech: man speech Speech synthesis is used in a wide assistive technology and multimedia the artificial production of hu range of applications, such as 10 10 Machine learning: a new programming paradigm A neural network is parameterized by its weights A loss function measures the quality of the network’s output The loss score is used as a feedback signal to adjust the weights 1D Convolution Recurrent neural network The encoder-decoder model with additive attention mechanism [1] An alignment graph Transformer model architecture [2] Scaled Dot-Product Attention (left) and Multi-Head Attention (right) [2] 12 14 15 16 17 18 19 20 21 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Text-to-speech process Tacotron model architecture [3] Tacotron model architecture [4] ForwardTacotron model architecture [5] FastSpeech model architecture [6] FastPitch model architecture [7] FastSpeech model architecture [8] 23 25 26 27 28 29 29 4.1 4.2 32 4.6 Model architecture of FastTacotron The CBHG module (1-D convolution bank + highway network + bidirectional GRU) [3] This CBHG is a powerful module for extracting representations from sequences The LSTM cell Duration/Pitch/Energy Predictor The duration, pitch, and energy predictors all have a similar model structure (but different model parameters) Length Regulator [6] This module is used to expand the length of the phoneme sequence to match the length of mel-spectrogram sequence, as well as to control the voice speed and part of prosody MelGAN model architecture [9] 38 38 5.1 5.2 5.3 Mel loss and Pitch loss Duration loss and Energy loss TransformerTTS model [10] 41 42 43 4.3 4.4 4.5 22 33 35 36 5.1.2 Grapheme-to-phoneme converter The grapheme-to-phoneme converter converts written words into phonetic transcriptions A grapheme sequence (or graphemes) is the spelling of a word, while a phoneme sequence (or phonemes) is the phonetic form The grapheme-to-phoneme converter has a highly essential role in text-to-speech synthesis Figure 5.1: Mel loss and Pitch loss For example, there are certain regularities in English pronunciation, such as there are two types of syllables in English: open and closed In open syllables, the letter ”a” is commonly pronounced as /eı/, but in closed syllables, it is pronounced as /æ/ or /a:/ In the training process, we may rely on the neural network to learn such regularity However, learning all the regularities is difficult when the training data is insufficient, as it often is, and some exceptions have too few occurrences for neural networks to learn 41 Figure 5.2: Duration loss and Energy loss As a result, I use a grapheme-to-phoneme converter module g2pE [36] to solve the problem of mispronunciation This module is designed to convert English graphemes to phonemes g2pE implement a multilingual multi-speaker TTS model by modifying Tacotron [3] g2pE uses a simplified version of Tacotron for the encoder and the decoder, but uses the original Tacotron style of Post-processing net and Griffin-Lim algorithm [25] for conversion of a linear-scale spectrogram to a waveform A sequence of phonemes is converted to phoneme embeddings, then fed to the encoder as input The phoneme embedding dictionary of each language are concatenated to form the entire phoneme embedding dictionary, so duplicated phonemes may exist in the dictionary if the languages share the same phonemes Note that the phoneme embeddings are normalized to have the same norm To model multiple speakers’ voices in a single TTS model, g2pE adopts Deep Voice [22] style speaker embedding network One-hot speaker identity vector is converted to a 32-dimensional speaker embedding vector by the speaker embedding network Unless stated otherwise, I use the same hyperparameter settings with [4] 42 Figure 5.3: TransformerTTS model [10] The model is trained to minimize L1 losses between ground-truth and predicted spectrogram for both the linear-scale and Mel-scale spectrogram When a multispeaker TTS model is trained, the amount of speech data differs across speakers This data imbalance may induce a bias to the TTS model To cope with this data imbalance, the loss of each sample is divided from one speaker by the total number of samples in a training set that belongs to the speaker It empirically found that this adjustment in loss function yields better synthesis quality 43 5.1.3 Model configuration My model configuration is as follows The dimension of phoneme embedding, the size of hidden representation in pre-net and post-net are all set to 256 The hidden size of GRU is set to 64 in duration predictor and 128 in pitch predictor and energy predictor The remaining parameters in duration, pitch, and energy predictors (figure 4.4) are set the same: convolution kernel is 3, convolution output channel is 256, the dropout rate is 0.5, the hidden dimension of the linear layer is Besides, pitch and energy predictors use a 1D convolution with kernel size and 256 output channels to project pitch and energy to match the dimension of the hidden representation in pre-net The output of the length regulator is fed to a bi-directional LSTM with hidden size 512, and then the linear layer converts it into an 80-dimensional hidden representation trained the model on one GeForce RTX 2080 Ti GPU using Adam optimizer [37] (with β1 = 0.9, β2 = 0.98, and ε = 1e-9), with batch size 32 and learning rate 1e-4 The model is optimized with mean absolute error (MAE), and final loss is a combination of mel-spectrogram loss, duration loss, pitch loss, and energy loss The training converges after hours after 24000 steps Train loss and validation loss are shown in figure 5.1 and figure 5.2 5.2 Evaluations Table 5.2: Audio quality and inference latency comparison MelGAN [9] was used as the vocoder MOS is a numerical measure of the human-judged the overall quality of a synthesized speech, which was judged on a scale of (bad) to (excellent) RTF denotes the time (in seconds) required for the system to synthesize a one-second waveform Method Ground Truth Tacotron [4] (+ MelGAN [9]) FastSpeech [6] (+ MelGAN [9]) ForwardTacotron [5] (+ MelGAN [9]) FastTacotron (+ MelGAN [9]) MOS 4.45 3.96 3.62 3.72 3.81 Inference latency (RTF) 0.0793 (11.7x) 0.0024 (00.3x) 0.0060 (00.9x) 0.0068 (01.0x) In this section, I evaluate the performance of FastTacotron in terms of audio quality, training, and inference speedup by comparing it with other baseline models, including GT (the ground-truth recordings), Tacotron [4], FastSpeech [6], and 44 Table 5.3: Training time comparison Method Tacotron [4] FastSpeech [6] (+ TransformerTTS [10]) ForwardTacotron [5] (+ TransformerTTS [10]) FastTacotron (+ MFA [33]) Training Time 38 (hours) 53 (hours) 56 (hours) 05 (hours) Speedup 7.6x 10.6x 11.2x 1.0x ForwardTacotron [5] I get the baseline models checkpoints from open source and performed inference in the same machine as the proposed model to comparison 5.2.1 Teacher model � � �� �� �� �� �� �� ������� ������� �� �� �� �� �� �� �� �� ��� � � ��� � ��� �� ��� �� ��� � � ��� ��� ����� � ���� � ��� ����� � � Figure 5.4: Standard mel-spectrogram and corresponding phonemes alignment of sentence “It’s another hot day” in the left and sentence “I love cats” in the right For baseline models which use the two-stages teacher-student training pipeline (FastSpeech and ForwardTacotron), the TransformerTTS model [10] was used as the teacher model TransformerTTS combines the benefits of Tacotron2 and Transformer to present an end-to-end TTS model in which a multi-head attention mechanism replaces RNN structures in the encoder and decoder, as well as the vanilla attention network (figure 5.3) To boost parallelization capabilities and alleviate the long-distance dependence problem, the self-attention method unties the sequential dependency on the last previous hidden state Multi-head attention, as opposed to vanilla attention between the encoder and decoder, may construct the context vector from many aspects utilizing different attention heads Using the Transformer in neural TTS provides two benefits over RNN-based models First, by deleting recurrent connections, it permits parallel training, since 45 frames of an input sequence for the decoder may be delivered simultaneously The second is that self-attention allows you to directly inject the global context of the whole sequence into each input frame, allowing you to establish long-range dependencies This is particularly useful in a neural TTS model, such as the prosody of synthesized waves, which is dependent not just on a few nearby words but also on sentence-level semantics � � �� �� �� ������� ������� �� �� �� �� �� �� �� � � ��� � � ��� � ��� � ��� � ��� �� �� ��� ��� � ��� � � ��� ����� ����� Figure 5.5: Two time slower speed: mel-spectrogram and corresponding phonemes alignment of sentence “It’s another hot day” in the left and sentence “I love cats” in the right 5.2.2 Metrics To evaluate the perceptual quality, I performed mean opinion score (MOS) [38] evaluation I randomly chose 50 text samples that did not appear on the training dataset and generated speech audios Synthesized audios are sent to a human rating service which I created using framework Django (see figure 5.7) Eight people with good English skills with an IELTS score of at least 7.5 are invited to judge the speech quality Every sample is rated by raters on a scale from to (1: bad, 2: poor, 3: fair, 4: good, 5: excellent), where is the lowest perceived quality, and is the highest perceived quality, from which a subjective mean opinion score (MOS) is calculated All about systems (except GT) use MelGAN [9] as the vocoder for a fair comparison The training time of FastSpeech and ForwadTacotron includes teacher and student training, while the training time of FastTacotron includes the training time of the MFA model and FastTacotron itself Training time here does not include the vocoder training The proposed model was trained on one GeForce RTX 2080 Ti 46 � � �� �� �� �� �� �� ������� ������� �� �� �� �� �� �� �� �� � � ��� � ���� � ��� � ��� ��� � �� ��� ��� �� � ��� � � ��� ����� ����� Figure 5.6: Two time faster speed: mel-spectrogram and corresponding phonemes alignment of sentence “It’s another hot day” in the left and sentence “I love cats” in the right GPU, while other baseline models were trained on a NVIDIA V100 GPU (which is faster than RTX 2080 GPU) The inference time includes acoustic model and vocoder inference All tests are conducted using a GeForce RTX 2080 Ti GPU and batch size of RTF denotes the real-time factor, that is, the time (in seconds) required for the system to synthesize a one-second waveform The speedup in waveform synthesis for FastSpeech is larger than that reported in [6] since I use MelGAN as the vocoder, which is much faster than WaveGlow Figure 5.7: MOS rating service interface 47 5.2.3 Results � � �� �� �� �� �� �� ������� ������� �� �� �� �� �� �� �� �� � � ��� � � � ��� ��� � ��� � ��� �� �� ��� ��� � ��� � � ��� ����� ����� Figure 5.8: Amplified pitch: mel-spectrogram and corresponding phonemes alignment of sentence “It’s another hot day” in the left and sentence “I love cats” in the right The final results are shown in Table 5.2 and Table 5.3 As we can see, Tacotron is a model with the highest audio quality but the lowest inference speed In contrast, FastSpeech is a model with the fastest inference but the worst audio quality ForwardTacotron and FastTacotron simultaneously outperform FastSpeech in terms of audio quality and also outperform Tacotron in terms of inference speed Moreover, because ForwardTacotron uses two stages teacher-student training pipeline, it is time-consuming FastTacotron has a simple training pipeline and uses some prosodic information as conditional inputs to help convergence faster Our model speeds up the training time by 7.6x, 10.6x, and 11.2x compared with Tacotron 2, FastSpeech, and ForwardTacotron, respectively I also did some experiments about controlling the prosody of synthesized speech In the inference process, I scale up or down the phoneme durations to generate slower or faster audio Pitch and energy also were manipulated to get different expressive speech Audio samples can be found in https://bit.ly/3xguaCW Some examples are shown in the following figures Two sample input sentences are ”It’s another hot day” and ”I love cats”, with corresponding phonemes are ”IH1 T S AH0 N AH1 DH ER0 HH AA1 T D EY1” and ”AY1 L AH1 V K AE1 T S” Figure 5.4 is standard output mel-spectrogram and corresponding phonemes alignment The alignment is based on the predicted duration By multiplying the predicted duration by two, I can get two times slower synthesized speech (see figure 5.5) On the other hand, I divided the duration of the phoneme by two to get 48 � � �� �� �� �� �� �� ������� ������� �� �� �� �� �� �� �� �� � � ��� � � � ��� ��� � ��� � ��� �� �� ��� ��� � ��� � � ��� ����� ����� Figure 5.9: Amplified energy: mel-spectrogram and corresponding phonemes alignment of sentence “It’s another hot day” in the left and sentence “I love cats” in the right two times faster audio (see figure 5.6) Apart from adjusting the audio speed, our model allows controlling other styles of speech, including pitch and energy This will help more variant speech according to one input sentence (figure 5.8, 5.9) 49 Chapter Conclusion Speech synthesis is the artificial computer simulation of human speech Speech synthesis is mainly used to convert text into audio and in applications such as voice-enabled services and mobile applications Apart from that, it’s also utilized in assistive technology to assist vision-impaired people in reading text Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focuses on deep learning-based end-toend speech synthesis technology, with a more robust modeling ability and a simpler pipeline This report proposed a new deep neural network-based text-to-speech model called FastTacotron The proposed model is constructed based on the Tacotron architecture, but instead of using attention, it uses a length regulator inspired by the idea from FastSpeech In addition, I use ground truth phoneme duration extracted from MFA to improve the alignment accuracy and simplify the training pipeline Inspired by the idea from FastSpeech 2, I also introduce some prosodic information as conditional inputs to improve the convergence and quality of synthesized speech The experiment results show that our proposed model can nearly match the state-of-the-art models in terms of speech quality, while it is also fast in both training and inference 50 References [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio chine translation by jointly learning to align and translate Neural maarXiv preprint arXiv:1409.0473, 2014 [2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin Attention is all you need In Advances in neural information processing systems, pages 5998–6008, 2017 [3] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al Tacotron: Towards end-to-end speech synthesis arXiv preprint arXiv:1703.10135, 2017 [4] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al Natural tts synthesis by conditioning wavenet on mel spectrogram predictions In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783 IEEE, 2018 [5] C Schaefer Forwardtacotron https://developer.nvidia.com/blog/ creating-robust-neural-speech-synthesis-with-forwardtacotron/, 2020 [6] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and TieYan Liu Fastspeech: Fast, robust and controllable text to speech preprint arXiv:1905.09263, 2019 51 arXiv [7] Adrian Ła´ncucki Fastpitch: Parallel text-to-speech with pitch prediction In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588–6592 IEEE, 2021 [8] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and TieYan Liu Fastspeech 2: Fast and high-quality end-to-end text to speech arXiv preprint arXiv:2006.04558, 2020 [9] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Br´ebisson, Yoshua Bengio, and Aaron Courville Melgan: Generative adversarial networks for conditional waveform synthesis arXiv preprint arXiv:1910.06711, 2019 [10] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu Neural speech synthesis with transformer network In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6706–6713, 2019 [11] Keith Ito and Linda Johnson The lj speech dataset https://keithito com/LJ-Speech-Dataset/, 2017 [12] Andrew J Hunt and Alan W Black Unit selection in a concatenative speech synthesis system using a large speech database In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 373–376 IEEE, 1996 [13] Alan W Black and Paul A Taylor Automatically clustering similar units for unit selection in speech synthesis 1997 [14] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura Speech parameter generation algorithms for hmm-based speech synthesis In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings (Cat No 00CH37100), volume 3, pages 1315–1318 IEEE, 2000 [15] Heiga Zen, Keiichi Tokuda, and Alan W Black Statistical parametric speech synthesis speech communication, 51(11):1039–1064, 2009 52 [16] Heiga Ze, Andrew Senior, and Mike Schuster Statistical parametric speech synthesis using deep neural networks In 2013 ieee international conference on acoustics, speech and signal processing, pages 7962–7966 IEEE, 2013 [17] Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura Speech synthesis based on hidden markov models Proceedings of the IEEE, 101(5):1234–1252, 2013 [18] Chenfeng Miao, Shuang Liang, Minchuan Chen, Jun Ma, Shaojun Wang, and Jing Xiao Flow-tts: A non-autoregressive network for text to speech based on flow In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7209–7213 IEEE, 2020 [19] Jan Vainer and Ondˇrej Duˇsek Speedyspeech: Efficient neural speech synthesis arXiv preprint arXiv:2008.03802, 2020 [20] Hao Li, Yongguo Kang, and Zhenyu Wang Emphasis: An emotional phoneme-based acoustic model for speech synthesis system arXiv preprint arXiv:1806.09276, 2018 [21] Zhizheng Wu, Oliver Watts, and Simon King Merlin: An open source neural network speech synthesis system In SSW, pages 202207, 2016 ă Ark, Mike Chrzanowski, Adam Coates, Gregory Diamos, An[22] Sercan O drew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al Deep voice: Real-time neural text-to-speech In International Conference on Machine Learning, pages 195–204 PMLR, 2017 [23] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and Ming Zhou Close to human quality tts with transformer arXiv preprint arXiv:1809.08895, 2018 [24] Wei Ping, Kainan Peng, and Jitong Chen Clarinet: Parallel wave generation in end-to-end text-to-speech arXiv preprint arXiv:1807.07281, 2018 [25] Daniel Griffin and Jae Lim Signal estimation from modified short-time fourier transform IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984 53 [26] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu Wavenet: A generative model for raw audio arXiv preprint arXiv:1609.03499, 2016 [27] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals Listen, attend and spell: A neural network for large vocabulary conversational speech recognition In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964 IEEE, 2016 [28] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher Non-autoregressive neural machine translation arXiv preprint arXiv:1711.02281, 2017 [29] Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu Nonautoregressive neural machine translation with enhanced decoder input In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3723–3730, 2019 [30] Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu Non-autoregressive machine translation with auxiliary regularization In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5377–5384, 2019 [31] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al Parallel wavenet: Fast high-fidelity speech synthesis In International conference on machine learning, pages 3918–3926 PMLR, 2018 [32] Ryan Prenger, Rafael Valle, and Bryan Catanzaro Waveglow: A flow-based generative network for speech synthesis In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621 IEEE, 2019 [33] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger Montreal forced aligner: Trainable text-speech alignment using kaldi In Interspeech, volume 2017, pages 498–502, 2017 54 [34] Sepp Hochreiter and Jăurgen Schmidhuber Long short-term memory Neural computation, 9(8):17351780, 1997 [35] Paul Boersma et al Accurate short-term analysis of the fundamental fre- quency and the harmonics-to-noise ratio of a sampled sound In Proceedings of the institute of phonetic sciences, volume 17, pages 97–110 Citeseer, 1993 [36] Maximilian Bisani and Hermann Ney Joint-sequence models for graphemeto-phoneme conversion Speech communication, 50(5):434–451, 2008 [37] Diederik P Kingma and Jimmy Ba Adam: A method for stochastic optimization arXiv preprint arXiv:1412.6980, 2014 [38] Min Chu and Hu Peng Objective measure for estimating mean opinion score of synthesized speech, April 2006 US Patent 7,024,362 55 ... Hạnh phúc BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ tên tác giả luận văn: Lâm Xuân Thư Đề tài luận văn: Cải tiến tổng hợp tiếng nói sử dụng học sâu Chuyên ngành: Khoa học liệu Trí tuệ nhân tạo... automatically (figure 1.1) Text Speech Synthesis System Speech Figure 1.1: Speech synthesis or Text-to -speech: the artificial production of human speech Speech synthesis is used in a wide range of... List of Figures 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 Speech synthesis or Text-to -speech: man speech Speech synthesis is used in a wide assistive technology and multimedia

Ngày đăng: 04/04/2022, 12:48

Xem thêm:

w