Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 68 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
68
Dung lượng
2,23 MB
Nội dung
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY MASTER THESIS Expressive Speech Synthesis NGUYEN THI NGOC ANH Anh.NTN211269M@sis.hust.edu.vn School of Information and Communication Technology Supervisor: Dr Nguyen Thanh Hung Supervisor’s signature School: Information and Communication Technology 18th May 2023 Graduation Thesis Assignment Name: Nguyen Thi Ngoc Anh Phone: +84342612379 Email: Anh.NTN211269M@sis.hust.edu.vn; ngocanh2162@gmail.com Class: CH2021A Affiliation: Hanoi University of Science and Technology Nguyen Thi Ngoc Anh - hereby warrants that the work and presentation in this thesis were performed by myself under the supervision of Dr Nguyen Thanh Hung All the results presented in this thesis are truthful and are not copied from any other works All references in this thesis including images, tables, figures, and quotes are clearly and fully documented in the bibliography I will take full responsibility for even one copy that violates school regulations Student Signature and Name Acknowledgement I’d like to take this opportunity to thank everyone who has been so supportive of me throughout my academic career To begin, I’d like to thank Dr Nguyen Thanh Hung for his unwavering support and encouragement throughout my master’s studies His support and guidance have been instrumental in helping me achieve my academic goals In addition, I’d like to thank Dr Nguyen Thi Thu Trang and her colleagues in Lab 914 for their assistance in completing the experiments Their willingness to share their knowledge and skills has been very helpful to me, and I’ve learned a lot from them Her knowledge, advice, and support were very important to my academic career, and I will always be grateful to her I also want to thank Dr Do Van Hai and my other coworkers at Viettel Cyberspace Center for their constant help and support during my master’s studies Their willingness to lend a hand and help me out when I needed it has been very important to me I would not have been able to achieve academic success without their assistance Aside from my academic mentors, I’m thankful to my family and friends for their constant support and encouragement Their never-ending love and support have given me strength and pushed me to well in school Finally, I would like to thank myself for persevering and not giving up The journey was difficult, but I am proud of myself for overcoming the challenges and reaching my academic goals Abstract Text-to-speech technology, also known as TTS, is a type of assistive technology that converts written text into spoken words The overall goal of the speech synthesis research community is to create natural sounding synthetic speech Currently, there are many speech synthesis engines available on the market, each with its own strengths and weaknesses Some engines focus on generating naturalsounding speech, while others focus on generating expressive speech To increase naturalness, researchers have recently identified synthesizing emotional speech as a major research focus for the speech community Expressive speech synthesis is the ability to convey emotions and attitudes through synthesized speech This is achieved by adding prosodic features like intonation, stress, and rhythm to the speech waveform Vietnamese expressive speech research is scarce, to my knowledge No datasets from these articles have been released However, significant work remains in this field A large, high-quality dataset is needed to investigate Vietnamese expressive speech This thesis (1) publishes two Vietnamese emotional speech datasets, (2) proposes a method for automatically building data, and (3) develops a model for synthesizing emotional speech The proposed method for automatically building data helps reduce costs and time by extracting and labeling data from available data sources Simultaneously, the applicability of the presented data is illustrated using the proposed emotional speech synthesis model Keywords: Speech Synthesis, Text To Speech, Expressive Speech Synthesis, Corpus Building Student Signature and Name TABLE OF CONTENTS INTRODUCTION CHAPTER THEORETICAL BACKGROUND 1.1 Speech Features 1.1.1 Non-emotional Features 1.1.2 Emotional Features 1.2 Speech Synthesis 1.2.1 Overview 1.2.2 Traditional Speech Synthesis Techniques 1.2.3 Modern Speech Synthesis Techniques 1.3 Expressive Speech Synthesis 10 1.3.1 Introduction 10 1.3.2 ESS Techniques 11 CHAPTER BUILDING VIETNAMESE EMOTIONAL SPEECH DATASET 13 2.1 Surveys 13 2.1.1 Existing Emotion Datasets 13 2.1.2 Data Processing Techniques 14 2.2 Pipeline For Building Emotional Speech Dataset 17 2.2.1 Data Selection 18 2.2.2 Target Speech Segmentation 19 2.2.3 Text Scripting 20 2.2.4 Emotional Labeling 20 2.2.5 Post-Processing 21 2.2.6 Data Augmentation 22 2.3 Label Processing 23 2.3.1 Manual Annotation 23 2.3.2 Automatic Annotation 24 2.4 Dataset Analysis 26 2.4.1 Analysis of Pipeline Errors 26 2.4.2 Text Analysis 27 2.4.3 Emotion Analysis 28 2.5 Released Datasets 31 CHAPTER EMOTIONAL SPEECH SYNTHESIS SYSTEM 32 3.1 Acoustic Model 32 3.1.1 Baseline Acoustic Model 32 3.1.2 Proposed Acoustic Model 36 3.2 Vocoder 39 3.2.1 HifiGAN Vocoder 39 3.2.2 Denoiser Module 40 CHAPTER EXPERIMENTS 42 4.1 Evaluation Strategy 42 4.1.1 Evaluation Metrics 42 4.1.2 Evaluation Design 43 4.1.3 Sheme Design 44 4.2 Experimental Setup 45 4.2.1 Model Configuration 45 4.2.2 Training Settings 47 4.3 Result and Discussion 47 4.3.1 Dataset Evaluation 48 4.3.2 Model Evaluation 49 4.3.3 Discussion 50 CONCLUSION 52 LIST OF FIGURES 1.1 An example of waveform, spectrogram, and mel-spectrogram 1.2 Russell’s (1980) circumplex model [8] 1.3 An example of modern TTS architecture 1.4 Typical acoustic models 1.5 Some expressive speech synthesis techniques 11 2.1 Pipeline for building an emotional speech dataset 17 2.2 Audio post-processing 21 2.3 SER model [56] 25 2.4 F0 means in the TTH and LMH datasets 29 2.5 t-SNE visualizations of emotion embeddings in the TTH dataset 30 2.6 t-SNE visualizations of emotion embeddings in the LMH dataset 3.1 Pipeline for training baseline acoustic model 32 3.2 Baseline acoustic model architecture 33 3.3 Proposed acoustic model architecture 37 3.4 Detail of Emotion Encoder module 38 3.5 Proposed vocoder 39 3.6 HifiGAN model architecture [66] 39 30 LIST OF TABLES 2.1 Some emotional datasets 14 2.2 Pipeline errors in the LMH dataset 26 2.3 LMH dataset before and after normalization 27 2.4 Compare manual pipeline and proposed pipeline processing times 2.5 Syllable coverage in two datasets 28 2.6 TTH dataset 28 2.7 LMH dataset 28 4.1 Scheme setup 45 4.2 Acoustic model configuration 46 4.3 MOS score of data evaluation 48 4.4 EIR score of data evaluation 48 4.5 SUS score of data evaluation 49 4.6 MOS score in model evaluation 49 4.7 EIR score in model evaluation 50 4.8 SUS score in model evaluation 50 27 ACRONYMS Notation Description AI ASR CNN DNN E2E EIR ESS GAN Artificial Intelligence Automatic Speech Recognition Convolutional Neural Network Deep Neural Network End To End Emotion Identification Rate Expressive Speech Synthesis Generative Adversarial Network GST HMM LMH LSTM MOS NLP RNN S2S SER SOTA SPSS SUS TTH TTS Global Style Tokens Hidden Markov Model Luong Manh Hai Long Short Term Memory Mean Opinion Score Natural Language Processing Recurrent Neural Network Sequence-to-Sequence Speech Emotion Recognition State-Of-The-Art Statistical Parametric Speech Synthesis Semantically Unpredictable Sentences Tang Thanh Ha Text To Speech VAE WER Variational Autoencoder Word Error Rate • LMH: baseline model trained with the LMH data set; • LMH-aug-over: baseline model trained with LMH data augmented by oversampling For model evaluation, the following models using the augmented oversampling dataset from emotion were trained: • TTH-baseline: The baseline acoustic model was trained separately for each emotion The model was trained with neutral sub-data before being trained with each emo sub-data individually No modifications were made to the model to extract or include emotional data; • TTH-tacotron2: The Tacotron acoustic model was utilized as a SOTA TTS model in order to compare my baseline and proposed model The training strategy is deployed the same way as TTS-baseline training; • TTH-proposed: The proposed acoustic model HifiGAN vocoders version were employed for each target speaker to transform these acoustic features into time-domain waveform samples Note that data augmentation was not used for training the vocoder because (1) it has been reported that using a large amount of training data is not crucial for the vocoder [71], and (2) some preliminary experiments also confirmed subtle improvements when the amount of the source speaker’s data was sufficiently large [72] 4.1.3 Sheme Design In my experiment, a scheme is defined as the evaluation of a synthesis system and a given dataset A total of schemes had to be evaluated (7 systems in 4.1.2 with four emotions, plus ground truth speech) in tests (MOS test, EIR test, SUS test) Each scheme generated a set of speech for the 32-67 sentences that were not included in the speech training sentences They were medium-length sentences, ranging from to 10 words, with an average length of words The content of the test sentences was emotionally neutral to allow listeners to focus on acoustic cues only The SUS testset was built using a semi-automatic method Vbee Jsc provides a text dataset of million dialogues and stories, which are split into sentences and assigned POS labels using VnCoreNLP [73] The top 50 most frequent POS 44 patterns are filtered and used to randomly generate 300 new sentences The best 20 sentences are selected based on meeting SUS requirements and having correct Vietnamese grammar structure One example sentence in Intelligibility test data: "Chị/N, uống/V bao_nhiêu/P điếu/N rồi/C mới/R bay/V lên/R thế/P anh/N ơi/I" ("how many cigarettes you drink before you fly into the air") The Latin square design [74] allowed evaluation of all schemes and all synthesized sentences and controlled for ordering effects by ensuring that each group of listeners heard the stimuli in a different order Thirty listeners with different socio-cultural profiles participated in the evaluation, which was conducted individually in a single session per listener All listeners were Vietnamese, aged 20 to 40, and half had speech-related research experience The evaluation was conducted in a quiet environment using headphones Sessions were limited to being completed in about 10-15 minutes per listener to avoid long sessions Table 4.1 describes information about the tests Table 4.1: Scheme setup TTH MOS EIS SUS # Text-file 25 30 12 # Utterance 125 150 48 5 # Session # Subject LMH 4.2 4.2.1 25 25 25 (5-5-5-5-5) (5-5-5-5-5) (5-5-5-5-5) # Text-file 12 12 # Utterance 36 36 16 # Session 3 # Subject 15 (5-5-5) 15 (5-5-5) 15 (5-5-5) Experimental Setup Model Configuration The audio format is 16-bit PCM with a 22,050 Hz sampling rate The FFT, window, and hop size were set to 1024, 1024, and 256, respectively, before being converted into mel-spectrograms following [75] The parameters of the acoustic base model are mostly obtained from FastSpeech The Conformer module with the same number of layers - blocks in the Encoder and the Mel-spectrogram Decoder - replaces the Transformer module The number of attention heads is set to The size of the phoneme vocabulary is 45 176, including punctuation The kernel sizes of the 1D-convolution are set to in the Variance Predictor, with input/output sizes of 256/256 for both layers and a dropout rate of 0.1 To improve the overall reconstructed mel-spectrogram, the Post-Net layer described in the Tacotron2 paper is leveraged The predicted melspectrogram is passed through a 5-layer convolutional Post-Net Each Post-Net layer is composed of 256 filters with the shape × with batch normalization, followed by Tanh activations on all but the final layer The hyperparameters and configurations of acoustic models used in this evaluation are presented in Table 4.2 Table 4.2: Acoustic model configuration Hyperparameter Phoneme Embedding Dimension Baseline Model Proposed Model 256 256 Encoder Layers 4 Encoder Hidden 1536 1536 Encoder CNN Kernel 7 Encoder Attention Heads 2 Encoder Dropout 0.2 0.3 Decoder Layers 4 Decoder Hidden 1536 1536 Decoder CNN Kernel 7 Decoder Attention Heads 2 0.2 0.2 Variance Predictor Layer 2 Variance Predictor Kernel Size 3 Variance Predictor Filter Size 256 256 Variance Predictor Dropout 0.1 0.2 5 256 256 5 0.5 0.6 Reference Encoder Kernel \ Reference Encoder RNN \ 128 Reference Encoder Dense \ 128 Style Tokens \ Style Token Attention Head \ 256 256 Decoder Dropout Postnet Layers Postnet Channels Postnet Kernel Postnet Dropout Batch Size 46 In the proposed acoustic model, the parameter settings of the Reference Encoder and Style Token Layer are the same as in the original paper architecture For the Style Token Layer to learn the style corresponding to the emotion labels defined in the dataset, tokens corresponding to the number of emotions were used to aid in the convergence of the unsupervised training process The parameters of the HifiGAN model mostly follow Version of the original paper to achieve the best quality of speech synthesis 80-band mel-spectrograms were used as input conditions The Kaldi-based ASR model was used for forced alignment 4.2.2 Training Settings All models are trained using Tesla V100 GPUs running on NVIDIA DGX machines with automatic mixed precision Acoustic models are trained up to 4,000 epochs with a batch size of 256 sentences The Adam optimizer [76] is used with a learning rate of 10−3 , β = 0.9, β = 0.98, ϵ = 10−9 The learning rate is increased for a warm-up period of 3,000 steps and then decayed according to the Conformer schedule The gradient accumulation trick is utilized to increase the batch size due to the limited VRAM of the GPU Training a new vocoder for each speaker is impossible due to the limited speech data available in each dataset To help the model reach convergence faster, A pretrained Hifi-GAN model on the LJ speech dataset is leveraged as the starting point of the model The Hifi-GAN vocoder is trained up to 1,000,000 iterations for each speaker using weight normalization and an Adam optimizer with a fixed learning rate of ∗ 10−4 with a batch size of 16 samples The networks were trained using the AdamW optimizer [77] with β = 0.8, β = 0.99, and weight decay λ = 0.01 4.3 Result and Discussion In this section, the experiments that were carried out are detailed The first one evaluates the usefulness of the proposed pipeline’s datasets In the second one, the synthesized speech quality of the proposed model was compared with the baseline model and SOTA model https://github.com/kaldi-asr/kaldi 47 4.3.1 Dataset Evaluation The performance of two proposed constructed datasets was investigated by evaluating them with ground truth and speech data synthesized from training data and augmented training data The results of the MOS, EIR, and SUS evaluations are shown in Table 4.3, Table 4.4, and Table 4.5 The findings can be summarized as follows: Table 4.3: MOS score of data evaluation MOS Neutral Happy Sad Angry TTH-ground truth 4.463 ± 0.736 4.458 ± 0.632 4.250 ± 0.833 4.528 ± 0.656 TTH 4.148 ± 0.568 3.375 ± 0.729 3.472 ± 0.864 3.667 ± 0.833 TTH-aug-over 4.167 ± 0.741 3.458 ± 0.785 4.000 ± 0.667 3.944 ± 0.846 LMH-ground truth 4.600 ± 0.317 4.533 ± 0.552 - 4.600 ± 0.686 LMH 4.267 ± 0.478 3.733 ± 0.638 - 3.867 ± 0.981 LMH-aug-over 4.200 ± 0.510 4.067 ± 0.781 - 4.267 ± 0.781 Table 4.4: EIR score of data evaluation EIR (F1-score) Neutral Happy Sad Angry TTH-ground truth 0.772 0.680 0.915 0.921 TTH 0.736 0.667 0.824 0.896 TTH-aug-over 0.741 0.638 0.889 0.909 LMH-ground truth 0.842 0.929 - 0.800 LMH 0.807 0.786 - 0.444 LMH-aug-over 0.825 0.720 - 0.875 With ground truth data, the TTH dataset capturing angry and sad emotions is the easiest to recognize, with an F1 score of 0.921 and 0.915, respectively Happy is the most easily confused emotion, particularly with neutral ones One possible reason for this result is the similarity of the raised voice (amplitude, F0, volume, etc.) in the whole sentence Unsurprising, given that this is also confirmed by the data analysis With LMH data, all three emotions are clearly distinct, with accuracies of 0.842, 0.929, and 0.800 for neutral, sad, and angry, respectively Based on both the score and the t-SNE graph in Figure 2.6, it is determined that the autonomous system constructs the data fairly well In addition, LMH recorded a higher approximate MOS value than TTH With the synthesized speech from the built data, TTH augmented oversampling data shows a 5% increase in EIR score, a 10% increase in MOS, and the max48 imum increase in emotions, with the lowest data indicating sad (approximately 11.5%) Unlike TTH data, augmenting LMH data does not significantly enhance its results Increased MOS scores for augmented happy and angry emotions (increased by 10%), and increased EIR for angry emotions from 0.444 to 0.875 and neutral from 0.807 to 0.825 However, when experiencing positive emotions, EIR falls from 0.72 to 0.78 Table 4.5: SUS score of data evaluation SUS (% WER) TTH 43.61 TTH-aug-over 43.28 LMH 43.33 LMH-aug-over 43.70 The WER score in the SUS test was analyzed for both the unprocessed data and the augmented data of the TTH and LMH speeches The results, as shown in Table 4.5, indicate that there was no significant difference in the WER score between the two types of data This suggests that the use of data augmentation techniques did not have a noticeable impact on the intelligibility of emotional speech synthesis 4.3.2 Model Evaluation The most evident result from the results presented in Table 4.6, Table 4.7, and Table 4.8 is that both the baseline model and the proposed model outperform the SOTA neutral TTS in all naturalness, intelligibility, and emotion identification: The average MOS score for the Tacotron2 model (SOTA) was 2.xx, whereas the baseline model and proposed model both scored 4.xx and the average rate of identification in angry reached 0.9% Table 4.6: MOS score in model evaluation MOS Neutral Happy Sad Angry TTH-ground truth 4.463 ± 0.736 4.458 ± 0.632 4.250 ± 0.833 4.528 ± 0.656 TTH-aug-tacotron2 2.926 ± 0.768 2.417 ± 0.785 2.278 ± 0.799 3.028 ± 0.704 TTH-aug-baseline 4.167 ± 0.741 3.458 ± 0.785 4.000 ± 0.667 3.944 ± 0.846 TTH-aug-proposed 3.926 ± 0.686 4.000 ± 0.333 4.194 ± 0.806 4.556 ± 0.617 The MOS of the proposed model is typically 10% higher than the baseline model for each emotion, and 4.556 points higher than the truth for furious synthesized 49 speech (4.528) It is easy to understand why the MOS neutral score of the proposed model is lower than that of the baseline model, since training the model with neutral-only data is clearly superior to training it with data containing other emotions Table 4.7: EIR score in model evaluation EIR (F1-score) Neutral Happy Sad Angry TTH-ground truth 0.772 0.680 0.915 0.921 TTH-aug-tacotron2 0.561 0.171 0.512 0.756 TTH-aug-baseline 0.741 0.638 0.889 0.909 TTH-aug-proposed 0.641 0.653 0.744 0.894 The proposed model is more natural and understandable than the baseline model, but it is 11% less sensitive to emotional nuance, except happy, which has a higher 0.015 in the F1 score This can be explained by the fine-tuning that enables the baseline model to implicitly acquire the hidden characteristics of emotions after being trained on neutral speech unaffected by other emotions in the final phases Table 4.8: SUS score in model evaluation SUS (%WER) TTH-aug-baseline 43.28 TTH-aug-tacotron2 65.49 TTH-aug-proposed 39.08 With the results shown in Table 4.8 of the SUS evaluation, the proposed model proves that speech synthesis has a higher intelligibility than the baseline model with 4.20% lower WER and outperforms the Tacotron model with 26.41% 4.3.3 Discussion In general, manual relabeling improves results but does not substantially outperform the proposed pipeline; however, it reduces data processing costs This demonstrates that if a quick and cost-effective data set is required, implementing the proposed pipeline is quite reasonable and efficient Consequently, this facilitates the creation of synthetic speech and the processing of emotional speech Supporting the creation of high-quality datasets is no longer difficult Using SER still necessitates data comprehension, particularly when the deep learning trend seeks data centering e.g Predicted gender designations were typically detrimental in terms of emotion recognition This can be explained by the 50 fact that in the multi-speaker corpus, true gender classifications can be misleading: there are numerous instances of female speech that strongly resemble male features, and vice versa In addition, data augmentation through oversampling generally improves performance, but it is not always effective and must be implemented correctly Based on the findings presented above, it is evident that the proposed model has a significant impact on the naturalness and intelligibility of emotional synthetic speech Specifically, when compared to both the baseline model and the neutralemotional SOTA model (Tacotron 2), the proposed model demonstrates a marked improvement in the quality of synthetic speech However, while the proposed model has shown promising results in terms of generating more natural and intelligible speech, there is still a need for improvement in its ability to extract emotions Specifically, the model’s emotional extraction capabilities must be improved to more accurately distinguish between different speech emotions In fact, the accuracy of the best models is not yet sufficiently high to allow for the implementation of a system that can provide concrete help in daily life However, the proposed solutions can inspire and indicate possible directions for further investigations 51 CONCLUSION In this work, I have presented several novel contributions to the field of Vietnamese emotional speech synthesis The first contribution has been the creation and publication of two Vietnamese emotional speech datasets based on open-domain data As far as I know, my work is the first study to publish a Vietnamese emotional speech dataset These datasets have allowed me to train models for synthesizing emotional artificial speech in Vietnamese, which is the second contribution I have presented The proposed model has demonstrated the ability to generate expressive speech that conveys a range of emotions with high fidelity My third and final contribution has been a proposed method for automatically designing two emotional speech corpora based on available services, tools, models, and data This approach has addressed the need for a more efficient and costeffective way of creating other emotional speech corpora that can be used for training expressive speech synthesis models To evaluate the effectiveness of the proposed model, synthetic emotional speech has been evaluated by 40 different assessors The results of the evaluation have been promising, demonstrating the potential of the model to generate high-quality emotional speech in Vietnamese Although the thesis achieved many promising results, it still presents several challenges to the emotional speech synthesis task • Combining different information domains (text/visual data available in movies) to more accurately normalize data • The small size of the emotional speech corpora is often a restriction for training complex deep learning systems Therefore, I hope that the system I propose can help create more datasets, increasing the number of emotional speech datasets in Vietnamese specifically, and speech datasets in Vietnamese in general 52 Some other papers in the speech synthesis domain related to my thesis that I have achieved during my Master’s program • Nguyen Thi Ngoc Anh, Le Minh Nguyen, Nguyen Thi Thu Trang " VLSP 2022 - TTS Challenge: Vietnamese Emotional Speech Synthesis.", Journal of Computer Science and Cybernetics, Manuscript submitted for publication • Nguyen Thi Ngoc Anh, Nguyen Tien Thanh, Le Dang Linh " TTS - VLSP 2021: The Thunder Text-To-Speech System." VNU Journal of Science: Computer Science and Communication Engineering [Online], 38.1 (2022) • N T N Anh, N T Thanh and L N Minh, "Development of a High Quality Text to Speech System for LAO," 2022 25th Conference of the Oriental COCOSDA International Committee for the Coordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Hanoi, Vietnam, 2022, pp 1-5 53 REFERENCES [1] T L Xuan, T D T Le, L T Van, and Q N Hong, Speech emotions and statistical analysis for vietnamese emotion corpus, 2016 [2] D.-K Mac, E Castelli, and V Aubergé, Modeling the prosody of Vietnamese attitudes for expressive speech synthesis, 2012 [3] D.-K Mac, V Aubergé, A Rilliard, and E Castelli, Audio-visual prosody of social attitudes in Vietnamese: Building and evaluating a tones balanced corpus, International Speech Communication Association (ISCA), 2009 [4] N Brooke and Q Summerfield, Analysis, synthesis, and perception of visible articulatory movements, 1983 [5] K Doshi, “Audio deep learning made simple (part 2): Why mel spectrograms perform better,” 2021 [Online] Available: https://towardsdatascience com/audio- deep- learning- made- simple- part- 2- whymel-spectrograms-perform-better-aad889a93505 (visited on 04/12/2023) [6] R Banse and K Scherer, Acoustic profiles in vocal emotion expression, 1996 [7] M Wagner and D Watson, Experimental and theoretical advances in prosody: A review, 2010 [8] L Barrett and J Russell, Independence and bipolarity in the structure of current affect, 1998 [9] J Harrington and S Cassidy, “Digital formant synthesis,” in Techniques in Speech Acoustics Springer Netherlands, 1999, pp 195–210 [10] In Voice Communication Between Humans and Machines The National Academies Press, 1994, p 118 [11] J P Olive, Rule synthesis of speech from dyadic units, 1977 [12] T Yoshimura, K Tokuda, T Masuko, T Kobayashi, and T Kitamura, Simultaneous modeling of spectrum, pitch and duration in hmm-based speech synthesis, 1999 [13] H Zen, A W Senior, and M Schuster, Statistical parametric speech synthesis using deep neural networks, 2013 [14] X Tan, T Qin, F Soong, and T.-Y Liu, A survey on neural speech synthesis, 2021 54 [15] H Kawahara, STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds, 2006 [16] M Morise, F Yokomori, and K Ozawa, WORLD: A vocoder-based highquality speech synthesis system for real-time applications, 2016 [17] Y Wang, R Skerry-Ryan, D Stanton, et al., Tacotron: Towards end-to-end speech synthesis, 2017 [18] J Shen, R Pang, R J Weiss, et al., Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions, 2018 [19] S O Arik, M Chrzanowski, A Coates, et al., Deep voice: Real-time neural text-to-speech, 2017 [20] W Ping, K Peng, A Gibiansky, et al., Deep voice 3: Scaling text-to-speech with convolutional sequence learning, 2018 [21] S Arik, G Diamos, A Gibiansky, et al., Deep voice 2: Multi-speaker neural text-to-speech, 2017 [22] Y Ren, Y Ruan, X Tan, et al., Fastspeech: Fast, robust and controllable text to speech, 2019 [23] Y Ren, C Hu, X Tan, et al., Fastspeech 2: Fast and high-quality end-toend text to speech, 2022 [24] N Li, S Liu, Y Liu, S Zhao, M Liu, and M Zhou, Neural speech synthesis with transformer network, 2019 [25] J M R Sotelo, S Mehri, K Kumar, et al., Char2wav: End-to-end speech synthesis, 2017 [26] A van den Oord, S Dieleman, H Zen, et al., WaveNet: A generative model for raw audio, 2016 [27] N Kalchbrenner, E Elsen, K Simonyan, et al., Efficient neural audio synthesis, 2018 [28] R Prenger, R Valle, and B Catanzaro, WaveGlow: A flow-based generative network for speech synthesis, 2018 [29] S Kim, S gil Lee, J Song, J Kim, and S Yoon, FloWaveNet : A generative flow for raw audio, 2019 [30] K Kumar, R Kumar, T de Boissiere, et al., MelGAN: Generative adversarial networks for conditional waveform synthesis, 2019 55 [31] R Yamamoto, E Song, and J.-M Kim, Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multiresolution spectrogram, 2020 [32] J Ho, A Jain, and P Abbeel, Denoising diffusion probabilistic models, 2020 [33] J Kim, J Kong, and J Son, Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech, 2021 [34] Q H N Thanh X Le An T Le, Emotional vietnamese speech synthesis using style-transfer learning, 2023 [35] T D Ngo, M Akagi, and T D Bui, Toward a rule-based synthesis of vietnamese emotional speech, 2015 [36] P Ekman, W V Friesen, M O’Sullivan, et al., Universals and cultural differences in facial expressions of emotion, 1987 [37] M Schrăoder, Expressive speech synthesis: Past, present, and possible futures,” in Affective Information Processing Springer London, 2009, pp 111– 126 [38] N Tits, K E Haddad, and T Dutoit, The theory behind controllable expressive speech synthesis: A cross-disciplinary approach, 2019 [39] R Skerry-Ryan, E Battenberg, Y Xiao, et al., Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron, 2018 [40] Y.-J Zhang, S Pan, L He, and Z.-H Ling, Learning latent representations for style control and transfer in end-to-end speech synthesis, 2019 [41] D Stanton, Y Wang, and R Skerry-Ryan, Predicting expressive speaking style from text in end-to-end speech synthesis, 2018 [42] L N Tung, “A study on improving speaker diarization system,” M.S thesis, Hanoi University of Science and Technology, 2021 [43] B Desplanques, J Thienpondt, and K Demuynck, ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification, 2020 [44] N C Hong, L Chung-Ting, L Yung-Hui, and W Jia-Ching, A survey of vietnamese automatic speech recognition, 2021 [45] N T M Thanh, P X Dung, N N Hay, L N Bich, and D X Quy, Evaluation of vietnamese speech recognition platforms (vais, viettel, zalo, fpt and google) in news, 2021 56 [46] V H Nguyen, An end-to-end model for vietnamese speech recognition, 2019 [47] T B Thang, D D Son, L D Linh, D X Vuong, and D Q Tien, ASR VLSP 2021: Automatic speech recognition with blank label re-weighting, 2022 [48] M Schrăoder and R Cowie, Developing a consistent view on emotionoriented computing,” in Machine Learning for Multimodal Interaction Springer Berlin Heidelberg, 2006, pp 194–205 [49] Z Kong, W Ping, A Dantrey, and B Catanzaro, Speech denoising in the waveform domain with self-attention, 2022 [50] M Yurtbas¸ı, Detecting and correcting speech rhythm errors, 2015 [51] Z Lian, J Tao, B Liu, J Huang, Z Yang, and R Li, Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition, 2020 [52] I Gat, H Aronowitz, W Zhu, E Morais, and R Hoory, Speaker normalization for self-supervised speech emotion recognition, 2022 [53] Y Wang, A Boumadane, and A Heba, A fine-tuned wav2vec 2.0/huBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding, 2022 [54] M A Jalal, R Milner, and T Hain, Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition, 2020 [55] T D T Le, L T Van, and Q N Hong, Deep convolutional neural networks for emotion recognition of vietnamese, 2020 [56] B T Ta, T L Nguyen, D S Dang, N M Le, and V H Do, Improving speech emotion recognition via fine-tuning ASR with speaker information, 2022 [57] L Hieu-Thi and V Hai-Quan, A non-expert Kaldi recipe for Vietnamese speech recognition system, 2016 [58] A Velichko, M Markitantov, H Kaya, and A Karpov, Complex Paralinguistic Analysis of Speech: Predicting Gender, Emotions and Deception in a Hierarchical Framework, 2022 [59] L van der Maaten and G Hinton, Visualizing data using t-SNE, 2008 [60] S Schneider, A Baevski, R Collobert, and M Auli, Wav2vec: Unsupervised pre-training for speech recognition, 2019 57 [61] N T N Anh, L M Nguyen, and N T T Trang, “VLSP 2022 - TTS challenge: Vietnamese emotional speech synthesis,” Submitted for publication to Journal of Computer Science and Cybernetics 2023 [62] J Kong, J Kim, and J Bae, Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis, 2020 [63] N T N Anh, N T Thanh, and L Linh, TTS - VLSP 2021: The Thunder text-to-speech system, 2022 [64] A Gulati, J Qin, C.-C Chiu, et al., Conformer: Convolution-augmented transformer for speech recognition, 2020 [65] Y Wang, D Stanton, Y Zhang, et al., Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, 2018 [66] [Online] Available: https://catalog.ngc.nvidia.com/orgs/ nvidia/teams/dle/resources/hifigan_pyt [67] I.-T Recommendation, Methods for subjective determination of transmission quality [68] J Holub, H Avetisyan, and S Isabelle, Subjective speech quality measurement repeatability: Comparison of laboratory test results, 2017 [69] F Tsao, Is aq or f score the last word in determining individual effort, 1943 [70] C Benoˆıt, M Grice, and V Hazan, The sus test: A method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences, 1996 [71] K Matsubara, T Okamoto, R Takashima, et al., Investigation of training data size for real-time neural vocoders on cpus, 2021 [72] R Terashima, R Yamamoto, E Song, et al., Cross-speaker emotion transfer for low-resource text-to-speech using non-parallel voice conversion with pitch-shift data augmentation, 2022 [73] T Vu, D Q Nguyen, D Q Nguyen, M Dras, and M Johnson, VnCoreNLP: A vietnamese natural language processing toolkit, 2018 [74] W G Cochran and M C Gertrude, Experimental designs, 1957 [75] S Jonathan, P Ruoming, W R J., et al., Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions, 2018 [76] D P Kingma and J Ba, Adam: A method for stochastic optimization, 2017 [77] I Loshchilov and F Hutter, Decoupled weight decay regularization, 2019 58