Định hướng phát triển

Từ kết quả nghiên cứu của luận văn, tác giả để xuất một số hướng nghiên cứu tiếp theo như sau:

•Xây dựng bộ dữ liệu với khoảng 500 câu cho từng cảm xúc: buồn, vui ( tương ứng với 5% kích thước tập huấn luyện). Các âm thanh này được kì vọng sẽ được thu từ các nghệ sĩ chuyên nghiệp và có khả năng thể hiện cảm xúc tốt.

•Nghiên cứu thêm về khả năng kết hợp cảm xúc trong câu. Vì trong thực tế, chúng ta sẽ bắt gặp rất nhiều trường hợp một câu nói nhưng mang nhiều cảm xúc khác nhau.

•Tìm hiểu và nghiên cứu về hướng tiếp cận mới: Reinforcement Learning. Hiện tại trên thế giới, bài báo đầu tiên về hướng tiếp cận này được công bố vào tháng 5 năm nay.

DANH MỤC TÀI LIỆU THAM KHẢO

[1] T. Q. F. S. T.-Y. L. Xu Tan, "A Survey on Neural Speech Synthesis,"

CoRR, pp. 1-63, 2021.

[2] E. C. V. A. Dang-Khoa Mac, "MODELING THE PROSODY OF VIETNAMESE ATTITUDES FOR EXPRESSIVE SPEECH," Third

Workshop on Spoken Language Technologies for Under-resourced,

pp. 114-118, 2012.

[3] D.-K. M.-D. Tran, "Modeling Vietnamese Speech Prosody: A Step-by- Step Approach Towards an Expressive Speech Synthesis System,"

Revised Selected Papers of the PAKDD 2015 Workshops on Trends and Applications in Knowledge Discovery and Data Mining, p. 273–

287, 2015.

[4] W. F. S. Felix Burkhardt, "Verification of Acousical Correlates of Emotional Speech using Formant-Synthesis," ITRW on Speech and Emotion, pp. 1-6, 2000.

[5] T. Phan, T. Duong, A. Dinh, T. Vu and C. Luong, "Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information," The 2013 RIVF International Conference on

Computing \& Communication Technologies - Research, Innovation, and Vision for Future (RIVF), pp. 276-281, 2013.

[6] T. Vu, M. C. Luong and N. Satoshi, "An HMM- based Vietnamese Speech Synthesis System," 2009 Oriental COCOSDA International

Conference on Speech Database and Assessments, ICSDA 2009, pp.

116 - 121, 2009.

[7] K. O. T. M. T. K. Junichi Yamagishi, "Modeling of Various Speaking Styles and Emotions for HMM-Based Speech Synthesis,"

EUROSPEECH 2003, pp. 1-6, 2003.

[8] M. M. J. G. Sangramsing Kayte, "Hidden Markov Model based Speech Synthesis: A Review," International Journal of Computer

Applications, pp. 35-39, 2015.

[9] S. D. H. Z. Aaron van den Oord, "WaveNet: A Generative Model for Raw Audio," Proc. 9th ISCA Workshop on Speech Synthesis

Workshop (SSW 9), p. 125, 2016.

[10] R. V. B. C. Ryan Prenger, "WaveGlow: A Flow-based Generative Network for Speech Synthesis," ICASSP 2019, pp. 3617-3621, 2019. [11] Đ. T. L. T. T. V. L. N. H. Q. Lê Xuân Thành, "Speech Emotions and

Statistical Analysis for Vietnamese Emotions," Journal of Vietnam

Ministry of Information and Communication, pp. 86-98, 2016.

[12] W. F. S. Felix Burkhardt, "A database of German emotional speech,"

INTERSPEECH 2005, pp. 1517--1520, 2005.

[13] N. Campbell, "CHATR the Corpus; a 20-year-old archive of Concatenative Speech Synthesis," Proceedings of the Tenth

{LREC} 2016, pp. 3436-3439, 2016.

[14] H. Z. A. W. B. Keiichi Tokuda, "An HMM-based speech synthesis system applied to English," Proceedings of 2002 IEEE Workshop on

Speech Synthesis, 2002., pp. 227-230, 2002.

[15] F. Eyben, S. Buchholz and N. Braunschweiler, "Unsupervised clustering of emotion and voice styles for expressive TTS," 2012 International Conference on Acoustics, Speech and Signal, pp. 4009-

4012, 2012.

[16] G. D. A. G. J. M. K. P. W. P. J. R. Y. Z. Sercan Arik, "Deep Voice 2: Multi-Speaker Neural Text-to-Speech," Proceedings of the 31st International Conference on Neural Information Processing Systems,

p. 2966–2974, 2017.

[17] Y. W. a. R. J. S.-R. a. D. S. a. Y. W. a. R. J. W. a. N. J. a. Z. Y. a. Y. X. a. Z. C. a. S. B. a. Q. V. L. a. Y. A. a. R. A. J. C. a. R. A. Saurous, "Tacotron: Towards End-to-End Speech Synthesis," INTERSPEECH,

pp. 4006-4010, 2017.

[18] K. P. A. G. S. O. A. A. K. S. N. J. R. J. M. Wei Ping, "Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning,"

International Conference on Learning Representations, pp. 1-6, 2018.

[19] D. V. Sang and L. X. Thu, "FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis," 2021 International

Conference on Multimedia Analysis and Pattern Recognition (MAPR),

pp. 1-5, 2021.

[20] Y. Z. R. S.-R. Daisy Stanton, "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis,"

Proceedings of the 35th International Conference on Machine Learning, pp. 5180-5189, 2018.

[21] O. Kwon, I. Jang, C. Ahn and H.-G. Kang, "An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis," IEEE Signal Processing Letters, pp. 1-1, 2019.

[22] Z.-h. L. L.-j. L. Peng-fei Wu, “End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training,” 2019

APSIPA ASC, pp. 623-627, 2019.

[23] D. Stanton, Y. Wang and R. Skerry-Ryan, "Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis," 2018

IEEE Spoken Language Technology Workshop (SLT), pp. 595-602,

2018.

[24] S. P. L. H. Z.-H. L. Ya-Jie Zhang, "Learning latent representations for style control and transfer in end-to-end speech synthesis," ICASSP

2019, pp. 6945-6949, 2019.

[25] B. S. H. L. Rui Liu, "Reinforcement Learning for Emotional Text-to- Speech Synthesis with Improved Emotion Discriminability," CoRR, pp. 4648-4652, 2021.

58 [26] C. d. A. R. D. D. T. Thi Thu Trang Nguyen, "HMM-based TTS for

Hanoi Vietnamese: issues in design and evaluation," INTERSPEECH

2013, pp. 2311-2315, 2013.

[27] B. Q. N. K. H. P. H. V. D. Thinh Van Nguyen, "Development of Vietnamese Speech Synthesis System using Deep Neural Networks,"

Journal of Computer Science and Cybernetics, p. 349–363, 2019.

[28] N. M. T. C. X. N. Do Tri Nhan, "Vietnamese Speech Synthesis with End-to-End Model and Text Normalization," 2020 7th NAFOSTED Conference on Information and Computer Science (NICS), pp. 179-

183, 2020.

[29] J. S. a. R. P. a. R. J. W. a. M. S. a. N. J. a. Z. Y. a. Z. C. a. Y. Z. a. Y. W. a. R. J. S.-R. a. R. A. S. a. Y. A. a. Y. Wu, "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions," ICASSP, pp. 4779-4783, 2018.

[30] L. X. Thành, TỔNG HỢP TIẾNG VIỆT VỚI CÁC CHẤT GIỌNG KHÁC NHAU, 2018.

[31] K. S. R. P. B. C. Rafael Valle, "Flowtron: an Autoregressive Flow- based Generative Network for Text-to-Speech Synthesis," 9th

International Conference on Learning Representations, pp. 1 - 17,

2021.

[32] G. X. Z. Z. Wei Song, "Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed," INTERSPEECH 2020, pp. 225 - 229, 2020.

[33] G. L. a. D. L. Marie Tahon, "Can we Generate Emotional Pronunciations for Expressive Speech Synthesis?," IEEE Transactions

on Affective Computing, pp. 684-695, 2020.

[34] O. Kwon, I. Jang, C. Ahn and H.-G. Kang, "Emotional Speech Synthesis Based on Style Embedded Tacotron2 Framework," 2019

ITC-CSCC, pp. 1-4, 2019.

[35] D. K. A. L. A. C. Chin-Wei Huang, "Neural Autoregressive Flows,"

Proceedings of the 35th International Conference on Machine Learning, pp. 2078-2087, 2018.

[36] M. C. A. C. G. D. A. G. Y. K. X. L. J. M. A. N. J. R. S. S. M. S. Sercan O. Arik, "Deep Voice: Real-time Neural Text-to-Speech,"

Proceedings of the 34th International Conference on Machine Learning, pp. 195-204, 2017.

[37] L. Q.-H. L.-J. W.-F. Z.-H. Wang, "HMM-Based Emotional Speech Synthesis Using Average Emotion Model," Chinese Spoken Language

Processing, 5th International Symposium, {ISCSLP}, pp. 233-240,

Các bộ dữ liệu được công bố

Sơ đồ kiến trúc Flowtron