Tổng hợp tiếng việt với các chất giọng khác nhau và có biểu lộ cảm xúc

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI Lê Xuân Thành TỔNG HỢP TIẾNG VIỆT VỚI CÁC CHẤT GIỌNG KHÁC NHAU VÀ CÓ BIỂU LỘ CẢM XÚC Ngành: Khoa học máy tính Mã số: 9480101 LUẬN ÁN TIẾN SĨ KHOA HỌC MÁY TINH NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS.TS Đặng Văn Chuyết PGS.TS Trịnh Văn Loan Hà Nội - 2018 LỜI CAM ĐOAN Tôi xin cam đoan tất nội dung luận án “Tổng hợp tiếng Việt với chất giọng khác có biểu lộ cảm xúc” cơng trình nghiên cứu riêng tơi Các số liệu, kết luận án trung thực chưa cơng bố cơng trình Việc tham khảo nguồn tài liệu thực trích dẫn ghi nguồn tài liệu tham khảo quy định Hà Nội, ngày tháng TẬP THỂ HƯỚNG DẪN KHOA HỌC TÁC GIẢ LUẬN ÁN PGS.TS Đặng Văn Chuyết Lê Xuân Thành PGS.TS Trịnh Văn Loan năm 2018 LỜI CẢM ƠN Tơi xin bày tỏ lịng biết ơn tới Trường Đại học Bách khoa Hà Nội, Viện Đào tạo Sau đại học, Viện Công nghệ Thông tin Truyền thơng, Bộ mơn Kỹ thuật Máy tính, Bộ mơn Khoa học Máy tính tạo điều kiện thuận lợi cho tơi q trình làm việc, học tập nghiên cứu Trường Tôi muốn gửi lời cảm ơn đặc biệt tới tập thể hướng dẫn trực tiếp PGS.TS Trịnh Văn Loan PGS.TS Đặng Văn Chuyết Hai thầy ln tận tình giúp đỡ, đưa lời khuyên, định hướng khoa học quý báu để tơi triển khai hồn thành cơng việc nghiên cứu Xin chân thành cảm ơn thầy cô, đồng nghiệp Bộ môn Kỹ thuật Máy tính, Viện Cơng nghệ Thơng tin Truyền thông, Trường Đại học Bách khoa Hà Nội, nơi làm việc, học tập thực đề tài nghiên cứu nhiệt tình giúp đỡ động viên tơi suốt q trình nghiên cứu Xin gửi lời biết ơn đến thầy, cô, nhà khoa học, đồng nghiệp bạn bè thân hữu động viên giúp đỡ tơi q trình nghiên cứu Cuối cùng, tơi muốn bày tỏ lịng biết ơn sâu sắc tới gia đình, nơi ni dưỡng nguồn động lực để tơi vượt trở ngại, khó khăn để hoàn thành luận án Lê Xuân Thành MỤC LỤC LỜI CAM ĐOAN LỜI CẢM ƠN MỤC LỤC DANH MỤC CÁC TỪ VIẾT TẮT MỤC LỤC CÁC BẢNG MỤC LỤC CÁC HÌNH ẢNH 11 MỞ ĐẦU 13 TỔNG QUAN NGHIÊN CỨU VỀ TỔNG HỢP TIẾNG NÓI VÀ TỔNG HỢP TIẾNG NÓI CÓ CẢM XÚC 17 1.1 Tình hình nghiên cứu giới tổng hợp tiếng nói 17 1.1.1 Tổng hợp phương pháp ghép nối 18 1.1.2 Tổng hợp phương pháp sử dụng mơ hình 20 1.2 Các nghiên cứu tổng hợp tiếng nói nước 22 1.2.1 Tổng hợp phương pháp ghép nối 23 1.2.2 Tổng hợp phương pháp sử dụng mơ hình 24 1.3 Các nghiên cứu tổng hợp tiếng nói có cảm xúc giới 24 1.3.1 Tổng quan 24 1.3.2 Các tham số ảnh hưởng đến cảm xúc tiếng nói 25 1.4 Các nghiên cứu tiếng nói có cảm xúc tiếng Việt 27 1.5 Kết chương 27 XÂY DỰNG CÁC BỘ NGỮ LIỆU TIẾNG VIỆT 30 2.1 Xây dựng ngữ liệu tổng hợp tiếng Việt nói chất lượng tốt 30 2.1.1 Đặc điểm ngữ âm tiếng Việt 30 2.1.2 Hệ thống âm vị cấu trúc âm tiết tiếng Việt 31 2.1.3 Hệ thống điệu 34 2.1.4 Hệ thống âm đầu 35 2.1.5 Hệ thống âm đệm 36 2.1.6 Hệ thống âm 37 2.1.7 Hệ thống âm cuối 38 2.1.8 Xây dựng ngữ liệu tiếng Việt nói chất lượng tốt 39 2.1.9 Xây dựng danh sách âm tiết ngữ liệu 41 2.1.10 Kịch thu 41 2.1.11 Thu âm 42 2.2 Xây dựng ngữ liệu tiếng Việt có cảm xúc 43 2.2.1 Mục đích xây dựng ngữ liệu tiếng Việt có cảm xúc 43 2.2.2 Các tham số ảnh hưởng đến cảm xúc tiếng nói 43 2.2.3 Phương pháp xây dựng ngữ liệu tiếng Việt có cảm xúc 45 2.2.4 Phân tích đánh giá số tham số ảnh hưởng đến biểu lộ cảm xúc tiếng Việt nói 47 2.2.5 Đánh giá độ phân lớp ngữ liệu tiếng Việt có cảm xúc 58 2.3 Kết chương 59 TỔNG HỢP TIẾNG VIỆT CÓ BIỂU LỘ CẢM XÚC 61 3.1 Tổng hợp tiếng Việt chất lượng tốt 61 3.1.1 Xây dựng ngữ liệu cho tổng hợp tiếng Việt chất lượng tốt 61 3.1.2 Tổng hợp tiếng Việt chất lượng tốt phương pháp ghép nối 61 3.1.3 Thử nghiệm tổng hợp số câu nói tổng hợp tiếng Việt chất lượng tốt 71 3.2 Tổng hợp tiếng Việt có biểu lộ cảm xúc 73 3.2.1 Mô hình Fujisaki 74 3.2.2 Tổng hợp tiếng Việt có biểu lộ cảm xúc sử dụng mơ hình Fujisaki 77 3.2.3 Thử nghiệm tổng hợp tiếng Việt có biểu lộ cảm xúc 79 3.2.4 Đánh giá phương pháp chủ quan chất lượng câu tổng hợp tiếng Việt có biểu lộ cảm xúc 83 3.2.5 Đánh giá phương pháp khách quan chất lượng câu tổng hợp tiếng Việt có biểu lộ cảm xúc 88 3.3 Kết chương 90 KẾT LUẬN VÀ KIẾN NGHỊ 91 CÁC CƠNG TRÌNH ĐÃ CƠNG BỐ CỦA LUẬN ÁN 94 TÀI LIỆU THAM KHẢO 95 PHỤ LỤC A – DANH SÁCH CÁC ÂM CẦN THU 107 DANH MỤC CÁC TỪ VIẾT TẮT Chữ viết tắt Chữ viết đầy đủ Giải thích Accent Trọng âm ANOVA Analysis of variance Phân tích phương sai BKEmon Bach Khoa Emotion Bộ ngữ liệu cảm xúc tiếng Việt nghiên cứu sinh xây dựng DRM Distinctive Region Model Mô hình phần riêng biệt Thời hạn phát âm (là độ dài tín hiệu âm) Duration EEG ElectroEncephaloGram Tín hiệu điện não F0 Fundamental frequency Tần số GMM Gaussian Mixture Model Mơ hình hỗn hợp Gauss HLDA Heteroscedastic Linear Discriminant Analysis Phân tích phân biệt tuyến tính khơng đồng HMM Hidden Markov Model Mơ hình Markov ẩn HTK Hidden Markov Model Toolkit Bộ công cụ mô hình Markov ẩn HTS HMM-based Speech Synthesis System Hệ tổng hợp tiếng nói mơ hình HMM LDA Linear Discriminant Analysis Phân tích phân biệt tuyến tính LDC Linguistic Data Consortium Hội đồn liệu ngơn ngữ LLR Log Likelihood Ratio Log tỉ lệ khả LPC Linear Prediction Coding Mã hóa tiên đốn tuyến tính MBROLA Multi-Band Resynthesis OverLap Add Bộ tổng hợp tiếng nói phương pháp ghép nối MFCC Mel Frequency Cepstral Coefficients Các hệ số Cepstral theo thang đo tần số Mel MICA International Research Institute Multimedia, Information, Communication and Applications Viện nghiên cứu Quốc tế Truyền thông, Thông tin, Đa phương tiện Ứng dụng MOS Mean Opinion Score Điểm trung bình số ý kiến NIST National Institute of Standards and Technology Viện Tiêu chuẩn Công nghệ Quốc gia Mỹ NLP Natural Language Processing Xử lý ngôn ngữ tự nhiên Pitch Cao độ Pitch contour Đường bao cao độ PCA Principal Component Analysis Phân tích thành phần Phrase Cụm từ PSOLA Pitch Synchronous Overlap Kỹ thuật cộng chồng đồng cao and Add độ SMO Sequential Minimal Optimization Tối ưu hóa cực tiểu Segmental Đoạn tính Suprasegmental Siêu đoạn tính SVM Máy véc-tơ hỗ trợ Support Vector Machines Thanh điệu Tone TTS Văn thành tiếng nói Text-to-Speech Tukey’s test WER Kiểm định T Word Error Rate Tỷ lệ lỗi từ MỤC LỤC CÁC BẢNG Bảng 2.1 Hệ thống phụ âm cách đọc 32 Bảng 2.2 Hệ thống nguyên âm tiếng Việt 33 Bảng 2.3 Cấu trúc âm tiết tiếng Việt 34 Bảng 2.4 Phân loại điệu tiếng Việt 35 Bảng 2.5 Hệ thống âm đầu tiếng Việt 35 Bảng 2.6 Bảng mô tả hệ thống phụ âm đầu tiếng Việt 36 Bảng 2.7 Hệ thống âm tiếng Việt 37 Bảng 2.8 Hệ thống nguyên âm với 13 ngun âm đơn, ngun âm đơi 37 Bảng 2.9 Hệ thống âm cuối tiếng Việt 38 Bảng 2.10 Hệ thống âm cuối tiếng Việt theo cách phát âm 39 Bảng 2.11 Cách tổ chức đơn vị âm đầu đơn vị âm cuối 40 Bảng 2.12 Giá trị F P-value phân tích phương sai ANOVA cho giọng nam nữ với tần số 𝑭𝟎 trung bình lượng trung bình 50 Bảng 2.13 Kết phân tích kiểm định T 𝑭𝟎 cho giọng người nói T.T.H Đ.K 51 Bảng 2.14 Kết phân tích kiểm định T lượng trung bình cho giọng Đ.K (nam) T.T.H (nữ) 53 Bảng 2.15 Giá trị F P-value phân tích phương sai ANOVA cho giọng nam nữ với 𝑭𝟎 trung bình lượng trung bình 56 Bảng 2.16 Kết phân tích kiểm định T 𝑭𝟎 trung bình lượng trung bình cho giọng giọng nam 57 Bảng 2.17 Kết phân tích kiểm định T 𝑭𝟎 trung bình lượng trung bình cho giọng giọng nữ 58 Bảng 3.1 Các câu thử nghiệm tổng hợp giọng trần thuật (cảm xúc bình thường) tổng hợp tiếng Việt chất lượng tốt 71 TÀI LIỆU THAM KHẢO TÀI LIỆU THAM KHẢO TIẾNG VIỆT [1] Bạch Hưng Nguyên; Nguyễn Tiến Dũng (2003) "Mơ hình Fujisaki áp dụng phân tích điệu tiếng Việt," in National Informatics Conference, Thai Nguyen [2] Châu, Hoàng Thị (2009) Phương ngữ học tiếng Việt, NXB Đại học Quốc gia Hà Nội [3] Chừ, Mai Ngọc; Nghiệu, Vũ Đức; Phiến, Hoàng Trọng (2008) Cơ sở ngôn ngữ học tiếng Việt, NXB Giáo dục [4] Lên, Bùi Tiến (2001) Luận văn Thạc sỹ: Xây dựng hệ tổng hợp tiếng Việt dựa luật, Hochiminh: Đại học Khoa học tự nhiên, Đại học Quốc gia Thành phố Hồ Chí Minh [5] Lưỡng, Đinh Đồng (2009) Luận văn Thạc sỹ: Tổng hợp tiếng Việt chất lượng tốt, Hà Nội: Trường Đại học Bách khoa Hà Nội [6] Minh, Lê Hồng (2003) “Một số kết nghiên cứu phát triển hệ phần mềm chuyển văn thành tiếng nói cho tiếng Việt tổng hợp formant,” Kỷ yếu Hội thảo Khoa học Quốc gia lần thứ - Nghiên cứu Phát triển Ứng dụng Công nghệ Thông tin Truyền thông (ICT.rda), Hà Nội [7] Nhan, Nguyễn Tôn; Hẳn, Phú Văn (2013) Từ điển tiếng Việt, Hanoi: NXB Từ điển Bách Khoa [8] Quân, Vũ Hải; Nam, Cao Xuân (2009) “Tổng hợp tiếng nói tiếng Việt theo phương pháp ghép nối cụm từ,” Các công trình nghiên cứu, phát triển ứng dụng CNTT-TT, Tạp chí CNTT TT, Các tập %1 %2V-1, số 1, pp 7076 [9] Thuật, Đoàn Thiện (2003) Ngữ âm tiếng Việt - tái lần 2, NXB Quốc gia Việt Nam [10] Tuấn, Trịnh Anh (2000) “Một số phương pháp nâng cao chất lượng hệ thống tổng hợp,” Tạp chí Bưu chính Viễn thông, tập 3, pp 19-23 95 TÀI LIỆU THAM KHẢO TIẾNG ANH [11] Ayadi, Moataz El; Kamel, Mohamed S.; Karray, Fakhri (2011) "Survey on speech emotion recognition: Features, classification schemes, and databases," Pattern Recognition, vol 44, no 3, pp 572-587 [12] Barra, R.; Montero, J.M.; Macias-Guarasa, J.; Gutierrez-Arriola, J.; Ferreiros, J.; Pardo, J.M (2007) "On the limitations of voice conversion techniques in emotion identification," in 8th Annual Conference of the International Speech Communication Association, Antwerp [13] Black, Alan W (2003) "Unit lelection and emotional speech," in 8th European Conference on Speech Communication and Technology, Geneva [14] Black, Alan W; Campbell, Nick (1995) "Optimising selection of units from speech databases for concatenative synthesis," in Fourth European Conference on Speech Communication and Technology, Madrid [15] Boersma, Paul; Weenink, David (2017) 27 10 2017 @Online] Available: www.praat.org [16] Bozkurt, Baris; Dutoit, Thierry; Prudon, Romain; D’Alessandro, Christophe; Vincent (2002) "Improving quality of mbrola synthesis for non-uniform units synthesis," in IEEE Workshop on Speech Synthesis, Santa Monica [17] Bruk, Phil; Polansky, Lary; Repetto, Douglas; Roberts, Mary; Rockmore, Dan (2011) Music and Computers, Spring [18] Bulut, Murtaza; Narayanan, Shrikanth S.; Syrdal, Ann K (2002) "Expressive speech synthesis using a concatenative synthesizer," in 7th International Conference on Spoken Language Processing, Denver [19] Burkhardt, Felix; Sendlmeier, Walter F (2000) "Verification of acoustical correlates of emotional speech using formant-synthesis," in ITRW on Speech and Emotion, Newcastle [20] Cahn, Janet (1990) "The generation of affect in synthesized speech," Journal of American Voice Input/Output Society, vol 8, pp 1-19 [21] Cahn, Janet Elizabeth (1989) Master Thesis: Generating expression in synthesized speech, Massachusetts: Massachusetts Institute of Technology 96 [22] Calliope (1989) La parole et son traitement automatique, Masson Paris [23] Charonnat, Laure; Vidal, Gaëlle; Boeffard, Olivier (2008) "Automatic phone segmentation of expressive speech," in Language Resources and Evaluation Conference, Marrakech [24] Charpentier, F.; Stella, M (1986) "Diphone synthesis using an overlap-add technique for speech waveforms concatenation," in IEEE International Conference of Acoustics, Speech, and Signal Processing, ICASSP'86, Tokyo [25] Chomphan, Suphattharachai; Kobayashi, Takao (2008) "Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis," Speech Communication, vol 50, no 5, pp 392-404 [26] Clark, Robert A J.; Richmond, Korin; King, Simon (2007) "Multisyn: Opendomain unit selection for the Festival speech synthesis system," Speech Communication, vol 49, no 4, pp 317-330 [27] Cowie, Roddy; Schröder, Marc (2004) "Piecing Together the Emotion Jigsaw," in Machine Learning for Multimodal Interaction (1St International Workshop on Machine Learning for Multimodal Interaction), Martigny, Springer, pp 305-317 [28] Dang-Khoa MAC (2012) PhD thesis: Génération de parole expressive dans le cas des langues tons, MICA, INPG [29] Dang-Khoa Mac, Do-Dat Tran (2015) "Modeling Vietnamese Speech Prosody: A Step-by-Step Approach Towards an Expressive Speech Synthesis System," in Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2015 Workshops: BigPMA, VLSP, QIMIE, DAEBH), Ho Chi Minh, Springer, Vol 9441, pp 273-287 [30] Dat, Tran Do; Castelli, E.; Loan, Trinh Van; Bac, Le Viet (2004) "Building a large Vietnamese speech database," Journal of science and technology, vol 46+47, pp 13-17 [31] Dat, Tran Do; Castelli, Eric (2010) "Generation of F0 contours for Vietnamese speech synthesis," in Third International Conference on Communications and Electronics 2010, Nhatrang 97 [32] Dat, Tran Do; Castelli, Eric; Jean-Francois, Serignat; Loan, Trinh Van; Hung, Le Xuan (2006) "Linear F0 Contour Model for Vietnamese Tones and Vietnamese Syllable Synthesis with TD-PSOLA," in 2nd International Symposium on Tonal Aspects of Languages, La Rochelle [33] Deepa P Gopinath, Sheeba P S., Achuthsankar S Nair (2007) "Emotional Analysis for Malayalam Text to Speech Synthesis Systems," in Setit 2007 4th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications, Tunisia [34] Dellaert, Frank; Polzin, Thomas; Waibel, Alex (1996) "Recognising emotions in speech," in The Fourth International Conference on Spoken Language Processing, Philadelphia [35] Devore, Jay L (2010) Probability and Statistics for Engineering and the Sciences, 8th Edition, Brooks/Cole Edition [36] Dinh, AT; Phan, TS; Vu, TT; Luong, CM (2013) "Vietnamese HMM-based speech synthesis with prosody information," in 8th ISCA Workshop on Speech, Barcelona [37] Donovan, R E.; Woodland, P C (1999) "A hidden Markov-model-based trainable speech synthesizer," Computer Speech & Language, vol 13, no 3, pp 223-241 [38] Dutoit, T.; Pagel, V.; Pierret, N.; Bataille, F.; Vrecken, O van der (1996) "The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes," in Fourth International Conference on Spoken Language,ICSLP 96 Proceedings, Philadelphia [39] Edgington, Mike (1997) "Investigating the limitations of concatenative synthesis," in Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997, Rhodes [40] Eide, E (2002) "Preservation, identification, and use of emotion in a text-tospeech system," in IEEE Workshop on Speech Synthesis, Santa Monica [41] Eide, E.; Aaron, A.; Bakis, R.; Hamza, W.; Picheny, Michael; Pitrelli, J (2004) "A corpus-based approach to expressive speech synthesis," in Fifth ISCA ITRW on Speech Synthesis, Pittsburgh 98 [42] Frick, Robert W (1985) "Communicating Emotion The Role of Prosodic Features," Psychological Bulletin, vol 97, no 3, p 412–429 [43] Fujisaki, H.; S Nagashima (1969) "A model for the synthesis of pitch contours of connected speech," In annual report of the engineering research institute, Faculty of Engineering, University of Tokyo, vol 28, pp 53-60 [44] Fujisaki, Hiroya; Wang, Changfu; Ohno, Sumio; Gu, Wentao (2005) "Analysis and synthesis of fundamental frequency contours of Standard Chinese using the command–response model," Speech Communication, vol 47, no 1-2, pp 59-70 [45] Fujisaky, H.; Hirosetanvina, K (1984) "Analysis of voice fundamental frequency contours for declarative sentences of Japanese," Journal of the Acoustic Society of Japan, vol 5, no 4, p 233–242 [46] Gob, C.; Bennett, E.; Chasaide, Ailbhe Ni (2002) "Expressive synthesis: how crucial is voice quality," in 2002 IEEE Workshop on Speech Synthesis, Santa Monica [47] H, Linyu; Yan, Jian; Zuo, Libo; Kui, Liping (2011) "A trainable Vietnamese speech synthesis system based on HMM," in International Conference on Electric Information and Control Engineering, Wuhan [48] Hamza, Wael; Bakis, Raimo; Eide, Ellen M.; Picheny, Michael A.; Pitrelli, John F (2004) "The IBM expressive speech synthesis system," in 8th International Conference on Spoken Language Processing, Jeju Island [49] Hansjörg Mixdorff, Nguyen Hung Bach, Hiroya Fujisaki and Mai Chi Luong (2003) "Quantitative Analysis and Synthesis of Syllabic Tones in Vietnamese," in 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva [50] Heuf, B.; Portele, T.; Rauth, M (1996) "Emotions in time domain synthesis," in Fourth International Conference on Spoken Language, Philadelphia [51] Hofer, Gregor O.; Richmond, Korin; Clark, Robert A.J (2005) "Informed blending of databases for emotional speech synthesis," in 9th European Conference on Speech Communication and Technology, Lisbon 99 [52] Holmes, John; Holmes, Wendy (2001) Speech synthesis and recognition, London & New York : Taylor & Francis, 2001 [53] Hunt, Andrew J.; Black, Alan W (1996) "Unit selection in a concatenative speech synthesis system using a large speech database," in IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta [54] Ian H Witten; Eibe Frank (2005) Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann Publishers [55] III, Julius O Smith (2011) Physical Audio Signal Processing: for Virtual Musical Instruments and Digital Audio Effects, W3K Publishing [56] John Platt (1998) "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines," Microsoft TechReport [57] Jr, A H Gray; Markel, J D (1976) "Distance Measures for Speech Processing," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol 5, pp 380-391 [58] Karaiskosa, Vasilis; King, Simon; Clark, Robert A J.; Mayo, Catherine (2008) "The Blizzard Challenge 2008," in Blizzard Challenge Workshop 2008, Brisbane [59] Kellerman, Henry (1989) Emotion, Theory, Research, and Experience - Vol 4, New York: Academic Press [60] Kempelen, Wolfgang von (1791) Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, J.V Degen [61] Kimmo Pärssinen (2007) Doctor thesis: Multilingual Text-to-Speech System for Mobile Devices: Development and Applications, Tampere University of Technology [62] Kominek, John; Black, Alan W (2003) CMU ARCTIC databases for speech synthesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University 100 [63] Lamel, L.F.; Gauvain, J.L.; Prouts, B.; Bouhier, C.; Boesch, R (1993) "Generation and Synthesis of Broadcast Messages," in Proc ESCA-NATO Workshop on Applications of Speech Technology, Lautrach [64] Le Thi Xuyen (1989) Thèse de Doctorat: Etude contrastive de l'intonation expressive en franỗais et en vietnamien, Université Paris [65] Le, H.M (2013) VnSpeech, Hanoi [66] Le, T.H (2013) VietVoice Virtual Voice Inc [67] Lee, Chul Min; Yildirim, Serdar; Bulut, Murtaza; Kazemzadeh, Abe; Busso, Carlos; Deng, Zhigang; Lee, Sungbok; Narayanan, Shrikanth (2004) "Emotion recognition based on phoneme classes," in 8th International Conference on Spoken Language Processing, Jeju Island [68] Lee, Ki-Seung (2014) "A unit selection approach for voice transformation," Speech Communication, vol 60, pp 30-43 [69] Lim, Yee Chea; Tan, Tian Swee; Salleh, Sheikh Hussain Shaikh; Kwong, Dandy (2012) "Application of Genetic Algorithm in unit selection for Malay speech synthesis system," In Expert Systems with Applications, vol 39, no 5, pp 5376-5383 [70] Mac, Dang-Khoa; Castelli, Eric; Aubergé, Véronique (2012) "Modeling the Prosody of Vietnamese Attitudes for Expressive Speech Synthesis," in Workshop of Spoken Languages Technologies for Under-resourced Languages, Cape Town [71] Manzara, Leonard (2005) "The Tube Resonance Model Speech Synthesizer," The Journal of the Acoustical Society of America, vol 117, no 4, pp 1-11 [72] McGilloway, Sinéad; Cowie, Roddy; Douglas-Cowie, Ellen; Gielen, Stan; Westerdijk, Machiel; Stroeve, Sybert (2000) "Approaching automatic recognition of emotion from voice: A rough benchmark," in ISCA Workshop on Speech and Emotion, Newcastle [73] Michaud A (2004) "Final consonants and glottalization: new perspectives from Hanoi Vietnamese," Phonetica, vol 61, no 2-3, pp 119-146 101 [74] Minh, Nguyen Huu (2009) Master thesis: Xac dinh khoang ngung giua cac am tiet, cuong va truong cua am tiet cho bo phat am tieng Viet, University of Science, Vietnam National University, Ho Chi Minh City [75] Miwa, H.; Umetsu, T.; Takanishi, A.; Takanobu, H (2000) "Robot personalization based on the mental dynamics," in RSJ International Conference on Intelligent Robots and Systems, Takamatsu [76] Montero, J.M.; Gutiérrez-Arriola, J.; Colás, J.; E.Enríquez; J.M.Pardo (1999) "Analysis and modelling of emotional speech in Spanish," in 14th International Congress of Phonetic Sciences, San Francisco [77] Mozziconacci, S J L.; Hermes, Dik J (1999) "Role of intonation patterns in conveying emotion in speech," in 14th International Congress of Phonetic Sciences, San Francisco [78] Murray, Iain R.; Arnoti, John L.; Rohwer, Elizabeth A (1996) "Emotional stress in synthetic speech: Progress and future directions," Speech Communication, vol 20, pp 85-91 [79] Ngo, Thi Duyen; Bui, The Duy (2012) "A study on prosody of Vietnamese emotional speech," in Fourth International Conference on Knowledge and Systems Engineering, Danang [80] Nguyen Quoc Cuong (2002) PhD Thesis: Reconnaissance de la parole en langue Vietnamienne, INP-Grenoble, France [81] Nguyen, Dung; Luong, Chi Mai; Vu, Bang Kim; Huy, Ngo Hoang (2004) "Fujisaki Model based F0 contours in Vietnamese TTS," in 8th International Conference on Spoken Language Processing, Jeju Island [82] Nguyen, Viet Son; Castelli, Eric; Carré, René (2009) "Vietnamese Final Stop Consonants /p, t, k/ Described in Terms of Formant Transition Slopes," in International Conference on Asian Language Processing, Singapore [83] Nose, Takashi; Yamagishi, Junichi; Masuko, Takashi; Kobayashi, Takao (2007) "A style control technique for HMM-based expressive speech synthesis," IEICE - Transactions on Information and Systems, Vols E90-D, no 9, pp 1406-1413 102 [84] Öhman, S E G (1967) "Word and sentence intonation: A quantitative model," STL-QPSR, vol 8, no 2-3, pp 20-54 [85] D O'Shaughnessy (1988) "Linear predictive coding," IEEE Potentials, vol 7, no 1, pp 29 - 32 [86] Pätzold, M (1991) Master thesis: Nachbildung von Intonationskonturen mit dem Modell von Fujisaki Implementierung des Algorithmus und erste Experimente mit ein- und zweiphrasigen Aussagesätzen, University of Bonn [87] Pham, T.N (2013) Vietnamese Voice [88] Phan, ST; Vu, TT; Duong, CT; Luong, MC (2013) "A study in vietnamese statistical parametric speech synthesis based on HMM," International Journal of Advances in Computer Science and Technology, vol 2, no 1, pp 1-6 [89] Phan, ST; Vu, TT; Luong, MC (2013) "Extracting MFCC, F0 feature in Vietnamese HMM-based speech synthesis," International Journal of Electronics and Computer Science Engineering, vol 2, no 1, pp 46-52 [90] Phan, TS; Duong, TC; Dinh, AT; Vu, TT; Luong, CM (2013) "Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information," in Computing and Communication Technologies, Research, Innovation, and Vision, Hanoi [91] Pitrelli, J F.; Bakis, R.; Eide, E M.; Fernandez, R.; Hamza, W.; Picheny, M A (2006) "The IBM expressive text-to-speech synthesis system for American English," IEEE Transactions on Audio, Speech, and Language Processing, vol 14, no 4, pp 1099 - 1108 [92] Pittermann, Johannes, Pittermann, Angela, Minker, Wolfgang (2010) Handling Emotions in Human-Computer Dialogues in Human-Computer Dialogues, Springer [93] Pucher, Michael; Schabus, Dietmar; Yamagishi, Junichi; Neubarth, Friedrich; Strom, Volker (2010) "Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis," Speech Communication, vol 52, no 2, pp 164-179 [94] J R Quinlan (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers 103 [95] R C Streijl, S Winkler, and D S Hands (2006) "Mean opinion score (MOS) revisited: methods and applications, limitationsand alternatives," Multimedia Systems, vol 22, no 2, pp 213-227 [96] Rank, Erhard; Pirker, Hannes (1998) "Generating emotional speech with a concatenative synthesizer," in 5th International Conference on Spoken Language Processing, Sydney [97] Scherer, Klaus R.; Ladd, D Robert; Silverman, Kim E A (1984) "Vocal cues to speaker affect: Testing two models," The Journal of the Acoustical Society of America, vol 76, p 1346–1356 [98] Scherer, KLaus R (2003) "Vocal communication of emotion: A review of research paradigms," Speech Communication, vol 40, pp 227-256 [99] Schröder, Marc (2001) "Emotional speech synthesis: a review," in 7th European Conference on Speech Communication and Technology, Aalborg [100] Schubiger, Maria (1958) "English intonation: its form and function," Language, vol 36, no 4, pp 544-548 [101] Shuichi Narusawa, Nobuaki Minematsu, Keikichi Hirose and Hiroya Fujisaki (2000) "A method for automatic extraction of parameters of the fundamental frequency contour of speech," in 6th International Conference of Spoken Language Processing - Interspeech 2000, Beijing [102] Strom, Volker; King, Simon (2008) "Investigating Festival’s target cost function using perceptual experiments," in 9th Annual Conference of the International Speech Communication Association, Brisbane [103] Subhashree, R.; Rathna, G N (2016) "Speech Emotion Recognition: Performance Analysis based on Fused Algorithms and GMM Modelling," Indian Journal of Science and Technology, vol 9, no 11, pp 1-8 [104] Syrdal, Ann K.; Wightman, Colin W.; Conkie, Alistair; Stylianou, Yannis; Beutnagel, Mark; Schroeter, Juergen; Strom, Volker; Lee, Ki-Seung; Makashay, Matthew J (2000) "Corpus-based techniques in the AT&T NEXTGEN synthesis system," in 6th International Conference on Spoken Language Processing, Beijing 104 [105] Tachibana, Makoto; Yamagishi, Junichi; Masuko, Takashi; Kobayashi, Takao (2005) "Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing," IEICE - Transactions on Information and Systems, Vols E88-D, no 11, p 2484–2491 [106] Tachibana, Makoto; Yamagishi, Junichi; Masuko, Takashi; Kobayashi, Takao (2006) "A style adaptation technique for speech synthesis using HSMM and suprasegmental features," IEICE - Transactions on Information and Systems, Vols E89-D, no 3, pp 1092-1099 [107] Thao, Do Van; Dat, Tran Do; Trang, Nguyen Thi Thu (2011) "Nonuniformunit selection in Vietnamese Speech Synthesis," in 2nd SoICT 2011, Hanoi [108] Trang, Nguyen Thi Thu (2015) PhD Thesis: HMM-based Vietnamese TextTo-Speech: Prosodic phrasing modeling, corpus design, system design and evaluation, Paris: Université Paris Sud [109] Trang, Nguyen Thi Thu; D'Alessandro, Christophe; Rilliard, Albert; Dat, Tran Do (2013) "HMM-based TTS for Hanoi Vietnamese: issues in design and evaluation," in INTERSPEECH 2013, Lyon [110] Trinh, Quoc Son (2015) "HMM-based Vietnamese speech synthesis," in 14th International Conference Computer and Information Science, Las Vegas [111] Viet, Hoang Anh; Manh, Ngo Van; Bang, Ban Ha; Thang, Huynh Quyet (2012) "A real-time model based Support Vector Machine for emotion recognition through EEG," in International Conference on Control, Automation and Information Sciences, Ho Chi Minh [112] Vroomen, Jean; Collier, René; Mozziconacci, Sylvie (1993) "Duration and intonation in emotional speech," in 3rd European Conference on Speech Communication and Technology, Berlin [113] Vu, ThangTat; Luong, Mai Chi; Satoshi, Nakamura (2009) "An HMM- based Vietnamese Speech Synthesis System," in Oriental COCOSDA International Conference on Speech Database and Assessments, Urumqi [114] Vutuan, La; Cheng-Wei, Huang; Cheng, Ha; Li, Zhao (2013) "Emotional Feature Analysis and Recognition from Vietnamese Speech," Journal of Signal Processing, China, vol 29, no 10, pp 1423-1432 105 [115] Wang, William Yang; Georgila, Kallirroi (2011) "Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis," in IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa [116] Williams, Carl E.; Stevens, Kenneth N (1972) "Emotions and speech: Some acoustical correlates," The Journal of the Acoustical Society of America, vol 52, no 4, pp 1238-1250 [117] Xia, Xian-Jun; Ling, Zhen-Hua; Jiang, Yuan; Dai, Li-Rong (2014) "HMMbased unit selection speech synthesis using log likelihood ratios derived from perceptual data," In Speech Communication, Vols 63-64, pp 27-37 [118] Yamagishi, Junichi; Nose, Takashi; Zen, Heiga; Ling, Zhen-Hua; Toda, Tomoki; Tokuda, Keiichi; King, Simon; Renals, Steve (2009) "A robust speaker-adaptive HMM-based text-to-speech synthesis," IEEE Transactions on Audio, Speech, and Language Processing, vol 17, no 6, pp 1208 - 1230 [119] Yamagishi, Junichi; Zen, Heiga; Wu, Yi-Jian; Toda, Tomoki; Tokuda, Keiichi (2008) "The HTS-2008 system: yet another evaluation of the speakeradaptive HMMbased speech synthesis system in the 2008 Blizzard Challenge," in Blizzard Challenge 2008, Brisbane [120] Zen, Heiga; Toda, Tomoki; Tokuda, Keiichi (2007) "Details of Nitech HMM based speech synthesis system for the Blizzard Challenge," EICE TRANSACTIONS on Information and Systems, Vols E90-D, no 5, p 325– 333 [121] Zen, Heiga; Tokuda, Keiichi; Masuko, Takashi; Kobayasih, Takao; Kitamura, Tadashi (2007) "A hidden semi-Markov model-based speech synthesis system," IEICE TRANSACTIONS on Information and Systems, Vols E90-D, no 5, p 825–834 [122] Zhang, Julia (2004) Language Generation and Speech Synthesis in Dialogues for Language Learning, Massachusetts: Massachusetts Institute of Technology [123] Zhipeng, Jiang; Chengwei, Huang (2015) "High-Order Markov Random Fields and Their Applications in Cross-Language Speech Recognition," Cybernetics and Information Technologies, vol 15, no 4, pp 50-57 106 PHỤ LỤC A – DANH SÁCH CÁC ÂM CẦN THU ================Đơn vị âm cuối============================== ay ao au am an ang anh ài ày àu àm àn àng ành áy áo áu áp át ác ách ám án ánh ải ảy ảo ảu ảm ản ảng ảnh ại ạy ạo ạu ạp ạt ạc ạch ạm ạn ạng ạnh ãi ãy ão ãu ãm ãn ãng ãnh ăm ăn ăng ằm ằn ằng ắp ắc ắm ắn ắng ẳm ẳn ẳng ặp ặt ặc ặm ặn ặng ẵm ẵn ẵng ây âu âm ân âng ầy ầu ầm ần ầng ấu ấp ất ấc ấm ấn ấng ẩy ẩu ẩm ẩn ẩng ậy ậu ập ật ậc ậm ận ậng ẫy ẫu ẫm ẫn ẫng eo em en eng èo èm èn èng éo ép ét éc ém én éng ẻo ẻm ẻn ẻng ẻo ẹp ẹt ẹc ẹm ẹn ẹng ẽo ẽm ẽn ẽng êm ên ênh ều ềm ền êng ềnh ếu ếp ết ếch ếm ến dềng ếnh ểu ểm ển ểng ểnh ệu ệp ệt ệch ệm ện ệng ệnh ễu ễm ễn ễng ễnh iu im in inh ìu ìm ìn ình íu íp íc ích ím ín ính ỉu ỉm ỉn ỉnh ịu ịp ịt ịc ịch ịm ịn ịnh ĩu ĩm ĩn ĩnh oi om on ong ịi ịm ịn ịng ói óp ót óc óm ón óng 107 ỏi ỏm ỏn ỏng ọi ọp ọt ọc ọm ọn ọng õi õm õn õng ôi ôm ôn ông ồi ồm ồn ồng ối ốp ốt ốc ốm ốn ống ổi ổm ổn ội ộp ột ộc ộm ộn ộng ỗi ỗm ỗn ỗng ơm ơn ời ờm ờn ới ớp ớt ớm ớn ởi ởm ơn ợi ợp ớt ớm ớn ỡi ỡm ỡn ui uy um un ung ùi ùy ùm ùn ùng úy úp út úc úm ún úng ủi ủy ủm ủn ủng ụi ụy ụp ụt ục ụm ụn ụng ũi ũy ũm ũn ũng ưu ưn ưm ưng ừu ừm ừn ừng ứu ứt ức ứm ứng ửi ửu ửm ửng ựu ựt ực ựm ựng ữu ữm ững uynh uỳnh ýt uýnh uỷnh ỵt uỵnh uỹnh ia ìa ía ỉa ịa ĩa iêu iêm iên iêng iều iềm iền iềng iếu iếp iết iếc iếm iến iếng iểu iểm iển iểng iệu iệp iệt iệc iệm iện iệng iễu iễm iễn iễng ua ùa úa ụa ũa uôi uôm uôn uông uồi uồm uồn uồng uối uốt uốc uốm uốn uống uổi uổm uổn uổng uội uột uộc uộm uộn uộng uỗi uỗm uỗn uỗng ưa ừa ứa ửa ựa ữa ươi ươu ươm ươn ương ười ườm ườn ường ưới ướu ướt ước ướm ướn ướng ưởi ưởng ượi ượu ượt ược ượm ượn ượng ưỡi ưỡm ưỡn ưỡng oa oai oay oan oang oanh ịa ồi ồy ồn ồng ồnh óa oáy oát oác oách oán oáng oánh ỏa oải oảy oản oảng oảnh ọa oại oạy oạt oạc oạch oạn oạng oạnh õa oãi oãy oãn oãng oãnh oăn oăng oằn oằng oắt oắc oắn oắng oẳn oẳng 108 oặt oặc oặn oặng oẵn oẵng oe ịe óe oét ỏe ọe oẹt õe oen oèn oén oẻn oẹn oẽn oong ịong óong ỏong ọong õong c oọc y uân uâng uầy uần uầng uấy uất uấn uấng uẩy uẩn uẩng uậy uật luận uậng uẫy uẫn uẫng uê uề uế uếch uệch uênh uềnh uếnh uểnh uệnh uễnh uể uệ uễ ui ùi ủi ụi ũi uy uynh ùy uỳnh úy uýt uých uýnh ủy uỷu uỷnh ụy uỵu uỵt uỵch uỵnh ũy uỹnh uyên uyền uyết uyến uyển uyệt uyện uyễn uơ uở uya uýa yêu yên yếu yếm yến yểu yểm 109 ... tức tiếng Việt Nội dung thứ hai xây dựng ngữ liệu cho tổng hợp tiếng Việt chất lượng tốt để chuẩn bị cho tổng hợp tiếng Việt có biểu lộ cảm xúc Chương 3: Tổng hợp tiếng Việt có biểu lộ cảm xúc. .. nghiên cứu sâu vấn đề tổng hợp tiếng Việt với mục tiêu hướng tới hệ tổng hợp tiếng Việt chất lượng tốt với chất giọng khác có biểu lộ cảm xúc Đây vấn đề mẻ có tính thời tiếng Việt có tiềm ứng dụng... [114] chưa có nghiên cứu cách hệ thống tổng hợp tiếng Việt có cảm xúc Từ lý trên, nghiên cứu sinh lựa chọn đề tài nghiên cứu ? ?Tổng hợp tiếng Việt với chất giọng khác có biểu lộ cảm xúc? ?? nhằm nghiên

Định dạng
Số trang	110
Dung lượng	3,08 MB