Nghiên cứu phát triển hệ thống tổng hợp tiếng nói tiếng việt sử dụng công nghệ học sâu

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	65
Dung lượng	2,68 MB

Nội dung

Nghiên cứu phát triển hệ thống tổng hợp tiếng nói tiếng việt sử dụng công nghệ học sâu Nghiên cứu phát triển hệ thống tổng hợp tiếng nói tiếng việt sử dụng công nghệ học sâu Nghiên cứu phát triển hệ thống tổng hợp tiếng nói tiếng việt sử dụng công nghệ học sâu luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp

NGUYỄN VĂN THỊNH BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI - Nguyễn Văn Thịnh HỆ THỐNG THÔNG TIN NGHIÊN CỨU PHÁT TRIỂN HỆ THỐNG TỔNG HỢP TIẾNG NĨI TIẾNG VIỆT SỬ DỤNG CƠNG NGHỆ HỌC SÂU LUẬN VĂN THẠC SĨ KHOA HỌC HỆ THỐNG THÔNG TIN CLC2017B Hà Nội 2018 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI Nguyễn Văn Thịnh NGHIÊN CỨU PHÁT TRIỂN HỆ THỚNG TỔNG HỢP TIẾNG NĨI TIẾNG VIỆT SỬ DỤNG CÔNG NGHỆ HỌC SÂU Chuyên ngành : Hệ Thống Thông Tin LUẬN VĂN THẠC SĨ KHOA HỌC HỆ THỐNG THÔNG TIN NGƯỜI HƯỚNG DẪN KHOA HỌC : TS Mạc Đăng Khoa Hà Nội 2018 LỜI CẢM ƠN Đầu tiên, xin được gửi lời cảm ơn chân thành tới Viện nghiên cứu quốc tế MICA nơi đã tạo điều kiện cho thực hiện luận văn Tiếp đến, xin cảm ơn trung tâm không gian mạng VIETTEL, nơi làm việc, đã tạo điều kiện giúp đỡ tơi việc hồn thành hệ thống mà trình bày luận văn thạc sỹ Tôi xin chân thành cảm ơn TS Mạc Đăng Khoa người thầy, người hướng dẫn suốt thời gian qua để tơi có thể hồn thành ḷn văn cho mình Thêm nữa, xin chân thành cảm ơn anh Nguyễn Tiến Thành, chị Nguyễn Hằng Phương cùng tồn thể các viện nghiên cứu q́c tế MICA đã giúp đỡ quá trình làm luận văn tại viện nghiên cứu quốc tế MICA Tôi xin gửi lời cảm ơn trận trọng đến anh Nguyễn Quốc Bảo cùng tồn thể đờng nghiệp của tơi tại nhóm voice trung tâm không gian mạng VIETTEL, ban giám đốc trung tâm cùng toàn thể anh chị em trung tâm đã giúp đỡ hỗ trợ quá trình hồn thành ḷn văn thạc sỹ Ći cùng tơi xin gửi lời cảm ơn tới cô Đỗ Thị Ngọc Diệp, người đã hướng dẫn từ còn sinh viên đại học hỗ trợ, giúp đỡ đến tơi hồn thành ḷn văn Hà Nội, ngày 27 tháng 03 năm 2018 Nguyễn Văn Thịnh MỤC LỤC LỜI CẢM ƠN .3 MỤC LỤC DANH MỤC HÌNH ẢNH DANH MỤC BẢNG DANH MỤC TỪ VIẾT TẮT VÀ THUẬT NGỮ MỞ ĐẦU .9 LỜI CAM ĐOAN .11 CHƯƠNG 1: TỔNG QUAN VỀ TỔNG HỢP TIẾNG NÓI 12 1.1 Giới thiệu về tổng hợp tiếng nói 12 1.1.1 Tổng quan về tổng hợp tiếng nói .12 1.1.2 Xử lý ngôn ngữ tự nhiên tổng hợp tiếng nói 12 1.1.3 Tổng hợp tín hiệu tiếng nói 13 1.2 Các phương pháp tổng hợp tiếng nói 14 1.2.1 Tổng hợp mô hệ thống phát âm 14 1.2.2 Tổng hợp tần số formant 14 1.2.3 Tổng hợp ghép nối 15 1.2.4 Tổng hợp dùng tham số thống kê .16 1.2.5 Tổng hợp tiếng nói bằng phương pháp lai ghép 19 1.2.6 Tổng hợp tiếng nói dựa phương pháp học sâu (DNN) .19 1.3 Tình hình phát triển các vấn đề với tổng hợp tiếng nói tiếng Việt 21 CHƯƠNG 2: PHƯƠNG PHÁP HỌC SÂU ÁP DỤNG TRONG TỔNG HỢP TIẾNG NÓI .23 2.1 Kỹ thuật học sâu sử dụng mạng nơ ron nhân tạo 23 2.1.1 Những mạng nơ ron bản 23 2.1.2 Mạng nơ ron học sâu 25 2.2 Tổng hợp tiếng nói dựa phương pháp học sâu 27 2.3 Trích chọn các đặc trưng ngơn ngữ 27 2.4 Mô hình âm học dựa mạng nơ ron học sâu 30 2.5 Vocoder 32 CHƯƠNG 3: XÂY DỰNG HỆ THỚNG TỔNG HỢP TIẾNG NĨI TIẾNG VIỆT VỚI CÔNG NGHỆ HỌC SÂU 35 3.1 Giới thiệu hệ thống Viettel TTS 35 3.2 Kiến trúc tổng quan của hệ thống Viettel TTS .35 3.3 Xây dựng các mô đun của hệ thống tổng hợp tiếng nói 36 3.3.1 Mô đun chuẩn hóa văn bản đầu vào 36 3.3.2 Mơ đun trích chọn đặc trưng ngôn ngữ 38 3.3.3 Mô đun tạo tham số đặc trưng âm học .39 3.3.4 Mô đun tổng hợp tiếng nói từ các đặc trưng âm học .41 3.4 Xây dựng sở dữ liệu huấn luyện hệ thống .42 3.4.1 Thu thập dữ liệu cho hệ thống tổng hợp tiếng nói 42 3.4.2 Huấn luyện hệ thống 42 3.5 Xử lý dữ liệu huấn luyện để nâng cao chất lượng đầu 42 CHƯƠNG 4: CÀI ĐẶT THỬ NGHIỆM VÀ ĐÁNH GIÁ KẾT QUẢ 46 4.1 Cài đặt thử nghiệm hệ thống 46 4.2 Đánh giá kết quả thử nghiệm hệ thống 47 4.2.1 Đánh giá chất lượng tổng hợp dùng DNN so với HMM 47 4.2.2 Đánh giá kết quả của việc cải thiện sở dữ liệu huấn luyện 47 4.2.3 Đánh giá so sánh chất lượng hệ thống tổng hợp tiếng nói so với các hệ thống tổng hợp tiếng Việt hiện có .48 4.2.4 Đánh giá hiệu hệ thống 50 KẾT LUẬN .52 A Tổng kết 52 B Phương hướng phát triển cải thiện hệ thống 52 TÀI LIỆU THAM KHẢO 53 PHỤ LỤC 55 Phụ lục A: Cấu trúc của nhãn biễu diễn ngữ cảnh của âm vị 55 Phụ lục B: Các công bố khoa học của luận văn 57 DANH MỤC HÌNH ẢNH Hình 1: Sơ đờ tổng quát hệ thớng tổng hợp tiếng nói [9] 12 Hình 2: Cấu trúc bản tổng hợp formant nới tiếp[13] 14 Hình 3: Cấu trúc bản tổng hợp formant song song[13] .15 Hình 4: Mơ hình markov ẩn áp dụng tổng hợp tiếng nói 16 Hình 5: Quá trình huấn luyện tổng hợp hệ thống tổng hợp tiếng nói dựa mô hình markov ẩn 18 Hình 6: Tổng hợp tiếng nói dựa DNN[18] 20 Hình 7: Một perceptron với ba đầu vào[24] 23 Hình 8: Mạng nơ ron gờm nhiều perceptron[24] 24 Hình 9: Hàm sigmoid[24] 25 Hình 10: Hàm kích hoạt relu 25 Hình 11: Mạng nơ ron lớp ẩn [24] .26 Hình 12: Mạng nơ ron hai lớp ẩn[24] .26 Hình 13: Kiến trúc bản của hệ thống tổng hợp tiếng nói 27 Hình 14: Biểu diễn đặc trưng ngôn ngữ học của văn bản[28] 28 Hình 15: Thơng tin đặc trưng ngơn ngữ liên quan đến từng âm vị[28] 29 Hình 16: Thời gian xuất hiện mỗi trạng thái của từng âm vị 29 Hình 17: Mạng nơ ron feat forward .30 Hình 18: Chuyển hóa véc tơ đặc trưng thành các véc tơ nhị phân 31 Hình 19: Mạng nơ ron học sâu áp dụng tổng hợp tiếng nói[4] 31 Hình 20: Tổng quan về hệ thống WORLD vocoder[30] .33 Hình 21: Tổng hợp tiếng nói với WORLD vocoder 34 Hình 22: Hệ thớng tổng hợp tiếng nói Viettel TTS 35 Hình 23: Kiến trúc hệ thống tổng hợp tiếng nói 36 Hình 24: Quá trình chuẩn hóa văn bản đầu vào 37 Hình 25: Hoạt động của trích chọn đặc trưng ngôn ngữ học .38 Hình 26: Cấu trúc hoạt động của Genlab 39 Hình 27: Cấu trúc mô đun tạo tham số đặc trưng .39 Hình 28: Quá trình huấn luyện tổng hợp hệ thống tổng hợp tiếng nói dựa mô hình mạng nơ ron học sâu .41 Hình 29: Tổng hợp tiếng nói từ các đặc trưng âm học bằng WORLD vocoder 41 Hình 30: Tín hiệu âm trước (trên) sau cân bằng (dưới) .43 Hình 31: Tín hiệu âm trước (ở trên) sau (ở dưới) sau lọc nhiễu 44 Hình 32: Phân bớ dữ liệu sau gán nhãn 45 Hình 33: Hình ảnh chạy thử nghiệm hệ thống tổng hợp tiếng nói 46 Hình 34: Hình ảnh chạy thử nghiệm hệ thớng tổng hợp tiếng nói 46 Hình 35: Đánh giá độ tự nhiên 49 Hình 36: Đánh giá độ hiểu 49 Hình 37: Đánh giá MOS 49 Hình 38: Đánh giá thời gian đáp ứng của hệ thớng 50 Hình 39: Đánh giá chiếm dụng nhớ .50 DANH MỤC BẢNG Bảng 1: Đánh giá so sánh HMM DNN 20 Bảng 2: Dữ liệu huấn luyện hệ thống tổng hợp tiếng nói 42 Bảng 3: Kết quả so sánh tổng hợp DNN HMM 47 Bảng 4: Kết quả so sánh chất lượng tổng hợp tiếng nói của hệ thống có dữ liệu huấn luyện đã được xử lý (DNN2) chưa được xử lý (DNN1) 48 Bảng 5: Thông tin người nghe đánh giá hệ thống tổng hợp tiếng nói 48 DANH MỤC TỪ VIẾT TẮT VÀ THUẬT NGỮ Từ viết tắt HMM DNN PSOLA TTS MSLA GMM VLSP MOS F0 Từ đầy đủ Hidden markov model Deep Neural Network Pitch Synchronous Overlap and Add Text To Speech Mel Log Spectral Approximation Gaussian mixture model Vietnamese language and speech processing Mean opinion score Fundamental frequency Ý nghĩa Mô hình markov ẩn Mạng nơ ron học sâu Kỹ thuật chồng đồng cao độ tần số bản Tổng hợp văn bản thành tiếng nói xấp xỉ phổ mel Mô hình gauss hỗn hợp Xử lý ngôn ngữ tiếng nói tiếng Việt Điểm ý kiến trung bình Tần số bản MỞ ĐẦU Hiện nay, lĩnh vực tổng hợp tiếng nói đã được nghiên cứu phát triển rất nhiều nơi thể giới, nhiều công nghệ phương pháp khác được thử nghiệm, triển khai thành công, thậm chí có những cơng trình đã đạt đến mức khó có thể phân biệt được với giọng đọc của người Còn Việt Nam, cũng đã có nhiều công trình nghiên cứu sản phẩm về lĩnh vực tổng hợp tiếng nói, có thể kể đến các nghiên cứu của Viện công nghệ thông tin thuộc Viện hàn lâm khoa học công nghệ Việt Nam ([1], [2]), các nghiên cứu đều dựa kiến trúc của hệ thống HTS[3] để xây dựng hệ thống tổng hợp tiếng nói, mô hình được áp dụng mô hình Markov ẩn Các công trình nghiên cứu hệ thống thực tế về tổng hợp tiếng nói Việt nam hiện chủ yếu được phát triển dựa hai phương pháp: tổng hợp tiếng nói ghép nối tổng hợp tiếng nói thống kê dựa mô hình Markov ẩn (HMM) Hai phương pháp nêu hai phương pháp đã được nghiên cứu phát triển nhiều năm thế giới cũng Việt Nam, đã có nhiều sản phẩm, hệ thống thành công với nó Tuy nhiên hai phương pháp vẫn còn nhiều mặt hạn chế chất lượng tiếng nói tổng hợp không thật đối với HMM sở dữ liệu cần lưu trữ lớn cũng chỉ cho chất lượng tốt miền hẹp đối với tổng hợp ghép nối Mặt khác thế giới hiện đã bắt đầu phát triển công nghệ tổng hợp tiếng nói mới, đó tổng hợp tiếng nói dựa phương pháp học sâu, nó cũng đã cho thấy những kết quả tích cực, chất lượng tổng hợp của hệ thống mức cao, gần với tự nhiên[4] Vì hai lý trên, để tài được đề xuất thực hiện nhằm thử nghiệm áp dụng công nghệ học sâu vào tổng hợp tiếng nói tiếng Việt với mong muốn tạo được hệ thống tổng hợp tiếng nói có chất lượng cao Đề tài tập trung nghiên cứu áp dụng công nghệ tổng hợp tiếng nói dựa mạng nơ ron học sâu cho tổng hợp tiếng nói tiếng Việt, cho đạt được hệ thống có chất lượng giọng tổng hợp tốt so với các hệ thống tổng hợp tiếng Việt sử dụng các công nghệ khác cũ Để làm được điều này, tác giả đã đề các nhiệm vụ cần hồn thành sau: - Nghiên cứu về phương pháp tổng hợp tiếng nói dựa công nghệ học sâu cách áp dụng - Triển khai xây dựng hệ thống tổng hợp tiếng nói dựa công nghệ - Áp dụng số giải pháp tiền xử lý dữ liệu để nâng cao chất lượng giọng tổng hợp Luận văn được xây dựng quá trình làm việc tại trung tâm không gian mạng VIETTEL thời gian làm việc tại phòng Giao tiếp tiếng nói thuộc Viện nghiên cứu quốc tế MICA Với môi trường làm việc nghiêm túc, được sự hướng dẫn của TS Mạc Đăng Khoa cùng với sự trợ giúp của đồng nghiệp các anh, chị, thầy, cô Viện Nghiên cứu quốc tế MICA đã đúc rút được kinh nghiệm hoàn thành luận văn Sau bớ cục của ḷn văn • CHƯƠNG TỔNG QUAN VỀ TỔNG HỢP TIẾNG NÓI: Chương giới thiệu chung về tổng hợp tiếng nói, tình hình nghiên cứu phát triển các hệ thống tổng hợp tiếng nói, các phương pháp tổng hợp tiếng nói phổ biến hiện • CHƯƠNG 2: PHƯƠNG PHÁP HỌC SÂU ÁP DỤNG TRONG TỔNG HỢP TIẾNG NÓI: Chương chủ yếu nói về phương pháp học sâu cách áp dụng nó trong tổng hợp tiếng nói • CHƯƠNG 3: XÂY DỰNG HỆ THỚNG TỔNG HỢP TIẾNG NĨI TIẾNG VIỆT VỚI CƠNG NGHỆ HỌC SÂU: Chương chủ yếu nói về kiến trúc hệ thống tổng hợp tiếng nói tiếng Việt dựa phương pháp học sâu, cách triển khai xây dựng từng mô đun dựa kiến trúc cách thu thập, phương pháp xử lý, lọc dữ liệu cho hệ thống tổng hợp tiếng nói • CHƯƠNG 4: CÀI ĐẶT THỬ NGHIỆM VÀ ĐÁNH GIÁ KẾT QUẢ: Chương chủ yếu nói về cách thức cài đặt, thử nghiệm đánh giá kết quả hệ thống tổng hợp tiếng nói đã được xây dựng • Phần KẾT LUẬN: Phần phần kết luận về luận văn cũng những phương hướng nghiên cứu, cải thiện 10 Đánh giá dung lượng chiếm dụng nhớ, dung lượng nhớ mà hệ thống chiếm dụng được tính tại thời điểm tạo các tham số đặc trưng âm học bằng mô hình DNN, cũng thời điểm chiếm dụng nhớ nhiều nhất Kết quả đánh giá được thể hiện hình 39 Kết quả đánh giá cho thấy, dung lượng chiếm dụng nhớ khơng quá nhiều chỉ khoảng 1% tồn dung lượng vật lý của môi trường 51 KẾT LUẬN A Tổng kết Sau toàn quá trình hoàn thành luận văn này, chúng đã đạt được số kết quả nhất định sau: - Tìm hiểu làm chủ được công nghệ tổng hợp tiếng nói, xây dựng thành công hệ thống tổng hợp tiếng nói tiếng Việt đầu tiên sử dụng công nghệ học sâu - Phân tích được sớ vấn đề việc xây dựng sở dữ liệu huấn luyện tổng hợp tiếng nói dựa phương pháp học sâu, kiểm định kết quả cải thiện thông qua các đánh giá Hệ thống tổng hợp tiếng nói được phát triển khuôn khổ luận văn đã được ứng dụng triển khai tại tập đồn cơng nghiệp viễn thơng qn đội Viettel, mô đun cấu thành nên nền tảng trí tuệ nhân tạo (AI) của Viettel, đã được tích hợp vào các hệ thớng hệ thớng trợ lý ảo Viettel hệ thống chăm sóc khách hàng tự động Ngồi ra, hệ thớng tổng hợp tiếng nói cũng đã được gửi tham dự thi về tổng hợp tiếng nói hội nghị VLSP14 2018 đã giành giải nhất, vượt qua đội Mica vais (Đánh giả về cả ba hệ thống được nêu chương 4) Báo cáo về hệ thống tổng hợp tiếng nói dành cho hội thảo VLSP được nêu phụ lục B Ngoài ra, quá trình làm luận văn, tác giả có có báo được công bố trình bày tại Hội nghị quốc tế về Nhận dạng ký tự Xử lý ngôn ngữ tự nhiên cho các ngôn ngữ Asean (Regional Conference on Optical character recognition and Natural language processing technologies for ASEAN languages - ONA 2017))15 Chi tiết về các báo cáo khoa học thi tổng hợp tiếng nói tại VLSP 2018 báo tại hội nghị ONA 2017 xin xem Phụ lục B B Phương hướng phát triển và cải thiện hệ thống Hệ thống tổng hợp tiếng nói khuôn khổ của luận văn đạt được chất lượng đầu tương đối tốt so với các hệ thống hiện tại, nhiên vẫn còn số vấn đề cần cải thiện như: - Thời gian đáp ứng còn chậm - Chưa đạt được chất giọng tốt tổng hợp tiếng nói theo phương ngữ miền Nam của tiếng Việt Vì vậy, công việc tiếp theo của luận văn tiếp tục cải thiện các nhược điểm của hệ thống cũng nâng cấp các khả khác của hệ thống cụ thể như: - Cải thiện thời gian đáp ứng bằng cách song song hóa lọc bỏ các khâu không cần thiết - Thêm các giải pháp mới cho toàn chuẩn hóa văn bản đầu vào - Thêm từ điển dành riêng cho các phương ngữ khác phương ngữ Nam Trung để cải thiện chất lượng tổng hợp các phương ngữ 14 15 http://vlsp.org.vn/ http://ona2017.org/ 52 TÀI LIỆU THAM KHẢO [1] A.-T Dinh, T.-S Phan, T.-T Vu, and C.-M Luong, “Vietnamese HMM-based Speech Synthesis with prosody information,” Th ISCA Speech Synth Workshop, p 4, 2013 [2] T.-S Phan, T.-C Duong, A.-T Dinh, T.-T Vu, and C.-M Luong, “Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information,” 2013, pp 276–281 [3] H Zen et al., “The HMM-based Speech Synthesis System (HTS) Version 2.0,” p 6, 2007 [4] Z Wu, O Watts, and S King, “Merlin: An Open Source Neural Network Speech Synthesis System,” 2016, pp 202–207 [5] J J Ohala, “Christian Gottlieb Kratzenstein: pioneer in speech synthesis,” Proc 17th ICPhS, 2011 [6] D Suendermann, H Höge, and A Black, “Challenges in Speech Synthesis,” in Speech Technology, Huggins and F Chen, Eds Boston, MA: Springer US, 2010, pp 19–32 [7] P T Sơn and P T Nghĩa, “Một số vấn đề về tổng hợp tiếng nói tiếng Việt,” p 5, 2014 [8] K Tokuda, Y Nankaku, T Toda, H Zen, J Yamagishi, and K Oura, “Speech Synthesis Based on Hidden Markov Models,” Proc IEEE, vol 101, no 5, pp 1234– 1252, May 2013 [9] T T T Nguyen, “HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation,” PhD Thesis, Paris 11, 2015 [10] Q Ngũn Hờng, “Phân tích văn bản cho tổng hợp tiếng nói tiếng Việt,” Đại Học Bách Khoa Hà Nội, 2006 [11] P Taylor, Text-to-speech synthesis Cambridge university press, 2009 [12] J Dang and K Honda, “Construction and control of a physiological articulatory model,” J Acoust Soc Am., vol 115, no 2, pp 853–870, 2004 [13] S Lukose and S S Upadhya, “Text to speech synthesizer-formant synthesis,” 2017, pp 1–4 [14] F Charpentier and M Stella, “Diphone synthesis using an overlap-add technique for speech waveforms concatenation,” 1986, vol 11, pp 2015–2018 [15] S.-J Kim, “HMM-based Korean speech synthesizer with two-band mixed excitation model for embedded applications,” PhD Thesis, Ph D dissertation, School of Engineering, Information and Communication University, Korea, 2007 [16] T Masuko, “HMM-Based Speech Synthesis and Its Applications,” p 185, 2002 [17] T Fukada, K Tokuda, T Kobayashi, and S Imai, “An adaptive algorithm for melcepstral analysis of speech,” 1992, pp 137–140 vol.1 [18] H Ze, A Senior, and M Schuster, “Statistical parametric speech synthesis using deep neural networks,” 2013, pp 7962–7966 [19] H Zen, “Statistical Parametric Speech Synthesis,” Autom Speech Recognit., p 93 [20] D D Tran, “Synthèse de la parole partir du texte en langue vietnamienne,” PhD Thesis, Grenoble INPG, 2007 [21] T Van Do, D.-D Tran, and T.-T T Nguyen, “Non-uniform unit selection in Vietnamese speech synthesis,” in Proceedings of the Second Symposium on Information and Communication Technology, 2011, pp 165–171 [22] S Ronanki, M S Ribeiro, F Espic, and O Watts, “The CSTR entry to the Blizzard Challenge 2017.” 53 [23] T Q Cường, “Nghiên Cứu Áp Dụng Kỹ Thuật Học Sâu (Deep Learning) Cho Bài Toán Nhận Dạng Ký Tự Latinh,” TRƯỜNG ĐẠI HỌC HÀNG HẢI VIỆT NAM, HẢI PHÒNG, 2016 [24] M A Nielsen, Neural networks and deep learning Determination Press, 2015 [25] Z.-H Ling et al., “Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends,” IEEE Signal Process Mag., vol 32, no 3, pp 35–52, May 2015 [26] N T T Trang, T D Dat, A Rilliard, C D’alessandro, and P T N Yen, “Intonation Issues In HMM-Based Speech Synthesis For Vietnamese,” St Petersburg, p 7, 2014 [27] D Jurafsky and J H Martin, Speech and language processing, vol Pearson, 2014 [28] C King, “• Prof of Speech Processing • Director of CSTR • Co-author of Festival • CSTR website: www.cstr.ed.ac.uk • Teaching website: speech.zone,” p 424 [29] H Kawahara, “Straight, exploitation of the other aspect of Vocoder: Perceptually isomorphic decomposition of speech sounds,” Acoust Sci Technol., vol 27, no 6, pp 349–353, 2006 [30] M Morise, F Yokomori, and K Ozawa, “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Trans Inf Syst., vol E99.D, no 7, pp 1877–1884, 2016 [31] F Espic, C V Botinhao, and S King, “Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis,” 2017, pp 1383–1387 [32] M Morise, H Kawahara, and H Katayose, “Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech,” in Audio Engineering Society Conference: 35th International Conference: Audio for Games, 2009 [33] M Morise, “CheapTrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Commun., vol 67, pp 1–7, Mar 2015 [34] M Morise, “PLATINUM: A method to extract excitation signals for voice synthesis system,” Acoust Sci Technol., vol 33, no 2, pp 123–125, 2012 [35] J Lafferty, A McCallum, and F C Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001 [36] Q T Do, Vita: A Toolkit for Vietnamese segmentation, chunking, part of speech tagging and morphological analyzer 2015 54 PHỤ LỤC Phụ lục A: Cấu trúc nhãn biễu diễn ngữ cảnh âm vị Cấu trúc mỗi nhãn (tương ứng mỗi dòng tệp chứa các nhãn): p1^p2-p3+p4=p5@p6_p7/A:a1_a2/B:b1-b2@b3-b4&b5-b6/C:c1+c2/D:d1d2/E:e1+e2/F:f1-f2/G:g1-g2/H:h1=h2@h3=h4/I:i1_i2/J:j1+j2-j3 Giải thích các trường cho nhãn sau: Trường P1 P2 P3 P4 P5 P6 P7 Mơ tả Âm vị phía trước của âm vị phía trước âm vị hiện tại Âm vị phía trước âm vị hiện tại Âm vị hiện tại Âm vị tiếp theo Âm vị sau của âm vị tiếp theo Vị trí của âm vị hiện tại từ hiện tại (tính từ phía trước) Vị trí của âm vị hiện tại từ hiện tại (tính từ phía sau) A1 A2 Thanh điệu âm tiết phía trước Sớ lượng âm vị âm tiết phía trước B1 B2 B3 B4 B5 B6 Thanh điệu của âm tiết hiện tại Số lượng âm vị âm tiết hiện tại Vị trí của âm tiết từ hiên tại (tính từ phía trước) Vị trí của âm tiết từ hiên tại (tính từ phía sau) Vị trí của âm tiết cụm từ hiện tại (tính từ phía trước) Vị trí của âm tiết cụm từ hiện tại (tính từ phía sau) C1 C2 Thanh điệu của từ tiếp theo Số lượng âm vị âm tiết tiếp theo D1 D2 Nhãn từ loại của từ phía trước Sớ lượng âm vị từ phía trước E1 E2 Nhãn của từ loại từ hiện tại Số lượng âm vị từ hiện tại F1 F2 Nhãn của từ loại từ tiếp theo Số lượng âm vị từ tiếp theo G1 G2 Số lượng âm vị cụm phía trước Sớ lượng từ cụm phía trước H1 H2 Số lượng âm vị cụm hiện tại Số lượng từ cụm hiện tại H3 Vị trí của cụm hiện tại câu (tính từ phía trước) 55 H4 Vị trí của cụm hiện tại câu (tính từ phía sau) I1 I2 Sớ lượng âm vị cụm tiếp theo Số lượng từ cụm tiếp theo J1 J2 J3 Số lượng âm vị câu Số lượng từ câu Số lượng cụm từ câu 56 Phụ lục B: Các công bố khoa học luận văn Van-Thinh NGUYEN, Thi-Ngoc-Diep DO, Dang-Khoa MAC, Eric CASTELLI (2017) Optimizing data transmission on mobile platform for speech translation system First Regional Conference on OCR and NLP for ASEAN Languages, Phnom Penh – Cambodia Van Thinh NGUYEN, Khac Tan PHAM, Huy Kinh PHAN and Quoc Bao NGUYEN (2018), Development of a Vietnamese Speech Synthesis System for VLSP 2018, The Fifth International Workshop on Vietnamese Language and Speech Processing (VLSP 2018), Hanoi, March 2018 57 Optimizing data transmission on mobile platform for speech translation system Van-Thinh NGUYEN, Thi-Ngoc-Diep DO, Dang-Khoa MAC, Eric CASTELLI International Research Institute MICA, HUST-CNRS/UMI 2954-Grenoble INP, Hanoi, Vietnam thinhnv1811@gmail.com, {ngoc-diep.do, dang-khoa.mac, eric.castelli}@mica.edu.vn Abstract Speech This paper describes the work of building a speech translation system in mobile platform using client-server architecture To reduce the amount of data transmitted between mobile device and server, a specific module is developed to extract only necessary features of recorded speech and transmit them over network This module is applied to implement an English-Vietnamese translation system A performance test shows that this solution can reduce more than 50% of transmission data while retaining the quality of system ASR Text MT Text TTS Speech Figure Basic architecture of speech translation Three modules above were researched longtime in the past through many different techniques Currently, most of them follows the statistical approach using machine learning techniques to build the models These approaches normally require a large number of training data and high computation cost Therefore, most of speech translation application on mobile device now uses clientserver architecture (Figure 2) The main modules (including ASR, MT and TTS) are deployed on the server which provides high computational performance The client just plays the role of user interaction interface Keywords: Speech translation, speech recognition, client-server architecture, acoustic feature extraction, data transmission Introduction SERVER So far, the language difference is the major barrier of communication between human in different countries That is the aim of automatic speech translation system, which can convert speech signal from one language to another language[1] Nowadays, there are many automatic speech translation products, supporting many language such as Google translation1, Bing translator Almost system mentioned above are based on the architecture shown in Figure 1, which have three main modules [2]: Automatic speech recognition (ASR), Machine translation (MT) and Text to speech (TTS) In this system, the ASR module takes speech signal as an input and return recognized text in source language After that, the recognized text will be translated into another language by the MT module and the translation text is synthesized into speech of target language by TTS module[3] Input Speech ASR MT Output Speech TTS Figure The common architecture of speech translation system However in this architecture speech data is directly transmitted between client and server Therefore, a large amount of data is transferred via network That takes the cost for internet connection, especially when using 3G or 4G connections In this paper we will describe our work of www.translate.google.com https://www.bing.com/translator building a speech translation system in mobile platform using client-server architecture To reduce the amount of data transmitted between the mobile device and the server, a specific module is developed to extract only the necessary features of recorded speech and transmit them over the network The deployment of features extraction is presented in Section In section 3, the deployment of Vietnamese-English speech translation system is described The improvement of the system performance with the proposal method is evaluated in the fourth section The paper ends with some discussions and conclusions Proposal Method 2.1 System architecture proposal With the aim of reducing the transmission data, a new architecture of speech translation system is proposed as in Figure as the input for both training and recognition phases In training phase, feature vectors of training speech and corresponding transcriptions are used to train acoustic model In the recognition phase, the pronunciation dictionary, language model, speech model (acoustic model) are given, feature vectors of input speech is decoded to text of target language Figure Statistical approach to ASR based on hidden markov model[4] The most popular features in both Speech Recognition and Speech Synthesis is the Mel Frequency Cepstral Coefficients (MFCCs), as it is less complex in implementation and robust under various conditions [5] 2.2.2 Development of feature extraction module on mobile device For deploying the feature extractor module on this system, input speech data is extract into MFCCs vector of 39 dimensions Figure presents the necessary steps to generate the feature vectors Speech Input Preemphasi s Figure Proposal architecture of Speech translation system On this architecture, speech data is recorded by the client Some appropriate speech features are extracted by Feature Extractor module and sent to the server The Recognizer module receives speech features from client and decodes to the corresponding text of source language This text is translated to target language text by SMT module The translated text is transmitted back to the client and synthesized by TTS module installed in the client 2.2 Speech Features Extraction 12 MFCC 12 MFCC 12 MFCC Energy Energy Energy Window DFT Mel Filter Bank IDFT Log Energy DELTA Figure Steps for features extraction The proposal feature extraction module will be deployed with more steps to improve the distinguishing of speech sound units Audio Recorder Feature Extractor Data Blocker Speech Classifier Speech Marker FFT MelFilt erBank DCT Peemphas izer LiveC MN Window er Feature Extraction 2.2.1 Speech features for ASR Speech features which provide a compact representation of given speech signal is a sequence of feature vectors These feature vectors contain the relevant information for distinguishing between speech sounds [4] In automatic speech recognition (Figure 4), feature vectors extracted from speech waveform by feature extraction module will be used Figure Operation of Feature extraction module The block diagram of feature extraction module deployed on mobile device is shown in Figure Speech data recorded by Audio recorder will be pushed into a queue, as the input of Feature extractor module Features extractor module gets speech data from that queue and processes through flowing blocks: - Data Blocker is used to split speech data into the packages having equal size - Speech Classifier block classifies speech into two categories which are speech and nonspeech based on energy - Speech maker block takes output of Speech Classifier and then marks begin point of speech and end point of speech which are used in the next steps to determine when human starts speaking and when they stop speaking - The next six blocks are the deployment of the six steps of MFCC extraction method with an additional block which is LiveCMN This block uses cepstral mean normalization method to reduce the signal to noise ratio and the error rate for clean speech [6] - The last module will pack the features and transfer to the next component in system synthesized to speech of the target language So, from input speech data of source language, it passed many steps between client and server, to generate speech of target language INPUT Voice CLIENT SIDE Processor Audio Recorder Text The system processes as follow After recorded by Audio Recorder, features of speech data in source language will be extracted Feature Extractor Connection manager will transmit these features to server via SOCKET TCP/IP Speech features from client will be received by Controller and Module manager in the server, then will be transmitted to Recognizer The output text is recognized text and it is transferred back to Module Manager The text then be transferred to Translator module via XMLRPC protocol[9] MT module will translate its input text from the source language to the target language The output translated text is returned client and in the final step, this text will be Speech Listener Speech Synthesizer MFCC SERVER Controller Module Manager The detail architecture of the whole system is presented in Figure 7, containing two main parts: client and server The server side contains two main modules, Speech Recognizer and Machine translation The automatic speech recognition is deployed on server in block Recognizer based on Hidden Markov Model to recognize text from speech features Speech recognizer modules is deployed by using SphinX4 toolkit[7, p 4] Statistical machine translation (SMT) module uses phrasebased machine translation approach which translate text from a language to another language Machine Translation module is deploy based on Moses framework[8] These modules are managed and controlled by Module manager Feature Extractor Connection Manager 2.3 The system architecture The client side have three main components: Audio Recorder to record audio, Feature Extractor to extract features, and a Text to speech engine on mobile device which can generate the speech signal from input text OUTPUT Voice XMLRPC Transla tor Text MT Recognizer Figure Architecture of Speech Translation System Deployment of Vietnamese-English speech translation system This section presents the deployment of the Vietnamese-English speech translation system following the proposal architecture above 3.1 Building models for ASR and MT modules One of the most important work to develop the speech translation system is building the model for ASR and MT module For the automatic speech recognition, both of English and Vietnamese models are trained by Sphinx toolkits With English speech recognition model, the TEDLIUM corpus [10] is used, which have more than 100 hours of recording (about 33000 vocabulary in dictionary) With Vietnamese speech recognition model, we trained the VNSpeechCorpus[11] with more than hours of recording (about 3000 sentences and paragraphs) The Table shows the word error rate for two models in Vietnamese and English calculated on a testing corpus which have total duration equal to 10% of training duration Table ASR results in percentage of WER WER English ASR model Vietnamese ASR model 28% 11,2% For the Machine translation model, we built two directions of translations between English and Vietnamese These models are built by Moses toolkit[8] using a parallel corpus that have 3,8 million pair of sentences collected from OPUS corpus[12] The evaluation of these translation model (in BLEU score) are displayed in table transferring MFCCs features via internet, not only a large amount of transmitted data is reduced, but also the cost for internet data transmission also reduced Table The result of Machine translation models BLEU Score English Vietnamese Vietnamese English 46,7 46,61 Figure Data transmission result For Text to speech engine on mobile device, we use two available TTS engines including Google TTS applied for English and VNTTS[13] applied for Vietnamese Evaluation The objective of evaluation experiment is to show how improvement of the proposal method on the performance of whole system 4.2 Whole system respond time The objective of this evaluation is the system respond time, which is measured from the moment that the speech ends to the time the client plays the speech output This test uses the same hardware and the testing data with the previous test (data transmission) For testing, two systems are setup: - The original system: which use conventional architecture (as in Figure 2) - The proposal system: as presented in section These two systems are installed on the same hardware conditional Sever part is installed on a server (CPU core i5, GB Ram) Client part is deployed on an Android device (1 GB Ram, CPU core 1,3 Ghz) Testing data used for evaluation these systems which are 30 sentences that have different lengths: - Short sentence: from to syllables - Medium sentence: to 15 syllables - Long sentence: more than 15 syllables These systems and testing data used for both data transmission evaluation and system respond time evaluation which are described as below 4.1 Data transmission improvement The data transmission amount is calculated by a specific module, calculate the summary of data transferred between client side and server side from receiving of input speech to getting the output signal The result of data transmission is shown in Figure The number of bytes transmitted via network of proposal system are reduced more than 50% in comparison to the original system does with all type of input sentence lengths So, by Figure The average response time The result of system responding time testing is displayed in Figure According to this result, the respond time of proposal system is a little shorter that of original system in short sentences and medium sentences, but with long sentences it takes more time than original system does The responding time with the long sentence is long can be due to the feature extraction module runs long in client side (lower performance) While in the original system, feature extraction module is deployed on server with high performance For this reason, applying the method of transmitting features vector via network may not improve the whole system respond time, especially with the long sentence Conclusion This paper described a development of speech translation system which resolves the issues related to performance and speed The proposal system has decreased more than 50% amount of transmission data, and has a little improvement of system responding time (in case of short and medium input sentences) The future work will aim to complete the system by improving the feature extraction module to reduce the processing time Acknowledgement This work was supported by the Vietnamese national science and technology project: “Research and development automatic translation system from Vietnamese text to Muong speech, apply to unwritten minority languages in Vietnam” (Project code: ĐTĐLCN.20/17) References [1] M Dureja and S Gautam, “Speech-to-Speech Translation: A Review,” Int J Comput Appl., vol 129, no 13, pp 28–30, 2015 [2] M Goyani and N Dave, “Performance Analysis of LPC, PLP and MFCC Parameters in Speech Recognition,” in Proceedings of National Conference on Advance Computing, 2009 [3] Y Zhang, “Survey of current speech translation research,” Found Web Httpprojectile Cs Cmu Eduresearchpublictal KsspeechTranslationsst-Surv.-Joy Pdf, 2003 [4] Gajic and Bojana, “Feature Extraction for Automatic Speech Recognition in Noisy Acoustic Environments.” [5] P P Singh and P Rani, “An approach to extract feature using mfcc,” IOSR J Eng., vol 4, no 8, pp 21–25, 2014 [6] A Acero and X Huang, “Augmented cepstral normalization for robust speech recognition,” in Proc of IEEE Automatic Speech Recognition Workshop, 1995, pp 146–147 [7] W Walker et al., Sphinx-4: A flexible open source framework for speech recognition Sun Microsystems, Inc Mountain View, CA, USA, 2004 [8] P Koehn, “Machine Translation System User Manual and Code Guide,” 2011 [9] S St Laurent, E Dumbill, and J Johnston, Programming web services with XML-RPC Sebastopol, Calif.: O’Reilly, 2001 [10] “A Rousseau, P Deléglise, and Y Estève, ‘Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks’, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), May 2014.” [11] V B Le, D D Tran, E Castelli, L Besacier, and J.-F Serignat, “Spoken and Written Language Resources for Vietnamese.,” in LREC, 2004, vol 4, pp 599–602 [12] J Tiedemann, “Parallel Data, Tools and Interfaces in OPUS,” in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 2325 [13] T.-T NGUYEN, D.-K MAC, D.-D TRAN, M.-H NGUYEN, E CASTELLI, and V.S NGUYEN, “Indexing syllable dictionary for non-uniform unit selection speech synthesis: Application on Text-to-speech system on Android devices,” Reg Conf Opt Character Recognit Nat Lang Process Technol ASEAN Lang ONA 2017, vol 1st Development of a Vietnamese Speech Synthesis System for VLSP 2018 Van Thinh Nguyen, Khac Tan Pham, Huy Kinh Phan and Quoc Bao Nguyen Viettel CyberSpace Center {thinhnv20, kinhph, tanpk, baonq2}@viettel.com.vn Abstract—This paper describes our deep neural network-based speech synthesis system for high quality of Vietnamese speech synthesis task The system takes text as input, extract linguistic features and employs neural network to predict acoustic features, which are then passed to a vocoder to produce the speech wave form Index Terms—Speech Synthesis, Deepneural Network, Vocoder, Text Normalization I I NTRODUCTION The fifth International Workshop on Vietnamese Language and Speech Processing (VLSP 2018) organizes the shared task of Named Entity Recognition, Sentiment Analysis, Speech Recognition and Speech Synthesis at the first time for Vietnamese language processing The goal of this workshop series is to attempt a synthesis of research in Vietnamese language and speech processing and to bring together researchers and professionals working in this domain In this paper we describe the speech synthesis system which we participated in the TTS track of the 2018 VLSP evaluation campaign II S YSTEM A RCHITECTURE AND I MPLEMENTATION A System Architecture With the aim of improving the naturalness and intelligibility of speech synthesis system, we propose apply new technologies of speech synthesis for acoustic modeling and waveform generation such as using deep neural network combined with new type of vocoder To archive to this goal, a new architecture of speech synthesis system is proposed in figure Our proposed system takes text as an input and normalize it into standard text which is readable Linguistic feature extraction is then applied to extract text’s linguistic features as an input for acoustic model For vocoder parameter generation, Acoustic model used to take given input linguistic feature and generate predicted vocoder parameter The speech waveform is generated by vocoder, which is new type Fig The proposed speech synthesis system [1] B Front End 1) Text Normalization: Text Normalization plays an important role in a Text-To-Speech (TTS) system It is a process to decide how to read Non Standard Words (NSWs) which can’t be spoken by applying letter-to-sound rules such as CSGT (cnh st giao thng), keangnam (cang nam) The process decides the quality of a TTS system The module implemented and based on using regular expression and using abbreviation dictionary Regular expression is a direct and powerfull technique to clasify NSWs We build expressions that describe the date, time, scrore, currency and mesuarment An abbreviation dictionary containing foregin proper names, acronyms [2] 2) Linguistics Features Extraction: Linguitics Feature, which was used as input features for the system, had been extracted by generating a label file from linguistics properties of the text (Part-of-Speech tag, word segmentation, and text chunking) and mapping the corresponding information to binary codes presented in a question file Each piece of information was encoded into an one-hot vector, which was later concatenated horizontally to form a single one-hot vector presenting the text C Acoustic Modeling Acoustic model is based on deep neural network, specially it is feedforward neural network with enough layers, a simplest type of network The architecture of network is shown in figure follow this network, The input linguistic features used to predict the output parameter via several layers of hidden units [1] Each node of the network is called perceptron and each perceptron perform a nonlinear function, as follow: ht = H(W xh xt + bh ) yt = W hy ht + by Where H(.) is a nonlinear activation function in a perceptron (in this system, we use TANH function for each unit) [3], W xh and W hy are the weight matrices, bx and by are bias vector D Vocoder Currently, The speech synthesis system use many type of vocoder and most of vocoder are based on source filter model [4] In our system we used a vocoder-based speech synthesis system, named WORLD, which was developed in an effort to improve sound quality of real-time application using speech [5] WORLD vocoder consist of three algorithm for obtaining The input features for neural network, is extracted by frontend, consisted 743 features 734 of these derived from linguistic context, including phoneme identity, part of speech and positional information within a syllable, word, phrase, etc The remain features are within phoneme positional information The speech acoustic features extracted by WORLD vocoder for both training and decoding Each speech feature vecor contain 60 dimensional Mel Cepstral Coefficients (MFCCs), band aperiodicities (BAPs) and fundamental frequency on log scale (logF0) at milliseconds frame intervals Deep neural network is configured with feedforward hidden layers and each layer has 1024 hyerbolic tangent unit C Results The objective result of the system is presented in table it shown that, MCD: Mel cepstral distortion [9], BAP: distortion of band aperiodicities and V/UV: voice/unvoice error are quite low, that mean we has traned good acoustic model which return the best result F0 RMSE is caculated on linear scale TABLE I T HE OBJECTIVE RESULT OF SPEECH SYNTHESIS SYSTEM Fig The feedforward neural network for acoustic modeling three speech parameters, which are F0 contour estimated with DIO [6], spectral envelop is estimated with CheapTrick [7] and excitation signal is estimated with PLATINUM used as an aperiodic parameter [8], and a synthesis algorithms for obtaining three parameter as an input With WORLD vocoder, speech parameter predicted from acoustic model which correspond to input text sentence, will be used for produce speech waveform DNN system MCD (dB) BAP (dB) F0 RMSE (Hz) 22.9 V/UV 6.15 The subjective results presented in table this table show the comparision of evaluation of deep neural network speech synthesis system with old system based on Hidden markov model the evaluation of both system is executed by native Vietnamese listener, who evaluated the naturalness and intelligibility of each system on a scale of five the results shown that, our speech synthesis system based on Deep neural network has better score than old system based on hidden markov model III E XPERIMENTAL S ETUP AND R ESULTS A Data Preparation In this section, we describe our effort to collect more than 6.5 hour of high quality of audio for speech corpus which are used to train our acoustic model for speech synthesis system To archive our target, firstly we are collected 6.5 hour of recordings, but almost our data come from internet such as radio online, because we not have resource to record audio ourself Audio data crawled from internet which has much more noise, so the next step we did is apply a noise filter to reduce noise signal Each audio is very long and the difference in amplitude is very large at different times For that reason, we cut into small audio file corresponding to text sentence and balanced all these files And finally, we got a corpus which has more than 3500 audio file corresponding to 6.5 hour of high quality of audio TABLE II T HE COMPARISION OF SUBJECTIVE RESULTS Average score DNN system 4.21 HMM system) 3.8 B Experimental Setup IV C ONCLUSION In this paper, our speech synthesis system based deep neural network is shown, and the improvement of this system compared to old system based on hidden markov model which has dominated acoustic modeling for past decade We hope this system can provide the best speech sysnthesis system for Vietnamse to produce high quality of voice from text In future work, we want to improve the performace of our system ( it still has long time delay for generate an audio from text) by apply parallel computing and the quality by improve quality of data or change neural network architecture To demonstrate how we archive high quality of speech synthesis, we report experimental setup for this architecture we used speech corpus collected in previous section.In this data, 3150 utterances were used for training, 175 as a development set, and 175 as the evaluation set V ACKNOWLEDGMENT This work was supported by Viettel Cybersace Center Viettel Group R EFERENCES [1] Z Wu, O Watts, and S King, “Merlin: An open source neural network speech synthesis system,” Proc SSW, Sunnyvale, USA, 2016 [2] D A Tuan, P T Lam, and P D Hung, “A study of text normalization in vietnamese for text-to-speech system.” [3] J Jantzen, “Introduction to perceptron networks,” Technical University of Denmark, Lyngby, Denmark, Technical Report, 1998 [4] F Espic, C Valentini-Botinhao, and S King, “Direct modelling of magnitude and phase spectra for statistical parametric speech synthesis,” Proc Interspeech, Stochohlm, Sweden, 2017 [5] M Morise, F Yokomori, and K Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol 99, no 7, pp 1877– 1884, 2016 [6] M Morise, H Kawahara, and H Katayose, “Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech,” in Audio Engineering Society Conference: 35th International Conference: Audio for Games Audio Engineering Society, 2009 [7] M Morise, “Cheaptrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, vol 67, pp 1–7, 2015 [8] ——, “Platinum: A method to extract excitation signals for voice synthesis system,” Acoustical Science and Technology, vol 33, no 2, pp 123–125, 2012 [9] J Kominek, T Schultz, and A W Black, “Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion,” in Spoken Languages Technologies for Under-Resourced Languages, 2008 ... 1: TỔNG QUAN VỀ TỔNG HỢP TIẾNG NÓI 1.1 Giới thiệu tổng hợp tiếng nói 1.1.1 Tổng quan tổng hợp tiếng nói Tổng hợp tiếng nói quá trình tạo tiếng nói của người từ văn bản, hệ thống tổng. .. PHÁP HỌC SÂU ÁP DỤNG TRONG TỔNG HỢP TIẾNG NÓI: Chương chủ yếu nói về phương pháp học sâu cách áp dụng nó trong tổng hợp tiếng nói • CHƯƠNG 3: XÂY DỰNG HỆ THỚNG TỔNG HỢP TIẾNG NĨI TIẾNG... VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI Nguyễn Văn Thịnh NGHIÊN CỨU PHÁT TRIỂN HỆ THỚNG TỔNG HỢP TIẾNG NĨI TIẾNG VIỆT SỬ DỤNG CƠNG NGHỆ HỌC SÂU Chuyên ngành : Hệ

Ngày đăng: 12/02/2021, 15:24

Nguồn tham khảo

Tài liệu tham khảo

Loại

Chi tiết

[1] A.-T. Dinh, T.-S. Phan, T.-T. Vu, and C.-M. Luong, “Vietnamese HMM-based Speech Synthesis with prosody information,” Th ISCA Speech Synth. Workshop, p. 4, 2013

Sách, tạp chí

Tiêu đề:	Vietnamese HMM-based Speech Synthesis with prosody information,” "Th ISCA Speech Synth. Workshop

[2] T.-S. Phan, T.-C. Duong, A.-T. Dinh, T.-T. Vu, and C.-M. Luong, “Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information,” 2013, pp. 276–281

Sách, tạp chí

Tiêu đề:	Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information

[3] H. Zen et al., “The HMM-based Speech Synthesis System (HTS) Version 2.0,” p. 6, 2007

Sách, tạp chí

Tiêu đề:	et al.", “The HMM-based Speech Synthesis System (HTS) Version 2.0

[4] Z. Wu, O. Watts, and S. King, “Merlin: An Open Source Neural Network Speech Synthesis System,” 2016, pp. 202–207

Sách, tạp chí

Tiêu đề:	Merlin: An Open Source Neural Network Speech Synthesis System

[5] J. J. Ohala, “Christian Gottlieb Kratzenstein: pioneer in speech synthesis,” Proc 17th ICPhS, 2011

Sách, tạp chí

Tiêu đề:	Christian Gottlieb Kratzenstein: pioneer in speech synthesis,” "Proc 17th ICPhS

[6] D. Suendermann, H. Hửge, and A. Black, “Challenges in Speech Synthesis,” in Speech Technology, Huggins and F. Chen, Eds. Boston, MA: Springer US, 2010, pp.19–32

Sách, tạp chí

Tiêu đề:	Challenges in Speech Synthesis,” in "Speech Technology

[7] P. T. Sơn and P. T. Nghĩa, “Một số vấn đề về tổng hợp tiếng nói tiếng Việt,” p. 5, 2014

Sách, tạp chí

Tiêu đề:	Một số vấn đề về tổng hợp tiếng nói tiếng Việt

[8] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speech Synthesis Based on Hidden Markov Models,” Proc. IEEE, vol. 101, no. 5, pp. 1234–1252, May 2013

Sách, tạp chí

Tiêu đề:	Speech Synthesis Based on Hidden Markov Models,” "Proc. IEEE

[9] T. T. T. Nguyen, “HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation,” PhD Thesis, Paris 11, 2015

Sách, tạp chí

Tiêu đề:	HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation

[10] Q. Nguyễn Hồng, “Phân tích văn bản cho tổng hợp tiếng nói tiếng Việt,” Đại Học Bách Khoa Hà Nội, 2006

Sách, tạp chí

Tiêu đề:	Phân tích văn bản cho tổng hợp tiếng nói tiếng Việt

[11] P. Taylor, Text-to-speech synthesis. Cambridge university press, 2009

Sách, tạp chí

Tiêu đề:	Text-to-speech synthesis

[12] J. Dang and K. Honda, “Construction and control of a physiological articulatory model,” J. Acoust. Soc. Am., vol. 115, no. 2, pp. 853–870, 2004

Sách, tạp chí

Tiêu đề:	Construction and control of a physiological articulatory model,” "J. Acoust. Soc. Am

[13] S. Lukose and S. S. Upadhya, “Text to speech synthesizer-formant synthesis,” 2017, pp. 1–4

Sách, tạp chí

Tiêu đề:	Text to speech synthesizer-formant synthesis

[14] F. Charpentier and M. Stella, “Diphone synthesis using an overlap-add technique for speech waveforms concatenation,” 1986, vol. 11, pp. 2015–2018

Sách, tạp chí

Tiêu đề:	Diphone synthesis using an overlap-add technique for speech waveforms concatenation

[15] S.-J. Kim, “HMM-based Korean speech synthesizer with two-band mixed excitation model for embedded applications,” PhD Thesis, Ph. D. dissertation, School of Engineering, Information and Communication University, Korea, 2007

Sách, tạp chí

Tiêu đề:	HMM-based Korean speech synthesizer with two-band mixed excitation model for embedded applications

[16] T. Masuko, “HMM-Based Speech Synthesis and Its Applications,” p. 185, 2002

Sách, tạp chí

Tiêu đề:	HMM-Based Speech Synthesis and Its Applications

[17] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithm for melcepstral analysis of speech,” 1992, pp. 137–140 vol.1

Sách, tạp chí

Tiêu đề:	An adaptive algorithm for mel-cepstral analysis of speech

[18] H. Ze, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” 2013, pp. 7962–7966

Sách, tạp chí

Tiêu đề:	Statistical parametric speech synthesis using deep neural networks

[19] H. Zen, “Statistical Parametric Speech Synthesis,” Autom. Speech Recognit., p. 93

Sách, tạp chí

Tiêu đề:	Statistical Parametric Speech Synthesis,” "Autom. Speech Recognit

[20] D. D. Tran, “Synthèse de la parole à partir du texte en langue vietnamienne,” PhD Thesis, Grenoble INPG, 2007

Sách, tạp chí

Tiêu đề:	Synthèse de la parole à partir du texte en langue vietnamienne