Nghiên cứu và thiết kế ứng dụng chuyển đổi từ giọng nói sang ngôn ngữ cử chỉ

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM KỸ THUẬT THÀNH PHỐ HỒ CHÍ MINH LUẬN VĂN THẠC SĨ BÙI ĐỨC VŨ NGHIÊN CỨU VÀ THIẾT KẾ ỨNG DỤNG CHUYỂN ĐỔI TỪ GIỌNG NĨI SANG NGƠN NGỮ CỬ CHỈ NGÀNH: KỸ THUẬT ĐIỆN TỬ Tp Hồ Chí Minh, tháng 11/2022 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM KỸ THUẬT THÀNH PHỐ HỒ CHÍ MINH LUẬN VĂN THẠC SĨ BÙI ĐỨC VŨ NGHIÊN CỨU VÀ THIẾT KẾ ỨNG DỤNG CHUYỂN ĐỔI TỪ GIỌNG NÓI SANG NGÔN NGỮ CỬ CHỈ NGÀNH: KỸ THUẬT ĐIỆN TỬ Hướng dẫn khoa học: PGS TS TRƯƠNG NGỌC SƠN Thành phố Hồ Chí Minh, tháng 05 năm 2021 I BIÊN BẢN CHẤM LUẬN VĂN TỐT NGHIỆP THẠC SĨ II PHIẾU NHẬN XÉT LUẬN VĂN TỐT NGHIỆP THẠC SĨ III IV V VI VII LÝ LỊCH KHOA HỌC (Dùng cho nghiên cứu sinh & học viên cao học) I LÝ LỊCH SƠ LƯỢC: Họ & tên: Bùi Đức Vũ Giới tính: Nam Ngày, tháng, năm sinh: 10/02/1992 Nơi sinh: Bạc Liêu Quê quán: Nam Định Dân tộc: Kinh Chỗ riêng địa liên lạc: 1/1 Điện Biên Phủ, Khu Phố 5, Thị Trấn Trảng Bom, Huyện Trảng Bom, Tỉnh Đồng Nai Điện thoại liên lạc: 0909147663 E-mail: bdv24h@gmail.com II Q TRÌNH ĐÀO TẠO: Trung học phổ thơng: Hệ đào tạo: Chính quy Thời gian đào tạo: Từ 09/2007 đến 06/2010 Nơi học (trường, thành phố): Trường THPT Long Phước, Xã Long Phước, Huyện Long Thành, Tỉnh Đồng Nai Đại học: Hệ đào tạo: Chính quy Thời gian đào tạo từ 10/2010 đến 09/2015 Nơi học (trường, thành phố): Trường Đại học Sư phạm Kỹ thuật Thành phố Hồ Chí Minh, Thành Phố Thủ Đức Ngành học: Kỹ thuật Điện – Điện tử Tên đồ án tốt nghiệp: Thiết kế ứng dụng nhận dạng danh tính người Ngày & nơi bảo vệ Đồ Án: 12/2014, Trường Đại học Sư phạm Kỹ thuật Thành phố Hồ Chí Minh Người hướng dẫn: PGS TS Lê Mỹ Hà Thạc sĩ: Hệ đào tạo: Chính quy Thời gian đào tạo từ 04/2019 đến Nơi học (trường, thành phố): Trường Đại học Sư phạm Kỹ thuật Thành phố Hồ Chí Minh, Thành Phố Thủ Đức VIII Năm 2021 √ √ √ √ Xin chào √ √ √ √ Tơi có áo màu xanh √ √ √ √ Nhà bạn có người √ √ √ √ Bạn ăn cơm chưa √ √ √ √ Ở biển có cá tơm cua √ √ √ √ 10 Mình làm bạn khơng √ √ √ √ 100% 100% 100% 100% Tỉ lệ nghe câu nói Đánh giá Bảng 5.1: khoảng nhiễu tạo có độ lớn dao động, độ lớn tiếng nói cao so với nhiễu tạo ra, độ nhận dạng giọng nói hệ thống nhanh xác, tối đa giây sau ngừng nói, hệ thống trả văn tương ứng với câu nói ghi âm Riêng với khoảng nhiễu (86 – 88 dB) tốc độ hiển thị văn lên hình sau ngừng nói có câu lên tới 15 giây Đó cường độ nhiễu lúc gần tương đương với cường độ giọng nói, nên API Speech to Text Google cần thêm thời gian xử lý lấy thêm mẫu Bảng 5.2: Khảo sát hệ thống với 10 câu nói với cường độ nhiễu tăng dần có dao động lớn độ lớn cường độ nhiễu STT Câu nói 58 – 72 66 – 86 92 – 108 dB dB dB Ba mươi ngàn đồng √ √ X Tên ký hiệu bạn √ √ X Hơm thứ bảy √ √ X Năm 2021 √ √ X Xin chào √ √ X Tơi có áo màu xanh √ √ X Nhà bạn có người √ √ X 45 Bạn ăn cơm chưa √ √ X Ở biển có cá tơm cua √ X X 10 Mình làm bạn khơng √ √ X 100% 90% 0% Tỉ lệ nghe câu nói Đánh giá Bảng 5.2: có khoảng thay đổi lớn độ lớn cường độ nhiễu lần thử (độ chênh lệnh giá trị lớn nhỏ 10 dB) lúc hiệu chỉnh nhiễu hệ thống phát huy tốt tác dụng thời điểm ghi âm nhiễu môi trường, cường độ nhiễu ngẫu nhiên khơng ổn định để xác định đâu ngưỡng Trong khoảng nhiễu (58 – 72 dB), hệ thống nhận dạng 100% 10 câu nói thử nghiệm, nhược điểm thời gian sau ngừng nói lúc hiển thị văn lên hình lâu (lên tới 10 giây) Trong khoảng nhiễu (66 – 86 dB), khoảng thời gian lên tới 30 giây có câu nhận dạng thiếu từ câu nói, tỉ lệ nhận dạng 90% Trong khoảng (92 – 108 dB) khơng thể nhận dạng câu nói cường độ nhiễu lúc lớn cường độ giọng nói ghi âm, hệ thống khơng thể phân biệt đâu giọng nói cần ghi âm, đâu nhiễu Hơn nữa, hiệu chỉnh nhiễu tự động khoảng (92 – 108 dB) thu ngưỡng nhiễu cao cường độ giọng nói ghi âm 5.2 Phương hướng phát triển Phương hướng phát triển Luận Văn tạo hệ thống với tính gọn nhẹ chi phí thấp phiên (chạy thử phiên thấp Raspberry để tìm phiên tối ưu tốc độ xử lý chi phí), đồng thời cập nhật thêm sở liệu video ngôn ngữ cử thêm phong phú 46 CHƯƠNG 6: KẾT LUẬN Hệ thống tập trung tận dụng câu nói thơng dụng với khoảng 200 câu nói Lượng sở liệu hệ thống không nhiều đáp ứng nhu cầu giao tiếp thường ngày, q trình xử lý kỹ thuật nhanh (chưa đầy 0.1 giây so với 0.5 giây mạng học sâu, xem Bảng 3.4) không cần phải huấn luyện so với hệ thống sử dụng mạng học sâu Trong nghiên cứu này, Học viên trình bày ứng dụng đơn giản giúp chuyển đổi từ giọng nói sang ngơn ngữ cử cho người khiếm thính Việt Nam với chi phí khoảng 4.000.000 Đồng Ứng dụng giúp ích cho nhân viên siêu thị, nhân viên bán hàng tạp hóa, nhân viên văn phịng việc hỗ trợ người khiếm thính mua hàng sử dụng loại giấy tờ tùy thân tiếp nhận thông tin từ người bình thường dễ dàng API Speech to Text Google giúp chuyển giọng nói cho văn tương ứng với độ xác cao, hoạt động tốt mơi trường có tạp âm, nhược điểm yêu cầu phải kết nối mạng trình sử dụng Các cử thủ ngữ hiển thị tương ứng với lời nói ngõ vào dựa vào câu định sẵn dựa vào kỹ thuật Levenshtein Distance word-based để làm giảm lượng từ vựng sử dụng ứng dụng, tăng tốc độ xử lý Hệ thống nhận dạng giọng nói tốt với cường độ nhiễu môi trường mức 80 dB có độ dao động nhỏ độ lớn (Bảng 5.1) siêu thị mini, văn phịng làm việc Đối với mơi trường có nhiều loại tạp âm với cường độ âm khác (Bảng 5.2) ngồi đường nhiều phương tiện giao thơng qua lại, công trường xây dựng hay xưởng hàn cắt kim, gia cơng loại khả hoạt động hệ thống giảm xuống Hệ thống sử dụng ổn định liên tục 47 TÀI LIỆU THAM KHẢO [1] Ban quản lý dự án Bộ giáo dục Đào tạo Mục tiêu dự án Internet: https://qipedc.moet.gov.vn/introduce, 21/05/2021 [2] Sternberg, Martin LA American sign language concise dictionary Perennial Library, 1990 [3] Tennant, Richard A., Marianne Gluszak, and Marianne Gluszak Brown The American sign language handshape dictionary Gallaudet University Press, 1998 [4] R San-Segundo, R Barra, R Córdoba, L.F D’Haro, F Fernández, J Ferreiros, J.M Lucas, J Macías-Guarasa, J.M Montero, J.M Pardo Speech to sign language translation system for Spanish Speech Communication, Volume 50, Issues 11–12, 2008, Pages 1009-1020, ISSN 0167-6393 [5] Radzi Ambar, Chan Kar Fai, Mohd Helmy Abd Wahab, Muhammad Mahadi Abdul Jamil and Ahmad Alabqari Ma’radzi Development of a Wearable Device for Sign Language Recognition 1st International Conference on Green and Sustainable Computing (ICoGeS) 2017 [6] Khalid Khalil El-Darymli, Othman O Khalifa and Hassan Enemosah Speech to Sign Language Interpreter System (SSLIS) The IEEE International Conference of Computer and Communication Enginnering (ICCCE'06), in Proceedings of, Kuala Lumpur, Malaysia, 2006 [7] Oi Mean Foong, Tang Jung Low, and Wai Wan La V2S: Voice to Sign Language Translation System for Malaysian Deaf People Computer & Information Sciences Department, Universiti Teknologi PETRONAS, Bandar Sri Iskandar, 31750 Tronoh, Malaysia [8] R San-Segundo, R Barra, L.F D’Haro, J.M Montero, R Córdoba, J Ferreiros A spanish speech to sign language translation system for assisting deaf-mute people Interspeech 2006 – icslp September 17-21, Pittsburgh, Pennsylvania 48 [9] R San-Segundo, A Pérez, D Ortiz, L F D’Haro1, M I Torres, F Casacuberta Evaluation of Alternatives on Speech to Sign Language Translation Interspeech 2007 August 27-31, Antwerp, Belgium [10] Michelle Cutajar, Edward Gatt, Ivan Grech, Owen Casha, Joseph Micallef Comparative study of automatic speech recognition techniques Published in IET Signal Processing, Accepted on 8th January 2013 [11] D Yu and L Deng Automatic Speech Recognition Signals and Communication Technology, 2015 [12] Douglas O'Shaughnessy Invited paper: Automatic speech recognition: History, methods and challenges Pattern Recognition 41 (2008) 2965 – 2979 [13] R Vergin, D O'Shaughnessy and A Farhat Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition IEEE Transactions on Speech and Audio Processing, vol 7, no 5, pp 525-532, Sept 1999, doi: 10.1109/89.784104 [14] Mark D Skowronski and John G Harris Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition The Journal of the Acoustical Society of America 116, 1774 (2004) [15] R Berg and D Stork The physics of sound 3rd ed Upper Saddle River, N.J.: Pearson Prentice-Hall, 2005 [16] Machine Vision Study Guide, 17 Frequency Domain Filtering, 17.3.1 Common Fourier Transform Pairs Internet: http://faculty.salina.k- state.edu/tim/mVision/freq-domain/fourier_transform.html, 21/05/2021 [17] L Burget et al Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, 2010, pp 4334-4337, doi: 10.1109/ICASSP.2010.5495646 49 [18] Chunyang Wu Structured Deep Neural Networks for Speech Recognition Department of Engineering, University of Cambridge Wolfson College, May 2018 [19] L Bahl, P Brown, P de Souza and R Mercer Maximum mutual information estimation of hidden Markov model parameters for speech recognition ICASSP '86 IEEE International Conference on Acoustics, Speech, and Signal Processing, Tokyo, Japan, 1986, pp 49-52, doi: 10.1109/ICASSP.1986.1169179 [20] Huang, Xuedong Semi-continuous hidden Markov models for speech recognition 1989 [21] D Povey et al Subspace Gaussian Mixture Models for speech recognition 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, 2010, pp 4330-4333, doi: 10.1109/ICASSP.2010.5495662 [22] Lawrence R Rabiner A tutorial on hidden Markov models and selected applications in speech recognition Proceedings of the IEEE, 77(2):257–286, 1989 [23] Jonathan Hui Speech Recognition – Acoustic, Lexicon & Language Model Sep 16, 2019 Internet: https://jonathan-hui.medium.com/speech-recognition- acoustic-lexicon-language-model-aacac0462639, 21/05/2021 [24] Buitelaar P., Sintek M., Kiesel M A Multilingual/Multimedia Lexicon Model for Ontologies In: Sure Y., Domingue J (eds) The Semantic Web: Research and Applications ESWC 2006 Lecture Notes in Computer Science, vol 4011 Springer, Berlin, Heidelberg [25] Adda-Decker M., Lamel L (2000) The Use of Lexica in Automatic Speech Recognition In: Van Eynde F., Gibbon D (eds) Lexicon Development for Speech and Language Processing Text, Speech and Language Technology Springer, Dordrecht [26] Mark JF Gales, Kate M Knill, and Anton Ragni Unicode-based graphemic systems for limited resource languages In Acoustics, Speech and Signal 50 Processing (ICASSP), 2015 IEEE International Conference on, pages 5186– 5190 IEEE, 2015b [27] Stephan Kanthak and Hermann Ney Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, pages I–845 IEEE, 2002 [28] Mirjam Killer, Sebastian Stüker, and Tanja Schultz Grapheme based speech recognition In INTERSPEECH, 2003 [29] Wen-Hsiung Chen, CH Smith, and SC Fralick A fast computational algorithm for the discrete cosine transform IEEE Transactions on communications, 25(9):1004–1009, 1977 [30] Bishnu S Atal and Suzanne L Hanauer Speech analysis and synthesis by linear prediction of the speech wave The journal of the acoustical society of America, 50 (2B):637–655, 1971 [31] Sadaoki Furui Speaker-independent isolated word recognition based on emphasized spectral dynamics In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’86., volume 11, pages 1991–1994 IEEE, 1986 [32] Bishnu S Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification the Journal of the Acoustical Society of America, 55(6):1304–1312, 1974 [33] Olli Viikki and Kari Laurila Cepstral domain segmental feature vector normalization for noise robust speech recognition Speech Communication, 25(1):133–147, 1998 [34] Li Lee and Richard C Rose Speaker normalization using efficient frequency warping procedures In Acoustics, Speech, and Signal Processing, 1996 ICASSP-96 Conference Proceedings., 1996 IEEE International Conference on, volume 1, pages 353–356 IEEE, 1996 51 [35] Myung, I J (2003) Tutorial on maximum likelihood estimation Journal of Mathematical Psychology, 47(1), 90–100 doi:10.1016/s0022-2496(02)000287 [36] White, H (1982) Maximum Likelihood Estimation of Misspecified Models Econometrica, 50(1), doi:10.2307/1912526 [37] X Huang, A Acero, and H.W Hon Spoken Language Processing: A Guide to Theory, Algorithm, and System Development Prentice Hall PTR, 2001 [38] Santanu Pattanayak Pro Deep Learning with TensorFlow Springer Science+Business Media New York Copyright © 2017 by Santanu Pattanayak [39] Antonio Gulli, Sujit Pal Deep Learning with Keras First published: April 2017, Published by Packt Publishing Ltd [40] Tom M Mitchell Machine Learning McGraw-Hill Science/Engineering/Math; (March 1, 1997) [41] J Simpson Best Speech-to-Text APIs Internet: https://nordicapis.com/5-bestspeech-to-text-apis/, 21/05/2021 [42] Sepp Hochreiter, Jürgen Schmidhuber Long Short-Term Memory Neural Comput 1997, (8): 1735–1780 [43] Ilya Sutskever, Oriol Vinyals, Quoc V Le Sequence to Sequence Learning with Neural Networks arXiv:1409.3215 [cs.CL] [44] Alexis Perrier Google Upgrades Its Speech-to-Text Service with Tailored DeepLearning Models Internet: https://www.infoq.com/news/2018/05/googlespeech-to-text-api/, 21/05/2021 [45] Ian McGraw and Rohit Prabhavalkar and Raziel Alvarez and Montse Gonzalez Arenas and Kanishka Rao and David Rybach and Ouais Alsharif and Hasim Sak and Alexander Gruenstein and Franỗoise Beaufays and Carolina Parada Personalized Speech Recognition On Mobile Devices Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016) 52 [46] W Zhou, L Li, M Luo and W Chou REST API Design Patterns for SDN Northbound API 2014 28th International Conference on Advanced Information Networking and Applications Workshops, 2014, pp 358-365, doi: 10.1109/WAINA.2014.153 [47] F Paolucci, A Sgambelluri, M Dallaglio, F Cugini and P Castoldi Demonstration of gRPC Telemetry for Soft Failure Detection in Elastic Optical Networks 2017 European Conference on Optical Communication (ECOC), 2017, pp 1-3, doi: 10.1109/ECOC.2017.8346066 [48] Google Speech-to-Text basics Internet: https://cloud.google.com/speech-totext/docs/basics#:~:text=A%20Speech%2Dto%2DText%20API,audio%2C%2 0it%20returns%20a%20response, 21/05/2021 [49] S Zhang, Y Hu and G Bian Research on string similarity algorithm based on Levenshtein Distance 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), 2017, pp 2247-2251, doi: 10.1109/IAEAC.2017.8054419 [50] Michael Gilleland, Merriam Park Software Levenshtein Distance, in Three Flavors Internet: https://people.cs.pitt.edu/~kirk/cs1501/assignments/editdistance/Levenshtein% 20Distance.htm, 21/05/2021 [51] techwithtim Simple AI Chat Bot Internet: https://www.techwithtim.net/tutorials/ai-chatbot/part-1/, 21/05/2021 [52] Clayton Valli, Ceil Lucas Linguistics Of American Sign Language Clerc Books, Gallaudet University Press, Washington, D.C Third edition 2000 [53] Wilbur, R B (1987) American Sign Language: Linguistic and applied dimensions (2nd ed.) Little, Brown and Co [54] Ban quản lý dự án Bộ Giáo dục Đào tạo Học sinh Miền Nam Internet: https://qipedc.moet.gov.vn/videos/W01658.mp4?autoplay=true, 21/05/2021 [55] Jolanta Lapiak Student https://www.handspeak.com/word/s/student.mp4, 21/05/2021 53 Internet: [56] Lê Chơn Nhựt Bình Từ Điển Ngôn Ngữ Ký Hiệu Internet: https://giaoducsangtao.com/san-pham-sang-tao/tu-dien-ngon-ngu-ky-hieuviet-nam.html, 21/05/2021 [57] Cao Thị Mỹ Xuân Giới thiệu phần mềm VSDIC Internet: https://tudienngonngukyhieu.com/nguon-noi-dung-vsdic.html, 21/05/2021 [58] Trung tâm nghiên cứu giáo dục người khiếm thính (CED) Internet: https://www.ced.org.vn/, 21/05/2021 [59] Binh Nhi Hướng dẫn học thủ ngữ tiếp xúc với người khiếm thính Internet: http://me.phununet.com/WikiPhununet/ChiTietWiki.aspx?m=0&StoreID=266 31, 21/05/2021 [60] Ban quản lý dự án Bộ Giáo dục Đào tạo Giới thiệu phần mềm giảng củng cố Internet: https://qipedc.moet.gov.vn/consolidate, 21/05/2021 [61] Hinchcliffe, R The threshold of hearing as a function of age Acta Acustica united with Acustica, Volume 9, Number 4, 1959, pp 303-308(6) [62] Anthony Zhang, Alexander Neumann, native-api, Stanislav Arnaudov, Josef Hửlzl, Franỗois Wautier Speech Recognition Library Reference Internet: https://github.com/Uberi/speech_recognition/blob/master/reference/libraryreference.rst, 21/05/2021 54 BÀI BÁO KHOA HỌC 55 56 57 58 S K L 0 ... nói chuyển đổi sang ngôn ngữ cử - Thiết kế ứng dụng giá rẻ, gọn nhẹ thực thi ổn định hệ thống chuyển đổi từ giọng nói sang ngôn ngữ cử board Raspberry Pi với hình hiển thị ngơn ngữ cử inches Ứng. .. nghệ chuyển đổi từ giọng nói sang văn (speech to text) - Nghiên cứu nguyên lý, vận hành ngôn ngữ cử - Nghiên cứu ứng dụng máy học giải thuật so sánh, tìm kiếm, thiết kế hệ thống nhận dạng giọng nói. .. phí, ứng dụng, thiết bị chuyển đổi sẵn có để hỗ trợ cho việc giao tiếp với người khiếm thính Các nghiên cứu liên quan nhận dạng giọng nói chuyển đổi giọng nói sang văn bản, chuyển đổi văn sang giọng

Định dạng
Số trang	79
Dung lượng	6,01 MB