1. Trang chủ
  2. » Luận Văn - Báo Cáo

Research and develop solutions to traffic data collection based on voice techniques

115 14 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY NGUYỄN THỊ TY RESEARCH AND DEVELOP SOLUTIONS TO TRAFFIC DATA COLLECTION BASED ON VOICE TECHNIQUES Major: Computer Science Major code: 8480101 MASTER THESIS HO CHI MINH CITY, July 2023 VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY NGUYỄN THỊ TY RESEARCH AND DEVELOP SOLUTIONS TO TRAFFIC DATA COLLECTION BASED ON VOICE TECHNIQUES Major: Computer Science Major code: 8480101 MASTER THESIS HO CHI MINH CITY, July 2023 THIS THESIS IS COMPLETED AT HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor: Assoc Prof Trần Minh Quang Examiner 1: Assoc Prof Nguyễn Văn Vũ Examiner 2: Assoc Prof Nguyễn Tuấn Đăng This master’s thesis is defended at HCM City University of Technology, VNUHCM City on July 11, 2023 The board of the Master’s Thesis Defense Council includes: (Please write down the full name and academic rank of each member of the Master Thesis Defense Council) Chairman: Assoc Prof Lê Hồng Trang Secretary: Dr Phan Trọng Nhân Examiner 1: Assoc Prof Nguyễn Văn Vũ Examiner 2: Assoc Prof Nguyễn Tuấn Đăng Commissioner: Assoc Prof Trần Minh Quang Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis is corrected (If any) CHAIRMAN OF THESIS COMMITTEE DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING i VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY SOCIALIST REPUBLIC OF VIETNAM HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY Independence – Freedom - Happiness THE TASK SHEET OF MASTER’S THESIS Full name: NGUYỄN THỊ TY Student code: 2171072 Date of birth: 22/11/1996 Place of birth: Binh Dinh Province Major: Computer Science Major code: 8480101 I THESIS TITLE: Research and develop solutions to traffic data collection based on voice techniques (Nghiên cứu phát triển giải pháp thu thập liệu giao thông dựa kỹ thuật giọng nói) II TASKS AND CONTENTS: • Task 1: Traffic Data Collection and Processing The first task involves collecting comprehensive traffic data Extensive research will be conducted to identify reliable data sources, followed by the implementation of appropriate data collection techniques Subsequently, experiments will be carried out to determine the most effective data processing methods The aim is to enhance data quality and optimize processing efficiency for further analysis • Task 2: Research and Experimentation for Automatic Speech Recognition Model Development In this phase, the focus will be on researching and experimenting with various architectures to develop high-performance automatic speech recognition models Different techniques will be explored to achieve accurate speech-totext conversion The goal is to identify the best-performing model that meets the project’s requirements ii • Task 3: Automatic Speech Recognition Model Evaluation and Future Work Once the automatic speech recognition models are developed, a comprehensive evaluation process will be undertaken The achieved results will be analyzed using appropriate metrics and techniques to assess their performance Strengths and weaknesses of each model will be identified Based on this analysis, recommendations for future work will be provided, outlining potential enhancements or modifications to the automatic speech recognition models III THESIS START DAY: 06/02/2023 IV THESIS COMPLETION DAY: 09/06/2023 V SUPERVISOR: Assoc Prof TRẦN MINH QUANG Ho Chi Minh City, June 9, 2023 SUPERVISOR CHAIR OF PROGRAM COMMITTEE (Full name and signature) (Full name and signature) DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING (Full name and signature) iii ACKNOWLEDGMENTS I would like to extend my sincere gratitude to the individuals who have provided invaluable support and assistance throughout my research journey I would like to express my formal appreciation to Assoc Prof Trần Minh Quang for his exceptional guidance, expertise, and unwavering support His mentorship has been instrumental in helping me navigate the necessary steps to complete this thesis Whenever I encountered difficulties or felt lost, Assoc Prof Quang provided invaluable advice that steered me back in the correct direction His suggestion to process the data to enhance its quality was a significant contribution to my research Furthermore, his assistance in establishing contact with esteemed researchers working on topics similar to mine and facilitating connections with individuals who could provide server support for training large models, such as automatic speech recognition models, has been immensely valuable I would like to express my profound gratitude to the esteemed researchers, Mr Nguyễn Gia Huy and Mr Nguyễn Tiến Thành, for their generous contributions in sharing their profound insights and knowledge Their willingness to address my inquiries regarding the Urban Traffic Estimation System, collected data, and existing issues has significantly enriched my comprehension of the subject matter Furthermore, I am sincerely thankful to my sisters, Ms Nguyễn Thị Nghĩa and Ms Nguyễn Thị Hiển, as well as Lương Duy Hưng and Vũ Anh Nhi, for their invaluable support in meticulously creating precise transcripts for the audio files Furthermore, I would like to express my deep appreciation to Mr Tăng Quốc Thái for his diligent efforts in meticulously collecting and securely storing the traffic reports from VOH 95.6MHz Additionally, I am profoundly grateful to Mr Mai Tấn Hà, who graciously provided me with access to a server for the training of automatic speech recognition models His generosity and support have been instrumental in enabling the successful execution of the model training process I would also like to extend my formal gratitude to Dr Lê Thành Sách and Mr Nguyễn Hoàng Minh from the Data Science Laboratory at Ho Chi Minh City University of Technology (HC- iv MUT) for their kind approval in granting me the opportunity to utilize an independent server for automatic speech recognition model training Their trust and support from the Data Science Laboratory have been pivotal in facilitating the smooth progress of my research In addition, I am sincerely thankful for the invaluable support rendered by my friends, Mr Nguyễn Tấn Sang and Mr Huỳnh Ngọc Thiện, in working with the server that has limited permissions Their expertise and assistance have been indispensable in effectively navigating the constraints imposed by the server limitations Lastly, I would like to express my heartfelt gratitude to my boss, co-workers, friends, and family for their unwavering emotional support and understanding during the challenging times that I encountered throughout this research endeavor Their encouragement and belief in my abilities have been instrumental in my success Once again, I am deeply grateful to all of the individuals mentioned above for their significant contributions and support, without which this thesis would not have been possible v ABSTRACT This thesis addresses two fundamental challenges within the domain of the current intelligent traffic system, specifically the Urban Traffic Estimation (UTraffic) System The first challenge pertains to the insufficiency of data that meets the requisite standards for training the automatic speech recognition (ASR) model that will be deployed in the UTraffic system The current dataset predominantly consists of synthesized data, resulting in a bias towards recognizing synthesized traffic speech reports while struggling to accurately transcribe real-life traffic speech reports imported by UTraffic users The second challenge involves the accuracy of the ASR model deployed in the current UTraffic system, particularly in transcribing real-life traffic speech reports into text To address these challenges, this research proposes several approaches Firstly, an alternative traffic data source is identified to reduce the reliance on synthesized data and mitigate the bias Secondly, a pipeline incorporating audio processing techniques such as sampling rate conversion and speech enhancement is designed to effectively process the dataset, with the ultimate objective of improving ASR model performance Thirdly, advanced and suitable ASR architectures are experimented with using the processed dataset to identify the most optimal model for deployment within the UTraffic system Significant achievements have been obtained through this research Firstly, a new dataset of superior quality compared to the previous one has been developed Continuous data collection from the alternative traffic data source can further enhance this dataset, making it a valuable resource for future research endeavors aiming to improve the ASR model deployed in the UTraffic system Additionally, notable progress has been made in improving the accuracy of the ASR model compared to the results achieved by the current architecture of the UTraffic system’s ASR model vi TÓM TẮT LUẬN VĂN Luận văn giải hai thách thức lĩnh vực hệ thống giao thông thông minh tại, cụ thể Hệ Thống Dự Báo Tình Trạng Giao Thông Đô Thị (UTraffic) Thách thức liên quan đến thiếu hụt liệu đáp ứng tiêu chuẩn cần thiết cho việc huấn luyện mơ hình nhận dạng giọng nói tự động (ASR), triển khai hệ thống UTraffic Bộ liệu chủ yếu bao gồm liệu tổng hợp, dẫn đến thiên vị cho việc nhận dạng báo cáo giao thơng tạo từ giọng nói tổng hợp, gặp khó khăn việc chuyển báo cáo giao thơng dạng giọng nói cung cấp người dùng UTraffic sang văn xác Thách thức thứ hai liên quan đến độ xác mơ hình ASR triển khai hệ thống UTraffic Để giải thách thức này, nghiên cứu đề xuất số phương pháp Thứ nhất, xác định nguồn liệu giao thông thay để giảm thiểu phụ thuộc vào liệu tổng hợp Thứ hai, thiết kế luồng xử lý thích hợp, kết hợp kỹ thuật xử lý âm chuyển đổi tỉ lệ lấy mẫu tăng cường giọng nói để xử lý hiệu liệu có, với mục tiêu cuối cải thiện hiệu suất mô hình ASR Thứ ba, thử nghiệm liệu xử lý kiến trúc ASR tiên tiến để xác định mơ hình tối ưu cho việc triển khai hệ thống UTraffic Nghiên cứu đạt thành tựu đáng kể Thứ nhất, hình thành liệu có chất lượng vượt trội so với liệu ban đầu Việc tiếp tục thu thập liệu từ nguồn thay nâng cao chất lượng liệu có, biến thành nguồn tài nguyên quý giá cho nỗ lực nghiên cứu cải thiện hiệu suất mơ hình ASR triển khai hệ thống UTraffic tương lai Ngoài ra, so với kết đạt mơ hình ASR hệ thống UTraffic, đạt tiến đáng kể, đặc biệt việc cải thiện độ nhận dạng giọng nói xác vii DECLARATION I, Nguyễn Thị Ty, solemnly declare that this thesis titled "Research and develop solutions to traffic data collection based on voice techniques" is the result of my own work, conducted under the supervision of Assoc Prof Trần Minh Quang I affirm that all the information presented in this thesis is based on my own knowledge, research, and understanding, acquired through extensive study and investigation I further declare that any external assistance, whether in the form of data, ideas, or references, has been duly acknowledged and properly cited in accordance with the established academic conventions I have provided appropriate references and citations for all the sources and materials used in this thesis, giving credit to the original authors and their contributions I acknowledge that this thesis is intended to fulfill the demands of society and to contribute to the existing body of knowledge in the field It represents the culmination of my efforts, dedication, and commitment to advancing knowledge and understanding in this area I hereby affirm that this thesis is an authentic and original piece of work, and I take full responsibility for its content I understand the consequences of any act of plagiarism or academic dishonesty, and I assure that this thesis has been prepared with utmost integrity and honesty Nguyễn Thị Ty June 9, 2023 87 3.1 Variations in Not_SE ASR Model Training and Inference The Not_SE directory hosts a set of files that serve distinct purposes in the realm of ASR model development brachformer_trainNotSE_SamplingRate_notVOH_inferNotSE: This file demonstrates the training and inference processes of an ASR model using the "brachformer" architecture The training involves a specific sampling rate, and the dataset does not incorporate audio from the VOH 95.6 MHz channel Additionally, an inference process related to not-speech-enhancement-processed audio data is conducted transformer_trainNotSE_notSamplingRate_infernotSE: This file showcases the training and inference procedures of an ASR model employing the "transformer" architecture Unlike the previous file, the sampling rate adjustments are not a focus during training The inference process relates to not-speech-enhancement-processed audio data transformer_trainNotSE_SamplingRate_infernotSE: In this file, an ASR model is trained and its inference is performed using the "transformer" architecture A specific sampling rate is employed during training, and the inference activities are geared towards not-speech-enhancement-processed audio data Collectively, these files represent different training and inference scenarios within ASR model development for the ’Not_SE’ category They explore variations in architecture, sampling rates, and other parameters, contributing to the creation and evaluation of ASR models tailored for not-speech-enhancement-processed audio data within the ESPnet toolkit framework 3.2 Variations in SE ASR Model Training and Inference The SE directory comprises a collection of files, each serving a distinct purpose in the domain of ASR model development branchformer_trainSE_SamplingRate_70epoch_CTCweight01: This file demonstrates the training of an ASR model using the "branchformer" architecture, involving a specific sampling rate, and spanning 70 epochs It incorporates a modification in the connectionist temporal classification (CTC) loss function, with a CTC weight of 88 0.1 branchformer_trainSE_SamplingRate_70epoch_CTCweight03: This file showcases the training of an ASR model with the "branchformer" architecture It utilizes a specific sampling rate and extends training over 70 epochs An alternative configuration for the CTC loss weight, set to 0.3, is employed branchformer_trainSE_SamplingRate_notVOH_inferSE: In this file, an ASR model is trained using the "branchformer" architecture, with a particular sampling rate The training dataset does not incorporate audio from the VOH 95.6 MHz channel Additionally, an inference process related to speech-enhancement-processed audio data is performed conformer_trainSE_SamplingRate_70epoch_CTCweight0103: This file pertains to the training of an ASR model using the "conformer" architecture, utilizing a specific sampling rate, and spanning 70 epochs It showcases variations in the CTC loss weight, where the CTC weight is set to 0.3 transformer_trainSE_notSamplingRate_50epoch_inferSE: This file is associated with the training of an ASR model employing the "transformer" architecture The training spans 50 epochs The "inferSE" label indicates an inference process related to speech-enhancement-processed audio data transformer_trainSE_SamplingRate_50epoch_inferSE: This file involves the training of an ASR model with the "transformer" architecture It specifically employs a particular sampling rate and extends training over 50 epochs The inference activities are related to speech-enhancement-processed audio data Collectively, these files correspond to different training and inference scenarios within ASR model development They explore variations in architecture, sampling rates, and other parameters, contributing to the creation and evaluation of ASR models tailored for speech-enhancement-processed audio data within the ESPnet toolkit framework ASR Model Deployement 4.1 Run bktraffic-analyxer 89 Before we run the bktraffic-analyxer, the environment where we will deploy our ASR models, we need to fulfill the following requirements: • Make sure that Anaconda is installed on your machine, and you already have bktraffic-analyxer source code from [2] • Ensure that your virtual environment has been created and equipped with essential libraries for operation If any of these requirements are not satisfied, we cannot run the bktraffic-analyxer If any of these requirements are not satisfied, we cannot run the bktraffic-analyxer • Step 1: Firstly, open the Terminal and navigate to the folder containing the source code of the project bktraffic-analyser By default, the folder name is bktrafficanalyxer-master if extracted from a zip file • Step 2: Now navigate to the los_dataset folder using the command cd los_dataset Inside this folder, there should be sub-folders named process_base_status and process_segment_status, along with Python files: export_from_db.py, process_nodes.py, and process_segs_streets.py • Step 3: Next, activate your virtual environment using the command conda activate • Step 4: Run the script export_from_db.py using the command python export_from_db.py After execution, a subfolder named los_dataset will be created within the current folder, containing another subfolder named data_origin, which includes files: nodes.csv, segment_reports.csv, segments.csv, and streets.csv • Step 5: Execute the script process_nodes.py using the command python process_nodes.py Upon completion, a subfolder named dataset will be generated within the los_dataset folder, containing a file named nodes.csv 90 • Step 6: Run the script process_segs_streets.py using the command python process_segs_streets.py This will create a file named temp_segments.csv within the dataset folder • Step 7: Navigate to the process_base_status folder Here, you can run process1.py and process2.py However, before proceeding, you need to make modifications to these two files: – For process1.py, adjust the code at line 18 from df.to_csv(’/los_dataset/dataset/temp_base_status.csv’) to df.to_csv(’ /los_dataset/dataset/temp_base_status.csv’) – For process2.py, modify the code at line from baseStatus_df = pd.read_csv(’/los_dataset/dataset/temp_base_status.csv’) to baseStatus_df = pd.read_csv(’ /los_dataset/dataset/temp_base_status.csv’) Then, update the code at line from period_status_path = ’/los_dataset/dataset/period_status’ to period_status_path = ’ /los_dataset/dataset/period_status’ Once these modifications are complete, you can run process1.py and process2.py using the respective commands: python process1.py and python process2.py After execution, you will find a folder named period_status and a file named temp_base_status.csv • Step 8: Now you can continuously run the processes inside process_segment_status To so, navigate back by running the command cd , and then run cd process_segment_status This will take you to the process_segment_status folder • Step 9: You will need to modify the files located inside the process_segment_status directory to suit your specific requirements Now you can obtain your segments.csv file and place it into resources/velocity_estimator/data (you must create the velocity_estimator directory 91 inside resources, and then create a data directory inside velocity_estimator, and finally place segments.csv inside the data directory) Please note that there are some issues with the code in process_segments_status/process2.py that prevent it from running Additionally, process4.py relies on process2.py, which means that neither of them can be executed Furthermore, ensure that you check the src/utils/path.py file You will notice that you are missing the following files: segment_id_encoder.pkl, street_level_encoder.pkl, street_type_encoder.pkl, and model.pt These files need to be provided as they contain pre-trained parameters Run the web app bktraffic-analyxer The pkl files, such as segment_id_encoder.pkl, street_level_encoder.pkl, street_type_encoder.pkl, and model.pt, belong to another module called the LOS estimator This module operates independently from the AST module, but both are hosted by the bktraffic-analyxer server These files, along with files related to speech-to-text models, are stored in Amazon S3 However, the previous research students no longer have access to them They suggested that we can run bktraffic-analyxer without the LOS estimator Therefore, we re-run the app.py file with the commands related to the LOS estimator commented out Finally, the Web application runs 4.2 Deploy ASR model into bktraffic-analyxer Now, we are writing a script named main.py to utilize our trained ASR model We accomplish this by employing the Speech2Text module from espnet2.bin.asr_inference, and we also declare the configuration and model files of our trained ASR models Running this main.py within the bktraffic-analyxer environment will yield the results of deploying ASR models References [1] “Utraffic (bktraffic).” (), [Online] Available: https://bktraffic.com/home/ (visited on 05/28/2023) [2] H Nguyen G, T Nguyen, H L Trung, and Q T Minh, “Speech-based traffic reporting: An automated data collecting approach for intelligent transportation systems,” in Artificial Intelligence in Data and Big Data Processing: Proceedings of ICABDE 2021, Springer, 2022, pp 671–683 [3] “Ailab university of science hcmc.” (), [Online] Available: http://ailab.hcmus edu.vn/vivos (visited on 06/05/2023) [4] “Vbee.” (), [Online] Available: https://vbee.vn/ (visited on 05/28/2023) [5] D C Gibbon and Z Liu, Audio Processing Springer, 2008 [6] A Gulati, J Qin, C.-C Chiu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020 [7] Y Peng, S Dalmia, I Lane, and S Watanabe, “Branchformer: Parallel mlpattention architectures to capture local and global context for speech recognition and understanding,” in International Conference on Machine Learning, PMLR, 2022, pp 17 627–17 643 [8] Y Lu, Z Li, D He, et al., “Understanding and improving transformer from a multi-particle dynamic system point of view,” arXiv preprint arXiv:1906.02762, 2019 92 93 [9] H Liu, Z Dai, D So, and Q V Le, “Pay attention to mlps,” Advances in Neural Information Processing Systems, vol 34, pp 9204–9215, 2021 [10] J Sakuma, T Komatsu, and R Scheibler, “Mlp-based architecture with variable length input for automatic speech recognition,” 2021 [11] J L Ba, J R Kiros, and G E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016 [12] N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol 15, no 1, pp 1929–1958, 2014 [13] A Graves and N Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International conference on machine learning, PMLR, 2014, pp 1764–1772 [14] Y Miao, M Gowayyed, and F Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in 2015 IEEE workshop on automatic speech recognition and understanding (ASRU), IEEE, 2015, pp 167– 174 [15] D Amodei, S Ananthanarayanan, R Anubhai, et al., “Deep speech 2: End-toend speech recognition in english and mandarin,” in International conference on machine learning, PMLR, 2016, pp 173–182 [16] J Chorowski, D Bahdanau, K Cho, and Y Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602, 2014 [17] L Lu, X Zhang, and S Renais, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp 5060–5064 94 [18] W Chan, N Jaitly, Q Le, and O Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2016, pp 4960–4964 [19] S Watanabe, T Hori, S Kim, J R Hershey, and T Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol 11, no 8, pp 1240–1253, 2017 [20] L R Bahl, P F Brown, P V de Souza, and R L Mercer, “A tree-based statistical language model for natural language speech recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol 37, no 7, pp 1001– 1008, 1989 [21] S Kim, T Hori, and S Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2017, pp 4835– 4839 [22] B Xue, J Yu, J Xu, et al., “Bayesian transformer language models for speech recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp 7378–7382 [23] C.-H Lee, F K Soong, and B.-H Juang, “A segment model based approach to speech recognition,” in ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing, IEEE Computer Society, 1988, pp 501–502 [24] J Drexler and J Glass, “Subword regularization and beam search decoding for end-to-end automatic speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp 6266–6270 [25] M Kocour, K Vesel, I Szăoke, et al., Automatic processing pipeline for collecting and annotating air-traffic voice communication data,” Engineering Proceedings, vol 13, no 1, p 8, 2021 95 [26] L Yi, R Min, C Kunjie, et al., “Identifying and managing risks of ai-driven operations: A case study of automatic speech recognition for improving air traffic safety,” Chinese Journal of Aeronautics, vol 36, no 4, pp 366–386, 2023 [27] D Guo, Z Zhang, P Fan, J Zhang, and B Yang, “A context-aware language model to improve the speech recognition in air traffic control,” Aerospace, vol 8, no 11, p 348, 2021 [28] A Blatt, M Kocour, K Veselý, I Szăoke, and D Klakow, “Call-sign recognition and understanding for noisy air-traffic transcripts using surveillance information,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp 8357–8361 DOI: 10.1109/ ICASSP43922.2022.9746301 [29] J Li, L Deng, Y Gong, and R Haeb-Umbach, “An overview of noise-robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol 22, no 4, pp 745–777, 2014 DOI: 10.1109/TASLP 2014.2304637 [30] V Stouten, P Wambacq, et al., “Model-based feature enhancement with uncertainty decoding for noise robust asr,” Speech communication, vol 48, no 11, pp 1502–1514, 2006 [31] A Narayanan and D Wang, “Joint noise adaptive training for robust automatic speech recognition,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2014, pp 2504–2508 [32] F Weninger, H Erdogan, S Watanabe, et al., “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in Latent Variable Analysis and Signal Separation: 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings 12, Springer, 2015, pp 91–99 [33] A Kato, “Hidden markov model-based speech enhancement,” Ph.D dissertation, University of East Anglia, 2017 96 [34] B J Borgstrăom, M S Brandstein, and R B Dunn, Improving statistical model-based speech enhancement with deep neural networks,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), IEEE, 2018, pp 471–475 [35] Y Luo and N Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol 27, no 8, pp 1256–1266, 2019 [36] X Chang, T Maekaku, Y Fujita, and S Watanabe, “End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation,” arXiv preprint arXiv:2204.00540, 2022 [37] Y.-J Lu, X Chang, C Li, et al., “Espnet-se++: Speech enhancement for robust speech recognition, translation, and understanding,” arXiv preprint arXiv:2207.09514, 2022 [38] T Hori, J Cho, and S Watanabe, “End-to-end speech recognition with wordbased rnn language models,” in 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2018, pp 389–396 [39] Y Lin, B Yang, L Li, et al., “Atcspeechnet: A multilingual end-to-end speech recognition framework for air traffic control systems,” Applied Soft Computing, vol 112, p 107 847, 2021 [40] A Graves, A.-r Mohamed, and G Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing, Ieee, 2013, pp 6645–6649 [41] J K Chorowski, D Bahdanau, D Serdyuk, K Cho, and Y Bengio, “Attentionbased models for speech recognition,” Advances in neural information processing systems, vol 28, 2015 97 [42] T Hori, S Watanabe, and J R Hershey, “Joint ctc/attention decoding for endto-end speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp 518– 529 [43] K Zhou, Q Yang, X Sun, S Liu, and J Lu, “Improved ctc-attention based end-to-end speech recognition on air traffic control,” in Intelligence Science and Big Data Engineering Big Data and Machine Learning: 9th International Conference, IScIDE 2019, Nanjing, China, October 17–20, 2019, Proceedings, Part II 9, Springer, 2019, pp 187–196 [44] Y Lin, Q Li, B Yang, Z Yan, H Tan, and Z Chen, “Improving speech recognition models with small samples for air traffic control systems,” Neurocomputing, vol 445, pp 287–297, 2021 [45] A Vaswani, N Shazeer, N Parmar, et al., “Attention is all you need,” Advances in neural information processing systems, vol 30, 2017 [46] I O Tolstikhin, N Houlsby, A Kolesnikov, et al., “Mlp-mixer: An all-mlp architecture for vision,” Advances in neural information processing systems, vol 34, pp 24 261–24 272, 2021 [47] Z Wu, Z Liu, J Lin, Y Lin, and S Han, “Lite transformer with long-short range attention,” arXiv preprint arXiv:2004.11886, 2020 [48] Q Zhang, H Lu, H Sak, et al., “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp 7829–7833 [49] O Press, N A Smith, and O Levy, “Improving transformer models by reordering their sublayers,” arXiv preprint arXiv:1911.03864, 2019 [50] C Gulcehre, O Firat, K Xu, et al., “On using monolingual corpora in neural machine translation,” arXiv preprint arXiv:1503.03535, 2015 98 [51] A Kannan, Y Wu, P Nguyen, T N Sainath, Z Chen, and R Prabhavalkar, “An analysis of incorporating an external language model into a sequence-tosequence model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp 1–5828 [52] ESPnet Developers, ESPnet github repository, https : / / github com / espnet / espnet/tree/v.202301, Accessed: June 5, 2023 [53] J O’Sullivan, G Bogaarts, M Kosek, et al., “Automatic speech recognition for asd using the open-source whisper model from openai,” [54] S Watanabe, T Hori, S Karita, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018 [55] Y Wang, H Wang, et al., “Multilingual convolutional, long short-term memory, deep neural networks for low resource speech recognition,” Procedia Computer Science, vol 107, pp 842–847, 2017 [56] S Suyanto, A Arifianto, A Sirwan, and A P Rizaendra, “End-to-end speech recognition models for a low-resourced indonesian language,” in 2020 8th International Conference on Information and Communication Technology (ICoICT), IEEE, 2020, pp 1–6 [57] ESPnet: End-to-end speech processing toolkit, https://espnet.github.io/espnet/ index.html, Accessed: June 5, 2023 [58] “Voice of ho chi minh city 95.6mhz channel.” (), [Online] Available: https : //voh.com.vn/radio-kenh-fm-956-fm956mhz.html (visited on 05/28/2023) [59] R Tang, “Manual transcription,” in Varieties of Qualitative Research Methods: Selected Contextual Perspectives, Springer, 2023, pp 295–302 [60] “Espnet: End-to-end speech processing toolkit.” (), [Online] Available: https: //pypi.org/project/espnet/0.6.2/ (visited on 06/30/2023) 99 [61] Y Luo and N Mesgarani, “Tasnet: Time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp 696–700 [62] D Yu, M Kolbæk, Z.-H Tan, and J Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp 241–245 [63] J Heymann, L Drude, and R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp 196–200 [64] H Erdogan, J R Hershey, S Watanabe, M I Mandel, and J Le Roux, “Improved mvdr beamforming using single-channel mask prediction networks.,” in Interspeech, 2016, pp 1981–1985 [65] Y Luo, Z Chen, and T Yoshioka, “Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp 46–50 [66] Y Hu, Y Liu, S Lv, et al., “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020 [67] C Li, J Shi, W Zhang, et al., “Espnet-se: End-to-end speech enhancement and separation toolkit designed for asr integration,” in 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2021, pp 785–792 [68] E Vincent, S Watanabe, A A Nugraha, J Barker, and R Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol 46, pp 535–557, 2017 100 [69] M.-T Luong, H Pham, and C D Manning, “Effective approaches to attentionbased neural machine translation,” arXiv preprint arXiv:1508.04025, 2015 [70] P Guo, F Boyer, X Chang, et al., “Recent developments on espnet toolkit boosted by conformer,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp 5874– 5878 [71] C Gao, G Cheng, R Yang, H Zhu, P Zhang, and Y Yan, “Pre-training transformer decoder for end-to-end asr model with unpaired text data,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp 6543–6547 [72] T Hori, S Watanabe, Y Zhang, and W Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” arXiv preprint arXiv:1706.02737, 2017 [73] R Sennrich, B Haddow, and A Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015 [74] L R Bahl, F Jelinek, and R L Mercer, “A maximum likelihood approach to continuous speech recognition,” IEEE transactions on pattern analysis and machine intelligence, no 2, pp 179–190, 1983 [75] R Vipperla, S Renals, and J Frankel, “Ageing voices: The effect of changes in voice parameters on asr performance,” EURASIP Journal on Audio, Speech, and Music Processing, vol 2010, pp 1–10, 2010 [76] C S Leow, T Hayakawa, H Nishizaki, and N Kitaoka, “Development of a low-latency and real-time automatic speech recognition system,” in 2020 IEEE 9th global conference on consumer electronics (GCCE), IEEE, 2020, pp 925– 928 [77] “Espnet.” (), [Online] Available: https://espnet.github.io/espnet/tutorial.html# use-of-gpu (visited on 07/06/2023) VITA Full name: NGUYỄN THỊ TY Date of birth: 22/11/1996 Place of birth: Binh Dinh Province Address: 32/37/12 Dong Hung Thuan 32, Tan Hung Thuan Ward, District 12 • 9/2014 - 6/2018: Student, Majoring in Mathematics Education • 10/2021 - now: Student, Majoring in Data Science - Computer Science • 6/2018 - now: Mathematics Teacher teaching in English • 8/2022 - now: AI Engineer

Ngày đăng: 25/10/2023, 22:15

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w