HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
NGUYỄN THỊ TY
RESEARCH AND DEVELOP SOLUTIONS TO TRAFFICDATA COLLECTION BASED ON VOICE TECHNIQUES
Major: Computer ScienceMajor code: 8480101
MASTER THESIS
Trang 2VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
NGUYỄN THỊ TY
RESEARCH AND DEVELOP SOLUTIONS TO TRAFFICDATA COLLECTION BASED ON VOICE TECHNIQUES
Major: Computer ScienceMajor code: 8480101
MASTER THESIS
Trang 3HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM
Supervisor: Assoc Prof Trần Minh Quang
Examiner 1: Assoc Prof Nguyễn Văn Vũ
Examiner 2: Assoc Prof Nguyễn Tuấn Đăng
This master’s thesis is defended at HCM City University of Technology, VNU-HCM City on July 11, 2023
The board of the Master’s Thesis Defense Council includes: (Please write downthe full name and academic rank of each member of the Master Thesis Defense Coun-cil)
1 Chairman: Assoc Prof Lê Hồng Trang
2 Secretary: Dr Phan Trọng Nhân
3 Examiner 1: Assoc Prof Nguyễn Văn Vũ
4 Examiner 2: Assoc Prof Nguyễn Tuấn Đăng
5 Commissioner: Assoc Prof Trần Minh Quang
Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty ofComputer Science and Engineering after the thesis is corrected (If any).
CHAIRMAN OF THESIS COMMITTEEDEAN OF FACULTY OF
Trang 4i
VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY SOCIALIST REPUBLIC OF VIETNAMHO CHI MINH CITY UNIVERSITY OF TECHNOLOGYIndependence – Freedom - Happiness
THE TASK SHEET OF MASTER’S THESIS
Full name: NGUYỄN THỊ TY Student code: 2171072
Date of birth: 22/11/1996 Place of birth: Binh Dinh ProvinceMajor: Computer Science Major code: 8480101
I THESIS TITLE:
Research and develop solutions to traffic data collection based on voice tech-niques (Nghiên cứu và phát triển các giải pháp thu thập dữ liệu giao thơng dựatrên các kỹ thuật giọng nói).
II TASKS AND CONTENTS:
• Task 1: Traffic Data Collection and Processing.
The first task involves collecting comprehensive traffic data Extensive re-search will be conducted to identify reliable data sources, followed by theimplementation of appropriate data collection techniques Subsequently, ex-periments will be carried out to determine the most effective data processingmethods The aim is to enhance data quality and optimize processing effi-ciency for further analysis.
• Task 2: Research and Experimentation for Automatic Speech RecognitionModel Development.
Trang 5• Task 3: Automatic Speech Recognition Model Evaluation and Future Work.Once the automatic speech recognition models are developed, a comprehen-sive evaluation process will be undertaken The achieved results will be ana-lyzed using appropriate metrics and techniques to assess their performance.Strengths and weaknesses of each model will be identified Based on thisanalysis, recommendations for future work will be provided, outlining po-tential enhancements or modifications to the automatic speech recognitionmodels.
III THESIS START DAY: 06/02/2023.
IV THESIS COMPLETION DAY: 09/06/2023.
V SUPERVISOR: Assoc Prof TRẦN MINH QUANG.
Ho Chi Minh City, June 9, 2023
SUPERVISORCHAIR OF PROGRAM COMMITTEE
(Full name and signature) (Full name and signature)
DEAN OF FACULTY OF
COMPUTER SCIENCE AND ENGINEERING
Trang 6iii
ACKNOWLEDGMENTS
I would like to extend my sincere gratitude to the individuals who have providedinvaluable support and assistance throughout my research journey I would like toexpress my formal appreciation to Assoc Prof Trần Minh Quang for his exceptionalguidance, expertise, and unwavering support His mentorship has been instrumen-tal in helping me navigate the necessary steps to complete this thesis Whenever Iencountered difficulties or felt lost, Assoc Prof Quang provided invaluable advicethat steered me back in the correct direction His suggestion to process the data toenhance its quality was a significant contribution to my research Furthermore, hisassistance in establishing contact with esteemed researchers working on topics sim-ilar to mine and facilitating connections with individuals who could provide serversupport for training large models, such as automatic speech recognition models, hasbeen immensely valuable.
I would like to express my profound gratitude to the esteemed researchers, Mr.Nguyễn Gia Huy and Mr Nguyễn Tiến Thành, for their generous contributions insharing their profound insights and knowledge Their willingness to address my in-quiries regarding the Urban Traffic Estimation System, collected data, and existingissues has significantly enriched my comprehension of the subject matter Further-more, I am sincerely thankful to my sisters, Ms Nguyễn Thị Nghĩa and Ms NguyễnThị Hiển, as well as Lương Duy Hưng and Vũ Anh Nhi, for their invaluable supportin meticulously creating precise transcripts for the audio files.
Trang 7(HC-MUT) for their kind approval in granting me the opportunity to utilize an independentserver for automatic speech recognition model training Their trust and support fromthe Data Science Laboratory have been pivotal in facilitating the smooth progress ofmy research In addition, I am sincerely thankful for the invaluable support renderedby my friends, Mr Nguyễn Tấn Sang and Mr Huỳnh Ngọc Thiện, in working withthe server that has limited permissions Their expertise and assistance have been in-dispensable in effectively navigating the constraints imposed by the server limitations.Lastly, I would like to express my heartfelt gratitude to my boss, co-workers,friends, and family for their unwavering emotional support and understanding duringthe challenging times that I encountered throughout this research endeavor Theirencouragement and belief in my abilities have been instrumental in my success.
Trang 8v
ABSTRACT
This thesis addresses two fundamental challenges within the domain of the cur-rent intelligent traffic system, specifically the Urban Traffic Estimation (UTraffic)System The first challenge pertains to the insufficiency of data that meets the req-uisite standards for training the automatic speech recognition (ASR) model that willbe deployed in the UTraffic system The current dataset predominantly consists ofsynthesized data, resulting in a bias towards recognizing synthesized traffic speechreports while struggling to accurately transcribe real-life traffic speech reports im-ported by UTraffic users The second challenge involves the accuracy of the ASRmodel deployed in the current UTraffic system, particularly in transcribing real-lifetraffic speech reports into text.
To address these challenges, this research proposes several approaches Firstly,an alternative traffic data source is identified to reduce the reliance on synthesizeddata and mitigate the bias Secondly, a pipeline incorporating audio processing tech-niques such as sampling rate conversion and speech enhancement is designed to ef-fectively process the dataset, with the ultimate objective of improving ASR modelperformance Thirdly, advanced and suitable ASR architectures are experimentedwith using the processed dataset to identify the most optimal model for deploymentwithin the UTraffic system.
Trang 9TÓM TẮT LUẬN VĂN
Luận văn này giải quyết hai thách thức cơ bản trong lĩnh vực hệ thống giao thôngthông minh hiện tại, cụ thể là Hệ Thống Dự Báo Tình Trạng Giao Thơng Đơ Thị(UTraffic) Thách thức đầu tiên liên quan đến sự thiếu hụt dữ liệu đáp ứng tiêu chuẩncần thiết cho việc huấn luyện mơ hình nhận dạng giọng nói tự động (ASR), sẽ đượctriển khai trong hệ thống UTraffic Bộ dữ liệu hiện tại chủ yếu bao gồm dữ liệu tổnghợp, dẫn đến sự thiên vị cho việc nhận dạng các báo cáo giao thơng tạo từ giọng nóitổng hợp, trong khi gặp khó khăn trong việc chuyển các báo cáo giao thơng ở dạnggiọng nói được cung cấp bởi người dùng UTraffic sang văn bản chính xác Tháchthức thứ hai liên quan đến độ chính xác của mơ hình ASR triển khai trong hệ thốngUTraffic hiện tại.
Để giải quyết những thách thức này, nghiên cứu này đề xuất một số phương pháp.Thứ nhất, xác định nguồn dữ liệu giao thông thay thế để giảm thiểu sự phụ thuộc vàodữ liệu tổng hợp Thứ hai, thiết kế luồng xử lý thích hợp, trong đó kết hợp các kỹthuật xử lý âm thanh như chuyển đổi tỉ lệ lấy mẫu và tăng cường giọng nói để xử lýhiệu quả bộ dữ liệu đang có, với mục tiêu cuối cùng là cải thiện hiệu suất mơ hìnhASR Thứ ba, thử nghiệm bộ dữ liệu đã được xử lý trên các kiến trúc ASR tiên tiếnđể xác định được mơ hình tối ưu nhất cho việc triển khai trong hệ thống UTraffic.
Trang 10vii
DECLARATION
I, Nguyễn Thị Ty, solemnly declare that this thesis titled "Research and developsolutions to traffic data collection based on voice techniques" is the result of my ownwork, conducted under the supervision of Assoc Prof Trần Minh Quang I af-firm that all the information presented in this thesis is based on my own knowledge,research, and understanding, acquired through extensive study and investigation.
I further declare that any external assistance, whether in the form of data, ideas,or references, has been duly acknowledged and properly cited in accordance withthe established academic conventions I have provided appropriate references andcitations for all the sources and materials used in this thesis, giving credit to theoriginal authors and their contributions.
I acknowledge that this thesis is intended to fulfill the demands of society and tocontribute to the existing body of knowledge in the field It represents the culminationof my efforts, dedication, and commitment to advancing knowledge and understand-ing in this area.
I hereby affirm that this thesis is an authentic and original piece of work, and Itake full responsibility for its content I understand the consequences of any act ofplagiarism or academic dishonesty, and I assure that this thesis has been preparedwith utmost integrity and honesty.
Trang 11ContentsList of Figures xList of Tables xi1INTRODUCTION11.1 General Introduction 11.2 Problem Description 31.3 Objective 51.4 Scope Of Work 61.5 Contribution 81.6 Thesis Structure 92BACKGROUND112.1 Traffic Data Processing for ASR Model 11
2.2 Components in ASR Model 12
2.2.1 Acoustic Modeling 13
2.2.2 Language Modeling 18
2.2.3 Decoding Algorithm 18
2.3 Challenges in Traffic Speech Recognition 20
3RELATED WORK224APPROACH264.1 Choosing ESPnet for ASR Model Development 26
Trang 12ix
4.2 Data Collection and Data Processing 30
4.2.1 Data Collection 31
4.2.2 Data Processing 32
4.3 Training and Decoding for End-to-End ASR 36
4.3.1 Attention-based Encoder Decoder 37
4.3.2 Hybrid CTC/Attention End-to-End ASR 38
4.3.3 Evaluation Metric 40
5EXPERIMENT AND EVALUATION435.1 Dataset 43
5.2 Experimental Setup 44
5.3 Experimental Result and Analysis 46
5.3.1 Data Processing Method Experiment 46
5.3.2 RNNLM Training Experiment 48
5.3.3 Architecture Comparison Experiment 49
5.3.4 Language Model Weight Variation Experiment 56
5.3.5 CTC Weight Variation Experiment 57
5.3.6 VOH Data Impact Assessment Experiment 59
5.4 Discussion 60
5.5 ASR Deployment 61
5.5.1 bktraffic-analyxer and Training Server Environments 62
5.5.2 ASR Deployment Result 62
5.5.3 ASR Deployment Result Analysis 71
Trang 13List of Figures
2.1 System Architecture of Automatic Speech Recognition [5] 13
2.2 The Architecture of the Conformer Encoder Model [6] 14
2.3 The Architecture of the Branchformer Encoder Block [7] 16
4.1 Taxonomy of Methods for Constructing our End-to-End ASR System 264.2 The Experimental Flow of the Standard ESPnet Recipe [54] 29
4.3 Distribution of Audio Hours among Three Data Sources 32
5.1 Training Time in the First Scenario 47
5.2 Training Time in the Second Scenario 47
5.3 Training Time in the Third Scenario 47
5.4 Training Time in the Fourth Scenario 47
5.5 GPU Maximum Cached Memory of Transformer-based Architecture 52
5.6 GPU Maximum Cached Memory of Architecture with Conformer-based Encoder 53
5.7 GPU Maximum Cached Memory of Architecture with Branchformer-based Encoder 53
5.8 WER for Transformer-based Encoder-Decoder Architecture 55
5.9 WER for Conformer-based Encoder Transformer-based Decoder Ar-chitecture 56
5.10 WER for Branchformer-based Encoder Transformer-based DecoderArchitecture 56
Trang 14List of Tables
4.1 Comparison of Audio Transcription Approaches 33
4.2 WER for System Combination of Speech Enhancement with SpeechRecognition on CHiME-4 Corpus [37] 34
4.3 Comparison of Word Error Rate in Different Scenarios 35
5.1 Language Model Perplexity 48
5.2 Comparison of ASR Models based on WER 49
5.3 ASR Model Latency and RTF 50
5.4 Models & Trainable Parameters and Training Time 51
5.5 Comparison of Training and Validation Metrics for Transformer, Con-former, and Branchformer Models 54
5.6 Impact of LM weight on WER and Sentence Error Rate (S.Err) 57
5.7 Impact of CTC Weight Variation during the Decoding Stage on WERand S.Err 58
5.8 Impact of CTC Weight Variation in the Training Stage on WER andS.Err 59
5.9 ASR Model WER, Latency & RTF with Different Models and TestSets 59
5.10 Audio Files Conversion Time and Transcription Results of ASR Modelin bktraffic-analyxer 63
Trang 15INTRODUCTION
1.1General Introduction
The Urban Traffic Estimation System (UTraffic) [1], developed by Ho Chi MinhUniversity of Technology, is an intelligent traffic system designed to estimate andpredict urban traffic conditions It utilizes real-time traffic data to generate accuratetraffic estimations By employing advanced technologies such as data analytics, ma-chine learning, and deep learning, UTraffic provides valuable insights and supportfor traffic management and planning, aiming to improve traffic efficiency, reducecongestion, and enhance overall transportation systems in urban areas.
One crucial feature of UTraffic is the functionality to convert traffic speech re-ports into text through web and mobile applications This feature significantly en-hances the system’s usability and effectiveness The details of this functionality areas follows:
• User Interaction: The web and mobile applications provide an interface for usersto submit their traffic speech reports Users can access the application from theirdevices, record their voice, and provide relevant details about the traffic situationthey are reporting.
• Voice-to-Text Conversion: The recorded speech reports undergo processing
Trang 162
ing automatic speech recognition (ASR) technology ASR algorithms analyzethe audio input and convert it into textual format.
• Text Processing: The converted text is then further processed to extract relevantinformation, such as the location, type of incident, severity, and any additionaldetails provided by the user Natural language processing (NLP) techniques maybe applied to understand and extract meaning from the text effectively.
• Data Integration: The extracted information in the form of segments from thespeech reports is integrated into the UTraffic system’s database, enabling furtheranalysis and utilization for purposes such as real-time traffic monitoring, incidentdetection, and traffic prediction.
• Route Optimization: By analyzing the speech reports, the UTraffic system canidentify the most efficient routes for specific destinations based on real-time traf-fic conditions It can provide recommendations to drivers to suggest alternateroutes that minimize travel time and avoid congested areas.
By converting traffic speech reports into text, the UTraffic system can efficientlyprocess and analyze user-generated data, enabling better traffic management and decision-making This feature improves accessibility and convenience for users to contributeto the system, as they can report traffic incidents through voice inputs that are trans-formed into actionable information within the UTraffic ecosystem In the context ofthis thesis, our primary focus lies on the aspect of Voice-to-Text conversion of thisfeature.
The paper by Nguyen et al [2] presents the process of developing an ASR modeland its deployment in the UTraffic system The ASR model utilizes a Conformer-based encoder and Transformer-Conformer-based decoder architecture, which has been trainedon the VIVOS dataset [3] The evaluation of this model is conducted on a set of 80traffic speech reports.
Trang 17a feature within the UTraffic web and mobile application that allows users to volun-tarily contribute their own traffic speech reports By prompting users to speak andrecord their speech, the training dataset can be enriched Additionally, the authorsutilized a speech synthesis tool called Vbee to generate synthetic traffic speech re-ports, resulting in an additional 122,569 seconds of audio data These efforts havecumulatively produced approximately 35 hours of audio data, which will be utilizedfor future training of the ASR model.
In terms of the chosen architecture for the ASR models deployed in the UTraf-fic system, the Conformer-based encoder and Transformer-based decoder architecturewas found to outperform alternative configurations such as Recurrent Neural Network(RNN) -based encoder-decoder and Transformer-based encoder-decoder, as assessedby the word error rate (WER) metric The Conformer-based encoder and Transformer-based decoder architecture exhibited superior performance, showcasing its suitabilityfor achieving accurate speech recognition within the UTraffic system during the eval-uation period.
1.2Problem Description
There are several challenges associated with the current implementation of theASR model in the UTraffic system Firstly, the ASR model deployed in UTrafficencounters a domain adaptation problem Despite the VIVOS training dataset con-taining a diverse range of speech recordings from various speakers, encompassingdifferent domains and topics, the trained model is applied to recognize traffic speechreports from the web or mobile application of the UTraffic system This introducespotential degradation in accuracy and recognition performance due to differences inacoustic and linguistic characteristics, including speaker variability, noise conditions,language style, and vocabulary, between the source domain and the target domain(traffic domain).
Trang 18pre-4
pared for future ASR model training, as mentioned by the author [2] While UTrafficincorporates a feature that allows the system’s web or mobile applications to collecttraffic speech reports by reading predefined transcripts, this functionality has not gar-nered significant user engagement As a result, the available data for training ASRmodels remains limited due to the low popularity of this application To addressthis limitation, the authors [2] have employed an alternative approach of utilizing theVbee speech synthesis tool [4] to generate additional artificial traffic data However,it is important to consider that these synthesized traffic data may possess inherentweaknesses, which need to be carefully evaluated and accounted for in the trainingprocess:
• Synthesized data typically lacks the diversity and variability found in real-lifespeech patterns As a result, the model trained predominantly on synthesizeddata may struggle to generalize well to real-world scenarios It may not effec-tively handle the natural variations in speech styles, accents, intonations, andbackground noise present in genuine traffic speech reports.
• Over-reliance on synthesized data can lead to a bias in the trained model Themodel becomes accustomed to the specific characteristics of synthesized speech,making it less effective in recognizing and transcribing real speech patterns accu-rately This bias can significantly impact the model’s performance on authenticdata and reduce its overall accuracy.
• Synthesized speech often lacks the naturalness and nuances present in humanspeech The model trained primarily on synthesized data may struggle to ac-curately transcribe or understand real speech patterns due to the differences inprosody, pronunciation, and other subtle speech variations that exist in genuinetraffic speech reports.
Trang 19techniques such as time-depth separable convolutions and multi-head self-attention,it still relies on sequential computations within each layer Consequently, this archi-tecture exhibits limitations in terms of parallelism Moreover, the Conformer-basedencoder requires an extended training time due to its sequential nature Additionally,although the Conformer-based encoder can handle relatively long sequences, it faceschallenges when processing excessively lengthy input sequences As the sequencelength increases, the sequential operations of the Conformer may introduce computa-tional bottlenecks, thereby impacting efficiency and escalating memory requirements.Considering the characteristics of the dataset and the computational constraintsof the current ASR model employed in the UTraffic system, the Conformer-basedencoder is considered suitable However, if the dataset or computational constraintswere to change, it would be advisable to explore alternative architectures beyond theConformer-based encoder to identify the most appropriate choice for the ASR model.This would involve evaluating other architectures that can effectively handle parallelprocessing, reduce training time, and efficiently process long sequences, thus opti-mizing the performance and scalability of the ASR model in the given context.
1.3Objective
The objectives of this thesis are to address the aforementioned problems encoun-tered in the ASR model used in the UTraffic system To address the first problem,an alternative approach is proposed to acquire additional traffic data for training theASR model A suitable data source with the following characteristics is sought:
• The data source should provide speech samples specifically related to the localtraffic conditions of Ho Chi Minh City or the surrounding region, as our ASRmodel is intended for traffic-related tasks in this specific area This ensures therelevance and applicability of the training data to the target region.
Trang 206
• It is important that the data source exhibits diverse speech styles and vocabulary.Training our ASR model with such data exposes it to a wide range of speechpatterns, facilitating improved generalization and robustness This allows themodel to adapt to various speakers and variations in speech delivery commonlyencountered in real-life traffic reports.
• Accessibility is a key consideration, thus a data source that is readily availableand easily accessible is preferred Ensuring access to a reliable and convenientdata source is essential for the effective building and training of our ASR model.
Subsequently, a pipeline is proposed to process the collected data, aiming to enhancethe performance of our ASR model.
To address the second problem, an in-depth exploration is conducted to identifyadvanced and suitable architectures for the ASR model in the domain of traffic au-tomatic speech recognition Various advanced architectures are experimented withto determine the most optimal choice for the ASR model to be deployed within theUTraffic system Particular attention is given to architectures capable of address-ing the weaknesses associated with the Conformer-based encoder, such as limitedparallelism, extended training time, high computational resource requirements, anddifficulties encountered when handling long sequences.
By pursuing these objectives, this thesis aims to overcome the identified chal-lenges and improve the effectiveness and efficiency of the ASR model utilized in theUTraffic system.
1.4Scope Of Work
Trang 21conditions of Ho Chi Minh City or the surrounding region The focus is on datasources that offer real-life traffic updates and reports, encompassing diverse speechstyles and vocabulary, while also being easily accessible.
Secondly, a data processing pipeline will be developed to effectively process thecollected data This pipeline aims to enhance the quality and suitability of the data fortraining the ASR model The steps involved may include sampling rate conversion,speech enhancement techniques, and aligning the audio data with corresponding texttranscripts.
Thirdly, the thesis will explore advanced ASR model architectures suitable fortraffic automatic speech recognition Different architectures will be thoroughly stud-ied and compared, with a specific focus on addressing the limitations associated withthe Conformer-based encoder The objective is to identify the most optimal archi-tecture that improves parallelism, reduces training time, minimizes computational re-source requirements, and effectively handles long sequences.
Fourthly, extensive experimental evaluation will be conducted to assess the per-formance of the proposed approaches This will involve training and testing the ASRmodels using the collected data and the selected architectures Evaluation metricssuch as transcription accuracy, computational efficiency, latency, and real-time fac-tor metrics will be employed Comparative analysis will be conducted to determinethe improvements achieved in comparison to the existing ASR model deployed in theUTraffic system.
Trang 228
1.5Contribution
This thesis makes five significant contributions Firstly, it contributes an en-hanced dataset for training and evaluating the ASR model Through the collection ofadditional traffic speech reports and the inclusion of transcripts corresponding to au-dio samples, the dataset reduces bias towards synthesized data and improves the ASRmodel’s ability to recognize real-life traffic speech reports This dataset serves as avaluable resource for future research in the field of traffic automatic speech recogni-tion.
Secondly, it improves the accuracy of the ASR model By exploring alternativearchitectures that are better suited to the dataset, the thesis enhances the transcrip-tion accuracy of the ASR model for traffic speech reports Through the selectranscrip-tionand implementation of a more suitable architecture, the ASR model achieves higheraccuracy in recognizing and transcribing traffic-related speech.
Thirdly, it develops a methodological pipeline for processing the collected data.This pipeline includes important steps such as sampling rate conversion and speechenhancement, ensuring the quality and suitability of the data for training the ASRmodel The developed pipeline serves as a valuable methodology for future researchersworking in similar domains, providing guidelines for effective data processing in traf-fic automatic speech recognition.
Fourthly, it presents a comprehensive comparative analysis and provides insightsinto the proposed approaches Through the evaluation of performance metrics suchas transcription accuracy, computational efficiency, latency, and real-time factor, thethesis offers valuable insights into the improvements achieved by the proposed ap-proaches This comparative analysis guides future researchers in selecting and opti-mizing ASR model architectures for traffic automatic speech recognition.
Trang 23inno-vative approaches, the addressing of remaining challenges, and the advancement ofthe field of traffic automatic speech recognition.
In summary, these contributions significantly contribute to the understanding andcapabilities of the ASR model in traffic-related tasks, providing valuable insights anddirections for researchers and practitioners in the field.
1.6Thesis Structure
The structure of this thesis consists of six chapters, which are as follows:
• Chapter 1 - INTRODUCTION: This chapter provides an overview of the
re-search topic, presents the objectives and significance of the study, and outlinesthe structure of the thesis.
• Chapter 2 - BACKGROUND: In this chapter, the relevant background
infor-mation and theoretical foundations related to traffic automatic speech recogni-tion are discussed It includes an overview of ASR models, speech recognirecogni-tiontechniques, and the challenges specific to traffic-related speech recognition.
• Chapter 3 - RELATED WORK: This chapter reviews the existing literature
and research studies related to ASR models in the context of traffic automaticspeech recognition It discusses the approaches, methodologies, and findings ofprevious works, highlighting the gaps and limitations that the current thesis aimsto address.
• Chapter 4 - APPROACH: This chapter presents the proposed approaches and
Trang 2410
• Chapter 5 - EXPERIMENT AND EVALUATION: This chapter presents the
experimental setup, data analysis procedures, and evaluation metrics used to as-sess the performance of the proposed approaches It includes details on the train-ing and testtrain-ing processes, performance evaluation criteria, and comparative anal-ysis of the results obtained The chapter provides insights into the effectivenessand efficiency of the proposed approaches.
• Chapter 6 - CONCLUSION: The final chapter summarizes the key findings,
conclusions, and contributions of the thesis It discusses the implications ofthe research, highlights the limitations, and suggests avenues for future research.This chapter concludes the thesis by emphasizing the significance of the workconducted and its impact on the field of traffic automatic speech recognition.
Trang 25BACKGROUND
This chapter provides the necessary background information and theoretical foun-dations related to traffic automatic speech recognition It offers an overview of ASRmodels, speech recognition techniques, and the unique challenges encountered in thedomain of traffic-related speech recognition.
2.1Traffic Data Processing for ASR Model
Traffic data processing techniques encompass a series of methodical proceduresemployed to preprocess and refine traffic-related data before its utilization in train-ing ASR models These techniques are pivotal for optimiztrain-ing the data and ensurtrain-ingits appropriateness for effective ASR model training The initial step involves datacollection, wherein speech data specifically associated with traffic scenarios is gath-ered This entails recording audio samples from diverse sources, including trafficradio broadcasts, roadside microphones, or in-car voice recordings Following datacollection is the transcription and annotation phase Traffic-related speech data ne-cessitates transcription and annotation to establish a labeled dataset for training ASRmodels Transcription involves the conversion of audio recordings into textual repre-sentations, while annotation entails the labeling of specific segments or events withinthe audio, such as identifying traffic-related terms, road names, or spoken commands.
Trang 2612
Subsequently, data cleaning is performed to rectify any noise, distortion, or artifactsthat could impede the quality and accuracy of ASR model training This processinvolves eliminating background noise, filtering non-speech sounds, and addressingpotential recording or transmission issues present in the raw data To ensure consis-tency and uniformity within the dataset, data normalization techniques are employed.This involves standardizing the format of transcriptions, aligning timestamps withcorresponding speech segments, and applying consistent labeling conventions.
Sampling rate conversion is then executed after transcription, cleaning, and nor-malization This entails adjusting the audio signals to a desired sampling rate suitablefor subsequent processing and ASR model training The objective is to achieve aconsistent and compatible sampling rate across the audio data obtained from vari-ous sources or recordings Following sampling rate conversion, the data undergoespreprocessing and feature extraction, which includes additional speech enhancementtechniques These techniques aim to further enhance the quality of the speech signalsby reducing noise, improving speech intelligibility, and extracting pertinent acousticfeatures, such as Mel-frequency cepstral coefficients or spectrograms These featurescapture the linguistic and acoustic characteristics of the speech These traffic dataprocessing techniques are integral for the meticulous preparation of data before train-ing ASR models Each step serves a crucial role in enhanctrain-ing the quality, relevance,and generalization capabilities of the models, ultimately leading to more accurate androbust speech recognition in traffic-related scenarios.
2.2Components in ASR Model
Trang 27Figure 2.1: System Architecture of Automatic Speech Recognition [5].
2.2.1Acoustic Modeling
Trang 2814
Figure 2.2: The Architecture of the Conformer Encoder Model [6].
Conformer Encoder
Trang 29re-ferred to as a "sandwich structure," draws inspiration from Macaron-Net [8] TheMacaron-Net approach suggests replacing the original feed-forward layer within theTransformer block with two half-step feed-forward layers One is positioned beforethe attention layer, while the other is positioned after it.
Branchformer Encoder
Trang 3016
Figure 2.3: The Architecture of the Branchformer Encoder Block [7].
connection While the two branches share the same input, they focus on capturingrelationships of different ranges, thereby complementing each other Within eachbranch, the input is first normalized using layer normalization [11] Then, global orlocal dependencies are extracted using attention or cgMLP, respectively, followed bydropout regularization [12] The outputs of the two branches can be merged usingconcatenation or weighted average, with the original input added as a residual con-nection [7].
Hybrid CTC/Attention end-to-end ASR
Two prominent implementations of end-to-end ASR systems are connectionisttemporal classification (CTC) [[13], [14], [15]] and attention-based encoder-decodermodels [[16], [17], [18]] These approaches differ in their methodologies for aligningacoustic frames with recognized symbols.
Trang 31sequence while generating the output symbols The attention mechanism enhances theability of the ASR system to model long-range dependencies and effectively handlevariations in the input speech On the other hand, CTC is a framework that leveragesMarkov assumptions to efficiently solve sequential problems using dynamic program-ming CTC-based ASR models do not explicitly require alignments between acousticframes and output symbols during training Instead, they estimate the alignment prob-abilities based on the input-output sequences, which are used to compute the overalllikelihood of the correct output sequence The CTC approach is particularly use-ful for handling variable-length input and output sequences Both attention-basedencoder-decoder models and CTC have demonstrated significant advancements inend-to-end ASR Attention-based methods excel at capturing intricate dependenciesbetween acoustic frames and output symbols, while CTC provides a computationallyefficient framework for training ASR models without explicit alignments The choicebetween these approaches depends on the specific requirements and constraints of theASR task at hand.
Trang 3218
to leverage the benefits of each approach to enhance the overall ASR system’s capa-bilities.
2.2.2Language Modeling
The second component, language modeling, focuses on modeling the structureand probability distribution of spoken language It aims to capture syntactic and se-mantic patterns, word sequences, and contextual information present in speech Lan-guage models play a crucial role in handling the ambiguity and variability encoun-tered in speech recognition They can be based on statistical n-gram models [20] ormore advanced approaches, such as RNNs [21] or Transformers [22].
RNNs are a type of neural network architecture that can effectively capture se-quential dependencies in data In the context of language modeling, RNNs are de-signed to model the dependencies between words or phonetic units in a sequenceof speech They process the input sequence step by step, incorporating informationfrom previous steps to generate predictions for the current step Transformers, onthe other hand, are a relatively newer architecture that has gained significant atten-tion in the field of natural language processing (NLP), including language modeling.Transformers utilize self-attention mechanisms to capture global dependencies andrelationships within the input sequence This attention mechanism allows the modelto attend to relevant parts of the input sequence when making predictions, enablingit to capture long-range contextual information effectively By assigning probabilitiesto word sequences, language models assist in the decoding process by favoring morelikely and coherent transcriptions.
2.2.3Decoding Algorithm
Trang 33language model to generate the final output Various algorithms are utilized for thistask, including the Viterbi decoding [23] and beam search [24].
The Viterbi decoding algorithm is specifically employed in HMM-based speechrecognition It aims to determine the most likely sequence of hidden states given theobserved features In the HMM framework, hidden states represent phonemes or sub-word units, while observations denote the extracted acoustic features The algorithmrecursively computes state sequences, taking into account the transition probabilitiesbetween states and the emission probabilities of observations It employs a trellisstructure to efficiently explore the possible state sequences At each time step, the al-gorithm evaluates all potential transitions from the previous step, calculating a scorefor each transition based on the previous state’s score, the transition probability, andthe emission probability of the current observation The transition with the highestscore is chosen as the most probable transition to reach the current state.
Beam search, on the other hand, is a widely used decoding algorithm in ASRsystems It explores multiple hypotheses or paths in parallel during the decodingprocess, keeping track of the most promising hypotheses based on a scoring criterion.The algorithm maintains a beam width, which determines the number of hypothesesconsidered at each decoding step By considering multiple alternatives, the searchspace is expanded, and the most likely hypothesis is selected based on the scoresfrom the acoustic and language models.
Trang 3420
2.3Challenges in Traffic Speech Recognition
Traffic speech recognition poses unique challenges due to the type of ASR modelused in the context of traffic speech recognition (TSR) and the characteristics oftraffic-related speech data Firstly, model complexity is a challenge Traffic-relatedASR models need to handle large vocabularies and complex linguistic structures spe-cific to traffic domains, such as traffic signs, road names, and vehicle models De-signing and training models that can effectively handle this complexity while main-taining high accuracy can be challenging Secondly, real-time performance is a con-cern In traffic applications such as in-vehicle speech recognition or real-time trafficcontrol, there is often a requirement for real-time processing and response ASRmodels need to operate with low latency to provide timely transcriptions and enablequick decision-making Balancing model complexity and computational efficiencyto achieve real-time performance can be a challenge Thirdly, scalability and deploy-ment pose challenges Deploying ASR models for traffic-related applications ofteninvolves large-scale systems that can handle high volumes of speech data in real-time.Designing models that are scalable, can handle increased traffic loads, and can beefficiently deployed on various platforms and devices presents a challenge Foremost,the paramount challenge in traffic-related ASR models is accuracy Numerous factorsintricately influence this challenge.
• Noisy environments in traffic scenarios introduce high levels of backgroundnoise, including road noise, engine sounds, honking, and other ambient sounds.These noisy environments degrade the speech signal quality, making it difficultfor ASR models to accurately recognize and transcribe the spoken content.
Trang 35required.
• Acoustic variations in traffic environments also affect accuracy Different in-carmicrophone configurations, varying distances between the speaker and the mi-crophone, and different vehicle types introduce acoustic variations that impactASR model performance Adapting to these variations and maintaining consis-tent recognition performance pose a challenge.
• Limited training data specific to traffic-related speech is another challenge De-veloping accurate traffic-related ASR models is hindered by the scarcity of la-beled training data Collecting and annotating large-scale, diverse, and repre-sentative datasets for traffic scenarios is time-consuming and costly The limitedtraining data can affect the generalization and accuracy of ASR models in traffic-related applications.
• Out-of-vocabulary (OOV) words pose a challenge in traffic-related ASR ASRmodels may encounter words or terms that are not part of their vocabulary, suchas new road names, traffic signs, or vehicle models Handling OOV words andadapting the model to recognize these specialized terms is challenging, particu-larly when training data for these specific terms is limited.
Trang 36Chapter 3
RELATED WORK
Recently, there have been studies focusing on developing ASR systems specif-ically tailored for transcribing traffic-related communication These studies aim toaddress the challenges of accurately transcribing various types of traffic-related au-dio data, including communication between traffic drivers [2] and communicationbetween pilots and air-traffic controllers for air traffic control [25], [26] To ensureprecise transcriptions of traffic-related speech, these studies evaluate the performanceof different ASR models by assessing acoustic and language modeling techniques [27]and noise reduction methods [28].
In recent years, the focus has also been on addressing the challenges of noise andenvironmental factors in transcribing traffic-related audio data [28] While there arerelatively few studies specifically on noise reduction for automatic speech recognitionsystems in road traffic, significant progress has been made in reducing backgroundnoise effects for ASR systems in other domains These studies explore various robust ASR techniques [29], including feature enhancement algorithms [30], noise-adaptive modeling approaches [31], and speech enhancement methods [32] Statisti-cal model-based approaches, such as Gaussian Mixture Models (GMMs) and HMMs[33], have been utilized to represent the statistical characteristics of clean speech andnoise, enabling the estimation of clean speech from noisy signals Deep learning-based approaches, particularly CNNs and RNNs, have also shown success in speech
Trang 37enhancement tasks by learning complex mappings between noisy and clean speechusing large-scale datasets [34] Convolutional TasNet (Conv-TasNet) [35], a notabledeep learning model used for speech enhancement (SE), excels in isolating speechfrom noise and interference, making it a compelling choice for improving the qualityof audio data in real-time ASR models [36], [37].
Conventional automatic speech recognition (ASR) systems, based on hidden Markovmodels (HMMs) or deep neural networks (DNNs), are known for their high complex-ity These systems are composed of multiple modules, including acoustic models,lexicons, and language models However, recent advancements have introduced end-to-end ASR architectures with the aim of simplifying the traditional module-basedapproaches These end-to-end ASR methods leverage paired acoustic and languagedata and do not rely on explicit linguistic knowledge They train a single modelusing a unified algorithm, enabling the development of ASR systems without theneed for expert knowledge [38] In the domain of traffic-related speech recognition,there exist specialized end-to-end ASR approaches For instance, studies have ex-plored the feasibility and performance of an enhanced end-to-end architecture basedon DeepSpeech2 [15] for transcribing traffic communication [39] These investi-gations have conducted comparative analyses to evaluate transcription accuracy andcomputational efficiency in comparison to traditional hybrid models Different typesof end-to-end architectures for ASR have been studied, including connectionist tem-poral classification (CTC) [13], recurrent neural network (RNN) transducer [40],attention-based encoder-decoder[41], and their hybrid variants [21], [42] Research[43] has also focused on hybrid CTC/attention end-to-end systems specifically fortraffic ASR, aiming to leverage the alignment-based training approach of CTC andthe attention-based decoding mechanism to improve accuracy and robustness in tran-scribing traffic-related speech CTC and attention mechanisms represent two distinctapproaches to modeling the acoustic component of an ASR system.
Trang 38convo-24
Trang 39assess-ing the effectiveness of these encoder architectures in accurately transcribassess-ing traffic-related speech However, these studies face the challenge of domain adaptation forASR models, particularly in the context of road traffic To address the limitationsdiscussed above with the Conformer architecture, a novel architecture called Branch-former [7] was introduced in 2022 The primary objective of the BranchBranch-former ar-chitecture is to redesign the existing structure to improve training stability, provideflexibility for accommodating various attention mechanisms, facilitate interpretabil-ity for insightful design analysis, and allow for varying levels of inference complexinterpretabil-itywithin a single model Branchformer serves as a more flexible, interpretable, andcustomizable encoder alternative to the Conformer encoder It incorporates parallelbranches to model different dependencies at various ranges, enabling effective end-to-end speech processing.
In recent advancements, the utilization of external language models has exhib-ited notable enhancements in accuracy for neural machine translation [50] and
end-to-end ASR [21], [51] This approach, referred to as shallow fusion, involves the
integration of the decoder network with an external language model operating in thelog probability domain during the decoding process The effectiveness of recurrentneural network language models (RNN-LMs) has been particularly demonstrated inJapanese and Mandarin Chinese tasks, achieving comparable or superior accuracywhen compared to systems based on deep neural networks and hidden Markov mod-els (DNN/HMM) systems [21] To facilitate the joint prediction of subsequent charac-ters, the RNN-LM has been designed as a character-based language model, effectivelycombined with the decoder network.
Trang 40Chapter 4
APPROACH
End-to-end ASR system construc-tion
Data preparation
Data collection
User traffic speech reports fromUTraffic’s web and mobile appVbee-synthesized traffic speechreports
VOH 95.6 MHz channel reports
Data processing
Manual transcriptionConversion of audio files intosingle-channel formatSampling rate setting
Conv-TasNet speech enhancement
Training and decoding
Attention-based encoder decoderHybrid CTC/attention end-to-end
Multiobjective trainingJoint decodingUse of language modelEvaluation metric
WERReal time factorLatency
Figure 4.1: Taxonomy of Methods for Constructing our End-to-End ASR System.
4.1Choosing ESPnet for ASR Model Development
To address the aforementioned challenges in deploying the ASR model withinthe UTraffic system, we propose several methods However, prior to delving into theintricacies of our methodologies, it is crucial to emphasize our deliberate choice to