Luận văn thạc sĩ Khoa học máy tính: Research and develop solutions to traffic data collection based on voice techniques

Trang 2

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

Trang 3

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM

Supervisor: Assoc Prof Trần Minh Quang

Examiner 1: Assoc Prof Nguyễn Văn Vũ

Examiner 2: Assoc Prof Nguyễn Tuấn Đăng

This master’s thesis is defended at HCM City University of Technology, HCM City on July 11, 2023

VNU-The board of the Master’s VNU-Thesis Defense Council includes: (Please write downthe full name and academic rank of each member of the Master Thesis Defense Coun-cil)

1 Chairman: Assoc Prof Lê Hồng Trang

2 Secretary: Dr Phan Trọng Nhân

3 Examiner 1: Assoc Prof Nguyễn Văn Vũ

4 Examiner 2: Assoc Prof Nguyễn Tuấn Đăng

5 Commissioner: Assoc Prof Trần Minh Quang

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty ofComputer Science and Engineering after the thesis is corrected (If any).

CHAIRMAN OF THESIS COMMITTEEDEAN OF FACULTY OF

COMPUTER SCIENCE AND ENGINEERING

Trang 4

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY SOCIALIST REPUBLIC OF VIETNAMHO CHI MINH CITY UNIVERSITY OF TECHNOLOGYIndependence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

Full name: NGUYỄN THỊ TY Student code: 2171072

Date of birth: 22/11/1996 Place of birth: Binh Dinh ProvinceMajor: Computer Science Major code: 8480101

I THESIS TITLE:

Research and develop solutions to traffic data collection based on voice niques (Nghiên cứu và phát triển các giải pháp thu thập dữ liệu giao thông dựatrên các kỹ thuật giọng nói).

tech-II TASKS AND CONTENTS:

• Task 1: Traffic Data Collection and Processing.

The first task involves collecting comprehensive traffic data Extensive search will be conducted to identify reliable data sources, followed by theimplementation of appropriate data collection techniques Subsequently, ex-periments will be carried out to determine the most effective data processingmethods The aim is to enhance data quality and optimize processing effi-ciency for further analysis.

re-• Task 2: Research and Experimentation for Automatic Speech RecognitionModel Development.

In this phase, the focus will be on researching and experimenting with ous architectures to develop high-performance automatic speech recognitionmodels Different techniques will be explored to achieve accurate speech-to-text conversion The goal is to identify the best-performing model that meetsthe project’s requirements.

Trang 5

vari-• Task 3: Automatic Speech Recognition Model Evaluation and Future Work.Once the automatic speech recognition models are developed, a comprehen-sive evaluation process will be undertaken The achieved results will be ana-lyzed using appropriate metrics and techniques to assess their performance.Strengths and weaknesses of each model will be identified Based on thisanalysis, recommendations for future work will be provided, outlining po-tential enhancements or modifications to the automatic speech recognitionmodels.

III THESIS START DAY: 06/02/2023.

IV THESIS COMPLETION DAY: 09/06/2023.

V SUPERVISOR: Assoc Prof TRẦN MINH QUANG.

Ho Chi Minh City, June 9, 2023

SUPERVISORCHAIR OF PROGRAM COMMITTEE

(Full name and signature) (Full name and signature)

DEAN OF FACULTY OF

COMPUTER SCIENCE AND ENGINEERING

(Full name and signature)

Trang 6

I would like to extend my sincere gratitude to the individuals who have providedinvaluable support and assistance throughout my research journey I would like toexpress my formal appreciation to Assoc Prof Trần Minh Quang for his exceptionalguidance, expertise, and unwavering support His mentorship has been instrumen-tal in helping me navigate the necessary steps to complete this thesis Whenever Iencountered difficulties or felt lost, Assoc Prof Quang provided invaluable advicethat steered me back in the correct direction His suggestion to process the data toenhance its quality was a significant contribution to my research Furthermore, hisassistance in establishing contact with esteemed researchers working on topics sim-ilar to mine and facilitating connections with individuals who could provide serversupport for training large models, such as automatic speech recognition models, hasbeen immensely valuable.

I would like to express my profound gratitude to the esteemed researchers, Mr.Nguyễn Gia Huy and Mr Nguyễn Tiến Thành, for their generous contributions insharing their profound insights and knowledge Their willingness to address my in-quiries regarding the Urban Traffic Estimation System, collected data, and existingissues has significantly enriched my comprehension of the subject matter Further-more, I am sincerely thankful to my sisters, Ms Nguyễn Thị Nghĩa and Ms NguyễnThị Hiển, as well as Lương Duy Hưng and Vũ Anh Nhi, for their invaluable supportin meticulously creating precise transcripts for the audio files.

Furthermore, I would like to express my deep appreciation to Mr Tăng QuốcThái for his diligent efforts in meticulously collecting and securely storing the trafficreports from VOH 95.6MHz Additionally, I am profoundly grateful to Mr Mai TấnHà, who graciously provided me with access to a server for the training of automaticspeech recognition models His generosity and support have been instrumental in en-abling the successful execution of the model training process I would also like toextend my formal gratitude to Dr Lê Thành Sách and Mr Nguyễn Hoàng Minh fromthe Data Science Laboratory at Ho Chi Minh City University of Technology (HC-

Trang 7

MUT) for their kind approval in granting me the opportunity to utilize an independentserver for automatic speech recognition model training Their trust and support fromthe Data Science Laboratory have been pivotal in facilitating the smooth progress ofmy research In addition, I am sincerely thankful for the invaluable support renderedby my friends, Mr Nguyễn Tấn Sang and Mr Huỳnh Ngọc Thiện, in working withthe server that has limited permissions Their expertise and assistance have been in-dispensable in effectively navigating the constraints imposed by the server limitations.Lastly, I would like to express my heartfelt gratitude to my boss, co-workers,friends, and family for their unwavering emotional support and understanding duringthe challenging times that I encountered throughout this research endeavor Theirencouragement and belief in my abilities have been instrumental in my success.

Once again, I am deeply grateful to all of the individuals mentioned above fortheir significant contributions and support, without which this thesis would not havebeen possible.

Trang 8

This thesis addresses two fundamental challenges within the domain of the rent intelligent traffic system, specifically the Urban Traffic Estimation (UTraffic)System The first challenge pertains to the insufficiency of data that meets the req-uisite standards for training the automatic speech recognition (ASR) model that willbe deployed in the UTraffic system The current dataset predominantly consists ofsynthesized data, resulting in a bias towards recognizing synthesized traffic speechreports while struggling to accurately transcribe real-life traffic speech reports im-ported by UTraffic users The second challenge involves the accuracy of the ASRmodel deployed in the current UTraffic system, particularly in transcribing real-lifetraffic speech reports into text.

cur-To address these challenges, this research proposes several approaches Firstly,an alternative traffic data source is identified to reduce the reliance on synthesizeddata and mitigate the bias Secondly, a pipeline incorporating audio processing tech-niques such as sampling rate conversion and speech enhancement is designed to ef-fectively process the dataset, with the ultimate objective of improving ASR modelperformance Thirdly, advanced and suitable ASR architectures are experimentedwith using the processed dataset to identify the most optimal model for deploymentwithin the UTraffic system.

Significant achievements have been obtained through this research Firstly, a newdataset of superior quality compared to the previous one has been developed Con-tinuous data collection from the alternative traffic data source can further enhancethis dataset, making it a valuable resource for future research endeavors aiming to im-prove the ASR model deployed in the UTraffic system Additionally, notable progresshas been made in improving the accuracy of the ASR model compared to the resultsachieved by the current architecture of the UTraffic system’s ASR model.

Trang 9

TÓM TẮT LUẬN VĂN

Luận văn này giải quyết hai thách thức cơ bản trong lĩnh vực hệ thống giao thôngthông minh hiện tại, cụ thể là Hệ Thống Dự Báo Tình Trạng Giao Thông Đô Thị(UTraffic) Thách thức đầu tiên liên quan đến sự thiếu hụt dữ liệu đáp ứng tiêu chuẩncần thiết cho việc huấn luyện mô hình nhận dạng giọng nói tự động (ASR), sẽ đượctriển khai trong hệ thống UTraffic Bộ dữ liệu hiện tại chủ yếu bao gồm dữ liệu tổnghợp, dẫn đến sự thiên vị cho việc nhận dạng các báo cáo giao thông tạo từ giọng nóitổng hợp, trong khi gặp khó khăn trong việc chuyển các báo cáo giao thông ở dạnggiọng nói được cung cấp bởi người dùng UTraffic sang văn bản chính xác Tháchthức thứ hai liên quan đến độ chính xác của mô hình ASR triển khai trong hệ thốngUTraffic hiện tại.

Để giải quyết những thách thức này, nghiên cứu này đề xuất một số phương pháp.Thứ nhất, xác định nguồn dữ liệu giao thông thay thế để giảm thiểu sự phụ thuộc vàodữ liệu tổng hợp Thứ hai, thiết kế luồng xử lý thích hợp, trong đó kết hợp các kỹthuật xử lý âm thanh như chuyển đổi tỉ lệ lấy mẫu và tăng cường giọng nói để xử lýhiệu quả bộ dữ liệu đang có, với mục tiêu cuối cùng là cải thiện hiệu suất mô hìnhASR Thứ ba, thử nghiệm bộ dữ liệu đã được xử lý trên các kiến trúc ASR tiên tiếnđể xác định được mô hình tối ưu nhất cho việc triển khai trong hệ thống UTraffic.

Nghiên cứu này đã đạt được thành tựu đáng kể Thứ nhất, chúng ta hình thànhđược một bộ dữ liệu mới có chất lượng vượt trội hơn so với bộ dữ liệu ban đầu Việctiếp tục thu thập dữ liệu từ nguồn thay thế có thể nâng cao hơn nữa chất lượng củabộ dữ liệu hiện có, biến nó thành nguồn tài nguyên quý giá cho những nỗ lực nghiêncứu cải thiện hiệu suất mô hình ASR triển khai trong hệ thống UTraffic trong tươnglai Ngoài ra, so với các kết quả đạt được bởi mô hình ASR hiện tại trong hệ thốngUTraffic, chúng ta đã đạt được những tiến bộ đáng kể, đặc biệt trong việc cải thiện độnhận dạng giọng nói chính xác.

Trang 10

I, Nguyễn Thị Ty, solemnly declare that this thesis titled "Research and developsolutions to traffic data collection based on voice techniques" is the result of my ownwork, conducted under the supervision of Assoc Prof Trần Minh Quang I af-firm that all the information presented in this thesis is based on my own knowledge,research, and understanding, acquired through extensive study and investigation.

I further declare that any external assistance, whether in the form of data, ideas,or references, has been duly acknowledged and properly cited in accordance withthe established academic conventions I have provided appropriate references andcitations for all the sources and materials used in this thesis, giving credit to theoriginal authors and their contributions.

I acknowledge that this thesis is intended to fulfill the demands of society and tocontribute to the existing body of knowledge in the field It represents the culminationof my efforts, dedication, and commitment to advancing knowledge and understand-ing in this area.

I hereby affirm that this thesis is an authentic and original piece of work, and Itake full responsibility for its content I understand the consequences of any act ofplagiarism or academic dishonesty, and I assure that this thesis has been preparedwith utmost integrity and honesty.

Nguyễn Thị TyJune 9, 2023

Trang 11

List of Figures x

List of Tables xi

1INTRODUCTION11.1 General Introduction 1

2.2 Components in ASR Model 12

viii

Trang 12

4.2 Data Collection and Data Processing 30

4.2.1 Data Collection 31

4.2.2 Data Processing 32

4.3 Training and Decoding for End-to-End ASR 36

4.3.1 Attention-based Encoder Decoder 37

4.3.2 Hybrid CTC/Attention End-to-End ASR 38

4.3.3 Evaluation Metric 40

5EXPERIMENT AND EVALUATION435.1 Dataset 43

5.2 Experimental Setup 44

5.3 Experimental Result and Analysis 46

5.3.1 Data Processing Method Experiment 46

5.3.2 RNNLM Training Experiment 48

5.3.3 Architecture Comparison Experiment 49

5.3.4 Language Model Weight Variation Experiment 56

5.3.5 CTC Weight Variation Experiment 57

5.3.6 VOH Data Impact Assessment Experiment 59

5.4 Discussion 60

5.5 ASR Deployment 61

5.5.1 bktraffic-analyxer and Training Server Environments 62

5.5.2 ASR Deployment Result 62

5.5.3 ASR Deployment Result Analysis 71

6CONCLUSION73References 100

Trang 13

2.1 System Architecture of Automatic Speech Recognition [5] 13

2.2 The Architecture of the Conformer Encoder Model [6] 14

2.3 The Architecture of the Branchformer Encoder Block [7] 16

4.1 Taxonomy of Methods for Constructing our End-to-End ASR System 264.2 The Experimental Flow of the Standard ESPnet Recipe [54] 29

4.3 Distribution of Audio Hours among Three Data Sources 32

5.1 Training Time in the First Scenario 47

5.2 Training Time in the Second Scenario 47

5.3 Training Time in the Third Scenario 47

5.4 Training Time in the Fourth Scenario 47

5.5 GPU Maximum Cached Memory of Transformer-based Architecture 52

5.6 GPU Maximum Cached Memory of Architecture with based Encoder 53

Conformer-5.7 GPU Maximum Cached Memory of Architecture with based Encoder 53

Branchformer-5.8 WER for Transformer-based Encoder-Decoder Architecture 55

5.9 WER for Conformer-based Encoder Transformer-based Decoder chitecture 56

Ar-5.10 WER for Branchformer-based Encoder Transformer-based DecoderArchitecture 56

x

Trang 14

List of Tables

4.1 Comparison of Audio Transcription Approaches 33

4.2 WER for System Combination of Speech Enhancement with SpeechRecognition on CHiME-4 Corpus [37] 34

4.3 Comparison of Word Error Rate in Different Scenarios 35

5.1 Language Model Perplexity 48

5.2 Comparison of ASR Models based on WER 49

5.3 ASR Model Latency and RTF 50

5.4 Models & Trainable Parameters and Training Time 51

5.5 Comparison of Training and Validation Metrics for Transformer, former, and Branchformer Models 54

Con-5.6 Impact of LM weight on WER and Sentence Error Rate (S.Err) 57

5.7 Impact of CTC Weight Variation during the Decoding Stage on WERand S.Err 58

5.8 Impact of CTC Weight Variation in the Training Stage on WER andS.Err 59

5.9 ASR Model WER, Latency & RTF with Different Models and TestSets 59

5.10 Audio Files Conversion Time and Transcription Results of ASR Modelin bktraffic-analyxer 63

xi

Trang 15

The Urban Traffic Estimation System (UTraffic) [1], developed by Ho Chi MinhUniversity of Technology, is an intelligent traffic system designed to estimate andpredict urban traffic conditions It utilizes real-time traffic data to generate accuratetraffic estimations By employing advanced technologies such as data analytics, ma-chine learning, and deep learning, UTraffic provides valuable insights and supportfor traffic management and planning, aiming to improve traffic efficiency, reducecongestion, and enhance overall transportation systems in urban areas.

One crucial feature of UTraffic is the functionality to convert traffic speech ports into text through web and mobile applications This feature significantly en-hances the system’s usability and effectiveness The details of this functionality areas follows:

re-• User Interaction: The web and mobile applications provide an interface for usersto submit their traffic speech reports Users can access the application from theirdevices, record their voice, and provide relevant details about the traffic situationthey are reporting.

• Voice-to-Text Conversion: The recorded speech reports undergo processing

us-1

Trang 16

• Data Integration: The extracted information in the form of segments from thespeech reports is integrated into the UTraffic system’s database, enabling furtheranalysis and utilization for purposes such as real-time traffic monitoring, incidentdetection, and traffic prediction.

• Route Optimization: By analyzing the speech reports, the UTraffic system canidentify the most efficient routes for specific destinations based on real-time traf-fic conditions It can provide recommendations to drivers to suggest alternateroutes that minimize travel time and avoid congested areas.

By converting traffic speech reports into text, the UTraffic system can efficientlyprocess and analyze user-generated data, enabling better traffic management and decision-making This feature improves accessibility and convenience for users to contributeto the system, as they can report traffic incidents through voice inputs that are trans-formed into actionable information within the UTraffic ecosystem In the context ofthis thesis, our primary focus lies on the aspect of Voice-to-Text conversion of thisfeature.

The paper by Nguyen et al [2] presents the process of developing an ASR modeland its deployment in the UTraffic system The ASR model utilizes a Conformer-based encoder and Transformer-based decoder architecture, which has been trainedon the VIVOS dataset [3] The evaluation of this model is conducted on a set of 80traffic speech reports.

Although the authors did not perform fine-tuning using additional data, they havetaken measures to address the limited training data challenge They have incorporated

Trang 17

a feature within the UTraffic web and mobile application that allows users to tarily contribute their own traffic speech reports By prompting users to speak andrecord their speech, the training dataset can be enriched Additionally, the authorsutilized a speech synthesis tool called Vbee to generate synthetic traffic speech re-ports, resulting in an additional 122,569 seconds of audio data These efforts havecumulatively produced approximately 35 hours of audio data, which will be utilizedfor future training of the ASR model.

volun-In terms of the chosen architecture for the ASR models deployed in the fic system, the Conformer-based encoder and Transformer-based decoder architecturewas found to outperform alternative configurations such as Recurrent Neural Network(RNN) -based encoder-decoder and Transformer-based encoder-decoder, as assessedby the word error rate (WER) metric The Conformer-based encoder and Transformer-based decoder architecture exhibited superior performance, showcasing its suitabilityfor achieving accurate speech recognition within the UTraffic system during the eval-uation period.

There are several challenges associated with the current implementation of theASR model in the UTraffic system Firstly, the ASR model deployed in UTrafficencounters a domain adaptation problem Despite the VIVOS training dataset con-taining a diverse range of speech recordings from various speakers, encompassingdifferent domains and topics, the trained model is applied to recognize traffic speechreports from the web or mobile application of the UTraffic system This introducespotential degradation in accuracy and recognition performance due to differences inacoustic and linguistic characteristics, including speaker variability, noise conditions,language style, and vocabulary, between the source domain and the target domain(traffic domain).

The second challenge relates to the quality of the training data that has been

Trang 18

pared for future ASR model training, as mentioned by the author [2] While UTrafficincorporates a feature that allows the system’s web or mobile applications to collecttraffic speech reports by reading predefined transcripts, this functionality has not gar-nered significant user engagement As a result, the available data for training ASRmodels remains limited due to the low popularity of this application To addressthis limitation, the authors [2] have employed an alternative approach of utilizing theVbee speech synthesis tool [4] to generate additional artificial traffic data However,it is important to consider that these synthesized traffic data may possess inherentweaknesses, which need to be carefully evaluated and accounted for in the trainingprocess:

• Synthesized data typically lacks the diversity and variability found in real-lifespeech patterns As a result, the model trained predominantly on synthesizeddata may struggle to generalize well to real-world scenarios It may not effec-tively handle the natural variations in speech styles, accents, intonations, andbackground noise present in genuine traffic speech reports.

• Over-reliance on synthesized data can lead to a bias in the trained model Themodel becomes accustomed to the specific characteristics of synthesized speech,making it less effective in recognizing and transcribing real speech patterns accu-rately This bias can significantly impact the model’s performance on authenticdata and reduce its overall accuracy.

• Synthesized speech often lacks the naturalness and nuances present in humanspeech The model trained primarily on synthesized data may struggle to ac-curately transcribe or understand real speech patterns due to the differences inprosody, pronunciation, and other subtle speech variations that exist in genuinetraffic speech reports.

The final concern regarding the ASR model utilized in the UTraffic system revolvesaround its architecture, specifically the Conformer-based encoder and Transformer-based decoder While the Conformer architecture incorporates parallel processing

Trang 19

techniques such as time-depth separable convolutions and multi-head self-attention,it still relies on sequential computations within each layer Consequently, this archi-tecture exhibits limitations in terms of parallelism Moreover, the Conformer-basedencoder requires an extended training time due to its sequential nature Additionally,although the Conformer-based encoder can handle relatively long sequences, it faceschallenges when processing excessively lengthy input sequences As the sequencelength increases, the sequential operations of the Conformer may introduce computa-tional bottlenecks, thereby impacting efficiency and escalating memory requirements.Considering the characteristics of the dataset and the computational constraintsof the current ASR model employed in the UTraffic system, the Conformer-basedencoder is considered suitable However, if the dataset or computational constraintswere to change, it would be advisable to explore alternative architectures beyond theConformer-based encoder to identify the most appropriate choice for the ASR model.This would involve evaluating other architectures that can effectively handle parallelprocessing, reduce training time, and efficiently process long sequences, thus opti-mizing the performance and scalability of the ASR model in the given context.

The objectives of this thesis are to address the aforementioned problems tered in the ASR model used in the UTraffic system To address the first problem,an alternative approach is proposed to acquire additional traffic data for training theASR model A suitable data source with the following characteristics is sought:

encoun-• The data source should provide speech samples specifically related to the localtraffic conditions of Ho Chi Minh City or the surrounding region, as our ASRmodel is intended for traffic-related tasks in this specific area This ensures therelevance and applicability of the training data to the target region.

• The chosen data source should offer real-life traffic updates and reports, servingas a valuable resource for obtaining authentic traffic speech reports.

Trang 20

• It is important that the data source exhibits diverse speech styles and vocabulary.Training our ASR model with such data exposes it to a wide range of speechpatterns, facilitating improved generalization and robustness This allows themodel to adapt to various speakers and variations in speech delivery commonlyencountered in real-life traffic reports.

• Accessibility is a key consideration, thus a data source that is readily availableand easily accessible is preferred Ensuring access to a reliable and convenientdata source is essential for the effective building and training of our ASR model.

Subsequently, a pipeline is proposed to process the collected data, aiming to enhancethe performance of our ASR model.

To address the second problem, an in-depth exploration is conducted to identifyadvanced and suitable architectures for the ASR model in the domain of traffic au-tomatic speech recognition Various advanced architectures are experimented withto determine the most optimal choice for the ASR model to be deployed within theUTraffic system Particular attention is given to architectures capable of address-ing the weaknesses associated with the Conformer-based encoder, such as limitedparallelism, extended training time, high computational resource requirements, anddifficulties encountered when handling long sequences.

By pursuing these objectives, this thesis aims to overcome the identified lenges and improve the effectiveness and efficiency of the ASR model utilized in theUTraffic system.

The scope of work for this thesis encompasses several key areas Firstly, it volves data collection, where an additional data source will be identified and selectedfor training the ASR model Extensive research and evaluation will be conducted toidentify sources that provide speech samples specifically related to the local traffic

Trang 21

in-conditions of Ho Chi Minh City or the surrounding region The focus is on datasources that offer real-life traffic updates and reports, encompassing diverse speechstyles and vocabulary, while also being easily accessible.

Secondly, a data processing pipeline will be developed to effectively process thecollected data This pipeline aims to enhance the quality and suitability of the data fortraining the ASR model The steps involved may include sampling rate conversion,speech enhancement techniques, and aligning the audio data with corresponding texttranscripts.

Thirdly, the thesis will explore advanced ASR model architectures suitable fortraffic automatic speech recognition Different architectures will be thoroughly stud-ied and compared, with a specific focus on addressing the limitations associated withthe Conformer-based encoder The objective is to identify the most optimal archi-tecture that improves parallelism, reduces training time, minimizes computational re-source requirements, and effectively handles long sequences.

Fourthly, extensive experimental evaluation will be conducted to assess the formance of the proposed approaches This will involve training and testing the ASRmodels using the collected data and the selected architectures Evaluation metricssuch as transcription accuracy, computational efficiency, latency, and real-time fac-tor metrics will be employed Comparative analysis will be conducted to determinethe improvements achieved in comparison to the existing ASR model deployed in theUTraffic system.

per-Finally, thorough documentation and reporting are included within the scope ofwork This encompasses comprehensive documentation of the research methodology,experimental setup, data collection process, pipeline development, model architectureexploration, and experimental results A comprehensive report will be prepared tosummarize the findings, draw conclusions, and provide recommendations based onthe research conducted.

Trang 22

This thesis makes five significant contributions Firstly, it contributes an hanced dataset for training and evaluating the ASR model Through the collection ofadditional traffic speech reports and the inclusion of transcripts corresponding to au-dio samples, the dataset reduces bias towards synthesized data and improves the ASRmodel’s ability to recognize real-life traffic speech reports This dataset serves as avaluable resource for future research in the field of traffic automatic speech recogni-tion.

en-Secondly, it improves the accuracy of the ASR model By exploring alternativearchitectures that are better suited to the dataset, the thesis enhances the transcrip-tion accuracy of the ASR model for traffic speech reports Through the selectionand implementation of a more suitable architecture, the ASR model achieves higheraccuracy in recognizing and transcribing traffic-related speech.

Thirdly, it develops a methodological pipeline for processing the collected data.This pipeline includes important steps such as sampling rate conversion and speechenhancement, ensuring the quality and suitability of the data for training the ASRmodel The developed pipeline serves as a valuable methodology for future researchersworking in similar domains, providing guidelines for effective data processing in traf-fic automatic speech recognition.

Fourthly, it presents a comprehensive comparative analysis and provides insightsinto the proposed approaches Through the evaluation of performance metrics suchas transcription accuracy, computational efficiency, latency, and real-time factor, thethesis offers valuable insights into the improvements achieved by the proposed ap-proaches This comparative analysis guides future researchers in selecting and opti-mizing ASR model architectures for traffic automatic speech recognition.

Lastly, the thesis suggests future research directions to further enhance the formance of the UTraffic ASR model Based on the analysis of results, the thesisprovides recommendations for future research, encouraging the exploration of inno-

Trang 23

per-vative approaches, the addressing of remaining challenges, and the advancement ofthe field of traffic automatic speech recognition.

In summary, these contributions significantly contribute to the understanding andcapabilities of the ASR model in traffic-related tasks, providing valuable insights anddirections for researchers and practitioners in the field.

The structure of this thesis consists of six chapters, which are as follows:

• Chapter 1 - INTRODUCTION: This chapter provides an overview of the

re-search topic, presents the objectives and significance of the study, and outlinesthe structure of the thesis.

• Chapter 2 - BACKGROUND: In this chapter, the relevant background

infor-mation and theoretical foundations related to traffic automatic speech tion are discussed It includes an overview of ASR models, speech recognitiontechniques, and the challenges specific to traffic-related speech recognition.

recogni-• Chapter 3 - RELATED WORK: This chapter reviews the existing literature

and research studies related to ASR models in the context of traffic automaticspeech recognition It discusses the approaches, methodologies, and findings ofprevious works, highlighting the gaps and limitations that the current thesis aimsto address.

• Chapter 4 - APPROACH: This chapter presents the proposed approaches and

methodologies for improving the ASR model in the UTraffic system It describesthe alternative data collection methods, the development of the data processingpipeline, and the exploration of advanced ASR model architectures The chapterprovides a detailed explanation of the rationale behind each approach and thetechniques employed.

Trang 24

• Chapter 5 - EXPERIMENT AND EVALUATION: This chapter presents the

experimental setup, data analysis procedures, and evaluation metrics used to sess the performance of the proposed approaches It includes details on the train-ing and testing processes, performance evaluation criteria, and comparative anal-ysis of the results obtained The chapter provides insights into the effectivenessand efficiency of the proposed approaches.

as-• Chapter 6 - CONCLUSION: The final chapter summarizes the key findings,

conclusions, and contributions of the thesis It discusses the implications ofthe research, highlights the limitations, and suggests avenues for future research.This chapter concludes the thesis by emphasizing the significance of the workconducted and its impact on the field of traffic automatic speech recognition.

Each chapter is structured to provide a comprehensive and logical progression ofthe research, contributing to the overall understanding and advancement of the ASRmodel in the context of the UTraffic system.

Trang 25

This chapter provides the necessary background information and theoretical dations related to traffic automatic speech recognition It offers an overview of ASRmodels, speech recognition techniques, and the unique challenges encountered in thedomain of traffic-related speech recognition.

Traffic data processing techniques encompass a series of methodical proceduresemployed to preprocess and refine traffic-related data before its utilization in train-ing ASR models These techniques are pivotal for optimizing the data and ensuringits appropriateness for effective ASR model training The initial step involves datacollection, wherein speech data specifically associated with traffic scenarios is gath-ered This entails recording audio samples from diverse sources, including trafficradio broadcasts, roadside microphones, or in-car voice recordings Following datacollection is the transcription and annotation phase Traffic-related speech data ne-cessitates transcription and annotation to establish a labeled dataset for training ASRmodels Transcription involves the conversion of audio recordings into textual repre-sentations, while annotation entails the labeling of specific segments or events withinthe audio, such as identifying traffic-related terms, road names, or spoken commands.

11

Trang 26

Subsequently, data cleaning is performed to rectify any noise, distortion, or artifactsthat could impede the quality and accuracy of ASR model training This processinvolves eliminating background noise, filtering non-speech sounds, and addressingpotential recording or transmission issues present in the raw data To ensure consis-tency and uniformity within the dataset, data normalization techniques are employed.This involves standardizing the format of transcriptions, aligning timestamps withcorresponding speech segments, and applying consistent labeling conventions.

Sampling rate conversion is then executed after transcription, cleaning, and malization This entails adjusting the audio signals to a desired sampling rate suitablefor subsequent processing and ASR model training The objective is to achieve aconsistent and compatible sampling rate across the audio data obtained from vari-ous sources or recordings Following sampling rate conversion, the data undergoespreprocessing and feature extraction, which includes additional speech enhancementtechniques These techniques aim to further enhance the quality of the speech signalsby reducing noise, improving speech intelligibility, and extracting pertinent acousticfeatures, such as Mel-frequency cepstral coefficients or spectrograms These featurescapture the linguistic and acoustic characteristics of the speech These traffic dataprocessing techniques are integral for the meticulous preparation of data before train-ing ASR models Each step serves a crucial role in enhancing the quality, relevance,and generalization capabilities of the models, ultimately leading to more accurate androbust speech recognition in traffic-related scenarios.

In this section, a comprehensive overview of ASR models is provided, focusingon their fundamental concepts and principles The discussion encompasses the keycomponents that constitute ASR systems, namely acoustic modeling, language mod-eling, and decoding algorithms.

Trang 27

Figure 2.1: System Architecture of Automatic Speech Recognition [5].

2.2.1Acoustic Modeling

The first component, acoustic modeling, pertains to the modeling of the ship between acoustic signals and phonetic units, such as phonemes or subword units,in speech Its objective is to accurately recognize speech by capturing the charac-teristics and variations in speech sounds Acoustic models are typically constructedusing machine learning techniques, such as hidden Markov models (HMMs) or deepneural networks (DNNs) These models learn to map acoustic features extracted fromspeech signals to the corresponding phonetic units Additionally, Transformer-basedacoustic models have gained popularity in ASR due to their ability to process infor-mation in parallel and leverage attention mechanisms Transformer models employself-attention mechanisms to capture global dependencies and selectively attend torelevant segments of the input speech This enables them to effectively model long-range contextual information, resulting in improved ASR performance Furthermore,the encoder component of the acoustic model, in its more advanced forms, adopts ei-ther the Convolution-augmented Transformer (Conformer) [6] for speech recognitionor Branchformer architecture [7].

Trang 29

neu-ferred to as a "sandwich structure," draws inspiration from Macaron-Net [8] TheMacaron-Net approach suggests replacing the original feed-forward layer within theTransformer block with two half-step feed-forward layers One is positioned beforethe attention layer, while the other is positioned after it.

Branchformer Encoder

Despite the remarkable performance achieved by the Conformer architecture, tain limitations exist Conformer employs a sequential combination of convolutionand self-attention, resulting in a static single-branch structure that lacks interpretabil-ity and customizability The utilization of local and global relationships in differentencoder layers remains unclear To address these concerns, a flexible, interpretable,and customizable alternative encoder called Branchformer is proposed for end-to-end ASR and Spoken Language Understanding (SLU) tasks [7] Branchformer [9]presents a pioneering architecture comprising two parallel branches, specifically en-gineered to capture context at diverse ranges One branch leverages self-attention orits variant to encapsulate long-range dependencies, while the other branch incorpo-rates an advanced multi-layer perceptron with convolutional gating known as cgMLP.The cgMLP, initially introduced in the context of vision and language tasks, hasdemonstrated successful application in CTC-based speech recognition (cgMLP) [10].The Branchformer encoder model surpasses the performance of Transformer and andmatches or even outperforms the state-of-the-art Conformer [7] in the ASR task Theoverall architecture of the Branchformer encoder is depicted in Figure 2.3 The rawaudio sequence undergoes initial processing by a frontend module to extract log Melfeatures Subsequently, a convolutional subsampling module downsamples the fea-ture sequence in the temporal domain Positional encodings are introduced to thesubsampled feature sequence, enhancing the effectiveness of self-attention modulesin the subsequent Branchformer encoder blocks A stack of N identical Branchformerblocks is employed to capture both global and local relationships within the featuresequence Each Branchformer block consists of two parallel branches and a residual

Trang 30

Figure 2.3: The Architecture of the Branchformer Encoder Block [7].

connection While the two branches share the same input, they focus on capturingrelationships of different ranges, thereby complementing each other Within eachbranch, the input is first normalized using layer normalization [11] Then, global orlocal dependencies are extracted using attention or cgMLP, respectively, followed bydropout regularization [12] The outputs of the two branches can be merged usingconcatenation or weighted average, with the original input added as a residual con-nection [7].

Hybrid CTC/Attention end-to-end ASR

Two prominent implementations of end-to-end ASR systems are connectionisttemporal classification (CTC) [[13], [14], [15]] and attention-based encoder-decodermodels [[16], [17], [18]] These approaches differ in their methodologies for aligningacoustic frames with recognized symbols.

Attention-based methods employ an attention mechanism that dynamically alignsacoustic frames with the corresponding recognized symbols during the decoding pro-cess This mechanism allows the ASR system to focus on relevant parts of the input

Trang 31

sequence while generating the output symbols The attention mechanism enhances theability of the ASR system to model long-range dependencies and effectively handlevariations in the input speech On the other hand, CTC is a framework that leveragesMarkov assumptions to efficiently solve sequential problems using dynamic program-ming CTC-based ASR models do not explicitly require alignments between acousticframes and output symbols during training Instead, they estimate the alignment prob-abilities based on the input-output sequences, which are used to compute the overalllikelihood of the correct output sequence The CTC approach is particularly use-ful for handling variable-length input and output sequences Both attention-basedencoder-decoder models and CTC have demonstrated significant advancements inend-to-end ASR Attention-based methods excel at capturing intricate dependenciesbetween acoustic frames and output symbols, while CTC provides a computationallyefficient framework for training ASR models without explicit alignments The choicebetween these approaches depends on the specific requirements and constraints of theASR task at hand.

CTC models, while effective in many cases, may face challenges in capturingfine-grained details and handling long-range dependencies In contrast, attention-based models have shown proficiency in capturing contextual information and man-aging long-range dependencies By employing an attention mechanism, these modelsselectively concentrate on different segments of the input sequence during the genera-tion of the corresponding output sequence This adaptability enables them to dynami-cally align the input and output sequences, attending to pertinent sections of the inputwhen producing each output element To harness the respective strengths of both CTCand attention-based models, a Hybrid CTC/Attention-based architecture has been de-veloped [19] This hybrid architecture combines the advantages of attention-basedand CTC models in both training and decoding stages Recent advancements havedemonstrated promising performance using this hybrid approach within the realm ofASR, particularly in relation to the acoustic model component of ASR systems Byincorporating elements from both modeling paradigms, the hybrid architecture aims

Trang 32

RNNs are a type of neural network architecture that can effectively capture quential dependencies in data In the context of language modeling, RNNs are de-signed to model the dependencies between words or phonetic units in a sequenceof speech They process the input sequence step by step, incorporating informationfrom previous steps to generate predictions for the current step Transformers, onthe other hand, are a relatively newer architecture that has gained significant atten-tion in the field of natural language processing (NLP), including language modeling.Transformers utilize self-attention mechanisms to capture global dependencies andrelationships within the input sequence This attention mechanism allows the modelto attend to relevant parts of the input sequence when making predictions, enablingit to capture long-range contextual information effectively By assigning probabilitiesto word sequences, language models assist in the decoding process by favoring morelikely and coherent transcriptions.

se-2.2.3Decoding Algorithm

The third component of ASR systems is the decoding algorithms, which are sponsible for searching and decoding the most probable transcription based on theacoustic and language models Decoding involves aligning the acoustic input with the

Trang 33

re-language model to generate the final output Various algorithms are utilized for thistask, including the Viterbi decoding [23] and beam search [24].

The Viterbi decoding algorithm is specifically employed in HMM-based speechrecognition It aims to determine the most likely sequence of hidden states given theobserved features In the HMM framework, hidden states represent phonemes or sub-word units, while observations denote the extracted acoustic features The algorithmrecursively computes state sequences, taking into account the transition probabilitiesbetween states and the emission probabilities of observations It employs a trellisstructure to efficiently explore the possible state sequences At each time step, the al-gorithm evaluates all potential transitions from the previous step, calculating a scorefor each transition based on the previous state’s score, the transition probability, andthe emission probability of the current observation The transition with the highestscore is chosen as the most probable transition to reach the current state.

Beam search, on the other hand, is a widely used decoding algorithm in ASRsystems It explores multiple hypotheses or paths in parallel during the decodingprocess, keeping track of the most promising hypotheses based on a scoring criterion.The algorithm maintains a beam width, which determines the number of hypothesesconsidered at each decoding step By considering multiple alternatives, the searchspace is expanded, and the most likely hypothesis is selected based on the scoresfrom the acoustic and language models.

These decoding algorithms adopt different strategies to navigate through the tic and language models with the objective of identifying the transcription that mostaccurately corresponds to the input speech By traversing the acoustic and languagemodels, these algorithms explore various paths in order to determine the transcriptionthat achieves the highest likelihood based on the scores assigned by the acoustic andlanguage models.

Trang 34

Traffic speech recognition poses unique challenges due to the type of ASR modelused in the context of traffic speech recognition (TSR) and the characteristics oftraffic-related speech data Firstly, model complexity is a challenge Traffic-relatedASR models need to handle large vocabularies and complex linguistic structures spe-cific to traffic domains, such as traffic signs, road names, and vehicle models De-signing and training models that can effectively handle this complexity while main-taining high accuracy can be challenging Secondly, real-time performance is a con-cern In traffic applications such as in-vehicle speech recognition or real-time trafficcontrol, there is often a requirement for real-time processing and response ASRmodels need to operate with low latency to provide timely transcriptions and enablequick decision-making Balancing model complexity and computational efficiencyto achieve real-time performance can be a challenge Thirdly, scalability and deploy-ment pose challenges Deploying ASR models for traffic-related applications ofteninvolves large-scale systems that can handle high volumes of speech data in real-time.Designing models that are scalable, can handle increased traffic loads, and can beefficiently deployed on various platforms and devices presents a challenge Foremost,the paramount challenge in traffic-related ASR models is accuracy Numerous factorsintricately influence this challenge.

• Noisy environments in traffic scenarios introduce high levels of backgroundnoise, including road noise, engine sounds, honking, and other ambient sounds.These noisy environments degrade the speech signal quality, making it difficultfor ASR models to accurately recognize and transcribe the spoken content.

• Variability in speech patterns adds to the accuracy challenge Factors such as thedriver’s accent, speaking style, pronunciation variations, and speech impairmentscontribute to diverse speech patterns in traffic-related ASR Robust ASR modelscapable of handling such variability and adapting to different speaking styles are

Trang 35

• Acoustic variations in traffic environments also affect accuracy Different in-carmicrophone configurations, varying distances between the speaker and the mi-crophone, and different vehicle types introduce acoustic variations that impactASR model performance Adapting to these variations and maintaining consis-tent recognition performance pose a challenge.

• Limited training data specific to traffic-related speech is another challenge veloping accurate traffic-related ASR models is hindered by the scarcity of la-beled training data Collecting and annotating large-scale, diverse, and repre-sentative datasets for traffic scenarios is time-consuming and costly The limitedtraining data can affect the generalization and accuracy of ASR models in traffic-related applications.

De-• Out-of-vocabulary (OOV) words pose a challenge in traffic-related ASR ASRmodels may encounter words or terms that are not part of their vocabulary, suchas new road names, traffic signs, or vehicle models Handling OOV words andadapting the model to recognize these specialized terms is challenging, particu-larly when training data for these specific terms is limited.

Addressing these challenges requires the development of specialized ASR modelstailored to traffic-related speech recognition tasks Techniques such as domain adap-tation, transfer learning, efficient model architectures, and deployment optimizationscan help overcome these challenges and improve the performance of traffic-relatedASR models By presenting an extensive overview of traffic data processing tech-niques within ASR models, discussing the components and techniques employed inASR, and addressing the specific challenges related to traffic-related speech recogni-tion, this chapter establishes a robust foundation for comprehending the subsequentchapters and the research conducted in this thesis.

Trang 36

Chapter 3

RELATED WORK

Recently, there have been studies focusing on developing ASR systems ically tailored for transcribing traffic-related communication These studies aim toaddress the challenges of accurately transcribing various types of traffic-related au-dio data, including communication between traffic drivers [2] and communicationbetween pilots and air-traffic controllers for air traffic control [25], [26] To ensureprecise transcriptions of traffic-related speech, these studies evaluate the performanceof different ASR models by assessing acoustic and language modeling techniques [27]and noise reduction methods [28].

specif-In recent years, the focus has also been on addressing the challenges of noise andenvironmental factors in transcribing traffic-related audio data [28] While there arerelatively few studies specifically on noise reduction for automatic speech recognitionsystems in road traffic, significant progress has been made in reducing backgroundnoise effects for ASR systems in other domains These studies explore various noise-robust ASR techniques [29], including feature enhancement algorithms [30], noise-adaptive modeling approaches [31], and speech enhancement methods [32] Statisti-cal model-based approaches, such as Gaussian Mixture Models (GMMs) and HMMs[33], have been utilized to represent the statistical characteristics of clean speech andnoise, enabling the estimation of clean speech from noisy signals Deep learning-based approaches, particularly CNNs and RNNs, have also shown success in speech

22

Trang 37

enhancement tasks by learning complex mappings between noisy and clean speechusing large-scale datasets [34] Convolutional TasNet (Conv-TasNet) [35], a notabledeep learning model used for speech enhancement (SE), excels in isolating speechfrom noise and interference, making it a compelling choice for improving the qualityof audio data in real-time ASR models [36], [37].

Conventional automatic speech recognition (ASR) systems, based on hidden Markovmodels (HMMs) or deep neural networks (DNNs), are known for their high complex-ity These systems are composed of multiple modules, including acoustic models,lexicons, and language models However, recent advancements have introduced end-to-end ASR architectures with the aim of simplifying the traditional module-basedapproaches These end-to-end ASR methods leverage paired acoustic and languagedata and do not rely on explicit linguistic knowledge They train a single modelusing a unified algorithm, enabling the development of ASR systems without theneed for expert knowledge [38] In the domain of traffic-related speech recognition,there exist specialized end-to-end ASR approaches For instance, studies have ex-plored the feasibility and performance of an enhanced end-to-end architecture basedon DeepSpeech2 [15] for transcribing traffic communication [39] These investi-gations have conducted comparative analyses to evaluate transcription accuracy andcomputational efficiency in comparison to traditional hybrid models Different typesof end-to-end architectures for ASR have been studied, including connectionist tem-poral classification (CTC) [13], recurrent neural network (RNN) transducer [40],attention-based encoder-decoder[41], and their hybrid variants [21], [42] Research[43] has also focused on hybrid CTC/attention end-to-end systems specifically fortraffic ASR, aiming to leverage the alignment-based training approach of CTC andthe attention-based decoding mechanism to improve accuracy and robustness in tran-scribing traffic-related speech CTC and attention mechanisms represent two distinctapproaches to modeling the acoustic component of an ASR system.

Regarding the acoustic model architectures used in traffic ASR, various typeshave been investigated For instance, studies have delved into the utilization of convo-

Trang 38

lutional neural networks (CNNs) in traffic ASR tasks [44], exploring different CNNarchitectures such as multiscale CNNs to effectively capture local and global acous-tic features from traffic-related audio data Another line of research has focusedon Transformer-based acoustic models for traffic speech recognition Transform-ers, which employ self-attention mechanisms, have demonstrated remarkable perfor-mance in capturing long-range global context crucial for accurate speech recognitionand comprehension [45] However, self-attention exhibits quadratic time and mem-ory complexity concerning the sequence length Therefore, research [39] has aimedto mitigate this limitation and assess the performance of Transformer-based models intranscribing traffic communication, comparing them against traditional acoustic mod-els Furthermore, the resurgence of multi-layer perceptrons (MLPs) [46] in languageand vision tasks has attracted attention, offering promising avenues for exploration.MLP-based models have shown comparable performance to Transformers, althoughthey are inherently limited to fixed-length inputs To overcome this limitation, someMLP modules can be replaced with convolutions or similar operations Additionally,researchers have explored ways to integrate both local and global contextual rela-tionships within a unified model Notable architectures in this regard include theLite Transformer [47] and the Convolution-augmented Transformer (Conformer) [6],which combines self-attention and convolution in a unique manner The Conformerarchitecture has emerged as the new state-of-the-art approach for speech recognitiontasks, surpassing previous Transformer Transducer models [48] However, the Con-former architecture’s single branch design presents challenges in analyzing how localand global interactions are utilized across different layers It also enforces a fixedinterleaving pattern between self-attention and convolution, which may not always beoptimal Recent studies in [49] have proposed reordering the self-attention and feed-forward layers to enhance the performance of Transformers Another limitation ofthe Conformer architecture lies in its quadratic complexity with respect to sequencelength, primarily due to the self-attention mechanism Consequently, research [2]has focused on Conformer-based encoders for acoustic models in traffic ASR, assess-

Trang 39

ing the effectiveness of these encoder architectures in accurately transcribing related speech However, these studies face the challenge of domain adaptation forASR models, particularly in the context of road traffic To address the limitationsdiscussed above with the Conformer architecture, a novel architecture called Branch-former [7] was introduced in 2022 The primary objective of the Branchformer ar-chitecture is to redesign the existing structure to improve training stability, provideflexibility for accommodating various attention mechanisms, facilitate interpretabil-ity for insightful design analysis, and allow for varying levels of inference complexitywithin a single model Branchformer serves as a more flexible, interpretable, andcustomizable encoder alternative to the Conformer encoder It incorporates parallelbranches to model different dependencies at various ranges, enabling effective end-to-end speech processing.

traffic-In recent advancements, the utilization of external language models has ited notable enhancements in accuracy for neural machine translation [50] and end-

exhib-to-end ASR [21], [51] This approach, referred to as shallow fusion, involves the

integration of the decoder network with an external language model operating in thelog probability domain during the decoding process The effectiveness of recurrentneural network language models (RNN-LMs) has been particularly demonstrated inJapanese and Mandarin Chinese tasks, achieving comparable or superior accuracywhen compared to systems based on deep neural networks and hidden Markov mod-els (DNN/HMM) systems [21] To facilitate the joint prediction of subsequent charac-ters, the RNN-LM has been designed as a character-based language model, effectivelycombined with the decoder network.

Overall, these related works contribute to the advancement of ASR in accuratelytranscribing traffic-related communication By exploring various acoustic model ar-chitectures, noise reduction techniques, and ASR system designs, researchers aim toimprove transcription accuracy, robustness, and adaptability to the unique challengespresented by traffic environments.

Trang 40

VOH 95.6 MHz channel reports

Data processing

Manual transcriptionConversion of audio files intosingle-channel formatSampling rate setting

Conv-TasNet speech enhancement

Training and decoding

Attention-based encoder decoderHybrid CTC/attention end-to-end

Multiobjective trainingJoint decodingUse of language modelEvaluation metric

WERReal time factorLatency

Figure 4.1: Taxonomy of Methods for Constructing our End-to-End ASR System.

To address the aforementioned challenges in deploying the ASR model withinthe UTraffic system, we propose several methods However, prior to delving into theintricacies of our methodologies, it is crucial to emphasize our deliberate choice to

26