Báo cáo nghiên cứu khoa học: Algorithm for Sound Direction Detection and Speech Recognition in Robotic Systems

VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL --- STUDENT RESEARCH REPORT TOPIC: Algorithm for Sound Direction Detection and Speech Recognition in Robotic Systems Team Lea

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

INTERNATIONAL SCHOOL

-

STUDENT RESEARCH REPORT

TOPIC: Algorithm for Sound Direction Detection and

Speech Recognition in Robotic Systems

Team Leader: Dao Tuan Minh ID: 20070749

Class: MIS2020B

Hanoi - April, 2024

Trang 2

TEAM LEADER INFORMATION

I Student profile

- Full name: Dao Tuan Minh

- Date of birth: 12/08/2002

- Place of birth: Dong Da, Ha Noi

- Class: MIS2020B

- Program: Management Information Systems

- Address: Dong Da, Ha Noi

- Phone no: 0352632756

- Email: 20070749@vnu.edu.vn

Academic year Overall Score Academic Rating

2020 - 2024 3.12 Good

III Other achievements:

………

Trang 3

1 Project Code:

CN.NV.SV.23_19

2 Member List:

Phung Hoang _ 20070832_ phunghoangvnuit@gmail.com _

ICE2020B_0338695456

3 Advisor: DR Nguyen Dang Khoa

Trang 4

TABLE OF CONTENT

1.INTRODUCTION 2

1 Literature review 2

2 Problem statement 3

3 Objectives 4

2.PROPOSEDALGORITHM (Algorithm for Sound Direction Detection- TDoA) 5

3 IMPLIMENTATION AND DISCUSSION 7

3.1 Hardware Setup 8

3.2 Software Configuration 8

3.3 Testing and Calibration 8

3.4 Discussion 8

3.5 Summary of Findings 9

3.6 Contributions 10

3.7 Future Scope 10

4 CONCLUSION 10

REFERENCES 11

Trang 5

LIST OF FIGURES

Figure 1 Inter-class similarity and intra-class discrepancy

Trang 6

Advisor Student (Signed with full name) (Signed with full name)

TS Nguyễn Đăng Khoa Đào Tuấn Minh

Trang 7

Algorithm for Sound Direction Detection and Speech

Recognition in Robotic Systems

1, Dao Tuan Minh1, Phung Hoang1, and Dr Nguyen Dang Khoa1

1Vietnam National University, Hanoi – International School, Hanoi,

Vietnam

Abstract The increasing complexity of robotic systems calls for sophisticated capabilities

in environmental interaction, particularly in sound-based navigation and commu-nication This paper presents a comprehensive system designed to enhance robotic abilities in determining sound direction and integrating speech recognition Uti-lizing advanced sensor arrays and cutting-edge algorithms, the system accurately identifies sound origins and processes speech, enabling robots to respond to audi-tory stimuli effectively The implementation involves a combination of directional microphones and real-time processing techniques, including beamforming and ma-chine learning models for speech recognition This paper details the design, de-velopment, and testing of the system, highlighting its dual functionality The ex-perimental results demonstrate the system’s precision in sound localization and its efficacy in recognizing spoken commands under various conditions The discussion emphasizes the system’s potential to revolutionize interactions between robots and their operational environments, contributing to more autonomous and responsive robotic applications The paper concludes with insights into the implications for future robotic systems and potential areas for further research in robotic auditory processing

1 Introduction

Virtual assistants have become an integral part of everyday life, enhancing user interaction through voice-controlled technologies As the sophistication of these devices increases, there is a notable demand for advanced auditory interaction technologies, particularly

in precise audio positioning This ability, which enables virtual assistants to accurately determine the direction of incoming sounds, is crucial for improving their responsive-ness and utility Accurately identifying the source of voice commands enhances the user experience by allowing the device to respond more effectively and interact in a contextu-ally aware manner Significant research has been conducted on the integration of sound localization techniques into virtual assistants According to Algazi et al (2001), im-plementing sound localization methods enhances the interaction quality between virtual assistants and users by improving the accuracy of voice command detection, especially

in environments with background noise (1) Additionally, Hawley et al (2007) discuss

Trang 8

how virtual assistants can utilize 3D audio cues to differentiate between multiple simulta-neous speakers, increasing their effectiveness in dynamic and complex acoustic scenarios (2) The system proposed in this paper builds upon these foundational insights, aiming

to significantly improve audio positioning for virtual assistants through the use of ad-vanced signal processing algorithms and spatial audio techniques This enhancement is anticipated to revolutionize the interaction between virtual assistants and their human users, particularly in smart home settings and interactive learning environments This paper details the development and practical implementation of a novel sound orientation system tailored for virtual assistants It will explore the design, functionality, deploy-ment challenges, and potential broad-scale applications of this technology, emphasizing its impact on the future trajectory of virtual assistant capabilities

1.1 Literature Survey

The literature on sound direction and speech recognition technologies is extensive and multifaceted, reflecting their critical roles in advancing human-computer interaction A central aspect of these technologies is their reliance on advanced signal processing and machine learning algorithms The development and enhancement of algorithms like Fast Fourier Transforms (FFT) and Hidden Markov Models (HMM) have significantly im-proved the accuracy and efficiency of these systems However, most current implemen-tations are typically optimized for controlled environments and may not perform well

in noisy or dynamic settings Studies have extensively evaluated various techniques for sound localization, which is pivotal for determining the direction of a sound source Tech-niques such as beamforming and Time Difference of Arrival (TDoA) are widely used due to their effectiveness in multi-speaker environments These methods, however, often require complex and expensive hardware setups, limiting their accessibility for widespread use In the realm of speech recognition, considerable advancements have been made in develop-ing systems that can accurately convert spoken language into text Systems like Google’s speech-to-text API leverage deep neural networks to enhance their learning processes and accuracy rates over time Nonetheless, these systems still face significant challenges in dealing with accents, dialects, and real-time processing of spontaneous speech The in-tegration of sound localization with speech recognition has been less explored but shows tremendous potential in creating more intuitive and interactive systems For instance, identifying the direction of the speaker can significantly enhance the responsiveness of virtual assistants in environments with multiple potential speakers This approach can also mitigate issues related to overlapping speech and background noise, which are com-mon in public and domestic spaces A particularly effective example of integrated systems comes from research on auditory scene analysis, where systems attempt to mimic human auditory capabilities to segregate and process sounds from different sources simultane-ously These systems use a combination of spatial cues and sound recognition algorithms

to improve the selectivity and accuracy of voice-activated systems This paper seeks to build upon these foundational studies by proposing a novel system that combines ad-vanced sound direction algorithms with robust speech recognition capabilities The sys-tem is designed to be adaptable to various acoustic environments, enhancing its practical applications in fields ranging from interactive virtual assistants to emergency response systems The proposed solution aims to address current limitations by implementing a scalable and cost-effective design, suitable for both high and low-resource settings The paper will detail the design and development processes, highlight the unique integration

Trang 9

of localization and recognition technologies, and discuss the potential broad-scale impacts

of this innovative approach on future human-computer interactions

1 Sound Localization - Techniques like beamforming and Time Difference of

Arrival (TDoA) are effective but often require complex, costly setups

- Performance typically optimized for controlled envi-ronments

2 Speech Recognition - Advances in deep neural networks have significantly

improved speech recognition accuracy

- Challenges remain with accents, dialects, and real-time speech processing

3 Integration Challenges - Few studies combine sound localization with speech

recognition

- Potential to enhance system responsiveness in multi-speaker environments

4 Technological Limitations - Existing systems struggle with noisy, dynamic

environ-ments

- Need for systems that can adapt to various acoustic settings

5 Innovative Applications - Emerging research on auditory scene analysis mimics

human auditory capabilities to segregate and process sounds from different sources

- Improves voice system selectivity and accuracy

6 Proposed System - Aims to combine sound direction algorithms with

speech recognition in a scalable, cost-effective design

- Adaptable to different environments, enhancing prac-tical applications

Table 1: Summary of Key Findings

1.2 Problem Statement

The integration of sophisticated audio processing technologies in environments where multiple speakers and various audio sources are present, such as public spaces or busy home settings, presents a unique set of challenges While technologies to determine sound direction and recognize speech independently are well-developed, their integration into a unified system that can operate effectively in noisy, dynamic environments is less explored and fraught with difficulties Current systems often struggle with accurately identifying the direction of sound in real-time, particularly in situations where multiple overlapping conversations occur, which is critical for the responsiveness of interactive systems like vir-tual assistants Moreover, existing speech recognition technologies, although advanced in quiet, controlled settings, still face significant challenges when dealing with background noise, varying accents, and the natural flow of human speech in public or chaotic en-vironments This reduces their utility in practical applications where interaction with technology via voice commands is becoming more common Additionally, the deploy-ment of such integrated systems frequently involves complex and costly hardware setups

Trang 10

or requires significant computational resources, which can be prohibitive for widespread implementation There is also a notable gap in solutions that are both scalable and adaptable across different types of environments and devices, further complicating the deployment of effective sound localization and speech recognition systems Given these challenges, there is a pressing need for a robust, adaptable, and cost-effective solution that can enhance sound direction determination and speech recognition in diverse set-tings Our proposed system aims to address these issues by utilizing advanced signal processing algorithms and machine learning techniques to improve the accuracy and effi-ciency of these technologies, making them more accessible and practical for everyday use

in various environments The proposed approach seeks to revolutionize how interactive systems perceive and respond to human speech, thereby enhancing user experience and broadening the application possibilities of voice-controlled technologies

1.3 Objectives

The primary aim of this research is to develop and validate a system that integrates sound direction determination and speech recognition technologies to enhance human-computer interaction in various environments Through the design and implementation

of this novel system, we aim to achieve the following specific objectives:

• To engineer a unified system that effectively combines sound localization techniques with advanced speech recognition algorithms, enhancing the interactive capabilities

of devices in environments with multiple speakers and background noise

• To prototype and evaluate the system within real-world settings, assessing its ac-curacy and responsiveness in dynamic and potentially noisy environments such as public spaces and homes

• To highlight the system’s advantages in terms of processing efficiency, cost-effectiveness, and scalability, establishing it as a practical option for both commercial and per-sonal applications

• To provide a comprehensive description of the system’s design and development processes, serving as a blueprint for future research and development in the field of integrated audio processing systems

• To collect and analyze feedback from users interacting with the system in various environments, gauging its impact on user experience and its effectiveness in different situational contexts

• To discuss potential future improvements and broader applications of the system, considering ongoing advancements in audio processing and machine learning tech-nologies

Ultimately, the goal is to contribute to the advancement of interactive technologies that understand and respond more intelligently to human speech, thereby improv-ing the practical utility and accessibility of voice-controlled systems across diverse settings

Trang 11

2 PROPOSED ALGORITHM (Algorithm for Sound Direction Detection - TDoA):

Step 1 Signal Acquisition

Let si(t) be the signal received by the ith microphone If there are N microphones, then

i = 1, 2, , N

Step 2 Preprocessing

Each signal si(t) is pre-processed through filtering to reduce noise A common filter used

is a band-pass filter, which can be represented as:

si,filtered(t) = BPF(si(t)) where BPF denotes the band-pass filtering operation

Step 3 Cross-Correlation

For each pair of microphones (i, j), we compute the cross-correlation function Rij(τ ):

Rij(τ ) =

Z ∞

−∞

si,filtered(t) · sj,filtered(t + τ ) dt where τ is the time lag

Step 4 Generalized Cross-Correlation with Phase Transform (GCC-PHAT)

To emphasize the time delay estimation in the presence of noise, we normalize the spec-trum and use only the phase information:

RGCC-PHATij (τ ) = F−1 F {si,filtered(t)} · F {sj,filtered(t)}∗

|F {si,filtered(t)} · F {sj,filtered(t)}∗|

where F denotes the Fourier transform and ∗ denotes the complex conjugate

Step 5 Time-Delay Estimation

The time difference of arrival ∆tij between microphones i and j is estimated by finding the value of τ that maximizes RGCC-PHAT

ij (τ ):

∆tij = arg maxτ(RGCC-PHATij (τ ))

Step 6 Direction Calculation

Assuming a planar array geometry and the speed of sound c, the angle of arrival θ can

be estimated using the differences in time of arrival and the positions of the microphones (xi, yi) and (xj, yj):

θ ≈ arcsin c · ∆tij

p(xj− xi)2+ (yj − yi)2

!

Tiêu đề	Algorithm for Sound Direction Detection and Speech Recognition in Robotic Systems
Tác giả	Dao Tuan Minh, Phung Hoang
Người hướng dẫn	DR. Nguyen Dang Khoa
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Management Information Systems
Thể loại	Student Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	17
Dung lượng	1,18 MB