VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL --- STUDENT RESEARCH REPORT TOPIC: Algorithm for Sound Direction Detection and Speech Recognition in Robotic Systems Team Lea
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
-
STUDENT RESEARCH REPORT
TOPIC: Algorithm for Sound Direction Detection and
Speech Recognition in Robotic Systems
Team Leader: Dao Tuan Minh ID: 20070749
Class: MIS2020B
Hanoi - April, 2024
Trang 2TEAM LEADER INFORMATION
I Student profile
- Full name: Dao Tuan Minh
- Date of birth: 12/08/2002
- Place of birth: Dong Da, Ha Noi
- Class: MIS2020B
- Program: Management Information Systems
- Address: Dong Da, Ha Noi
- Phone no: 0352632756
- Email: 20070749@vnu.edu.vn
Academic year Overall Score Academic Rating
2020 - 2024 3.12 Good
III Other achievements:
………
………
………
Trang 31 Project Code:
CN.NV.SV.23_19
2 Member List:
Phung Hoang _ 20070832_ phunghoangvnuit@gmail.com _
ICE2020B_0338695456
3 Advisor: DR Nguyen Dang Khoa
Trang 4TABLE OF CONTENT
1.INTRODUCTION 2
1 Literature review 2
2 Problem statement 3
3 Objectives 4
2.PROPOSEDALGORITHM (Algorithm for Sound Direction Detection- TDoA) 5
3 IMPLIMENTATION AND DISCUSSION 7
3.1 Hardware Setup 8
3.2 Software Configuration 8
3.3 Testing and Calibration 8
3.4 Discussion 8
3.5 Summary of Findings 9
3.6 Contributions 10
3.7 Future Scope 10
4 CONCLUSION 10
REFERENCES 11
Trang 5LIST OF FIGURES
Figure 1 Inter-class similarity and intra-class discrepancy
Trang 6Advisor Student (Signed with full name) (Signed with full name)
TS Nguyễn Đăng Khoa Đào Tuấn Minh
Trang 7Algorithm for Sound Direction Detection and Speech
Recognition in Robotic Systems
1, Dao Tuan Minh1, Phung Hoang1, and Dr Nguyen Dang Khoa1
1Vietnam National University, Hanoi – International School, Hanoi,
Vietnam
Abstract The increasing complexity of robotic systems calls for sophisticated capabilities
in environmental interaction, particularly in sound-based navigation and commu-nication This paper presents a comprehensive system designed to enhance robotic abilities in determining sound direction and integrating speech recognition Uti-lizing advanced sensor arrays and cutting-edge algorithms, the system accurately identifies sound origins and processes speech, enabling robots to respond to audi-tory stimuli effectively The implementation involves a combination of directional microphones and real-time processing techniques, including beamforming and ma-chine learning models for speech recognition This paper details the design, de-velopment, and testing of the system, highlighting its dual functionality The ex-perimental results demonstrate the system’s precision in sound localization and its efficacy in recognizing spoken commands under various conditions The discussion emphasizes the system’s potential to revolutionize interactions between robots and their operational environments, contributing to more autonomous and responsive robotic applications The paper concludes with insights into the implications for future robotic systems and potential areas for further research in robotic auditory processing
1 Introduction
Virtual assistants have become an integral part of everyday life, enhancing user interaction through voice-controlled technologies As the sophistication of these devices increases, there is a notable demand for advanced auditory interaction technologies, particularly
in precise audio positioning This ability, which enables virtual assistants to accurately determine the direction of incoming sounds, is crucial for improving their responsive-ness and utility Accurately identifying the source of voice commands enhances the user experience by allowing the device to respond more effectively and interact in a contextu-ally aware manner Significant research has been conducted on the integration of sound localization techniques into virtual assistants According to Algazi et al (2001), im-plementing sound localization methods enhances the interaction quality between virtual assistants and users by improving the accuracy of voice command detection, especially
in environments with background noise (1) Additionally, Hawley et al (2007) discuss
Trang 8how virtual assistants can utilize 3D audio cues to differentiate between multiple simulta-neous speakers, increasing their effectiveness in dynamic and complex acoustic scenarios (2) The system proposed in this paper builds upon these foundational insights, aiming
to significantly improve audio positioning for virtual assistants through the use of ad-vanced signal processing algorithms and spatial audio techniques This enhancement is anticipated to revolutionize the interaction between virtual assistants and their human users, particularly in smart home settings and interactive learning environments This paper details the development and practical implementation of a novel sound orientation system tailored for virtual assistants It will explore the design, functionality, deploy-ment challenges, and potential broad-scale applications of this technology, emphasizing its impact on the future trajectory of virtual assistant capabilities
1.1 Literature Survey
The literature on sound direction and speech recognition technologies is extensive and multifaceted, reflecting their critical roles in advancing human-computer interaction A central aspect of these technologies is their reliance on advanced signal processing and machine learning algorithms The development and enhancement of algorithms like Fast Fourier Transforms (FFT) and Hidden Markov Models (HMM) have significantly im-proved the accuracy and efficiency of these systems However, most current implemen-tations are typically optimized for controlled environments and may not perform well
in noisy or dynamic settings Studies have extensively evaluated various techniques for sound localization, which is pivotal for determining the direction of a sound source Tech-niques such as beamforming and Time Difference of Arrival (TDoA) are widely used due to their effectiveness in multi-speaker environments These methods, however, often require complex and expensive hardware setups, limiting their accessibility for widespread use In the realm of speech recognition, considerable advancements have been made in develop-ing systems that can accurately convert spoken language into text Systems like Google’s speech-to-text API leverage deep neural networks to enhance their learning processes and accuracy rates over time Nonetheless, these systems still face significant challenges in dealing with accents, dialects, and real-time processing of spontaneous speech The in-tegration of sound localization with speech recognition has been less explored but shows tremendous potential in creating more intuitive and interactive systems For instance, identifying the direction of the speaker can significantly enhance the responsiveness of virtual assistants in environments with multiple potential speakers This approach can also mitigate issues related to overlapping speech and background noise, which are com-mon in public and domestic spaces A particularly effective example of integrated systems comes from research on auditory scene analysis, where systems attempt to mimic human auditory capabilities to segregate and process sounds from different sources simultane-ously These systems use a combination of spatial cues and sound recognition algorithms
to improve the selectivity and accuracy of voice-activated systems This paper seeks to build upon these foundational studies by proposing a novel system that combines ad-vanced sound direction algorithms with robust speech recognition capabilities The sys-tem is designed to be adaptable to various acoustic environments, enhancing its practical applications in fields ranging from interactive virtual assistants to emergency response systems The proposed solution aims to address current limitations by implementing a scalable and cost-effective design, suitable for both high and low-resource settings The paper will detail the design and development processes, highlight the unique integration
Trang 9of localization and recognition technologies, and discuss the potential broad-scale impacts
of this innovative approach on future human-computer interactions
1 Sound Localization - Techniques like beamforming and Time Difference of
Arrival (TDoA) are effective but often require complex, costly setups
- Performance typically optimized for controlled envi-ronments
2 Speech Recognition - Advances in deep neural networks have significantly
improved speech recognition accuracy
- Challenges remain with accents, dialects, and real-time speech processing
3 Integration Challenges - Few studies combine sound localization with speech
recognition
- Potential to enhance system responsiveness in multi-speaker environments
4 Technological Limitations - Existing systems struggle with noisy, dynamic
environ-ments
- Need for systems that can adapt to various acoustic settings
5 Innovative Applications - Emerging research on auditory scene analysis mimics
human auditory capabilities to segregate and process sounds from different sources
- Improves voice system selectivity and accuracy
6 Proposed System - Aims to combine sound direction algorithms with
speech recognition in a scalable, cost-effective design
- Adaptable to different environments, enhancing prac-tical applications
Table 1: Summary of Key Findings
1.2 Problem Statement
The integration of sophisticated audio processing technologies in environments where multiple speakers and various audio sources are present, such as public spaces or busy home settings, presents a unique set of challenges While technologies to determine sound direction and recognize speech independently are well-developed, their integration into a unified system that can operate effectively in noisy, dynamic environments is less explored and fraught with difficulties Current systems often struggle with accurately identifying the direction of sound in real-time, particularly in situations where multiple overlapping conversations occur, which is critical for the responsiveness of interactive systems like vir-tual assistants Moreover, existing speech recognition technologies, although advanced in quiet, controlled settings, still face significant challenges when dealing with background noise, varying accents, and the natural flow of human speech in public or chaotic en-vironments This reduces their utility in practical applications where interaction with technology via voice commands is becoming more common Additionally, the deploy-ment of such integrated systems frequently involves complex and costly hardware setups
Trang 10or requires significant computational resources, which can be prohibitive for widespread implementation There is also a notable gap in solutions that are both scalable and adaptable across different types of environments and devices, further complicating the deployment of effective sound localization and speech recognition systems Given these challenges, there is a pressing need for a robust, adaptable, and cost-effective solution that can enhance sound direction determination and speech recognition in diverse set-tings Our proposed system aims to address these issues by utilizing advanced signal processing algorithms and machine learning techniques to improve the accuracy and effi-ciency of these technologies, making them more accessible and practical for everyday use
in various environments The proposed approach seeks to revolutionize how interactive systems perceive and respond to human speech, thereby enhancing user experience and broadening the application possibilities of voice-controlled technologies
1.3 Objectives
The primary aim of this research is to develop and validate a system that integrates sound direction determination and speech recognition technologies to enhance human-computer interaction in various environments Through the design and implementation
of this novel system, we aim to achieve the following specific objectives:
• To engineer a unified system that effectively combines sound localization techniques with advanced speech recognition algorithms, enhancing the interactive capabilities
of devices in environments with multiple speakers and background noise
• To prototype and evaluate the system within real-world settings, assessing its ac-curacy and responsiveness in dynamic and potentially noisy environments such as public spaces and homes
• To highlight the system’s advantages in terms of processing efficiency, cost-effectiveness, and scalability, establishing it as a practical option for both commercial and per-sonal applications
• To provide a comprehensive description of the system’s design and development processes, serving as a blueprint for future research and development in the field of integrated audio processing systems
• To collect and analyze feedback from users interacting with the system in various environments, gauging its impact on user experience and its effectiveness in different situational contexts
• To discuss potential future improvements and broader applications of the system, considering ongoing advancements in audio processing and machine learning tech-nologies
Ultimately, the goal is to contribute to the advancement of interactive technologies that understand and respond more intelligently to human speech, thereby improv-ing the practical utility and accessibility of voice-controlled systems across diverse settings
Trang 112 PROPOSED ALGORITHM (Algorithm for Sound Direction Detection - TDoA):
Step 1 Signal Acquisition
Let si(t) be the signal received by the ith microphone If there are N microphones, then
i = 1, 2, , N
Step 2 Preprocessing
Each signal si(t) is pre-processed through filtering to reduce noise A common filter used
is a band-pass filter, which can be represented as:
si,filtered(t) = BPF(si(t)) where BPF denotes the band-pass filtering operation
Step 3 Cross-Correlation
For each pair of microphones (i, j), we compute the cross-correlation function Rij(τ ):
Rij(τ ) =
Z ∞
−∞
si,filtered(t) · sj,filtered(t + τ ) dt where τ is the time lag
Step 4 Generalized Cross-Correlation with Phase Transform (GCC-PHAT)
To emphasize the time delay estimation in the presence of noise, we normalize the spec-trum and use only the phase information:
RGCC-PHATij (τ ) = F−1 F {si,filtered(t)} · F {sj,filtered(t)}∗
|F {si,filtered(t)} · F {sj,filtered(t)}∗|
where F denotes the Fourier transform and ∗ denotes the complex conjugate
Step 5 Time-Delay Estimation
The time difference of arrival ∆tij between microphones i and j is estimated by finding the value of τ that maximizes RGCC-PHAT
ij (τ ):
∆tij = arg maxτ(RGCC-PHATij (τ ))
Step 6 Direction Calculation
Assuming a planar array geometry and the speed of sound c, the angle of arrival θ can
be estimated using the differences in time of arrival and the positions of the microphones (xi, yi) and (xj, yj):
θ ≈ arcsin c · ∆tij
p(xj− xi)2+ (yj − yi)2
!