MINISTRY OF EDUCATION AND TRAININGHO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION GRADUATION THESIS MAJOR: DATA ENGINEERING INSTRUCTOR: HOANG VAN DUNG NGUYEN MINH TIEN Ho Chi Min
INTRODUCTION
LITERATURE REVIEW
Existing Vietnamese Speech Corpora
Part-of-speech Tagging
METHODOLOGY
Audio Resampling
Addressing Non-Standard Words
3.3.3 Detection and Normalization of Non-Standard Words
Addressing Out-of-Vocabulary Words
Forced Alignment
CORPUS ANALYSIS
Fine-tuning a Text-to-Speech Model
STT Thời gian Công việc Ghi chú
1 04/03/2024 - 10/03/2024 o Tìm hiểu các công trình nghiên cứu trước đó o Phân tích, đánh giá tổng quan đề tài
2 11/03/2024 – 31/03/2024 o Tìm hiểu, đánh giá các công cụ để cào dữ liệu o Xây dựng pipeline cào dữ liệu o Thực hiện cào dữ liệu
Từ ngày 01/04/2024 đến 14/04/2024, chúng tôi sẽ tiến hành nghiên cứu về các công cụ chuẩn hóa văn bản đã có trước đây Bên cạnh đó, chúng tôi cũng sẽ nghiên cứu âm học và âm tiết của tiếng Việt, cùng với việc khảo sát các mô hình bổ trợ như POSTagger.
4 15/04/2024 – 19/05/2024 o Xây dựng các từ điển cho các dấu câu, từ viết tắt, phiên âm o Xây dựng pipeline để chuẩn hóa dữ liệu o Thực hiện kiểm thử
5 20/05/2024 – 02/06/2024 o Nghiên cứu về Forced Aligner và áp dụng vào corpus
6 03/06/2024 – 16/06/2024 o Xây dựng pipeline cho text- normalization và forced alignment
7 17/06/2024 – 07/2024 o Thực hiện đánh giá, phân tích corpus o Kiểm thử dữ liệu o Viết báo cáo
Ngày tháng năm 2024 Ý kiến của giáo viên hướng dẫn Người viết đề cương
(ký và ghi rõ họ tên)
Tin Nguyen Huynh, Tien Minh Nguyen: A Vietnamese speech corpus for text to speech
(Under the direction of Van-Dung Hoang)
Vietnamese is an under-resourced language and expectation for the large-scale
Vietnamese speech corpus is increasing day by day as a result of the evolution of language models
This article introduces a cost-effective and low-dependency method for processing and curating a Vietnamese speech corpus sourced from publicly available YouTube videos For low-resource languages like Vietnamese, acquiring high-quality data poses a significant challenge in developing large language models Although existing public corpora feature manually annotated recordings from various speakers, this traditional approach is often expensive and time-consuming, making it impractical for students and independent speech researchers.
Our goal is to address the core issue by developing a scalable processing pipeline to curate an extensive Vietnamese speech corpus We utilize publicly available Vietnamese news videos on YouTube, which include closed captions, as our primary data source Additionally, we investigate different facets of Vietnamese text normalization and introduce a text normalizer that features an innovative approach to transliterating foreign words.
We would like to express our heartfelt appreciation to our thesis advisor, Assoc Prof Van-Dung Hoang, whose remarkable academic knowledge and unwavering commitment have played a crucial role in shaping this thesis.
We extend our sincere gratitude to Dr Thanh-Son Nguyen, a valued member of our thesis committee, for dedicating time and effort to evaluate our work and for offering insightful questions that have enriched our research.
In our final academic years, we developed a keen interest in natural language processing, a rapidly evolving field that expanded our horizons and challenged us to step outside our comfort zones This journey into new possibilities and experiences was made possible by the university's unwavering motivation and support in our pursuit of knowledge and exploration of new frontiers.
3.3.3 Detection and Normalization of Non-Standard Words 27
3.4 Addressing Out-of-Vocabulary Words 34
4.3 Fine-tuning a Text-to-Speech Model 64
Table 1 Vietnamese speech corpora overview 2
Table 2 Old and modern methods for placing tone marks 8
Table 3 Examples of key-value in in the 5-syllable context dictionary 12
Table 4 Short description of RDR segmenter 12
Table 5 Sample of tag, plain text and the explanation 15
Table 6 Speed comparison between different resamplers in downsampling 22
Table 7 Categorization of Vietnamese non-standard words 25
Table 8 Accuracy and speed comparison between part-of-speech taggers 29
Table 9 Part-of-speech tag set of the MarMoT-based tagger 30
Table 10 Semiotic classes per tag 32
Table 11 Classification rule per semiotic class 33
Table 14 Behavior of Gorman’s syllabifier 41
Table 15 Comparison between Gorman’s and our syllabifier 43
Table 16 Transliterated ARPA vowel phones (left) and consonant phones (right) 47
Table 17 Transliterated two-phone nuclei 49
Table 18 Transliterated three-phone nuclei 49
Table 19 Nucleus correction of illicit rimes after transliteration 52
Table 20 Quantity of audios in each range of alignment score 64
Figure 1 Front-end of recording application 5
Figure 2 Pipeline architecture of VnCoreNLP 9
Figure 3 A sample of SCRDR tree for POS Tagging 10
Figure 4 A diagram of another approach to construct an SCRDR tree 11
Figure 5 Content layout of a SubRip file 17
Figure 6 Content of a converted SubRip file 18
Figure 7 Directory structure of the data collection folder 19
Figure 8 Comparison of anti-aliasing filters of different resamplers in upsampling 22
Figure 9 Subset of possible semiotic classes 26
Figure 11 Pseudo-code for Gorman’s syllabification algorithm 41
Figure 12 Modifications to Gorman’s syllabification algorithm 43
Figure 13 Content of the annotated CMU dictionary 44
Figure 15 Forced alignment on the word level 58
Figure 16 Trimming of non-participating words and silences 59
Figure 17 The distribution of gender of the raw corpus 62
Figure 18 The distribution of alignment score of the raw corpus 62
Figure 19 The distribution of audio durations in our clean corpus 63
High-quality speech corpora are becoming increasingly important in speech-related studies, including speech analysis and synthesis English, being a widely spoken language, has a rich array of research focused on speech corpora, highlighting its significance in the field.
Linguistic research in Vietnamese faces significant challenges due to ambiguity and a scarcity of foundational studies, despite the language being spoken by approximately 100 million people as of 2024 The limited availability of speech samples complicates the creation of a speech corpus, which is vital for tasks such as text-to-speech (TTS) and speech recognition Developing a comprehensive speech database demands substantial resources, time, and effort, making it economically challenging and often disregarding its environmental impact To address these issues, this paper introduces a Vietnamese speech corpus specifically designed for TTS tasks, utilizing methodologies from prior research, including non-Vietnamese studies, while adapting them to accommodate the unique characteristics of Vietnamese as a monosyllabic and tonal language.
Leveraging our extensive experience from previous projects, we are currently developing a Vietnamese speech corpus that exceeds 100 hours in duration and comprises over 100,000 audio recordings Our goal is to enhance the quality and accessibility of Vietnamese speech technology.
1 Clawing and downloading data: Explore how to claw data on the website and use that data to download speech and transcripts based on those clawed data
2 Text normalization: self-building the hybrid text normalization for the transcripts We tried a rules-based approach along with NLP models such as WordSegmenter and POSTagger At the same time, we will explore how it works
3 Pre-processing audio: With the upside-down of Internet audio We will explain how the audio can be denoised and make a force-aligner for the synchronous audio corresponding to transcripts
4 Evaluation: Measure the quality of the speech corpus by training a text-to- speech model such as SpeechT5
In recent years, numerous Vietnamese speech corpora have emerged, including VOV, MICA VNSpeechCorpus, AIlab VIVOS, and VAIS-1000, as highlighted in Table 1, which provides an overview of these advanced resources Developing high-quality speech corpora demands substantial financial investment, time, and human resources However, there is a lack of research focused on minimizing the costs related to these essential elements.
Corpus Size Style Open/Close
Viettel corpus 85.8 hours Phone call Close
Corpus in [12] 25 hours Reading Close
Corpus in [13] 100.5 hours Spontaneous Open
Table 1 Vietnamese speech corpora overview
CHAPTER 2: LITERATURE REVIEW 2.1 Existing Vietnamese Speech Corpora
Vietnamese speech synthesis began in the 2000s, but research in this area has been limited due to challenges and the absence of a standard Vietnamese speech corpus The increasing demand for speech-related tasks, such as text-to-speech (TTS) and speech recognition, has made the creation of comprehensive speech corpora a resource-intensive endeavor, requiring significant labor, budget, and time Consequently, several Vietnamese speech corpora have been developed, including VOV, MICA VNSpeechCorpus, AIlab VIVOS, and VAIS-1000 Although these corpora are relatively small and may not represent the highest quality, they currently stand as the best available resources for Vietnamese speech synthesis.
Building a speech corpus in Vietnamese shares similarities with corpus research in other languages, but it presents unique challenges due to the high standards for data design Researchers must ensure clean recordings, utilize soundproof environments, and include a diverse range of genders, ages, and regional accents The complexity of Vietnamese speech and writing adds to these difficulties Moreover, the methods for collecting and processing speech corpora vary based on the specific objectives of each study.
The MICA VNSpeechCorpus research project involves the collection of data from Vietnamese web pages using web robots, including a custom-built robot developed by the researchers These robots systematically scan and retrieve documents, addressing content noise by normalizing and rewriting the text to ensure a cohesive and standardized text corpus.