Đồ án tốt nghiệp Kỹ thuật dữ liệu: A Vietnamese speech corpus for text to speech

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề	A Vietnamese Speech Corpus For Text To Speech
Tác giả	Huỳnh Nguyễn Tín, Nguyễn Minh Tiến
Người hướng dẫn	PGS.TS. Hoàng Văn Dũng
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Data Engineering
Thể loại	graduation thesis
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	85
Dung lượng	2,82 MB

Cấu trúc

CHAPTER 1: INTRODUCTION (6)
- 1.1. Context (6)
- 1.2. Objectives of the Thesis (6)
- 1.3. Related Works (6)
CHAPTER 2: LITERATURE REVIEW (6)
- 2.1. Existing Vietnamese Speech Corpora (6)
  - 2.1.1. Transcript Collection (6)
  - 2.1.2. Speech Collection (6)
  - 2.1.3. Processing Data (6)
  - 2.1.4. Export and Publish (6)
  - 2.1.5. Conclusion (6)
- 2.2. Vietnamese Text Normalization (6)
- 2.3. Part-of-speech Tagging (6)
  - 2.3.1. RDRsegmenter for word segmentation (6)
  - 2.3.2. VnMarMoT (6)
CHAPTER 3: METHODOLOGY (6)
- 3.1. Data Collection (6)
- 3.2. Audio Resampling (7)
  - 3.2.1. Sampling Rate (7)
  - 3.2.2. Resampler (7)
- 3.3. Addressing Non-Standard Words (7)
  - 3.3.1. Premise (7)
  - 3.3.2. Semiotic Classes (7)
  - 3.3.3. Detection and Normalization of Non-Standard Words (7)
- 3.4. Addressing Out-of-Vocabulary Words (7)
  - 3.4.1. Premise (7)
  - 3.4.2. Bridging the Gap (7)
  - 3.4.3. Phonetic Transcription (7)
  - 3.4.4. Syllabification (7)
  - 3.4.5. Dictionary Annotation (7)
  - 3.4.6. Transliteration (7)
- 3.5. Forced Alignment (7)
  - 3.5.1. Premise (7)
  - 3.5.2. Forced Aligner (7)
- 3.6. Post processing (7)
CHAPTER 4: CORPUS ANALYSIS (7)
- 4.1. Raw corpus (7)
- 4.2. Cleaning (7)
- 4.3. Fine-tuning a Text-to-Speech Model (7)

Nội dung

MINISTRY OF EDUCATION AND TRAININGHO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION GRADUATION THESIS MAJOR: DATA ENGINEERING INSTRUCTOR: HOANG VAN DUNG NGUYEN MINH TIEN Ho Chi Min

INTRODUCTION

LITERATURE REVIEW

Existing Vietnamese Speech Corpora

Part-of-speech Tagging

METHODOLOGY

Audio Resampling

Addressing Non-Standard Words

3.3.3 Detection and Normalization of Non-Standard Words

Addressing Out-of-Vocabulary Words

Forced Alignment

CORPUS ANALYSIS

Fine-tuning a Text-to-Speech Model

STT Thời gian Công việc Ghi chú

1 04/03/2024 - 10/03/2024 o Tìm hiểu các công trình nghiên cứu trước đó o Phân tích, đánh giá tổng quan đề tài

2 11/03/2024 – 31/03/2024 o Tìm hiểu, đánh giá các công cụ để cào dữ liệu o Xây dựng pipeline cào dữ liệu o Thực hiện cào dữ liệu

Từ ngày 01/04/2024 đến 14/04/2024, chúng tôi sẽ tiến hành nghiên cứu về các công cụ chuẩn hóa văn bản đã có trước đây Bên cạnh đó, chúng tôi cũng sẽ nghiên cứu âm học và âm tiết của tiếng Việt, cùng với việc khảo sát các mô hình bổ trợ như POSTagger.

4 15/04/2024 – 19/05/2024 o Xây dựng các từ điển cho các dấu câu, từ viết tắt, phiên âm o Xây dựng pipeline để chuẩn hóa dữ liệu o Thực hiện kiểm thử

5 20/05/2024 – 02/06/2024 o Nghiên cứu về Forced Aligner và áp dụng vào corpus

6 03/06/2024 – 16/06/2024 o Xây dựng pipeline cho text- normalization và forced alignment

7 17/06/2024 – 07/2024 o Thực hiện đánh giá, phân tích corpus o Kiểm thử dữ liệu o Viết báo cáo

Ngày tháng năm 2024 Ý kiến của giáo viên hướng dẫn Người viết đề cương

(ký và ghi rõ họ tên)

Tin Nguyen Huynh, Tien Minh Nguyen: A Vietnamese speech corpus for text to speech

(Under the direction of Van-Dung Hoang)

Vietnamese is an under-resourced language and expectation for the large-scale

Vietnamese speech corpus is increasing day by day as a result of the evolution of language models

This article introduces a cost-effective and low-dependency method for processing and curating a Vietnamese speech corpus sourced from publicly available YouTube videos For low-resource languages like Vietnamese, acquiring high-quality data poses a significant challenge in developing large language models Although existing public corpora feature manually annotated recordings from various speakers, this traditional approach is often expensive and time-consuming, making it impractical for students and independent speech researchers.

Our goal is to address the core issue by developing a scalable processing pipeline to curate an extensive Vietnamese speech corpus We utilize publicly available Vietnamese news videos on YouTube, which include closed captions, as our primary data source Additionally, we investigate different facets of Vietnamese text normalization and introduce a text normalizer that features an innovative approach to transliterating foreign words.

We would like to express our heartfelt appreciation to our thesis advisor, Assoc Prof Van-Dung Hoang, whose remarkable academic knowledge and unwavering commitment have played a crucial role in shaping this thesis.

We extend our sincere gratitude to Dr Thanh-Son Nguyen, a valued member of our thesis committee, for dedicating time and effort to evaluate our work and for offering insightful questions that have enriched our research.

In our final academic years, we developed a keen interest in natural language processing, a rapidly evolving field that expanded our horizons and challenged us to step outside our comfort zones This journey into new possibilities and experiences was made possible by the university's unwavering motivation and support in our pursuit of knowledge and exploration of new frontiers.

3.3.3 Detection and Normalization of Non-Standard Words 27

3.4 Addressing Out-of-Vocabulary Words 34

4.3 Fine-tuning a Text-to-Speech Model 64

Table 1 Vietnamese speech corpora overview 2

Table 2 Old and modern methods for placing tone marks 8

Table 3 Examples of key-value in in the 5-syllable context dictionary 12

Table 4 Short description of RDR segmenter 12

Table 5 Sample of tag, plain text and the explanation 15

Table 6 Speed comparison between different resamplers in downsampling 22

Table 7 Categorization of Vietnamese non-standard words 25

Table 8 Accuracy and speed comparison between part-of-speech taggers 29

Table 9 Part-of-speech tag set of the MarMoT-based tagger 30

Table 10 Semiotic classes per tag 32

Table 11 Classification rule per semiotic class 33

Table 14 Behavior of Gorman’s syllabifier 41

Table 15 Comparison between Gorman’s and our syllabifier 43

Table 16 Transliterated ARPA vowel phones (left) and consonant phones (right) 47

Table 17 Transliterated two-phone nuclei 49

Table 18 Transliterated three-phone nuclei 49

Table 19 Nucleus correction of illicit rimes after transliteration 52

Table 20 Quantity of audios in each range of alignment score 64

Figure 1 Front-end of recording application 5

Figure 2 Pipeline architecture of VnCoreNLP 9

Figure 3 A sample of SCRDR tree for POS Tagging 10

Figure 4 A diagram of another approach to construct an SCRDR tree 11

Figure 5 Content layout of a SubRip file 17

Figure 6 Content of a converted SubRip file 18

Figure 7 Directory structure of the data collection folder 19

Figure 8 Comparison of anti-aliasing filters of different resamplers in upsampling 22

Figure 9 Subset of possible semiotic classes 26

Figure 11 Pseudo-code for Gorman’s syllabification algorithm 41

Figure 12 Modifications to Gorman’s syllabification algorithm 43

Figure 13 Content of the annotated CMU dictionary 44

Figure 15 Forced alignment on the word level 58

Figure 16 Trimming of non-participating words and silences 59

Figure 17 The distribution of gender of the raw corpus 62

Figure 18 The distribution of alignment score of the raw corpus 62

Figure 19 The distribution of audio durations in our clean corpus 63

High-quality speech corpora are becoming increasingly important in speech-related studies, including speech analysis and synthesis English, being a widely spoken language, has a rich array of research focused on speech corpora, highlighting its significance in the field.

Linguistic research in Vietnamese faces significant challenges due to ambiguity and a scarcity of foundational studies, despite the language being spoken by approximately 100 million people as of 2024 The limited availability of speech samples complicates the creation of a speech corpus, which is vital for tasks such as text-to-speech (TTS) and speech recognition Developing a comprehensive speech database demands substantial resources, time, and effort, making it economically challenging and often disregarding its environmental impact To address these issues, this paper introduces a Vietnamese speech corpus specifically designed for TTS tasks, utilizing methodologies from prior research, including non-Vietnamese studies, while adapting them to accommodate the unique characteristics of Vietnamese as a monosyllabic and tonal language.

Leveraging our extensive experience from previous projects, we are currently developing a Vietnamese speech corpus that exceeds 100 hours in duration and comprises over 100,000 audio recordings Our goal is to enhance the quality and accessibility of Vietnamese speech technology.

1 Clawing and downloading data: Explore how to claw data on the website and use that data to download speech and transcripts based on those clawed data

2 Text normalization: self-building the hybrid text normalization for the transcripts We tried a rules-based approach along with NLP models such as WordSegmenter and POSTagger At the same time, we will explore how it works

3 Pre-processing audio: With the upside-down of Internet audio We will explain how the audio can be denoised and make a force-aligner for the synchronous audio corresponding to transcripts

4 Evaluation: Measure the quality of the speech corpus by training a text-to- speech model such as SpeechT5

In recent years, numerous Vietnamese speech corpora have emerged, including VOV, MICA VNSpeechCorpus, AIlab VIVOS, and VAIS-1000, as highlighted in Table 1, which provides an overview of these advanced resources Developing high-quality speech corpora demands substantial financial investment, time, and human resources However, there is a lack of research focused on minimizing the costs related to these essential elements.

Corpus Size Style Open/Close

Viettel corpus 85.8 hours Phone call Close

Corpus in [12] 25 hours Reading Close

Corpus in [13] 100.5 hours Spontaneous Open

Table 1 Vietnamese speech corpora overview

CHAPTER 2: LITERATURE REVIEW 2.1 Existing Vietnamese Speech Corpora

Vietnamese speech synthesis began in the 2000s, but research in this area has been limited due to challenges and the absence of a standard Vietnamese speech corpus The increasing demand for speech-related tasks, such as text-to-speech (TTS) and speech recognition, has made the creation of comprehensive speech corpora a resource-intensive endeavor, requiring significant labor, budget, and time Consequently, several Vietnamese speech corpora have been developed, including VOV, MICA VNSpeechCorpus, AIlab VIVOS, and VAIS-1000 Although these corpora are relatively small and may not represent the highest quality, they currently stand as the best available resources for Vietnamese speech synthesis.

Building a speech corpus in Vietnamese shares similarities with corpus research in other languages, but it presents unique challenges due to the high standards for data design Researchers must ensure clean recordings, utilize soundproof environments, and include a diverse range of genders, ages, and regional accents The complexity of Vietnamese speech and writing adds to these difficulties Moreover, the methods for collecting and processing speech corpora vary based on the specific objectives of each study.

The MICA VNSpeechCorpus research project involves the collection of data from Vietnamese web pages using web robots, including a custom-built robot developed by the researchers These robots systematically scan and retrieve documents, addressing content noise by normalizing and rewriting the text to ensure a cohesive and standardized text corpus.

Ngày đăng: 19/12/2024, 11:30

Nguồn tham khảo

Tài liệu tham khảo

Loại

Chi tiết

[1]. Mai, Luong Chi, and Dang Ngoc Duc. "Design of Vietnamese speech corpus and current status." Proceedings of the International Symposium on Chinese Spoken Language Processing (ISCSLP). Vol. 6.2006

Sách, tạp chí

Tiêu đề:	Design of Vietnamese speech corpus and current status

[2]. Le, Viet Bac, et al. "Spoken and Written Language Resources for Vietnamese." LREC. Vol. 4. 2004

Sách, tạp chí

Tiêu đề:	Spoken and Written Language Resources for Vietnamese

[3]. Phuong, Pham Ngoc, Quoc Truong Do, and Luong Chi Mai. "A high quality and phonetic balanced speech corpus for Vietnamese." arXiv preprint arXiv:1904.05569 (2019)

Sách, tạp chí

Tiêu đề:	A high quality and phonetic balanced speech corpus for Vietnamese

[4]. Nhan, Do Tri et al. “Vietnamese Speech Synthesis with End-to-End Model and Text Normalization.”

Sách, tạp chí

Tiêu đề:	Vietnamese Speech Synthesis with End-to-End Model and Text Normalization

[7]. Vu, Thanh, et al. "VnCoreNLP: A Vietnamese natural language processing toolkit." arXiv preprint arXiv:1801.01331 (2018)

Sách, tạp chí

Tiêu đề:	VnCoreNLP: A Vietnamese natural language processing toolkit

[8]. Nguyen, Dat Quoc, et al. "A fast and accurate Vietnamese word segmenter." arXiv preprint arXiv:1709.06307 (2017)

Sách, tạp chí

Tiêu đề:	A fast and accurate Vietnamese word segmenter

[9]. Nguyen, Dat Quoc, et al. "A robust transformation-based learning approach using ripple down rules for part-of-speech tagging." AI communications 29.3 (2016): 409-422

Sách, tạp chí

Tiêu đề:	A robust transformation-based learning approach using ripple down rules for part-of-speech tagging
Tác giả:	Nguyen, Dat Quoc, et al. "A robust transformation-based learning approach using ripple down rules for part-of-speech tagging." AI communications 29.3
Năm:	2016

[10] Nguyen, Dat Quoc, et al. "From word segmentation to POS tagging for Vietnamese." arXiv preprint arXiv:1711.04951 (2017)

Sách, tạp chí

Tiêu đề:	From word segmentation to POS tagging for Vietnamese

[5] Binh, Nguyen Vu Le. nguyenvulebinh/visen: Vietnamese tone normalization, GitHub. Available at: https://github.com/nguyenvulebinh/visen (Accessed: October 2023)

Link

[6] Mimino666. MIMINO666/langdetect: Port of Google’s language-detection library to python., GitHub. Available at: https://github.com/Mimino666/langdetect (Accessed: October 2023)

Link

[15] Thiemann, J. (1970) Audio resampling in python, Audio Resampling in Python. Available at: https://signalsprocessed.blogspot.com/2016/08/audio-resampling-in-python.html

Link

[16] Administration (2024) The waveforms of speech, Macquarie University. Available at: https://www.mq.edu.au/about/about-the-university/our-faculties/medicine-and-health-

Link

[17] Jonashaag Audio-resampling-in-python/audio resampling in python.ipynb at master ã Jonashaag/Audio-resampling-in-python, GitHub. Available at: https://github.com/jonashaag/audio-resampling-in-python/blob/master/Audio%20Resampling%20in%20Python.ipynb

Link

[18] Brick Wall Filter (2010) Wikipedia. Available at: https://en.wikipedia.org/wiki/Brick_wall_filter (Accessed: June 2024)

Link

[23] VLSP 2013 datasets, Association for Vietnamese Language and Speech Processing. Available at: https://vlsp.org.vn/resources-vlsp2013 (Accessed: June 2024)

Link

[25] Kylebgorman (2013) Syllabify/syllabify.py at master ã Kylebgorman/Syllabify, GitHub. Available at: https://github.com/kylebgorman/syllabify/blob/master/syllabify.py (Accessed: June 2024)

Link

[26] Kyubyong (2019) Kyubyong/G2p: G2p: English grapheme to phoneme conversion, GitHub. Available at: https://github.com/Kyubyong/g2p (Accessed: June 2024)

Link

[27] Cmusphinx, Cmusphinx/G2p-seq2seq: G2p with tensorflow, GitHub. Available at: https://github.com/cmusphinx/g2p-seq2seq (Accessed: June 2024)

Link

[29]. CTC forced alignment API tutorialả) CTC forced alignment API tutorial - Torchaudio 2.4.0.dev20240628 documentation. Available at:https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html (Accessed: July 2024)

Link

[32] Quy tắc đặt dấu Thanh Của chữ quốc ngữ (2024) Wikipedia. Available at: https://vi.wikipedia.org/wiki/Quy_t%E1%BA%AFc_%C4%91%E1%BA%B7t_d%E1%BA%A5u_thanh_c%E1%BB%A7a_ch%E1%BB%AF_Qu%E1%BB%91c_ng%E1%BB%AF (Accessed: July 2024)

Link