Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.
BACKGROUND AND RELATED WORKS
Chapter 1, titled "Overview of speech synthesis and speech synthesis for Low- Resourced Languages": This chapter concisely reviews the existing literature to gain a comprehensive understanding of TTS Research directions for low- resourced TTS are also detailed in this chapter.
Chapter 2, titled "Vietnamese and Muong Language": This chapter presents research on the phonology of Vietnamese and Muong languages. Computational linguistic resources for Vietnamese speech processing are described in detail as applied in Vietnamese TTS.
PART 2: SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE
Chapter 3, titled "Emulating Muong TTS Based on Input Transformation ofVietnamese TTS, " presents the proposal to synthesize Muong speech by adapting existing Vietnamese TTS systems This approach can be experimentally applied to create TTS systems for other Vietnamese ethnic minority languages quickly.
Chapter 4, titled "Cross-Lingual Transfer Learning for Muong Speech Synthesis": In this chapter, we use and experiment with approaches for Muong TTS that leverage Vietnamese resources We focus on transfer learning by creating Vietnamese TTS, further training it with different Muong datasets, and evaluating the resulting Muong TTS.
PART 3: SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN LANGUAGE
Chapter 5, titled "Generating Unwritten Low-Resourced Language's Speech Directly from Rich-resource Language's Text," presents our approach for addressing speech synthesis challenges for unwritten low- resourced languages by synthesizing L2 speech directly from L1 text The proposed system is built using end-to-end neural network technology for text-to-speech.
We use Vietnamese as L1 and Muong as L2 in our experiments.
Chapter 6, titled "Speech synthesis for Unwritten Low-Resourced Languages Using Intermediate Representation": This chapter proposes using phoneme representation due to its close relationship with speech within a single language The proposed method is applied to the Vietnamese and Muong language pair Vietnamese text is translated into an intermediate representation of two unwritten dialects of the Muong language: Muong Bi - Hoa Binh and Muong Tan Son - Phu Tho The evaluation reveals relatively high translation quality for both dialects.
In conclusion, speech synthesis for low-resourced languages is a significant research area with the potential to positively impact the lives of speakers of these languages. Despite challenges posed by limited data and linguistic knowledge, advancements in speech synthesis technology and innovative approaches enable the developing of high- quality speech synthesis systems for low-resourced languages The work presented in this dissertation contributes to this field by exploring novel methods and techniques for speech synthesis in low-resourced languages.
For future work, there is a need to continue developing innovative approaches to speech synthesis for low-resourced languages, particularly in response to the growing demand for accessible technology This can be achieved through ongoing research in transfer learning,unsupervised learning, and data augmentation Additionally, there is a need for further investment in collecting and preserving linguistic data for low-resourced languages and developing phonological studies for these languages With these efforts, we can ensure that speech synthesis technology is accessible to everyone, regardless of their language.
PART 1 : BACKGROUND AND RELATED WORKS
OVERVIEW OF SPEECH SYNTHESIS AND SPEECH
Overview of speech synthesis
This section offers a brief introduction to the field of speech synthesis It highlights the key concepts and techniques in converting written text into spoken language It also provides a foundation for understanding the complexities and challenges of developing speech synthesis systems.
Speech synthesis is the artificial generation of human speech using technology A computer system designed for this purpose, known as a speech computer or speech synthesizer, can be realized through software or hardware implementations A text-to- speech (TTS) system specifically converts standard written language text into audible speech, whereas other systems transform symbolic linguistic representations, such as phonetic transcriptions, into speech [1] TTS technology has evolved significantly over the years, incorporating advanced algorithms and machine learning techniques to produce more natural-sounding and intelligible speech output.
By simulating various aspects of human speech, including pitch, tone, and intonation, TTS systems strive to provide a seamless and user-friendly listening experience.
The development of TTS technology has undergone remarkable progress over time:
In the 1950s, pioneers like Homer Dudley with his "VODER" and Franklin S Cooper's
"Pattern Playback" initiated the foundation for modern TTS systems.
The 1960s brought forth formant-based synthesis, utilizing models of vocal tract resonances to produce speech sounds.
The 1970s introduced linear predictive coding (LPC), enhancing speech signal modeling and producing more natural synthesized speech.
The 1980s saw the emergence of concatenative synthesis, a method that combined pre- recorded speech segments for the final output.
During the 1990s, unit selection synthesis became popular, using extensive databases to select the best-fitting speech units for more natural output.
The 2000s experienced the rise of statistical parametric synthesis techniques, such as Hidden Markov Models (HMMs), providing a data- driven and adaptable approach to TTS.
The 2010s marked the beginning of deep learning-based TTS with models like Google's WaveNet, revolutionizing speech synthesis by generating raw audio waveforms instead of relying on traditional signal processing.
End-to-end neural TTS systems like Tacotron streamlined the TTS process by directly converting text to speech without intermediate stages.
Transfer learning and multilingual TTS models have recently enabled the development of high- quality TTS systems for low-resourced languages, expanding the reach of TTS technology.
Today, TTS plays a vital role in everyday life, powering virtual assistants, accessibility tools, and various digital content types.
Some current applications of text-to-speech (TTS) technology includes:
Assistive technology for the visually impaired: TTS systems help blind and visually impaired individuals by reading text from books, websites, and other sources, converting it into audible speech.
Learning tools: TTS systems are used in computer-aided learning programs, aiding language learners and students with reading difficulties or dyslexia by providing auditory reinforcement.
Voice output communication aids: TTS technology assists individuals with severe speech impairments by enabling them to communicate through synthesized speech.
Public transportation announcements: TTS provides automated announcements for passengers on buses, trains, and other public transportation systems.
E-books and audiobooks: TTS systems can read electronic books and generate audiobooks, making content accessible to a broader audience.
Entertainment: TTS technology is utilized in video games, animations, and other forms of multimedia entertainment to create realistic and engaging voiceovers.
Email and messaging: TTS systems can read emails, text messages, and other written content aloud, helping users stay connected and informed.
Call center automation: TTS is employed in automated phone systems, allowing users to interact with voice-activated menus and complete transactions through spoken commands.
Virtual assistants: TTS is a crucial component of popular voice-activated virtual assistants likeApple's Siri, Google Assistant, and Amazon's Alexa, enabling them to provide spoken responses to user queries.
Voice search applications: By integrating TTS with speech recognition, users can use speech as a natural input method for searching and retrieving information through voice search apps.
In conclusion, TTS technology has come a long way since its inception, with continuous advancements in algorithms, machine learning, and deep learning techniques As a result, TTS systems now provide more natural-sounding and intelligible speech, enhancing the user experience across various applications such as assistive technology, learning tools, entertainment, virtual assistants, and voice search The ongoing development and integration of TTS into our daily lives will continue to shape the future of human-computer interaction and digital accessibility.
The architecture of a TTS system is generally composed of several components, as depicted in Figure 1.1. The Text Processing component is responsible for preparing the input text for speech synthesis The G2P Conversion component converts the written words into
Figure 1.1 Basic system architecture of a TTS system [22] their corresponding phonetic representations The Prosody Modeling component adds appropriate intonation, duration, and other prosodic features to the phonetic sequence Lastly, the Speech Synthesis component generates the speech waveform based on the parameters derived from the fully tagged phonetic sequence [2]. Text processing is crucial for identifying and interpreting all textual or linguistic information that falls outside the realms of phonetics and prosody Its primary function is to transform non-orthographic elements into words that can be spoken aloud Through text normalization, symbols, numbers, dates, abbreviations, and other non- orthographic text elements are converted into a standard orthographic transcription, facilitating subsequent phonetic conversion Additionally, analyzing whitespace, punctuation, and other delimiters is vital for determining document structure and providing context for all subsequent steps Certain text structure elements may also directly impact prosody Advanced syntactic and semantic analysis can be achieved through effective text-processing techniques [2, p 682] The phonetic analysis aims to transform orthographic symbols of words into phonetic representations, complete with any diacritic information or lexical tones present in tonal languages Although future TTS systems might rely on word-sounding units and possess increased storage capacity, homograph disambiguation and grapheme-to-phoneme (G2P) conversion for new words remain essential for accurate pronunciation of every word G2P conversion is relatively straightforward in languages with a clear relationship between written and spoken forms A small set of rules can effectively describe this direct correlation, which is characteristic of phonetic languages such as Spanish and Finnish Conversely, English is not a phonetic language due to its diverse origins, resulting in less predictable letter-to-sound relationships In these cases, employing general letter-to- sound rules and dictionary lookups can facilitate the conversion of letters to sounds, enabling the correct pronunciation of any word [2, p 683].
In TTS systems, prosodic analysis involves examining prosodic features within the text input, such as stress, duration, pitch, and intensity This information is then utilized to generate more natural and expressive speech Prosodic analysis helps determine the appropriate stress, intonation, and rhythm for the synthesized speech, resulting in a more human-like output Predicting prosodic features can be achieved through rule-based or machine-learning methods, including acoustic modeling and statistical parametric speech synthesis By adjusting the synthesized speech, TTS systems can convey various emotions or speaking styles, enhancing their versatility and effectiveness across diverse applications Speech synthesis employs anticipated information from the fully tagged phonetic sequence to generate the corresponding speech waveform Broadly, two traditional speech synthesis techniques are concatenative and source/filter synthesizers Concatenative synthesizers assemble pre-recorded human speech components to produce the desired utterance In contrast, source/filter synthesizers create synthetic voices using a source/filter model based on the parametric description of speech The first method necessitates assistance in generating high-quality speech using the input text's parametric representation and speech parameters Meanwhile, the second approach requires a combination of algorithms and signal processing adjustments to ensure smooth and continuous speech, particularly at junctures.
Several improvements have been proposed for high-quality text-to-speech (TTS) systems, drawing from the two fundamental speech synthesis techniques Among the most prominent state-of-the-art methods are statistical parametric speech synthesis and unit selection techniques, which have been the subject of extensive debate among researchers in the field.
With the advancement of deep learning, neural network-based TTS (neural TTS) systems have been proposed, utilizing (deep) neural networks as the core model for speech synthesis A neural TTS system comprises three fundamental components: a text analysis module, an acoustic model, and a vocoder As illustrated in Figure 1.2 the text analysis module transforms a text sequence into linguistic features The acoustic model then generates acoustic features from these linguistic features, and finally, the vocoders synthesize the waveform from the acoustic features.
1.1.3 Evolution of TTS methods over time
The evolution of TTS methods has progressed significantly over time, with advancements in technology and research contributing to more natural and intelligible speech synthesis Early TTS systems relied on rule-based methods and simple concatenation techniques, which have since evolved into sophisticated machine learning approaches, including neural network-based TTS systems These modern systems offer improved speech quality, prosody, and adaptability, resulting in more versatile applications across various industries.
1.1.3.1 TTS using unit-selection method
The unit-selection approach allows for the creation of new genuinely sounding utterances by picking relevant sub-word units from a natural speech database [4], based on how well a chosen unit matches a specification/a target unit (and how well two chosen units join together) During synthesis, an algorithm chooses one unit from the available options to discover the best overall sequence of units that meets the specification [1]. The specification and the units are described by a feature set that includes linguistic and speech elements The feature set is used to do a Viterbi-style search to determine the sequence of units with the lowest total cost. Although they are theoretically quite similar, the review of Zen [4] suggests that there are two fundamental methods in unit-selection synthesis: (i) the selection model [5], shown in Figure 1.3a; (ii) the clustering approach [6], shown in Figure 1.3b, which effectively enables the target cost to be pre-calculated The second method asks questions about features available at the time of synthesis and groups units of the same type into a decision tree.
Figure 1.3 General and clustering-based unit-selection scheme: Solid lines represent target costs and dashed lines represent concatenation costs [13]
In the selection model for TTS synthesis, speech units are chosen based on a cost function calculated in real time during the synthesis process This cost function considers the acoustic and linguistic similarity between the target text and available speech units in the database, selecting the unit with the lowest cost for synthesis.Conversely, the clustering approach pre- calculates the cost for each speech unit, grouping similar units into a decision tree This tree allows for rapid speech unit selection during synthesis based on available features,reducing the real-time computation and resulting in faster, more efficient TTS synthesis Both methods have their advantages and disadvantages, with the selection model offering greater flexibility for adapting to different languages and voices and the clustering approach providing enhanced speed and efficiency The choice between these methods depends on the specific needs of the TTS system being developed.
In a typical statistical parametric speech synthesis system, a set of generative models is used to model the parametric speech representations extracted from a speech database, including spectral and excitation parameters (also known as vocoder parameters are used as inputs of the vocoder) The model parameters are frequently estimated using the Maximum Likelihood (ML) criterion Then, to maximize their output probabilities, speech parameters are constructed for a specific word sequence to be synthesized from the estimated models Finally, a speech waveform is built from the parametric representations of speech [4]. Any generative model can be employed; however, HMMs are mainly well-known In HMM-based speech synthesis (HTS) [7], context-dependent HMMs statistically model and produce the speech parameters of a speech unit, such as the spectrum and excitation parameters (for example, fundamental frequency - F0) A typical HMM-based speech
Figure 1.4 Core architecture of HMM-based speech synthesis system [25] synthesis system's core architecture, as shown in Figure 1.4 [8], consists of two main processes: training and synthesis.
The Expectation Maximization (EM) algorithm is used to do the ML estimation (MLE) during training, and it is similar to speech recognition The primary distinction is that excitation and spectrum parameters are taken from a database of natural speech that a collection of multi-stream context-dependent HMMs has modeled Excitation parameters include log F0 and its dynamic properties.
Another distinction is adding prosodic and linguistic circumstances to phonetic settings (called contextual features) The state-duration distribution for each HMM is also used to describe the temporal structure of speech The Gamma distribution and the Gaussian distribution are options for state-duration distributions In order to estimate them, the forward-backward method used statistical data that was gathered during the previous
Figure 1.5 General HMM-based synthesis scheme [13, p 5] iteration.
An inverse speech recognition procedure is carried out throughout the synthesis process The utteranceHMM is built by concatenating the context-dependent HMMs by the label sequence after a given word sequence is transformed into a context-dependent label sequence Second, the speech parameter generation algorithm creates spectral and excitation parameter sequences from the utterance HMM The obtained spectral and excitation parameters are then used to create a speech waveform using a speech synthesis filter and a vocoder with a source-excitation/filter model [4]. n n
Speech synthesis for low-resourced languages
The development of interactive systems for under-resourced languages [26] faces challenges due to the need for more data and minimal research in this area The SLTU- CCURL 2 workshops and SIGUL 3 meetings aim to gather researchers working on speech and NLP for these languages to exchange ideas and experiences These events foster innovation and encourage cross-disciplinary collaboration between fields like computer science, linguistics, and anthropology The focus is on promoting the development of spoken language technologies for low-resourced languages, covering topics like speech recognition, text-to-speech synthesis, and dialogue systems By bringing together academic and industry researchers, these meetings help address the challenges faced in under-resourced language processing.
Many investigations for low-resourced languages have been conducted recently using a variety of methods, including applying speaker characteristics [27], modifying phonemic features [28], [29], and cross-lingual text- to-speech [30], [31] Yuan-Jui Chen et al introduced end-to-end TTS with cross-lingual transfer learning [32]. The authors proposed a method to learn a mapping between source and target linguistic symbols because the model trained on the source language cannot be directly applied to the target language due to input space mismatches By using this memorization mapping, pronunciation information can be
2 http://sltu-ccurl-2020.ilc.cnr.it/
3 https://sigul-2022.ilc.cnr.it/ kept throughout the transfer proess Sahar Jamal et al [33] used transfer learning for the experiments to take advantage of the low-resourced scenario The information obtained then trains the model with a significantly smaller collection of Urdu training data The authors created standalone Urdu and learning systems by using pre-trained Tacotron models of English and Arabic as parent models Marlene Staib et al [34] improved or matched the performance of many baselines, including a resource-intensive expert mapping technique, by swapping out Tacotron 2’s character input for a manageably small set of IPA-inspired features This model architecture also enables the automated approximation of sounds that have not been seen in training They demonstrated that a model trained on one language could produce intelligible speech in a target language even in the lack of acoustic training data A similar approach [35] is used in transfer learning, where a high-resource English source model is fine-tuned with either 15 minutes or 4 hours of transcribed German data Data augmentation is a different approach that researchers apply to solve the low-resourced language challenge [36]–
[38] An innovative three-step methodology has been developed for constructing expressive style voices using as little as 15 minutes of recorded target data, circumventing the costly operation of capturing large amounts of target data Firstly, Goeric Huybrechts et al [36] augment data by using recordings of other speakers whose speaking styles match the desired one In the next step, they use synthetic data to train a TTS model based on the available recordings Finally, the model is fine-tuned to improve quality.
Muthukumar and his colleagues have developed a technique for automatically constructing phonetics for unwritten languages [39] Synthesis may be improved by switching to a representation closer to spoken language than written language.
The main challenges to address when developing TTS for under-resourced languages are
1 synthesizing speech for languages with a writing system but limited data; 2 synthesizing speech for languages without a writing system, using input text or speech from another language Key research directions, such as adaptation and polyglot approaches, will be discussed in detail in the following sections to tackle these challenges.
1.2.1 TTS using emulating input approach
The rationale behind this approach is to leverage an existing TTS system for a base language (Base Language - BL) to simulate TTS for an unsupported language (target language - TL) This strategy aims to assist individuals who speak unsupported languages when communicating in another language is inconvenient, such as when new immigrants visit a doctor While TTS plays a role in translating doctor-patient conversations, text-based communication is also essential in healthcare Consequently, TTS becomes necessary for users with limited English proficiency or literacy skills in their native language, as it enables them to access and understand vital information [40].
The first emulating idea given by Evans et al [41], the team developed the simulator to fit a screen reader. They describe a method that enables the production of text-to-speech synthesizers for new languages with assistive apps The method employs a straightforward rule-based text-to-phoneme step The phonemes are transmitted to a phoneme-to-speech system for another language They demonstrate that the correspondence between the language to be synthesized and the language on which the phoneme-to-speech system is based is crucial for the perceived quality of speech but not necessarily for speech comprehension They report the exam in Greek but can apply the same method with equal success for Albanian, Czech, Welsh, and additional languages.
Three primary challenges exist in simulating a target language (TL) using a base language (BL) First, it is essential to choose BL phonemes that closely resemble those of the TL's phonemes Second, the goal is to minimize discrepancies in text-to-phoneme mapping Lastly, we must select a BL with linguistic features that closely align with those of the TL's linguistic features These three challenges can lead to different approaches, and ultimately, the balance achieved will be significantly influenced by the decisions made regarding the BL [40].
In the study by Evans et al [41], the evaluation process has a unique aspect compared to conventional TTS assessment This distinction is essential to understand as it highlights the tailored approach needed for evaluating TTS systems in under-resourced languages The MRT (Mean Opinion Score - Revised) is a variation of the traditional MOS (Mean Opinion Score) assessment The conventional MOS is a subjective evaluation method used to gauge the overall quality of speech synthesis systems In contrast, MRT focuses on the clarity and usability of the synthesized speech in low-resource settings This shift in focus makes MRT a more suitable evaluation method for under-resourced languages The study used nonsensical words and simple sentence structures as test cases to evaluate the Greek TTS system This approach was chosen because, in under- resourced languages, ensuring that the TTS system can generate clear and understandable speech even when faced with unusual or uncommon linguistic structures is crucial By using these "fake" cases, the evaluation can better assess the system's performance and robustness in challenging situations.
Harold Somers and his colleagues proposed a "emulating" approach for developing TTS systems in under- resourced languages, as explored in their publications [40] and [42] They aimed to create a TTS system for Somali, an under-resourced language, by leveraging an existing TTS system for a well-resourced language. The researchers also discussed various experimental designs to assess TTS systems developed using this approach, emphasizing the importance of evaluating speech quality, intelligibility, and usefulness This method utilizes existing resources from well-resourced languages, showing potential for developing TTS systems for under-resourced languages By investigating different experimental designs and evaluation methods, researchers can better comprehend the challenges, opportunities, and limitations of this approach.
The advantages and disadvantages of the "emulating" approach for low-resourced languages, as well as its applicability, can be summarized as follows:
Resource efficiency: By leveraging existing TTS systems for rich- resourced languages, the need for extensive data collection and development efforts can be reduced.
Faster development: Utilizing existing resources accelerates the development process for TTS systems in low-resourced languages.
Cross-disciplinary collaboration: The "emulating" approach fosters collaboration among researchers in various fields, such as computer science, linguistics, and anthropology.
Speech quality: Synthesized speech quality may be compromised due to the mismatch between the base and target languages.
Intelligibility: Depending on the similarity between the base and target languages, the intelligibility of the generated speech might be limited.
Customizability: The "emulating" approach might not be suitable for every low-resourced language, especially if there is no closely-related rich-resourced language to use as a base. Applicability:
Languages with similar phonetic or linguistic characteristics: The "emulating" approach is most applicable when the target low-resourced language shares phonetic or linguistic features with a well-resourced language.
Situations requiring rapid TTS system development: In cases where a TTS system is urgently needed for an low-resourced language, the "emulating" approach can provide a quicker solution than traditional methods.
Initial system development: The "emulating" approach can serve as a starting point for developing a more refined TTS system for low- resourced languages, allowing researchers to identify specific challenges and opportunities for improvement.
In summary, the "emulating" approach presents a promising direction for developing TTS systems for low- resourced languages However, its success depends on selecting a suitable base language and overcoming the limitations inherent in this method.
1.2.2 TTS using the polyglot approach
Polyglot TTS and multilingual TTS are often used interchangeably, but they can have slightly different meanings depending on the context:
Polyglot TTS: A single TTS model is trained to handle multiple languages simultaneously in the polyglot approach The model can synthesize speech in various languages using the same architecture and shared parameters The polyglot approach aims to leverage commonalities among languages and transfer knowledge from rich-resourced languages to low-resourced languages This approach can be more resource-efficient and scalable compared to building separate TTS models for each language.
Multilingual TTS: Multilingual TTS is a broader term that refers to any TTS system capable of handling multiple languages, regardless of the specific architecture or method used A multilingual TTS system can include separate TTS models for each language or use a shared model like in the polyglot approach The main goal of multilingual TTS systems is to support speech synthesis in various languages.
In summary, polyglot TTS is a specific approach to building multilingual TTS systems where a single model is used for multiple languages On the other hand, multilingual TTS is a more general term that encompasses any TTS system capable of handling multiple languages, whether it uses separate models for each language or a shared model like in the polyglot approach.
Figure 1.11 Scheme of a HMM-based polyglot synthesizer [48]
Machine translation
Introducing machine translation is essential for the development of TTS systems for unwritten low- resourced languages through phoneme-level intermediate representation This is because TTS systems generally require text input, which may not be readily available or standardized for unwritten languages.Machine translation can help bridge the gap between written and unwritten languages by converting the source text from a well- resourced language to a phoneme sequence in the target unwritten language This phoneme sequence can then be used as the intermediate representation to guide the TTS system, enabling it to synthesize speech for the unwritten low-resourced language Consequently, the combination of machine translation and intermediate phoneme representation can significantly contribute to developing TTS systems for these languages.
Machine translation is the process of using computational algorithms and techniques to automatically translate text or speech from one language to another The primary goal is to generate both accurate and fluent translations, preserving the meaning and style of the original text Several approaches have been developed for machine translation, each with its own set of principles Rule-based machine translation relies on comprehensive linguistic rules and dictionaries to translate between languages This method requires extensive knowledge of the source and target languages' grammatical structures and vocabulary On the other hand, statistical machine translation employs probabilistic models based on the analysis of bilingual text corpora. These models learn to predict the most likely translation by observing patterns and co-occurrence frequencies of words and phrases in the training data.
In recent years, neural machine translation has gained prominence due to its ability to generate more fluent and accurate translations This approach utilizes deep neural networks, including encoder-decoder architectures and attention mechanisms, to capture complex linguistic relationships and generate translations in a more context-aware manner.
Each method has its advantages and drawbacks, often depending on the available data and computational resources While rule-based systems can perform well for languages with rich linguistic resources, statistical and neural approaches generally excel in scenarios with vast parallel corpora available for training.
Recall that the idea behind a neural network is to model language as a sequential process The neural machine translation model is constructed by extending a neural language model Given additional words as input, this model predicts the next word in the sequence When reaching the end of the sentence, it is considered to predict the translation of the input sentence, wherein each output is processed sequentially This is referred to as sequence-to- sequence [54] Figure 1.13 describes an example of converting a sequence into a sequence By expanding the language model, we put the end of the English entry with the German output The transition box (after handling the end of the token ) will contain the packaging of the entire input sentence.
Figure 1.13 Examples of sequence to sequence transformation [55]
The Encoder-Decoder [54], [56] approach in neural machine translation utilizes a neural network to create a meaningful representation of the input sentence and generate translations It involves encoding the input sentence into a fixed-length vector, which is then decoded to produce the output sentence This architecture is capable of handling variable-
� length input and output sequences, making it a popular choice for machine translation tasks The encoder and decoder components can be deepened using multiple layers, further enhancing their learning capacity. Additionally, Convolutional Neural Networks (CNNs) have been applied to this approach, offering new possibilities for handling more complex translation tasks.
1.3.2 Attention in neural machine translation
Constructing a neural translation model with decoding encryption architecture has brought good initial results In addition, proposals have been made to integrate additional structural models in two source and target languages into the neuron translation system, such as incorporating the models of goods and absolute location models And a relative of words in the sentence, the birth model (Fertility) [57], [58], etc However, the most critical improvement in integrating bilingual structure matching models into neural translation models is using the product model for input and output words.
The attention model is considered very important in a sequence-to-sequence processing model Their study
[59] added an alignment model (called attention mechanism) to the decoder The encryption part gave us a string representation from ℎ 𝑗 = (⃖ℎ , ℎ ), and the decryption required the context c i at each step i The attention mechanism allows connecting these two components.
Figure 1.14 depicts the location of the attention model when it comes to neural machine translation, input and output, and the connections between the components in the model The idea is the attention mechanism will link information about all representations from the input
, ℎ ) and the previously hidden state c i of the decoder, thereby computing ci input context
Figure 1.14 Describe the location of the Attention model in neural machine translation
Attention mechanisms are employed to enhance neural machine translation (NMT) model training The Attention mechanism facilitates the calculation of the connection between the decoding phase and each output word This connection can be calculated using weight vectors and bias, and the attention values are then normalized using the Softmax function.
The Self-Attention mechanism, which extends the Attention mechanism within the encoder, focuses on calculating connections between input words instead of input-output words A broader context is provided, and this mechanism supports parallel processing The self-attention mechanism can also be incorporated into the decoding phase.
The training process for NMT models consists of trading training data, dividing the data into large and small sets, processing mini-batches, collecting gradients, and updating parameters The training typically undergoes 5-
15 epochs Advanced training can leverage existing bilingual structure matching models to improve quality and speed, and self-attention mechanisms can enhance the translation model further.
1.3.3 Statistical machine translation based on phrase
Statistical machine translation views the translation problem as a machine learning problem An algorithm that extracts statistical information from an extensive database of previously translated texts (database of parallel texts) The system can then translate the new sentences Warren Weaver presented the first ideas about probabilistic statistical machine translation in 1949 The memo written by Warren Weaver in 1949 was perhaps the most influential publication in the early days of statistical machine translation Weaver first cited the possibility of using computers for translation in 1947 His proposals involved the applicability of coding methods to translation and the existence of universal language rules and logic between languages etc Statistical machine translation was then reintroduced in 1991 by researchers at the IBM Research Center and has contributed to a renewed interest in statistical machine translation in recent years Statistical machine translation is still one of the world's most studied machine translation methods.
Today, with tools available and enough parallel text data, we can build a translation system for a new language pair in a relatively short time This method can be applied to a large number of language pairs The accuracy of these systems depends mainly on the quantity, quality, and suitability of the parallel text field used.
1.3.3.1 Statistical machine translation problem based on phrase
Speech synthesis evaluation metrics
The Mean Opinion Score (MOS) is a pivotal metric in the realm of audio and speech quality assessment. MOS is employed to evaluate the perceived quality of audio or speech by listeners It stands as a standardized method for gauging user satisfaction with the quality of synthesized audio, be it through telecommunication systems, voice communication systems, or voice synthesis applications such as virtual assistants.
The formula for calculating MOS is typically represented as follows:
𝑁: The number of listeners participating in the evaluation (typically chosen randomly).
𝑅 𝑖 : The rating given by listener i, usually on a scale of 1 to 5, with the highest score indicating the highest satisfaction with the audio or speech quality.
The Mean Opinion Score (MOS) holds significant importance in the assessment and enhancement of audio and speech quality Its primary significance encompasses:
Synthesized Quality Assessment: MOS allows researchers and developers to evaluate the performance of speech synthesis systems By collecting MOS scores from real listeners, they can ascertain user satisfaction levels and identify weaknesses that require improvement.
System Comparison: MOS provides a standardized means to compare audio quality between different systems or variations of a system This aids in selecting the best-performing system or developing improved iterations.
Quality Threshold Establishment: MOS can be used to establish acceptable quality thresholds For example, a speech synthesis system with a MOS below a specific threshold may be deemed unacceptable and necessitate improvements.
In conclusion, MOS serves as a potent tool for the evaluation and improvement of audio and speech quality, ensuring that users have the best possible listening experience.
A Confidence Interval (CI) is a concept in statistics and science used to estimate a range within which the true value of a variable may lie It quantifies the uncertainty in data and allows us to understand the level of confidence in a measurement or estimation Below is the standard formula for calculating a Confidence Interval (CI):
𝑍: The value from the standard distribution table corresponding to the desired level of confidence (e.g., 1.96 for 95% confidence).
The Confidence Interval (CI) helps us gain a deeper understanding of the level of uncertainty in the data and the certainty of an estimation The significance of CI includes:
Generalizing the estimate: CI allows us to ascertain a range of certainty around the estimated value. Instead of providing only specific figures, it enables the identification of a range within which the true value may lie.
Comparing and deciding: CI enables comparisons between estimates and assesses whether differences are statistically significant It also supports decision-making based on the level of confidence.
Visualization and presentation of results: When reporting research findings, CI can be graphically represented or presented alongside the estimated value, aiding in the visual communication of data uncertainty.
Mel Cepstral Distortion (MCD) [87], [88] is a common metric in the field of speech synthesis and speech processing used to evaluate the similarity between two speech signals, typically a synthesized signal and a reference signal recorded from a human speaker.
The formula to calculate the Mel Cepstral Distortion (MCD) is typically represented as follows:
𝑁: Number of frames in the speech signal.
𝑀: Number of Mel cepstral coefficients (usually 12 or 24).
𝐶 (𝑡) : Mel cepstral coefficients of the synthesized (target) signal.
𝐶 (𝑟) : Mel cepstral coefficients of the reference (real) signal.
MCD measures the distance between two speech signals, one typically synthesized and the other real A lower MCD value indicates a higher degree of similarity between the two signals The significance of MCD includes:
Quality Assessment: MCD helps assess how closely the synthesized speech signal matches the reference, serving as an indicator of quality A lower MCD implies a closer match.
Model Optimization: In the field of speech synthesis, researchers and developers can use MCD to adjust and improve synthesis models to make the synthesized speech sound more similar to the reference.
1.4.2.4 MCD with Dynamic Time Warping (MCD – DTW)
Dynamic Time Warping (DTW) [89] is applied in speech synthesis to align and normalize the Mel Cepstral Distortion (MCD) between a synthesized speech signal and a reference (real) speech signal It's used to account for timing differences and make a meaningful comparison between the two signals By employing DTW, MCD becomes a more robust metric for evaluating the quality and similarity of synthesized speech, ensuring that temporal variations are considered during the assessment process This alignment and normalization process enhances the accuracy of speech synthesis evaluation and contributes to producing more natural and high-quality synthesized speech.
The Analysis of Variance, or ANOVA [90], [91], is a sophisticated statistical methodology employed to discern differences among group means It is an instrumental technique used extensively across various domains, such as psychology, medicine, and social sciences, to name a few, predominantly in experimental research The primary function of ANOVA is to analyze if there exists any statistically significant disparity among the means of two or more groups by scrutinizing the variances both within and between these groups.
The fundamental philosophy underpinning ANOVA is the comparison of the between- group variance with the within-group variance If the former substantially outweighs the latter, the conclusion drawn is that there are indeed noteworthy differences in the means of the groups under comparison ANOVA's potency lies in its capacity to detect even minor deviations between groups Multiple variants of ANOVA exist, each catering to different purposes and applications, such as one-way ANOVA, two-way ANOVA, and repeated measures ANOVA One-way and two-way ANOVA are both statistical methodologies used for comparing means across diverse groups or treatments; however, they differ in the number of factors or independent variables under analysis.
One-way ANOVA involves the analysis of a single factor or independent variable This statistical method tests for significant differences in means across three or more groups For instance, if the aim is to compare the average heights of students across three distinct schools, a one-way ANOVA would be the method of choice to determine any significant disparities in the mean heights among the schools.
The ANOVA formula for one-way ANOVA is as follows:
𝐹: The F-statistic, used to test the difference between groups.
𝑀𝑆𝐵: Calculated by dividing the sum of squares between groups by the degrees of freedom between groups It measures the variability between groups Higher MSB indicates significant differences between groups.
𝑀𝑆𝑊: Calculated by dividing the sum of squares within each group by the degrees of freedom within groups It measures the variability within each group Lower MSW indicates higher similarity within groups.
𝐹-statistic: This is a numerical value calculated based on the differences between groups in the data.
A high F-value suggests significant differences between groups.
Mean Square Between (𝑀𝑆𝐵): 𝑀𝑆𝐵 quantifies the variation between groups It is computed by summing the squares of the differences between the group means and the overall mean, divided by the degrees of freedom between groups.
Mean Square Within (𝑀𝑆𝑊): 𝑀𝑆𝑊 quantifies the variation within each group It is computed by summing the squares of the differences between individual data points and their respective group means, divided by the degrees of freedom within groups.
When comparing MSB and MSW by dividing MSB by MSW to calculate the F-value, if the F-value is high, there is evidence to believe that there are significant differences between the groups In other words, ANOVA helps us test whether the differences between groups are statistically significant.
This formula is used to assess the impact of an independent variable (such as a speech synthesis model) on a dependent variable (such as speech quality) across multiple groups or conditions.
Conclusion
TTS technology converts written text into spoken speech and can be implemented using various methods, such as unit selection, statistical parameter speech synthesis, and neural speech synthesis For low-resourced languages, approaches like emulating models, polyglot TTS, and adaptive methods have been used Emulating models are quick and cost-effective but may suffer from low quality and limited accuracy Polyglot TTS models offer multilingual capabilities, higher quality, and better accuracy but are computationally expensive and complex to implement Adaptive methods provide high accuracy and dynamic adaptation but are also computationally expensive and challenging to implement.
Newer approaches like transfer learning, zero-shot learning, unsupervised learning, and adversarial training are leveraging machine learning advancements to improve TTS quality and accuracy for low-resource languages Transfer learning utilizes pre-trained models in rich-resource languages for fine-tuning TTS models in low-resourced languages Zero-shot learning enables TTS models to generate speech for unseen languages, and unsupervised learning helps TTS models learn language characteristics without labelled data Adversarial training aims to improve the realism of synthesized speech These methods show promising results in enhancing speech synthesis for low-resourced languages.
In this thesis, the emulating method and cross-lingual transfer learning approach are chosen for several reasons:
Resource efficiency: The emulating method is quicker and more cost- effective than other approaches, as it requires fewer data and computational resources This makes it an attractive choice for low- resourced languages where acquiring substantial amounts of data is challenging.
Pre-trained models: Cross-lingual transfer learning leverages pre-trained models on rich- resource languages, which have already captured essential linguistic and acoustic features This enables the model to gain a head- start when fine-tuning for low-resourced languages, ultimately saving time and computational resources.
Adaptability: Cross-lingual transfer learning allows the model to adapt its knowledge from a rich-resource language to a low-resourced language, enabling it to generalize better across languages This adaptability can help improve the overall performance of the TTS system in low-resourced languages.
Simplicity: The emulating method involves a relatively straightforward process of training a TTS model on a rich-resource language and fine- tuning it on a low-resourced language This simplicity makes it easier to implement and maintain, especially in comparison to more complex approaches like polyglot TTS or adaptive methods.
Wider applicability: The combination of the emulating method and cross- lingual transfer learning can be applied to various low-resourced languages, making it a versatile solution for TTS systems that aim to support multiple languages.
By choosing the emulating method and adapting from the cross-lingual transfer learning approach, this thesis aims to develop an efficient, adaptable, and straightforward TTS system for low-resourced languages while maximizing the use of available resources and knowledge from rich-resource languages.
In the case of unwritten low-resourced languages, it is necessary to replace text with an intermediate phoneme-level representation This issue can be addressed through research that combines various machine translation methods previously presented in Chapter 1 Thus, translation and TTS problems are integrated to create TTS systems for unwritten low- resourced languages.
The implementation of the above content shows the need to study the phonetic matching of the Viet-Muong language pair carefully, and the next chapter will detail these studies.
VIETNAMESE AND MUONG LANGUAGE
Vietnamese language
The pronunciation of Vietnamese is considered simple and easy to learn, with a consistent and straightforward pronunciation system The Vietnamese phonetic system is a combination of Latin script with diacritical marks. There are 29 letters in the Vietnamese alphabet, with some letters used for vowels and consonants The most important feature of Vietnamese phonetics is its tone system, which consists of 6 tones that can change the meaning of a word These tones are indicated by diacritical marks placed above the vowels.
Vietnamese [92], is the language of the Vietnamese (Kinh people) and is the official language in Vietnam This is the mother tongue of about 85% of the Vietnamese, along with more than four million Vietnamese overseas. Vietnamese is also the second language of ethnic minorities in Vietnam Although Vietnamese has some borrowed words from the Chinese language and previously used Nôm (a Chinese-based script) to write, Vietnamese is considered one of the languages of the Austroasiatic family that has the number of people who speak the most.
Vietnamese is officially recognized in the constitution as the national language of Vietnam Vietnamese includes pronunciation in Vietnamese and Quoc Ngu to write However, there are no official documents forVietnamese standards and national accents at the state level Vietnamese is a native language derived from the civilization of agriculture in what is now the northern region of the Red River and the Ma River in Vietnam.According to A G Haudricourt, in 1954, the Viet-Muong language group in the early Christian era was a language or dialect Later, through interaction with the Chinese language and especially with the linguistic Tai-Kadai language, which has a highly developed tone system, the tone system in Vietnamese appears and looks today, according to the rules of tone formation The appearance of tones began around the 6th century (the northern part of Vietnamese history) with three tunes and stable development around the 12th century (the Ly) with six tones Then some of the early consonants change to this day In the process of change, the final consonants go to change the syllabic endings, and the consonant head moves from confused to tangible.
Since the French invaded Vietnam in the second half of the 19th century, the French gradually replaced the position of Confucian as the official language in education, administration, and diplomacy The Quoc Ngu, which was created by some European missionaries, especially two Portuguese Gaspar monks, Amaral and Antonio Barbosa, for the purpose of using Latin characters to express Vietnamese, is increasingly being used popularly and, at the same time, influenced by new terms and expressions of the Western language (mainly from French [93]) such as phanh, lốp, găng, pê đan , Han as chính đảng, kinh tế, giai cấp, bán kính Gia Dinh Newspaper was the first newspaper published in the national language in 1865, affirming the development and trend of the Vietnamese language as the Official writing of independent Vietnam later Quoc Ngu is a recording using only 27 Latin and six diacritic marks; simple, convenient, scientific, easy to learn, easy to remember, completely replated French and Chinese, those are difficult to read, hard to remember, not popular with the Vietnamese.
After the reunification of Vietnam in 1975, North-South relations were reconnected Recently, the popularity of television and radio nationwide has made Vietnamese a standardized part Many Vietnamese words are popularly used instead of Vietnamese, as well with the advancement of the internet and globalization The influence of English is growing in the media and the correspondents, many from abroad They were introduced in Vietnamese, lack of selectivity, and were written in a foreign language.
Phonetic speech units can be classified into several categories, including syllables, sounds, phonemes, and prosodic elements such as stress, rhythm, and intonation These units work together to create the unique sound and rhythm of speech, helping to convey meaning and express emotions.
Strings of words are broken down into different groups of sounds, varying in size from large to small The smallest unit is the syllable, such as in the word "xà phòng," which is pronounced as "xà" and "phòng." This word is considered to have two syllables [94] Syllables can be classified into different categories based on various criteria. The most common classification is based on the ending of the syllable Using this criterion, syllables can be split into four types [95]: open syllables, semi-open syllables, half-closed syllables, and closed syllables.
Sound is the smallest natural unit of speech Different criteria are used to distinguish between these sounds, such as their acoustic features and vocal characteristics The distinction between vowels and consonants can also be based on the structure of the vocal organs Specifically, consonants have a point of articulation, known as
"phonological focus," while vowels do not [95] In the context of the Vietnamese-Muong language pair, the differentiation between vowels and consonants is critical in determining the meaning of words By understanding the unique characteristics of these sounds, TTS systems can be developed to produce natural- sounding speech in low-resourced languages.
The smallest unit of language is a phoneme, an abstract unit rather than a specific sound produced by an individual For instance, "G" is a phoneme representing a cluster of distinct features expressed simultaneously [94]. These distinctive features are a social convention and allow for communication between speakers of the same language.
In Vietnamese, for example, the phoneme /d/ is pronounced with the same distinctive features, such as plosive and voiced, by all speakers Although there may be differences in how people and dialects pronounce words, these differences are unimportant and do not affect the meaning The phoneme represents a general social phonetic constraint that individuals must share for successful communication So, the sound is the expression of the phonemes in speech [95].
Analyzing a language's syllable structure is crucial to understanding its phonemic system In the case of Vietnamese, the complex syllable structure, which includes numerous nuclei/vowels, combinations of glide and vowel, and a prominent tone system, requires careful analysis to identify the distinct sounds and their role in the language.
Through analysis of the syllable structure, researchers can gain a deeper understanding of the relationships between the various components of Vietnamese syllables and how they contribute to the unique sound and rhythm of the language This information can be used to inform the analysis of the phonemic system, helping to identify the distinct sounds and their role in the language.
Vietnamese phonetic and grammatical units, such as syllables and morphemes, are the same, and each syllable in Vietnamese has a stable and complete structure consisting of distinct sound units The syllable structure in Vietnamese is much different from that of European languages, where each syllable typically contains a consonant and vowel sound.
In Vietnamese, each syllable has five components: tone, initial sound, medial tone, nucleus tone, and coda syllable These components play different roles in the syllable, such as regulating the pitch, opening the syllable, changing the tone, and forming the nucleus and end of the syllable Each component has its function, and the combination of these components forms the syllable.
Muong language
2.2.1 Overview of Muong people and Muong language
The Muong people are an ethnic minority predominantly inhabiting northern Vietnam's mountainous regions, particularly in Hoa Binh, Thanh Hoa, and Son La They are the second- largest ethnic minority group in Vietnam (1.452.095 people according to the 2019 census) and are closely related to the majority ethnic Vietnamese, or Kinh people, in terms of linguistic and cultural aspects Both groups are believed to share a common ancestry and historical roots.
Throughout history, the Muong people have maintained a distinct cultural identity shaped by their unique social structure, language, and way of life The traditional Muong society was organized around a hierarchical system,with chieftains ruling various regions and maintaining power through family lineages The nobility and commoners made up the remaining social strata This social organization has evolved, reflecting changes in political and social environments.
The Muong language is part of the Austroasiatic family and closely relates to Vietnamese Despite these linguistic similarities, the Muong language has unique features and characteristics, such as its tonal system and vocabulary Muong folklore, music, and traditional rituals contribute to their cultural heritage and help to reinforce their identity.
Agriculture, particularly wet rice cultivation, has been an essential part of the Muong way of life, highlighting their strong connection to the natural environment The Muong people have developed sophisticated agricultural techniques and practices over time to adapt to the challenges posed by the mountainous terrain.
The history of the Muong people is a testimony to their resilience and adaptability, as well as their ability to maintain their cultural identity while integrating and coexisting with other ethnic groups in Vietnam Understanding the Muong history provides valuable insights into the diversity and richness of the Vietnamese cultural landscape.
In terms of language family, Vietnamese and Muong belong to the same group Viet Muong belongs to the Mon-Khmer branch of the Austroasiatic family.
Figure 2.1 Mon-Khmer branch of the Austroasiatic family [109, pp 175–176]
The Viet Muong group is shown in Figure 2.2, according to the study of Ferlus [110].
The Muong language has unmistakable similarities with Vietnamese This was pointed out by André-Georges Haudricourt in his article "Position of Vietnamese in South Asian Linguistics" [111] Haudricourt has launched and discussed 12 words in the basic vocabulary of the human body in various Mon-Khmer languages, including Viet, Muong, Phong, Kuy, Mon, Bahnar, Mnong, and some other languages From the Table A.6, it can be seen that the correspondences between Vietnamese and Muong are absolute.
Many views have been made more specifically about the relationship between Vietnamese and Muong - the central issue in the study of the Muong language Since the Vietnamese (or Kinh) and the Muong formally split into two groups, each group tends to form its language Therefore, Muong and Vietnamese are independent languages with different dialects However, in terms of linguistics, these are just the manifolds from the same root of the Viet- Muong sub-group The study of these development processes has yet to be a clear division.
In a narrow sense, Viet-Muong refers to Northern Vietic languages, including Vietnamese, Muong, and Nguon, which share irregular tones in some basic vocabulary and phonetic features that distinguish them from southern Vietnamese languages Ferlus [112] suggested that the Proto-Vietic expansion to the north was a key factor in developing these distinctions Xinh Mun of the Khmuic was the only historical trace of this language group Linguistic diversity in the northern regions of Vietnam results from rapid changes, with Vietnamese eventually becoming the dominant language.
The Muong language has developed more conservatively than Vietnamese, retaining phonetic features closer to the Proto Viet-Muong Notably, the Muong language retains pre- syllabic sounds that disappear without substitution, while in Vietnamese, these sounds changed, such as in the word for 'chicken'—Muong /ka/ andVietnamese /ɣa/ [113].
Muong people live in three provinces of Hoa Binh (Muong accounts for 63.3% of the province's population), Phu Tho (Muong accounts for 13.1% of the province's population), and Thanh Hoa (9.5%) Figure 2.3 displays the geographical distribution of the various dialects of the Muong language.
Figure 2.3 The distribution of the Muong dialects [114, p 299]
In addition to the four main dialects of the Muong language (Muong Bi, Muong Vang, Muong Thang, andMuong Dong), it is important to mention the Muong Tan Son dialect, spoken in the Tan Son district of Phu ThoProvince in northern Vietnam.
Muong Bi: Primarily spoken in Hoa Binh Province, Muong Bi is considered the most prestigious dialect and often serves as the standard for the Muong language.
Muong Vang: Predominantly spoken in Thanh Hoa Province, the Muong Vang dialect exhibits differences from Muong Bi in terms of vocabulary and pronunciation.
Muong Thang: Found in areas of both Hoa Binh and Thanh Hoa Provinces, Muong Thang shares similarities with the Muong Vang dialect.
Muong Dong: Mainly spoken in Son La Province, Muong Dong is the least studied and understood among the Muong dialects, with more distinct variations from the other three dialects.
Muong Tan Son: Spoken in the Tan Son district of Phu Tho Province, the Muong Tan Son dialect exhibits unique phonetic, lexical, and grammatical features that differentiate it from the other Muong dialects.
These five dialects together represent the linguistic diversity and richness of the Muong language, shaped by regional, historical, and cultural factors Further research into each of these dialects, including Muong Tan Son, can provide valuable insights into the language's development and the cultural heritage of the Muong people.
The issue of the Muong written script is an exciting topic in the context of the Muong language and its development Historically, the Muong people did not have a formal writing system Instead, they relied on an oral tradition to transmit their culture, folklore, and knowledge from one generation to the next Over time, the Muong people adopted various writing systems influenced by the languages and cultures they interacted with.
One notable example is the use of the Vietnamese script (Chữ Nôm) to transcribe the Muong language Chữ Nôm is a logographic writing system used for the Vietnamese language from the 13th to the early 20th century, and it was based on the Chinese script Some Muong people, especially scholars and elites, learned Chữ Nôm and used it to record Muong texts, such as stories, poems, and historical documents.
Comparison between Vietnamese and Muong
This section will be based on the studies mentioned above, drawing comparisons between the similarities and differences between Vietnamese and the two Muong dialects, Muong Bi and Muong Tan Son In the following chapters, these comparative analyses will serve as a foundation for experimenting with TTS methods for low- resourced languages.
When comparing Vietnamese and Muong, the phonetic element of the two languages can be divided into three groups:
Equivalent elements: Muong phonemes coincide with phonemes in Vietnamese, so we can use equivalent simulations.
Closed elements: Muong phonemes are similar to phonemes in Vietnamese, so we can use simulators to replace phonemes approximately.
Distinct element: Muong phonemes are not found in phonemes in Vietnamese.
The following will present the proposal for the transformation rules between Muong and Vietnamese in terms of consonants, vowels, and tone Firstly, in terms of the initial consonant, Muong Hoa Binh has 24 initial consonants [115], the Vietnamese has 20 initial consonants [108] Closed consonants between Muong and Vietnamese are /b - ɓ, c - tɕ, g - ɣ, k h – x, p h – f/, they are part of the reason for the decline in the quality of synthetic speech The five consonant phonemes in Muong are not in Vietnamese: /p, r, hr, tl, kl/, in writing respectively p, r, hr, tl, kl The Muong Hoa Binh final consonants consist of 9 consonants and two approximants /w, j/ [115] Hanoi Vietnamese licenses eight segments in coda position: three unreleased voiceless obstruent /p, t, k/, three nasals /m, n, ŋ/, and two approximants /j, w/ With consonant tone Ha Noi Vietnamese distinguishes the palate of the mouth /k͡ p/ followed by o, u, ô (/oŋ͡ m˦/ - ông) and /ŋ̟ / followed by i, ê, a (/sik̟ ˦˥/
- xích) With the nasal consonant, distinguish the next /ŋ͡ m/ followed by o, u, ô (/oŋ͡ m˦/ - ông), and /ŋ̟ / followed by i, ê, a (/kiŋ̟ ˦/ - kinh) [108] So, we see only seven consonants /p, t, k, m, n, ɲ, ŋ/ and two semi-vowel /w, j/ in Muong is equivalent to
Vietnamese consonants We must find an alternative equivalent in Vietnamese for two consonants /c, l / This issue will be addressed in the following study.
One consonant, / / is present in Vietnamese but not in Muong One consonant similar to Vietnamese but only in the Muong Tan Son dialect and not in Muong Bi / / One consonant in the Muong Bi dialect is not in Muong Bi and Vietnamese /w/.
Muong has one medial /w/ is written in w For example, kwêl khwắn (smoking), khwắi
(snack); kwa (we), kwải (throw), kwang (clean) Vietnamese has a medial /w/ written by two letters o and u. Example hoa quả [115].
In the vowels system, Muong Hoa Binh has 14 vowel sounds [115], the Vietnamese have 11 vowels and two vowels [122, p 58] So, Muong and Vietnamese's vowel system is equivalent to 11 vowels and two diphthongs. The difference is that, in the Muong language, the diphthong /ɯɤ/ is transitive of diphthong ươ; in Vietnamese, it is transcribed into /ɯə/ There are several differences in orthography which required us to adapt Muong spelling rules to Hanoi Vietnamese, for example, changing êê to ê, oo to o, ôô to ô, uu to u, ưư to ư [115] As for the vowel system, Vietnamese has 11 vowels [94] Muong monophthongs, written with the vowel letters a, e, i, o, u correspond reasonably equivalent to Hanoi Vietnamese vowels written similarly Muong has 14 vowels, and the diphthong /ɯɤ/ - ươ is replaced by another transcription /ɯə/ Hanoi Vietnamese has quite complex letter-to- phoneme mappings, mainly to do with voicing The Muong-to- “Hanoi Vietnamese” transliteration process is largely but not entirely automatable, so some manual revision of the texts is necessary Muong language does not have two short vowels /ɛ/ and /ɔ/, like in Vietnamese.
Following the analysis of the phonetic characteristics between Vietnamese and Muong above, the phonetic mapping for consonants, phonemes, vowels, and tones between Muong and Vietnamese are proposed in theTable 2.12.
Table 2.12 Muong and Vietnamese phonetic comparison (orthography in normal, IPA in italic; Vi: Vietnamese;
Mb: Muong Bi ; Mts : Muong Tan Son)
Mb Mts Vi Mb Mts Vi Mb Mts Vi
/tl/ - ng /ŋ/ ng /ŋ/ ng, ngh /ŋ/ ph /ph/ ph /ph/ ph /f/
/ / tr / / nh /ɲ/ nh /ɲ/ nh /ɲ/ - - s / / t /t/ t /t/ t /t/ w /w/ - - th /th/ th /th/ th /th/ v /v/ v /v/ v /v/ x /s/ x /s/ x /s/ z /z/ z /z/ d, gi /z/
Mb Mts Vi Mb Mts Vi Mb Mts Vi p
/c/ - t /t/ t /t/ t /t/ l /l/ l /l/ - c /k/ c /k/ c /k/ m /m/ m /m/ m /m/ n /n/ n /n/ n /n/ nh /ɲ/ nh /ɲ/ nh /ɲ/ ng /ŋ/ ng /ŋ/ ng /ŋ/ w /w/ w /w/ w /w/
Mb Mts Vi Mb Mts Vi Mb Mts Vi aa, a /a/ aa, a /a/ aa, a
/a/ - - e /ɛ/ ă /ă/ ă /ă/ ă /ă/ - - o /ɔ/ â /ɤ̆ / â /ɤ̆ / â /ɤ̆ / e /ɛ/ e /ɛ/ e /ɛ/ êê, ê/e/ êê, ê/e/ êê, ê/e/ i /i/ i /i/ i /i/ oo, o /ɔ/ oo, o /ɔ/ oo, o /ɔ/
In this thesis, we use the Muong Hoa Binh tonal system described in [115], which contains five tones: 33-Level, 42- Falling, 323 - Falling Rising, 34 - High Rising, and 342?- Low Falling Meanwhile, Hanoi Vietnamese have eight tones, a six-tone paradigm in open or sonorant final syllables, and a two-tone paradigm in syllables ending in an unreleased oral stop [108] Muong does not have a falling tone like in Vietnamese Details are described in Table 2.13.
Table 2.13 Comparing the tone of Vietnamese with Muong Tan Son and Muong Bi
No Tone Muong Bi Phonetic
Distinction Criteria Muong Tan Son Phonetic
Level, medium frequency Level Level, flat, medium frequency A1 – Level
2 Falling, low Falling, low, long A2 – Mid falling
Rising, high Rising, high, short C1 – Low falling < Hỏi >
Falling-rising, glottalization at the beginning of the syllable
High level, ending with glottal closure
Falling to low, short, ending with glottal closure
SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE
Chapter 3, titled "Emulating Muong TTS Based on Input Transformation ofVietnamese TTS, " presents the proposal to synthesize Muong speech by adapting existing Vietnamese TTS systems This approach can be experimentally applied to create TTS systems for other Vietnamese ethnic minority languages quickly.
Chapter 4, titled "Cross-Lingual Transfer Learning for Muong Speech Synthesis": In this chapter, we use and experiment with approaches for Muong TTS that leverage Vietnamese resources We focus on transfer learning by creating Vietnamese TTS, further training it with different Muong datasets, and evaluating the resulting Muong TTS.
PART 3: SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN LANGUAGE
Chapter 5, titled "Generating Unwritten Low-Resourced Language's Speech Directly from Rich-resource Language's Text," presents our approach for addressing speech synthesis challenges for unwritten low- resourced languages by synthesizing L2 speech directly from L1 text The proposed system is built using end-to-end neural network technology for text-to-speech.
We use Vietnamese as L1 and Muong as L2 in our experiments.
Chapter 6, titled "Speech synthesis for Unwritten Low-Resourced Languages Using Intermediate Representation": This chapter proposes using phoneme representation due to its close relationship with speech within a single language The proposed method is applied to the Vietnamese and Muong language pair Vietnamese text is translated into an intermediate representation of two unwritten dialects of the Muong language: Muong Bi - Hoa Binh and Muong Tan Son - Phu Tho The evaluation reveals relatively high translation quality for both dialects.
In conclusion, speech synthesis for low-resourced languages is a significant research area with the potential to positively impact the lives of speakers of these languages. Despite challenges posed by limited data and linguistic knowledge, advancements in speech synthesis technology and innovative approaches enable the developing of high- quality speech synthesis systems for low-resourced languages The work presented in this dissertation contributes to this field by exploring novel methods and techniques for speech synthesis in low-resourced languages.
For future work, there is a need to continue developing innovative approaches to speech synthesis for low-resourced languages, particularly in response to the growing demand for accessible technology This can be achieved through ongoing research in transfer learning,unsupervised learning, and data augmentation Additionally, there is a need for further investment in collecting and preserving linguistic data for low-resourced languages and developing phonological studies for these languages With these efforts, we can ensure that speech synthesis technology is accessible to everyone, regardless of their language.
PART 1 : BACKGROUND AND RELATED WORKS
Chapter 1 Overview of speech synthesis and speech synthesis for low-resourced language
This section presents a concise overview of Text-to-Speech (TTS) synthesis and its application to low-resourced languages It highlights the challenges faced in developing TTS systems for languages with limited resources and data Additionally, it introduces various approaches and techniques to address these challenges and improve TTS quality for low- resourced languages.
This section offers a brief introduction to the field of speech synthesis It highlights the key concepts and techniques in converting written text into spoken language It also provides a foundation for understanding the complexities and challenges of developing speech synthesis systems.
Speech synthesis is the artificial generation of human speech using technology A computer system designed for this purpose, known as a speech computer or speech synthesizer, can be realized through software or hardware implementations A text-to- speech (TTS) system specifically converts standard written language text into audible speech, whereas other systems transform symbolic linguistic representations, such as phonetic transcriptions, into speech [1] TTS technology has evolved significantly over the years, incorporating advanced algorithms and machine learning techniques to produce more natural-sounding and intelligible speech output.
By simulating various aspects of human speech, including pitch, tone, and intonation, TTS systems strive to provide a seamless and user-friendly listening experience.
The development of TTS technology has undergone remarkable progress over time:
In the 1950s, pioneers like Homer Dudley with his "VODER" and Franklin S Cooper's
"Pattern Playback" initiated the foundation for modern TTS systems.
The 1960s brought forth formant-based synthesis, utilizing models of vocal tract resonances to produce speech sounds.
The 1970s introduced linear predictive coding (LPC), enhancing speech signal modeling and producing more natural synthesized speech.
The 1980s saw the emergence of concatenative synthesis, a method that combined pre- recorded speech segments for the final output.
During the 1990s, unit selection synthesis became popular, using extensive databases to select the best-fitting speech units for more natural output.
The 2000s experienced the rise of statistical parametric synthesis techniques, such as Hidden Markov Models (HMMs), providing a data- driven and adaptable approach to TTS.
The 2010s marked the beginning of deep learning-based TTS with models like Google's WaveNet, revolutionizing speech synthesis by generating raw audio waveforms instead of relying on traditional signal processing.
End-to-end neural TTS systems like Tacotron streamlined the TTS process by directly converting text to speech without intermediate stages.
Transfer learning and multilingual TTS models have recently enabled the development of high- quality TTS systems for low-resourced languages, expanding the reach of TTS technology.
Today, TTS plays a vital role in everyday life, powering virtual assistants, accessibility tools, and various digital content types.
Some current applications of text-to-speech (TTS) technology includes:
Assistive technology for the visually impaired: TTS systems help blind and visually impaired individuals by reading text from books, websites, and other sources, converting it into audible speech.
Learning tools: TTS systems are used in computer-aided learning programs, aiding language learners and students with reading difficulties or dyslexia by providing auditory reinforcement.
Voice output communication aids: TTS technology assists individuals with severe speech impairments by enabling them to communicate through synthesized speech.
Public transportation announcements: TTS provides automated announcements for passengers on buses, trains, and other public transportation systems.
E-books and audiobooks: TTS systems can read electronic books and generate audiobooks, making content accessible to a broader audience.
Entertainment: TTS technology is utilized in video games, animations, and other forms of multimedia entertainment to create realistic and engaging voiceovers.
Email and messaging: TTS systems can read emails, text messages, and other written content aloud, helping users stay connected and informed.
Call center automation: TTS is employed in automated phone systems, allowing users to interact with voice-activated menus and complete transactions through spoken commands.
Virtual assistants: TTS is a crucial component of popular voice-activated virtual assistants likeApple's Siri, Google Assistant, and Amazon's Alexa, enabling them to provide spoken responses to user queries.
Voice search applications: By integrating TTS with speech recognition, users can use speech as a natural input method for searching and retrieving information through voice search apps.
In conclusion, TTS technology has come a long way since its inception, with continuous advancements in algorithms, machine learning, and deep learning techniques As a result, TTS systems now provide more natural-sounding and intelligible speech, enhancing the user experience across various applications such as assistive technology, learning tools, entertainment, virtual assistants, and voice search The ongoing development and integration of TTS into our daily lives will continue to shape the future of human-computer interaction and digital accessibility.
The architecture of a TTS system is generally composed of several components, as depicted in Figure 1.1. The Text Processing component is responsible for preparing the input text for speech synthesis The G2P Conversion component converts the written words into
Figure 1.1 Basic system architecture of a TTS system [22] their corresponding phonetic representations The Prosody Modeling component adds appropriate intonation, duration, and other prosodic features to the phonetic sequence Lastly, the Speech Synthesis component generates the speech waveform based on the parameters derived from the fully tagged phonetic sequence [2]. Text processing is crucial for identifying and interpreting all textual or linguistic information that falls outside the realms of phonetics and prosody Its primary function is to transform non-orthographic elements into words that can be spoken aloud Through text normalization, symbols, numbers, dates, abbreviations, and other non- orthographic text elements are converted into a standard orthographic transcription, facilitating subsequent phonetic conversion Additionally, analyzing whitespace, punctuation, and other delimiters is vital for determining document structure and providing context for all subsequent steps Certain text structure elements may also directly impact prosody Advanced syntactic and semantic analysis can be achieved through effective text-processing techniques [2, p 682] The phonetic analysis aims to transform orthographic symbols of words into phonetic representations, complete with any diacritic information or lexical tones present in tonal languages Although future TTS systems might rely on word-sounding units and possess increased storage capacity, homograph disambiguation and grapheme-to-phoneme (G2P) conversion for new words remain essential for accurate pronunciation of every word G2P conversion is relatively straightforward in languages with a clear relationship between written and spoken forms A small set of rules can effectively describe this direct correlation, which is characteristic of phonetic languages such as Spanish and Finnish Conversely, English is not a phonetic language due to its diverse origins, resulting in less predictable letter-to-sound relationships In these cases, employing general letter-to- sound rules and dictionary lookups can facilitate the conversion of letters to sounds, enabling the correct pronunciation of any word [2, p 683].
In TTS systems, prosodic analysis involves examining prosodic features within the text input, such as stress, duration, pitch, and intensity This information is then utilized to generate more natural and expressive speech Prosodic analysis helps determine the appropriate stress, intonation, and rhythm for the synthesized speech, resulting in a more human-like output Predicting prosodic features can be achieved through rule-based or machine-learning methods, including acoustic modeling and statistical parametric speech synthesis By adjusting the synthesized speech, TTS systems can convey various emotions or speaking styles, enhancing their versatility and effectiveness across diverse applications Speech synthesis employs anticipated information from the fully tagged phonetic sequence to generate the corresponding speech waveform Broadly, two traditional speech synthesis techniques are concatenative and source/filter synthesizers Concatenative synthesizers assemble pre-recorded human speech components to produce the desired utterance In contrast, source/filter synthesizers create synthetic voices using a source/filter model based on the parametric description of speech The first method necessitates assistance in generating high-quality speech using the input text's parametric representation and speech parameters Meanwhile, the second approach requires a combination of algorithms and signal processing adjustments to ensure smooth and continuous speech, particularly at junctures.
Several improvements have been proposed for high-quality text-to-speech (TTS) systems, drawing from the two fundamental speech synthesis techniques Among the most prominent state-of-the-art methods are statistical parametric speech synthesis and unit selection techniques, which have been the subject of extensive debate among researchers in the field.
With the advancement of deep learning, neural network-based TTS (neural TTS) systems have been proposed, utilizing (deep) neural networks as the core model for speech synthesis A neural TTS system comprises three fundamental components: a text analysis module, an acoustic model, and a vocoder As illustrated in Figure 1.2 the text analysis module transforms a text sequence into linguistic features The acoustic model then generates acoustic features from these linguistic features, and finally, the vocoders synthesize the waveform from the acoustic features.
1.1.3 Evolution of TTS methods over time
EMULATING OF THE MUONG TTS BASED ON INPUT
Proposed method
The idea of emulating approach for TTS is based on the phonetic relation between the base language (BL) and the target language (TL) The work of building a emulating TTS for an unsupported language includes the following tasks:
Choosing a BL which is linguistically close to the TL.
Proposal orthography mapping between BL and TL, based on the phonetic similarity between 2 languages.
Building the emulating TTS for BL by applying the phonetic mapping on the available TTS of BL.
This section will present our work in building a emulating TTS for Muong, the target language (TL), using Vietnamese as the primary language (BL).
This work aims to find a simple and cheap way to generate Muong synthesized speech Therefore, our approach is to develop an independent module that can convert the Muong transcript to the suitable input of the available Vietnamese system This approach allows Muong TTS to develop independently and work with different Vietnamese TTS systems Figure 3.1 shows the structure of the Muong emulating TTS system, which includes three main modules.
Figure 3.1 Emulating TTS for Muong
Initially, the Muong G2Phone Tool was used to transform the Muong script into Muong phonemes represented in the International Phonetic Alphabet (IPA) Following this, the "Emulating IPA Tool" is employed to convert the Muong phonemes into their Vietnamese counterparts Once this conversion is complete, theVietnamese phonemes are transcribed into Vietnamese text, which is then used as input for the Vietnamese speech synthesis system The upcoming sections will elaborate on the rationale behind employing two differentVietnamese speech synthesis systems.
The G2P (grapheme to phoneme) module is a module that converts written text into phonemes of a specific language The transition from Muong Hoa Binh script to Muong Hoa Binh phoneme is clearly demonstrated in Figure 3.2 To develop this G2P module, we utilize the open-source vPhon [123] and gather statistics about the phonemes and tones that differ between Vietnamese and Muong.
Each character (or series of characters) is mapped to a specific phoneme (or phoneme string) in a process known as Character Phone Mapping Following this, the rules for constructing Muong phonemes based on the language's writing rules are applied This approach enables us to create G2P transformation functions for the Muong language.
In more detail, for the graphemes in the Muong script that do not appear in Vietnamese, we generate corresponding phonemes for each of these unique graphemes For example, for additional onsets, we introduce four Muong phonemes (hr, kl, tl, w), and for additional codas, we add two Muong phonemes (l, w) Furthermore, we omit the tone 6 component because it is not present in our research and the Muong script, as mentioned in Table 2.13 Comparing the tone of Vietnamese with Muong Tan Son and Muong Bi The detailed content is described in Table A.6 in the Appendix section.
Tôi ăn cơm Ho ăn cơm hɔ1 ʔan1 kəːm1
Tôi nghe thấy tiếng kêu bên kia sông
Ho iếng e thiếng đi ớ bên các khôông đớ hɔ1 ʔiəŋ3 ʔɛ1 thiəŋ3 ɗi1 ʔəː3 ɓen1 kaːk3 xoːŋ1 ɗəː3
Cha tôi vừa làm xong một ngôi nhà mới
Bác ho mới là xong một cái nhà mới ɓaːk3 hɔ1 məːj3 laː2 sɔŋ1 mot5 kaːj3 ɲaː2 məːj3
Chị tôi đang nấu cơm trong bếp Máng cải ho tang nố cơm ở tlong pếp maːŋ3 kaːj4 hɔ1 taːŋ1 no3 kəːm1 ʔəː4 tlɔŋ1 pep3
Mẹ tôi cho tôi hai quả chuối Máng ho cho ho hal tlái chuối maːŋ3 hɔ1 cɔ1 hɔ1 haːl1 tlaːj3 cuəj3
Table 3.1 gives some examples of Muong G2P results Our tool has converted all cases of phone files into Muong but not in Vietnamese.
Based on the above phonetic comparisons in the previous section, the transformation rules are proposed for mapping from Muong orthography to Vietnamese orthography, which can be read in Vietnamese TTS For equivalent and close cases, the transformation rules are the simple replacement of Muong items by Vietnamese items in Table A.5 For the distinct case, it is impossible to transform these items in Muong into Vietnamese items Therefore, that cases are not considered in this study and will be dealt with in future work Table 3.2 shows examples of applying transformation rules to convert the Muong text into input text for Vietnamese TTS.
Table 3.2 Examples of applying transformation rules to convert the Muong text into input text for
Muong Text Emulating text for
Ho phải za ty dộng bầy?
Nhà za chiếm từ cúi chăng?
Ho phải da ty dộng bầy?
Nhà da chiếm từ cúi chăng?
‘I'm studying’ ‘I'm with you go out?’ ‘Your house has many pigs?’
Our approach for developing this module is quite straightforward; it is based on the close relationship and minor phonetic differences between two languages, Vietnamese and Muong Muong IPA is converted to
Vietnamese IPA using a rule base and phone mapping table. Muong phonemes, including onsets, medial, nuclei, coda, and tone, are mapped to the corresponding Vietnamese phonemes Phonemes present in Muong but not in Vietnamese are mapped by another Vietnamese phoneme with the same pronunciation The phone mapping table is described in Table A.6 in the Appendix.
After obtaining the Vietnamese G2P module, we refer to James' vPhon and generate the corresponding IPA phonemes string for about 7,000 Vietnamese words We then build a Phoneme to Grapheme (P2G) dictionary with about 7,000 Vietnamese words where the key is the string of Vietnamese IPA phonemes, and the value is the orthography in Vietnamese Currently, we can translate the Muong phonemes into Vietnamese phonemes and theMuong phonemes string into Vietnamese words.
Experiment
For the Vietnamese TTS system, with the independent module structure, the system can be applied to different Vietnamese TTS systems At the time I began working on my thesis in 2016, there were two main approaches in speech synthesis: the unit selection approach and the statistical parametric approach (HMM/DNN) The available Vietnamese TTS system also follows these two approaches, such as VOS TTS (Voice Of Southern Vietnam) 4 , MICA TTS 5 , vnSpeak 6 (unit selection technique); VAIS TTS 7 , OpenFPT TTS 8 (statistical parametric technique) According to [124], the unit selection approach has the advantage of producing high quality at the waveform level because it concatenates speech waveforms directly This technique is also easy to implement
[125], and has been researched for a long history By contrast, statistical parametric approaches, which generate the average of some set of similarly sounding speech segments [124], cannot be compared with unit selection in producing a natural
4 http://www.ailab.hcmus.edu.vn/
5 http://mica.edu.vn/vova/
7 https://vais.vn/en/text-to-speech-service/
8 http://ngtts.stis.vn/#/demo voice [126] For testing Muong emulating TTS, we chose the Vietnamese TTS system of both techniques in order to examine whether the TTS technique affects the sound quality of Muong synthesized Speech We also chose the Vietnamese TTS as a web service to conveniently use and pair the module Finally, two TTS web services for Vietnamese, which both support generating Hanoi Vietnamese speech, were chosen for our experiment They are MICA TTS service (unit selection technique - TTS1) [127]–[129], and OpenFPT TTS (statistical parametric technique - TTS2) The emulating results of the built-in system with 2 Vietnamese TTS will be tested in the section later.
In this experiment, we aim to create a Muong Hoa Binh TTS using the emulating approach from the Vietnamese TTS, as previously described We have selected two Vietnamese TTS systems as the foundation for our experiment, considering their compatibility with the emulating approach and the specific requirements of our project.
As for the data, we don't need to train the system, so we will focus on gathering test data This test data comprises 15 sample texts collected from Muong Hoa Binh documents and 15 basic communication sentences rewritten in the Muong Hoa Binh script.
The method for this experiment involves the following steps: First, we input the collected data into the constructed system, which consists of the Muong G2P and Muong Emulating IPA modules Next, we generate fake input Vietnamese text using the modules, and finally, we input the generated text into the two Vietnamese TTS systems for synthesis.
The testing material was designed to examine the transformation rules proposed in the section above The test data so is divided into three groups:
Group 1 –Emulating tones testing: The goal of this test is to examine if the proposed Vietnamese tones can effectively "fake" the Muong tones Five Muong tones are set within five syllables, each having a simple structure (Consonant-Vowel) These syllables are then placed within the sentence
"Tứa có ưa chăng?" as shown in Table 3.3 After listening to the sentence, the listener is not required to write down the entire sentence but only the word containing the tone in question.
Table 3.3 Testing material for emulating tone
Muong Tone Vietnamese tone for emulating Containing sentence (in Muong)
IPA Emulating text Vietnamese meaning
ka kaA1 ca gà
mè mɛA2 mè mè
B1 - Rising ná naB1 ná nỏ
C1 – Low falling < Hỏi > tẻ tɛC1 tẻ đẻ
Falling B2 – Low glottalized mệ meB2 mệ mẹ
Group 2: Phone closed testing: In this test, five equivalent closed phonemes will be assessed within the sentence "Tứa có ưa chăng." The listener is required to write down the words they have just heard and evaluate the overall sentence quality This method helps in understanding the effectiveness of the proposed approach for handling closed phonemes in synthesized speech. Table 3.4 Testing material for emulating phone (the concerning phonemes in bold)
Vietnamese meaning bang baŋ1 ɓaŋA1 bang con hoẵng cha ca1 tɕaA1 cha vườn gế ge4 ɣeB1 ghế ghế kha kʰa1 xaA1 kha vợt bắt cá phui phui1 fuiA1 phui vui
Group 3: General testing, with the remaining phonemes left, we took out a set of 5 sentences for the Muong people to listen and record their sentences just heard and evaluate the quality of the sentence.
Fifteen sentences from the three groups above were set as input for emulating the TTS system for Muong with
2 Vietnamese TTS (as mentioned in the section above) The total output is 30 synthesized utterances (15 sentences x 2 TTS techniques) These utterances are stored as audio files to use in the perceptual test.
Table 3.5 Testing material for remaining phonemes
Muong sentence Muong IPA IPA transcription Emulating text Vietnamese sentence
Chú mua của oi? cu4 muə1 cuə3 ɔi1 cuB1 muəA1 cuəC1 ɔiA1 Chú mua của oi? Anh mua của ai?
Ho tang cúm lọ hɔ1 taŋ1 cum4 lɔ5 hɔA1 taŋA1 cumB1 lɔB2 Ho tang cúm lọ Tôi đang sẩy lúa
Cải chi ni? kai3 ci1 ni1 kaiC1 ciA1 niA1 Cải chi ni? Cái gì đây
Da bí thía nó à? da1 bi4 tʰia4 nɔ4 a2 daA1 biB1 tʰiəB1 nɔB1 aA2 Da bí thía nó à? Mày bị làm sao thế? Ở lái ăn cơm hái ɤ3 lai4 ăn1 kɤm1 hai4 ɤC1 laiB1 ănA1 kɤmA1 haiB2 Ở lái ăn cơm hái Ở lại ăn cơm nhé
The testing protocol was designed following the testing method for synthesized speech proposed by ITU [130]. This test was designed for two purposes:
Evaluate/assessing the intelligibility of Muong emulating speech: whether the listener can understand precisely the content of the testing sentence
Evaluate the quality of the synthesized sound: How the listener judges the nature of Muong's emulating speech
The evaluation was determined through the contributions of 50 native speakers of Muong Hoa Binh, ensuring gender balance within the cohort with an equal number of 25 male and 25 female participants The average age of these participants was noted to be 23.33 years old Additionally, the cohort displayed a balanced educational background, with half of them, amounting to 25 individuals, holding university degrees, while the rest had attained high school level education.
To conduct the testing, each participant was asked to listen to each test sentence between one to three times. Following this auditory assessment, the listeners were tasked with two primary actions:
Conclusion
The results indicate that the Muong emulating synthesis system is generally intelligible, although there are instances where improvements could be made Some synthesized speech could be more accurate, but overall, the majority of the synthesized speech is understandable for listeners Participants noticed that the emulating voice sounds similar to Vietnamese and lacks the intonation characteristic of some Muong dialects This study closely followed the new writing system of Muong - Hoa Binh and the obtained results demonstrated that due to the similarity of the phoneme systems, the fake-approach-based synthetic voices are understandable and perceptible. The Muong test participants' voice quality scores (MOS test) are relatively high The study also employed two high-quality speech synthesis systems, TTS1 and TTS2 Experimental results show that the TTS1 system scores higher in some cases, while in other cases, the TTS2 system scores higher In general, both TTS systems produced satisfactory results This outcome suggests that researchers can continue to explore the development direction to address more complex issues in the emulation technique.
In 2016, finding native Muong speakers was challenging The study collaborated with seven Muong volunteers to evaluate the system Comparing recorded Muong speech and synthesized signals from TTS1 andTTS2 validated the research This approach can be applied to create TTS for other Vietnamese ethnic minority languages Further details are in subsequent chapters, and some results were presented at the FAIR 10 conference.
Chapter 4 Cross-lingual transfer learning for Muong speech synthesis
As touched upon in Chapter 1 and subsequently revisited in the proposal segment of Chapter 2, one promising method to construct a Text-to-Speech (TTS) system for low-resourced languages involves harnessing the power of transfer learning from a pre-trained model of a language that boasts abundant resources This technique proves especially beneficial when the two languages share a close relationship.
The main objective of this section of the thesis is to evaluate the effectiveness of implementing and optimizing the transfer learning technique in the construction of a TTS system for the Muong language, with a particular focus on the Hoa Binh dialect As discussed in Section 1.2.3, transfer learning has demonstrated potential for adaptation to new domains In the context of the Vietnamese-Muong language relationship, this raises intriguing questions that our research intends to address.
Is it feasible to consider Muong, specifically the Muong Hoa Binh dialect, a new Vietnamese TTS system domain? If so, what would be the appropriate volume of data required for tuning this model, and what strategy should be employed to achieve this? Furthermore, we consider the potential of this tuning process to address certain limitations found in the emulation approach Specifically, can this tuning process help to accurately reproduce sounds unique to the Muong language that is not present in Vietnamese?
We propose a strategic approach to answer these compelling questions, including rigorous experimentation and a thorough evaluation process The subsequent sections will delve into these aspects, providing a detailed narrative on the selection and preparation of the training data, the challenges posed by the linguistic differences between Vietnamese and Muong, and how we gauge the TTS system's performance following the transfer learning application.
In our specific context, the low-resourced language is Muong Hoa Binh, while Vietnamese is the language rich in resources In this chapter, our main focus is the exploration of this transfer learning strategy to construct a TTS system for the Hoa Binh Muong language, capitalizing on the knowledge from a pre-existing Vietnamese TTS model.
Proposed method
Transfer learning is a crucial technique in the fields of machine learning and deep learning It allows a model to learn from previously acquired knowledge and apply it to a new task, often with limited training data An essential aspect of transfer learning is the use of pretrained models, and a key concept in the transfer of knowledge between models is "model weights."
Model weights refer to the parameters of a neural network in a machine learning model They include matrices and vectors that contain information on how the model represents data and performs predictions In deep learning, neural networks typically have millions of model weights These weights are adjusted and updated during training to enable the model to learn how to represent data and make predictions.
Transfer Learning Using Pretrained Models:
In transfer learning, pretrained models—models that have been trained on a similar or related task—are employed These models have already learned how to represent data and perform a specific task The weights of pretrained models contain knowledge about how to represent data for that task.
Benefits of Using Pretrained Model Weights:
Better Initialization: Using pretrained model weights allows a new model to start from a more favorable position in parameter space Instead of initializing with random weights, the model has already "learned" from a prior task and can make initial predictions more effectively.
Time and Resource Savings: Training a model from scratch often requires a substantial amount of data and time Using pretrained models can save time and resources, especially in situations where training data is limited.
Leveraging Shared Knowledge: Pretrained model weights contain shared knowledge on how to represent data and make predictions in a specific domain They can be utilized and fine-tuned to suit a new task, helping the model learn how to perform that task based on shared knowledge.
Multi-Task Knowledge Integration: Transfer learning enables the integration of knowledge from multiple previous tasks into a new model This can enhance generalization and automation for various applications.
In summary, model weights contain valuable information on how to represent data and make predictions. Employing transfer learning and pretrained model weights is an efficient way to leverage knowledge learned from previous tasks and apply it to new tasks, particularly in situations with limited training data.
In this chapter, we will train a Tacotron 2 model on Vietnamese data, which we refer to as the pretrained model. Then, the Tacotron 2 model will be fine-tuned on Muong language data During the fine-tuning process, all model weights will be updated with a smaller learning rate than when training on Vietnamese data, decreased from 1e-3 to 1e-04 Using a learning rate that is too high during the fine-tuning process can lead to overshooting and failure to converge This can happen when the model weights are adjusted too quickly, causing oscillations and instability during the convergence process Therefore, when fine-tuning a deep learning model, we need to decrease the learning rate to ensure stable and effective convergence Reducing the learning rate will decrease the magnitude of weight updates, allowing the fine-tuning process to be slower but more stable, thereby improving the accuracy of the model.
Figure 4.1 Low-resourced L2 TTS transfer learning from rich resource L1
Because Vietnamese and Muong languages are two closely related language families, their phonetic representations are similar, differing only in some phonemes such as initial and final consonants and tones, already discussed in chapter 2 The input representation for the Tacotron 2 model is phoneme representations, combined by the International Phonetic Alphabet (IPA) of Vietnamese and Mường languages, which is a shared input phoneme representation for the Tacotron 2 model.
Another difference in our transfer learning approach compared to the original Tacotron 2 paper is the vocoder we used Instead of the Wavenet model, we replaced it with the Hifigan model WaveNet is a type of deep learning model used for audio synthesis by predicting a sample of audio from a previous audio sequence WaveNet uses a recurrent neural network (RNN) architecture with short-length connections to generate high-quality audio samples. However, because WaveNet has more complex and deeper layers than HiFiGAN, it requires more time and resources to train and synthesize audio.
The speech synthesis model we used is described in Figure 4.2 There are two differences compared to the Tacotron 2 model in the original paper: input representation as phonemes instead of characters and using the Hifigan vocoder instead of the Wavenet network The modules of the model such as encoder, decoder, vocoder and attention are kept unchanged in terms of architecture, as well as their parameters.
Figure 4.2 Block diagram of the speech synthesis system architecture
Details of the model parameters are described in the following table:
Table 4.1 Parameters of acoustic model
Att_location_num_filters 32 Att_location_kernel_size 31 Postnet
The key factors related to the complexity of Tacotron 2 based on specific information about the model:
Number of Parameters: Tacotron 2 has approximately 28.2 million parameters These parameters include weights and biases in the encoder, decoder, and other components of the model The large number of parameters indicates a significant computational load during both training and inference.
At the same time, utilizing the HIFIGAN Vocoder model (~0.92 million parameters) as a replacement for the WaveNet Vocoder (~13.3 million parameters) significantly enhances both training and inference speed.
Architecture: Tacotron 2's architecture includes multiple components: o Encoder: The encoder consists of convolutional layers followed by LSTM layers Typically, Tacotron 2 uses 3 convolutional layers and 1 bidirectional LSTM layer in the encoder. o Decoder: The decoder includes LSTM layers for generating mel spectrograms from the encoder's output It usually has 2 LSTM layers in the decoder, each with a dimension of 1024. o Attention Mechanism: Tacotron 2 employs an attention mechanism to focus on relevant parts of the input sequence during decoding The attention mechanism has a dimension of
128, and it also uses convolutional layers with 32 filters and a kernel size of 31 for location- based attention. o Postnet: The postnet, used for refining mel spectrograms, has 5 convolutional layers with an embedding dimension of 512 and a kernel size of 5.
Input and Output Sizes: Tacotron 2 takes variable-length sequences of text as input and produces mel spectrograms as output The length of the input text sequence and the desired output length (mel spectrogram sequence) can affect computational complexity, especially during inference.
Experiment
Firstly, regarding the training data for the pretrained model, we used approximately 20 hours of labeled Vietnamese audiobook data collected from various open websites The audio data was obtained from NgheAudio 9 , and the corresponding text data was obtained from dtruyen 10 The raw data was not divided into small segments (ranging from 1 second to under 15 seconds) with corresponding text but was rather in the form of long audio files (with an average duration of one hour) for each chapter of the story.
After considering the series selection and the voiceover based on the criteria of clear voice, as little noise as possible, and at least 20 hours of audio, we chose Tran Van's voice, and the story title is Dai Mong Chu The audio data undergoes the following processing steps after being downloaded from the internet:
Sample rate is initially set at 44100 Hz.
The audio data has a stereo channel.
Bitrate is set at 128 kb/s.
The audio data is then normalized to have a sample rate of 22050 Hz.
The audio data is changed to have a mono channel.
The codec used is pcm_s16le.
The downloaded data has a significantly long duration, with each file having an approximate duration of one hour To generate input for the acoustic model, the original audio files were segmented into smaller units based on signal segments containing inadequate voice, commonly referred to as silence The length of each segment ranges from 1 second to 10 seconds This process resulted in a total of around 19,000 sentences The duration distribution over the entire dataset after slicing into segments is shown in Figure 4.3.
9 https://www.ngheaudio.org/truyen-audio-dai-mong-chu
The image above shows that segments with lengths from 1s to 6s occupy mainly the Vietnamese audio data set The processing steps for the clipped audio segments dataset are as follows:
The dataset contains a mix of audio segments, including a small portion of segments with background noise (music, ambient noise) and mostly segments with only the reader's voice.
To filter out audio segments with background noise, the open-source inaSpeechSegmenter is used This results in a selection of clean audio tracks that contain only the voice of the storyteller and no background noise.
The selected audio segments are then labeled using an open-source (whisper) Vietnamese Automatic Speech Recognition (ASR) model, with WER ~ 10% on Vietnamese, to obtain relatively accurate labels for each audio segment.
To correct any predicted label errors, the Levenshtein distance algorithm [131] is employed to calculate the distance between two strings Additionally, listening to the beginning and end of each long audio file before segmentation is used to limit the text space for comparison.
Total duration 19 hours 58 minutes 30 seconds
The information on the entire audiobook data set is described in Table 4.2 The number of distinctive phones here consists of 44 phonemes, including phones that represent silence (sil), end-of-sentence (eos), and padding() used for shorter sentences within a batch during training The remaining 41 phonemes are phonemes represented in IPA format shared by Vietnamese and Muong languages, as listed in Table A.6 in the Appendix.
The Vietnamese text database was built before recording the Muong speech database The selected domain is the news field To ensure the quality of the translation, the Vietnamese text has to balance phonemic and lexical distribution as in reality Nearly 4 million Vietnamese text sentences were collected from the general Vietnamese news field (Vietnamnet, Dantri, etc.) and around 900,000 Vietnamese text sentences from Muong local news publishers (Hoa Binh newspaper 11 , Phu Tho newspaper 12 ) The local news data helps to collect words commonly used in the Muong regions The raw data is then preprocessed (separate sentences, remove short sentences (under 5 tokens) and long sentences (over 120 tokens), and remove non-Vietnamese sentences (which contain more than 50% of foreign language words) The original text collection contained around 4.9 million sentences and was considered the word-reality distribution in the domain A random extraction algorithm that ensures the balance of phonemic and syllable distribution as in the original text collection was applied to extract a set of 20,000 sentences from the original text collection 20,000 sentences are divided into 2 sets: one consists of 5,000 sentences extracted from 900,000 sentences of Muong newspapers, and the other includes 15,000 sentences extracted from 4 million sentences of general newspapers.
Vietnamese text data is normalized using the Vietnamese normalization toolkit All sentences are converted into readable word representations in Vietnamese No more numeric representations, characters, or acronyms. The speech corpus was recorded in sound-proof rooms Four Muong native speakers, 2 males and 2 females, from 2 dialects (Muong Bi – Hoa Binh and Muong Tan Son – Phu Tho) were chosen to record the database All speakers are Muong radio broadcasters with a good, clear, and coherent voice The speakers read each Vietnamese sentence in the collection of
5.000 sentences and then speak them in Muong speech The male voices of two dialects were used to train the system (the female voices are reserved for other phonetic studies).
Along with the speech recording, the speech data was processed to normalize energy, remove noise, long pauses, pronunciation errors, or unexpected errors encountered during recording The faulty speech was required to be recorded again Each speaker was required to record all sentences of good quality Audio files are recorded at a speech sampling rate of 44.1 kHz The audio files are then converted to the 22.05 kHz speech sampling rate to match the system's training input Each collection for the male voice corresponds to more than 1800 minutes of an audio signal after post-processing The Vietnamese text data was preprocessed (normalize non-standard words: punctuations, numbers, acronyms, words, upper/lowercase characters, etc.) to get the appropriate representation of a sentence as a string of Vietnamese words Voice data is trimmed at the beginning of the file, at the end of the file, normalized energy.
In the Muong language dataset of the Muong project, Muong language data recorded by Bui Viet Cuong, a broadcaster from Hoa Binh Radio, was selected for transfer learning implementation The details of the recorded dataset are described in the table below:
11 http://www.baohoabinh.com.vn/en/
Mường Bi – Hoa Binh (CauBaoMuong)
Total duration 4 hours 24 minutes 30 seconds
Speaker name Bui Viet Cuong
To investigate the relationship between the amount of training data and the quality of the synthesized speech output, we divided the high-quality recorded dataset into smaller training sets for fine-tuning purposes The details of the smaller training sets are described in the table below:
Table 4.4 The Muong split data set
The training exercises are divided so that the maximum of the maximum coverage and the sentences are randomly taken Looking at the above board, we can see the total number of phonemes increasing through the sets of M_15M, M_30M, and M_60M, corresponding to the data sets with a duration of 15 minutes, 30 minutes, and
Figure 4.4 Duration distribution across the M_15m, M_30m, and M_60m datasets.
In Figure 4.4, the duration is evenly distributed across the datasets and ranges from 1 to 15 seconds.
Practice Validate when training for all three training evaluations of 50 sentences, randomly taken from the Muong dataset and containing different from training data.
To convert Vietnamese or Muong written text into IPA phoneme sequences, we utilized the same method of mapping characters to phonemes combined with the mapping rules presented in section 3.1.1 Muong G2P module The mapping tables are shown in the Table A.5 in the Appendix.
4.2.3 Training the pretrained model using Vietnamese dataset.
We used approximately 20 hours of Vietnamese audiobook data to train the acoustic model, which learns how to convert phoneme inputs into Mel spectrogram features The neural Network optimization algorithm for the Acoustic Model that we use is the Adam Optimization Algorithm The parameters of the Adam optimizer are described in the table below:
Evaluation
With the aim of examining the effectiveness of the model when finetuning pretrained models on different durations of Muong language datasets, we used 50 in-domain test sentences and 50 out-of-domain test sentences.Details of the two test sets are described in the following table:
Table 4.7 The specifications of the in-domain and out-domain test sets
In-domain test set Out-domain test set
The in-domain test set is randomly selected from the recorded Muong dataset, ensuring that all phonemes are represented The in-domain set consists of sentences collected from news sources, including newspapers, radio broadcasts, and current affairs On the other hand, the out- domain test set comprises daily conversation sentences, primarily short phrases that also cover all phonemes for a comprehensive assessment The table below provides a few examples of sentences from both test sets.
Sample in Vietnamese Sample in English
Chủ động xây dựng kế hoạch hoạt động cho từng nội dung chuyên môn gắn với công tác thi đua khen thưởng
Actively develop plans for each professional content in conjunction with the emulation and commendation work Người dân vừa là người tổ chức , lãnh đạo và là lực lượng trực tiếp xung kích , đấu tranh giữ gìn an ninh tại cơ sở
People are both organizers, leaders, and the direct force to fight for maintaining security at the grassroots level Khi hỏi bí quyết học tập , giang không ngần ngại chia sẻ , phương pháp học tập của em rất đơn giản
When asked about the secret to studying, Giang did not hesitate to share that her learning method is very simple Out domain
Gia đình anh có khỏe không? Anh chị được mấy cháu rồi?
Is your family well? How many children do you and your spouse have?
Gia đình anh vẫn khỏe, anh được hai cháu: một trai, một gái rồi.
My family is well; we have two grandchildren: one boy and one girl.
Rất vui gặp anh ở đây It's nice to meet you here
The Mean Opinion Score (MOS) was evaluated by a cohort of 50 Muong Hoa Binh native speakers This cohort was balanced in terms of gender, with 25 males and 25 females participating in the study The average age of the participants was 23.33 years old In terms of educational attainment, half of the participants, 25 in number, held university degrees while the remaining 25 had high school diplomas.
As a part of the evaluation process, each participant was instructed to listen to a total of 20 sentences, comprising of two distinct sets The first set included 10 sentences that were in- domain, covering topics like news, current affairs, and broadcasting The second set consisted of 10 out-of-domain sentences, reflecting daily communication scenarios Each set of 10 sentences was randomly selected from a larger pool of 50 test sentences to ensure a diverse representation of linguistic contexts.
For the quantitative evaluation, we utilize the MCD DTW 14 (Mel Cepstral Distortion with Dynamic Time Warping) score, which measures the difference between two sequences of Mel cepstra The smaller the score, the better the quality of the synthesized speech While it is not a perfect metric to assess synthetic speech quality, it can be useful when combined with other measures The MCD DTW score is calculated between the synthesized audio file and the original audio file, and the final score is averaged over 50 pairs for each set.
Test in-domain Test out-domain
MOS MCD (DTW) MOS MCD (DTW)
The MOS was used to evaluate the subjective quality of the speech samples from the different models In the table provided, we observe a trend of improvement in MOS scores as we increase the training duration This implies that with more training, the subjective quality of the synthesized speech increases.
For the in-domain test:
Ground Truth: As the reference point for natural speech, the Ground Truth yielded the highest MOS score (4.36 ± 0.21).
M_15m: With a MOS score of 3.09 ± 0.45, this model received the lowest score of the three, implying that the quality of the synthesized speech was not as good as the others.
M_30m: An improvement from the M_15m model is seen with a MOS score of 3.27 ± 0.30. This suggests that additional training time improved the subjective quality of the synthesized speech.
M_60m: This model achieved the highest MOS score (3.63 ± 0.36) among the synthesized models, indicating that the quality of the speech generated was the most appreciated by listeners, albeit not quite reaching the level of the natural speech.
For the out-of-domain test:
Ground Truth: Again, the Ground Truth demonstrated the highest MOS score (4.31 ± 0.22).
M_15m: The M_15m model had the lowest MOS (2.88 ± 0.45), suggesting that its synthesized speech was perceived as less satisfactory.
M_30m: An increase in MOS is observed with a score of 3.08 ± 0.44, indicating a better speech quality perception compared to M_15m.
M_60m: Mirroring the in-domain test, M_60m achieved the highest MOS score among the models (3.35 ± 0.36), though it still fell short of the natural speech.
14 https://github.com/SandyPanda-MLDL/ALGAN-VC-Generated-Audio-Samples
In conclusion, the MOS scores demonstrate a noticeable improvement in the subjective quality of synthesized speech with increased training duration from 15 minutes to 30 minutes, and then to 60 minutes However, there is still a noticeable gap between the models and the natural speech, suggesting room for further improvement. The Mel Cepstral Distortion (MCD) measured using Dynamic Time Warping (DTW) provides a quantitative metric that compares the difference between the synthesized speech and the natural reference speech Lower MCD DTW values indicate a closer match to the natural reference speech, implying better synthesized speech quality.
From the data table, there is a clear trend of decreasing MCD DTW values as we move from the M_15m model to the M_60m model, for both in-domain and out-domain tests This suggests that the quality of the synthesized speech improves with increased training duration, becoming more similar to natural speech.
In the in-domain tests:
The M_15m model exhibited the highest MCD DTW value (6.875 ± 0.127), suggesting its synthesized speech is most divergent from natural speech among the three models.
The M_30m model showed an improvement over the M_15m model, with a lower MCD DTW value (5.622 ± 0.214) This indicates that its synthesized speech is closer to natural speech than the M_15m model.
The M_60m model had the lowest MCD DTW value (5.133 ± 0.091) among the three models, suggesting its synthesized speech is closest to natural speech.
In the out-domain tests:
Once again, the M_15m model had the highest MCD DTW value (7.125 ± 0.235), suggesting its synthesized speech is most divergent from natural speech among the three models.
The M_30m model had a lower MCD DTW value (6.890 ± 0.161) compared to the M_15m model, indicating that its synthesized speech is closer to natural speech.
Consistent with the in-domain tests, the M_60m model had the lowest MCD DTW value (6.521 ± 0.143), indicating that its synthesized speech is closest to natural speech in the out-domain context as well.
The M_60m model achieved the best performance in terms of MCD DTW for both in- domain and out- domain tests, indicating that with increased training duration, the synthesized speech can approach the quality of natural speech more closely.
We can see that when the training data is increased, the MOS scores increase, and the MCD (DTW) scores decrease When training with only 60 minutes of data, the quality of the synthesized audio is approximately the same as the original signal.
MOS analysis by ANOVA
For in-domain test set, applying two-way ANOVA, called ANOVA5 in our research provides the means to test three distinct null hypotheses The hypotheses for our ANOVA5 analysis, which considers two independent variables—TTS_System and Subject (Muong volunteers), are as follows:
Null Hypothesis (H0) - TTS System: There is no significant variance in the mean of MOS attributable to the difference between the TTS systems being evaluated In other words, the TTS system used does not significantly affect the MOS scores.
Null Hypothesis (H0) - Subject: There is no significant variation in MOS scores across different Muong volunteers who are evaluating the synthesized speech This implies that the subjectivity of the listeners does not significantly influence the MOS scores.
Null Hypothesis (H0) - Interaction effect: There is no significant interaction effect between the TTS systems and the subjects on the resulting MOS scores This means that the combined effect of the TTS system and the subjectivity of the volunteers does not significantly affect the MOS scores.
In our ANOVA6 analysis, we are considering two independent variables: TTS_System and Sentences The null hypotheses for this analysis are as follows:
Null Hypothesis (H0) - TTS System: There is no substantial difference in the Mean Opinion Scores (MOS) that can be ascribed to variations between the evaluated TTS systems Essentially, the type of TTS system employed does not have a significant impact on the MOS.
Null Hypothesis (H0) - Sentences: There is no considerable variation in MOS scores across different sentences used in the evaluation process This suggests that the specific sentences chosen for the evaluation do not exert a significant influence on the MOS.
Null Hypothesis (H0) - Interaction Effect: There is no noteworthy interaction effect between the TTS systems and the sentences on the derived MOS scores This implies that the combined influence of the TTS system and the sentences used in the evaluation does not significantly alter the MOS.
Table 4.10 ANOVA Results for in-domain MOS Test
Looking at Table 4.10, the results of an ANOVA5 analysis of the Mean Opinion Scores (MOS) based on the hypotheses stated earlier.
The first hypothesis being tested is whether there is a significant difference in MOS between TTS systems The analysis shows that the factor "TTS_System" has a significant effect (F = 116.321, p
< 0.001, η2 = 0.162), indicating that there is a significant difference in MOS between TTS systems.
The second hypothesis being tested is whether there is a significant difference in MOS across different subjects The analysis shows that the factor "Subject" does not have a significant effect (F 1.292, p = 0.086, η2
= 0.034), indicating that there is no significant difference in MOS across different subjects.
The third hypothesis being tested is whether there is an interaction effect between TTS systems and subjects on MOS The analysis shows that there is no significant interaction effect between
"TTS_System" and "Subject" on MOS (F = 0.789, p = 0.968, η2 = 0.061), indicating that the effect of TTS systems on MOS does not depend on the subject.
These results suggest that the MOS scores are affected by the TTS systems used but not by the subjects listening to the synthesized speech These findings could be useful in improving the overall performance of TTS systems by identifying the specific factors that affect MOS scores and addressing them accordingly.
In the two-way ANOVA6 with the factors being the TTS_System and Sentences The ANOVA results for the MOS variable show that both the TTS_System and Sentences factors have significant effects on the MOS measurements, as well as a significant interaction between the two factors:
The TTS_System factor has a significant effect on the MOS measurements (F = 122.822, p