Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường

Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường.

BACKGROUND AND RELATED WORKS

 Chapter 1, titled "Overview of speech synthesis and speech synthesis for Low- Resourced Languages": This chapter concisely reviews the existing literature to gain a comprehensive understanding of TTS Research directions for low- resourced TTS are also detailed in this chapter.

 Chapter 2, titled "Vietnamese and Muong Language": This chapter presents research on the phonology of Vietnamese and Muong languages. Computational linguistic resources for Vietnamese speech processing are described in detail as applied in Vietnamese TTS.

PART 2: SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE

 Chapter 3, titled "Emulating Muong TTS Based on Input Transformation ofVietnamese TTS, " presents the proposal to synthesize Muong speech by adapting existing Vietnamese TTS systems This approach can be experimentally applied to create TTS systems for other Vietnamese ethnic minority languages quickly.

 Chapter 4, titled "Cross-Lingual Transfer Learning for Muong Speech Synthesis": In this chapter, we use and experiment with approaches for Muong TTS that leverage Vietnamese resources We focus on transfer learning by creating Vietnamese TTS, further training it with different Muong datasets, and evaluating the resulting Muong TTS.

PART 3: SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN LANGUAGE

 Chapter 5, titled "Generating Unwritten Low-Resourced Language's Speech Directly from Rich-resource Language's Text," presents our approach for addressing speech synthesis challenges for unwritten low- resourced languages by synthesizing L2 speech directly from L1 text The proposed system is built using end-to-end neural network technology for text-to-speech.

We use Vietnamese as L1 and Muong as L2 in our experiments.

 Chapter 6, titled "Speech synthesis for Unwritten Low-Resourced Languages Using Intermediate Representation": This chapter proposes using phoneme representation due to its close relationship with speech within a single language The proposed method is applied to the Vietnamese and Muong language pair Vietnamese text is translated into an intermediate representation of two unwritten dialects of the Muong language: Muong Bi - Hoa Binh and Muong Tan Son - Phu Tho The evaluation reveals relatively high translation quality for both dialects.

In conclusion, speech synthesis for low-resourced languages is a significant research area with the potential to positively impact the lives of speakers of these languages. Despite challenges posed by limited data and linguistic knowledge, advancements in speech synthesis technology and innovative approaches enable the developing of high- quality speech synthesis systems for low-resourced languages The work presented in this dissertation contributes to this field by exploring novel methods and techniques for speech synthesis in low-resourced languages.

For future work, there is a need to continue developing innovative approaches to speech synthesis for low-resourced languages, particularly in response to the growing demand for accessible technology This can be achieved through ongoing research in transfer learning,unsupervised learning, and data augmentation Additionally, there is a need for further investment in collecting and preserving linguistic data for low-resourced languages and developing phonological studies for these languages With these efforts, we can ensure that speech synthesis technology is accessible to everyone, regardless of their language.

PART 1 : BACKGROUND AND RELATED WORKS

OVERVIEW OF SPEECH SYNTHESIS AND SPEECH

Overview of speech synthesis

This section offers a brief introduction to the field of speech synthesis It highlights the key concepts and techniques in converting written text into spoken language It also provides a foundation for understanding the complexities and challenges of developing speech synthesis systems.

Speech synthesis is the artificial generation of human speech using technology A computer system designed for this purpose, known as a speech computer or speech synthesizer, can be realized through software or hardware implementations A text-to- speech (TTS) system specifically converts standard written language text into audible speech, whereas other systems transform symbolic linguistic representations, such as phonetic transcriptions, into speech [1] TTS technology has evolved significantly over the years, incorporating advanced algorithms and machine learning techniques to produce more natural-sounding and intelligible speech output.

By simulating various aspects of human speech, including pitch, tone, and intonation, TTS systems strive to provide a seamless and user-friendly listening experience.

The development of TTS technology has undergone remarkable progress over time:

 In the 1950s, pioneers like Homer Dudley with his "VODER" and Franklin S Cooper's

"Pattern Playback" initiated the foundation for modern TTS systems.

 The 1960s brought forth formant-based synthesis, utilizing models of vocal tract resonances to produce speech sounds.

 The 1970s introduced linear predictive coding (LPC), enhancing speech signal modeling and producing more natural synthesized speech.

 The 1980s saw the emergence of concatenative synthesis, a method that combined pre- recorded speech segments for the final output.

 During the 1990s, unit selection synthesis became popular, using extensive databases to select the best-fitting speech units for more natural output.

 The 2000s experienced the rise of statistical parametric synthesis techniques, such as Hidden Markov Models (HMMs), providing a data- driven and adaptable approach to TTS.

 The 2010s marked the beginning of deep learning-based TTS with models like Google's WaveNet, revolutionizing speech synthesis by generating raw audio waveforms instead of relying on traditional signal processing.

 End-to-end neural TTS systems like Tacotron streamlined the TTS process by directly converting text to speech without intermediate stages.

 Transfer learning and multilingual TTS models have recently enabled the development of high- quality TTS systems for low-resourced languages, expanding the reach of TTS technology.

 Today, TTS plays a vital role in everyday life, powering virtual assistants, accessibility tools, and various digital content types.

Some current applications of text-to-speech (TTS) technology includes:

 Assistive technology for the visually impaired: TTS systems help blind and visually impaired individuals by reading text from books, websites, and other sources, converting it into audible speech.

 Learning tools: TTS systems are used in computer-aided learning programs, aiding language learners and students with reading difficulties or dyslexia by providing auditory reinforcement.

 Voice output communication aids: TTS technology assists individuals with severe speech impairments by enabling them to communicate through synthesized speech.

 Public transportation announcements: TTS provides automated announcements for passengers on buses, trains, and other public transportation systems.

 E-books and audiobooks: TTS systems can read electronic books and generate audiobooks, making content accessible to a broader audience.

 Entertainment: TTS technology is utilized in video games, animations, and other forms of multimedia entertainment to create realistic and engaging voiceovers.

 Email and messaging: TTS systems can read emails, text messages, and other written content aloud, helping users stay connected and informed.

 Call center automation: TTS is employed in automated phone systems, allowing users to interact with voice-activated menus and complete transactions through spoken commands.

 Virtual assistants: TTS is a crucial component of popular voice-activated virtual assistants likeApple's Siri, Google Assistant, and Amazon's Alexa, enabling them to provide spoken responses to user queries.

 Voice search applications: By integrating TTS with speech recognition, users can use speech as a natural input method for searching and retrieving information through voice search apps.

In conclusion, TTS technology has come a long way since its inception, with continuous advancements in algorithms, machine learning, and deep learning techniques As a result, TTS systems now provide more natural-sounding and intelligible speech, enhancing the user experience across various applications such as assistive technology, learning tools, entertainment, virtual assistants, and voice search The ongoing development and integration of TTS into our daily lives will continue to shape the future of human-computer interaction and digital accessibility.

The architecture of a TTS system is generally composed of several components, as depicted in Figure 1.1. The Text Processing component is responsible for preparing the input text for speech synthesis The G2P Conversion component converts the written words into

Figure 1.1 Basic system architecture of a TTS system [22] their corresponding phonetic representations The Prosody Modeling component adds appropriate intonation, duration, and other prosodic features to the phonetic sequence Lastly, the Speech Synthesis component generates the speech waveform based on the parameters derived from the fully tagged phonetic sequence [2]. Text processing is crucial for identifying and interpreting all textual or linguistic information that falls outside the realms of phonetics and prosody Its primary function is to transform non-orthographic elements into words that can be spoken aloud Through text normalization, symbols, numbers, dates, abbreviations, and other non- orthographic text elements are converted into a standard orthographic transcription, facilitating subsequent phonetic conversion Additionally, analyzing whitespace, punctuation, and other delimiters is vital for determining document structure and providing context for all subsequent steps Certain text structure elements may also directly impact prosody Advanced syntactic and semantic analysis can be achieved through effective text-processing techniques [2, p 682] The phonetic analysis aims to transform orthographic symbols of words into phonetic representations, complete with any diacritic information or lexical tones present in tonal languages Although future TTS systems might rely on word-sounding units and possess increased storage capacity, homograph disambiguation and grapheme-to-phoneme (G2P) conversion for new words remain essential for accurate pronunciation of every word G2P conversion is relatively straightforward in languages with a clear relationship between written and spoken forms A small set of rules can effectively describe this direct correlation, which is characteristic of phonetic languages such as Spanish and Finnish Conversely, English is not a phonetic language due to its diverse origins, resulting in less predictable letter-to-sound relationships In these cases, employing general letter-to- sound rules and dictionary lookups can facilitate the conversion of letters to sounds, enabling the correct pronunciation of any word [2, p 683].

In TTS systems, prosodic analysis involves examining prosodic features within the text input, such as stress, duration, pitch, and intensity This information is then utilized to generate more natural and expressive speech Prosodic analysis helps determine the appropriate stress, intonation, and rhythm for the synthesized speech, resulting in a more human-like output Predicting prosodic features can be achieved through rule-based or machine-learning methods, including acoustic modeling and statistical parametric speech synthesis By adjusting the synthesized speech, TTS systems can convey various emotions or speaking styles, enhancing their versatility and effectiveness across diverse applications Speech synthesis employs anticipated information from the fully tagged phonetic sequence to generate the corresponding speech waveform Broadly, two traditional speech synthesis techniques are concatenative and source/filter synthesizers Concatenative synthesizers assemble pre-recorded human speech components to produce the desired utterance In contrast, source/filter synthesizers create synthetic voices using a source/filter model based on the parametric description of speech The first method necessitates assistance in generating high-quality speech using the input text's parametric representation and speech parameters Meanwhile, the second approach requires a combination of algorithms and signal processing adjustments to ensure smooth and continuous speech, particularly at junctures.

Several improvements have been proposed for high-quality text-to-speech (TTS) systems, drawing from the two fundamental speech synthesis techniques Among the most prominent state-of-the-art methods are statistical parametric speech synthesis and unit selection techniques, which have been the subject of extensive debate among researchers in the field.

With the advancement of deep learning, neural network-based TTS (neural TTS) systems have been proposed, utilizing (deep) neural networks as the core model for speech synthesis A neural TTS system comprises three fundamental components: a text analysis module, an acoustic model, and a vocoder As illustrated in Figure 1.2 the text analysis module transforms a text sequence into linguistic features The acoustic model then generates acoustic features from these linguistic features, and finally, the vocoders synthesize the waveform from the acoustic features.

1.1.3 Evolution of TTS methods over time

The evolution of TTS methods has progressed significantly over time, with advancements in technology and research contributing to more natural and intelligible speech synthesis Early TTS systems relied on rule-based methods and simple concatenation techniques, which have since evolved into sophisticated machine learning approaches, including neural network-based TTS systems These modern systems offer improved speech quality, prosody, and adaptability, resulting in more versatile applications across various industries.

1.1.3.1 TTS using unit-selection method

The unit-selection approach allows for the creation of new genuinely sounding utterances by picking relevant sub-word units from a natural speech database [4], based on how well a chosen unit matches a specification/a target unit (and how well two chosen units join together) During synthesis, an algorithm chooses one unit from the available options to discover the best overall sequence of units that meets the specification [1]. The specification and the units are described by a feature set that includes linguistic and speech elements The feature set is used to do a Viterbi-style search to determine the sequence of units with the lowest total cost. Although they are theoretically quite similar, the review of Zen [4] suggests that there are two fundamental methods in unit-selection synthesis: (i) the selection model [5], shown in Figure 1.3a; (ii) the clustering approach [6], shown in Figure 1.3b, which effectively enables the target cost to be pre-calculated The second method asks questions about features available at the time of synthesis and groups units of the same type into a decision tree.

Figure 1.3 General and clustering-based unit-selection scheme: Solid lines represent target costs and dashed lines represent concatenation costs [13]

In the selection model for TTS synthesis, speech units are chosen based on a cost function calculated in real time during the synthesis process This cost function considers the acoustic and linguistic similarity between the target text and available speech units in the database, selecting the unit with the lowest cost for synthesis.Conversely, the clustering approach pre- calculates the cost for each speech unit, grouping similar units into a decision tree This tree allows for rapid speech unit selection during synthesis based on available features,reducing the real-time computation and resulting in faster, more efficient TTS synthesis Both methods have their advantages and disadvantages, with the selection model offering greater flexibility for adapting to different languages and voices and the clustering approach providing enhanced speed and efficiency The choice between these methods depends on the specific needs of the TTS system being developed.

In a typical statistical parametric speech synthesis system, a set of generative models is used to model the parametric speech representations extracted from a speech database, including spectral and excitation parameters (also known as vocoder parameters are used as inputs of the vocoder) The model parameters are frequently estimated using the Maximum Likelihood (ML) criterion Then, to maximize their output probabilities, speech parameters are constructed for a specific word sequence to be synthesized from the estimated models Finally, a speech waveform is built from the parametric representations of speech [4]. Any generative model can be employed; however, HMMs are mainly well-known In HMM-based speech synthesis (HTS) [7], context-dependent HMMs statistically model and produce the speech parameters of a speech unit, such as the spectrum and excitation parameters (for example, fundamental frequency - F0) A typical HMM-based speech

Figure 1.4 Core architecture of HMM-based speech synthesis system [25] synthesis system's core architecture, as shown in Figure 1.4 [8], consists of two main processes: training and synthesis.

The Expectation Maximization (EM) algorithm is used to do the ML estimation (MLE) during training, and it is similar to speech recognition The primary distinction is that excitation and spectrum parameters are taken from a database of natural speech that a collection of multi-stream context-dependent HMMs has modeled Excitation parameters include log F0 and its dynamic properties.

Another distinction is adding prosodic and linguistic circumstances to phonetic settings (called contextual features) The state-duration distribution for each HMM is also used to describe the temporal structure of speech The Gamma distribution and the Gaussian distribution are options for state-duration distributions In order to estimate them, the forward-backward method used statistical data that was gathered during the previous

Figure 1.5 General HMM-based synthesis scheme [13, p 5] iteration.

An inverse speech recognition procedure is carried out throughout the synthesis process The utteranceHMM is built by concatenating the context-dependent HMMs by the label sequence after a given word sequence is transformed into a context-dependent label sequence Second, the speech parameter generation algorithm creates spectral and excitation parameter sequences from the utterance HMM The obtained spectral and excitation parameters are then used to create a speech waveform using a speech synthesis filter and a vocoder with a source-excitation/filter model [4]. n n

Speech synthesis for low-resourced languages

The development of interactive systems for under-resourced languages [26] faces challenges due to the need for more data and minimal research in this area The SLTU- CCURL 2 workshops and SIGUL 3 meetings aim to gather researchers working on speech and NLP for these languages to exchange ideas and experiences These events foster innovation and encourage cross-disciplinary collaboration between fields like computer science, linguistics, and anthropology The focus is on promoting the development of spoken language technologies for low-resourced languages, covering topics like speech recognition, text-to-speech synthesis, and dialogue systems By bringing together academic and industry researchers, these meetings help address the challenges faced in under-resourced language processing.

Many investigations for low-resourced languages have been conducted recently using a variety of methods, including applying speaker characteristics [27], modifying phonemic features [28], [29], and cross-lingual text- to-speech [30], [31] Yuan-Jui Chen et al introduced end-to-end TTS with cross-lingual transfer learning [32]. The authors proposed a method to learn a mapping between source and target linguistic symbols because the model trained on the source language cannot be directly applied to the target language due to input space mismatches By using this memorization mapping, pronunciation information can be

2 http://sltu-ccurl-2020.ilc.cnr.it/

3 https://sigul-2022.ilc.cnr.it/ kept throughout the transfer proess Sahar Jamal et al [33] used transfer learning for the experiments to take advantage of the low-resourced scenario The information obtained then trains the model with a significantly smaller collection of Urdu training data The authors created standalone Urdu and learning systems by using pre-trained Tacotron models of English and Arabic as parent models Marlene Staib et al [34] improved or matched the performance of many baselines, including a resource-intensive expert mapping technique, by swapping out Tacotron 2’s character input for a manageably small set of IPA-inspired features This model architecture also enables the automated approximation of sounds that have not been seen in training They demonstrated that a model trained on one language could produce intelligible speech in a target language even in the lack of acoustic training data A similar approach [35] is used in transfer learning, where a high-resource English source model is fine-tuned with either 15 minutes or 4 hours of transcribed German data Data augmentation is a different approach that researchers apply to solve the low-resourced language challenge [36]–

[38] An innovative three-step methodology has been developed for constructing expressive style voices using as little as 15 minutes of recorded target data, circumventing the costly operation of capturing large amounts of target data Firstly, Goeric Huybrechts et al [36] augment data by using recordings of other speakers whose speaking styles match the desired one In the next step, they use synthetic data to train a TTS model based on the available recordings Finally, the model is fine-tuned to improve quality.

Muthukumar and his colleagues have developed a technique for automatically constructing phonetics for unwritten languages [39] Synthesis may be improved by switching to a representation closer to spoken language than written language.

The main challenges to address when developing TTS for under-resourced languages are

1 synthesizing speech for languages with a writing system but limited data; 2 synthesizing speech for languages without a writing system, using input text or speech from another language Key research directions, such as adaptation and polyglot approaches, will be discussed in detail in the following sections to tackle these challenges.

1.2.1 TTS using emulating input approach

The rationale behind this approach is to leverage an existing TTS system for a base language (Base Language - BL) to simulate TTS for an unsupported language (target language - TL) This strategy aims to assist individuals who speak unsupported languages when communicating in another language is inconvenient, such as when new immigrants visit a doctor While TTS plays a role in translating doctor-patient conversations, text-based communication is also essential in healthcare Consequently, TTS becomes necessary for users with limited English proficiency or literacy skills in their native language, as it enables them to access and understand vital information [40].

The first emulating idea given by Evans et al [41], the team developed the simulator to fit a screen reader. They describe a method that enables the production of text-to-speech synthesizers for new languages with assistive apps The method employs a straightforward rule-based text-to-phoneme step The phonemes are transmitted to a phoneme-to-speech system for another language They demonstrate that the correspondence between the language to be synthesized and the language on which the phoneme-to-speech system is based is crucial for the perceived quality of speech but not necessarily for speech comprehension They report the exam in Greek but can apply the same method with equal success for Albanian, Czech, Welsh, and additional languages.

Three primary challenges exist in simulating a target language (TL) using a base language (BL) First, it is essential to choose BL phonemes that closely resemble those of the TL's phonemes Second, the goal is to minimize discrepancies in text-to-phoneme mapping Lastly, we must select a BL with linguistic features that closely align with those of the TL's linguistic features These three challenges can lead to different approaches, and ultimately, the balance achieved will be significantly influenced by the decisions made regarding the BL [40].

In the study by Evans et al [41], the evaluation process has a unique aspect compared to conventional TTS assessment This distinction is essential to understand as it highlights the tailored approach needed for evaluating TTS systems in under-resourced languages The MRT (Mean Opinion Score - Revised) is a variation of the traditional MOS (Mean Opinion Score) assessment The conventional MOS is a subjective evaluation method used to gauge the overall quality of speech synthesis systems In contrast, MRT focuses on the clarity and usability of the synthesized speech in low-resource settings This shift in focus makes MRT a more suitable evaluation method for under-resourced languages The study used nonsensical words and simple sentence structures as test cases to evaluate the Greek TTS system This approach was chosen because, in under- resourced languages, ensuring that the TTS system can generate clear and understandable speech even when faced with unusual or uncommon linguistic structures is crucial By using these "fake" cases, the evaluation can better assess the system's performance and robustness in challenging situations.

Harold Somers and his colleagues proposed a "emulating" approach for developing TTS systems in under- resourced languages, as explored in their publications [40] and [42] They aimed to create a TTS system for Somali, an under-resourced language, by leveraging an existing TTS system for a well-resourced language. The researchers also discussed various experimental designs to assess TTS systems developed using this approach, emphasizing the importance of evaluating speech quality, intelligibility, and usefulness This method utilizes existing resources from well-resourced languages, showing potential for developing TTS systems for under-resourced languages By investigating different experimental designs and evaluation methods, researchers can better comprehend the challenges, opportunities, and limitations of this approach.

The advantages and disadvantages of the "emulating" approach for low-resourced languages, as well as its applicability, can be summarized as follows:

 Resource efficiency: By leveraging existing TTS systems for rich- resourced languages, the need for extensive data collection and development efforts can be reduced.

 Faster development: Utilizing existing resources accelerates the development process for TTS systems in low-resourced languages.

 Cross-disciplinary collaboration: The "emulating" approach fosters collaboration among researchers in various fields, such as computer science, linguistics, and anthropology.

 Speech quality: Synthesized speech quality may be compromised due to the mismatch between the base and target languages.

 Intelligibility: Depending on the similarity between the base and target languages, the intelligibility of the generated speech might be limited.

 Customizability: The "emulating" approach might not be suitable for every low-resourced language, especially if there is no closely-related rich-resourced language to use as a base. Applicability:

 Languages with similar phonetic or linguistic characteristics: The "emulating" approach is most applicable when the target low-resourced language shares phonetic or linguistic features with a well-resourced language.

 Situations requiring rapid TTS system development: In cases where a TTS system is urgently needed for an low-resourced language, the "emulating" approach can provide a quicker solution than traditional methods.

 Initial system development: The "emulating" approach can serve as a starting point for developing a more refined TTS system for low- resourced languages, allowing researchers to identify specific challenges and opportunities for improvement.

In summary, the "emulating" approach presents a promising direction for developing TTS systems for low- resourced languages However, its success depends on selecting a suitable base language and overcoming the limitations inherent in this method.

1.2.2 TTS using the polyglot approach

Polyglot TTS and multilingual TTS are often used interchangeably, but they can have slightly different meanings depending on the context:

 Polyglot TTS: A single TTS model is trained to handle multiple languages simultaneously in the polyglot approach The model can synthesize speech in various languages using the same architecture and shared parameters The polyglot approach aims to leverage commonalities among languages and transfer knowledge from rich-resourced languages to low-resourced languages This approach can be more resource-efficient and scalable compared to building separate TTS models for each language.

 Multilingual TTS: Multilingual TTS is a broader term that refers to any TTS system capable of handling multiple languages, regardless of the specific architecture or method used A multilingual TTS system can include separate TTS models for each language or use a shared model like in the polyglot approach The main goal of multilingual TTS systems is to support speech synthesis in various languages.

In summary, polyglot TTS is a specific approach to building multilingual TTS systems where a single model is used for multiple languages On the other hand, multilingual TTS is a more general term that encompasses any TTS system capable of handling multiple languages, whether it uses separate models for each language or a shared model like in the polyglot approach.

Figure 1.11 Scheme of a HMM-based polyglot synthesizer [48]

Machine translation

Introducing machine translation is essential for the development of TTS systems for unwritten low- resourced languages through phoneme-level intermediate representation This is because TTS systems generally require text input, which may not be readily available or standardized for unwritten languages.Machine translation can help bridge the gap between written and unwritten languages by converting the source text from a well- resourced language to a phoneme sequence in the target unwritten language This phoneme sequence can then be used as the intermediate representation to guide the TTS system, enabling it to synthesize speech for the unwritten low-resourced language Consequently, the combination of machine translation and intermediate phoneme representation can significantly contribute to developing TTS systems for these languages.

Machine translation is the process of using computational algorithms and techniques to automatically translate text or speech from one language to another The primary goal is to generate both accurate and fluent translations, preserving the meaning and style of the original text Several approaches have been developed for machine translation, each with its own set of principles Rule-based machine translation relies on comprehensive linguistic rules and dictionaries to translate between languages This method requires extensive knowledge of the source and target languages' grammatical structures and vocabulary On the other hand, statistical machine translation employs probabilistic models based on the analysis of bilingual text corpora. These models learn to predict the most likely translation by observing patterns and co-occurrence frequencies of words and phrases in the training data.

In recent years, neural machine translation has gained prominence due to its ability to generate more fluent and accurate translations This approach utilizes deep neural networks, including encoder-decoder architectures and attention mechanisms, to capture complex linguistic relationships and generate translations in a more context-aware manner.

Each method has its advantages and drawbacks, often depending on the available data and computational resources While rule-based systems can perform well for languages with rich linguistic resources, statistical and neural approaches generally excel in scenarios with vast parallel corpora available for training.

Recall that the idea behind a neural network is to model language as a sequential process The neural machine translation model is constructed by extending a neural language model Given additional words as input, this model predicts the next word in the sequence When reaching the end of the sentence, it is considered to predict the translation of the input sentence, wherein each output is processed sequentially This is referred to as sequence-to- sequence [54] Figure 1.13 describes an example of converting a sequence into a sequence By expanding the language model, we put the end of the English entry with the German output The transition box (after handling the end of the token ) will contain the packaging of the entire input sentence.

Figure 1.13 Examples of sequence to sequence transformation [55]

The Encoder-Decoder [54], [56] approach in neural machine translation utilizes a neural network to create a meaningful representation of the input sentence and generate translations It involves encoding the input sentence into a fixed-length vector, which is then decoded to produce the output sentence This architecture is capable of handling variable-

� length input and output sequences, making it a popular choice for machine translation tasks The encoder and decoder components can be deepened using multiple layers, further enhancing their learning capacity. Additionally, Convolutional Neural Networks (CNNs) have been applied to this approach, offering new possibilities for handling more complex translation tasks.

1.3.2 Attention in neural machine translation

Constructing a neural translation model with decoding encryption architecture has brought good initial results In addition, proposals have been made to integrate additional structural models in two source and target languages into the neuron translation system, such as incorporating the models of goods and absolute location models And a relative of words in the sentence, the birth model (Fertility) [57], [58], etc However, the most critical improvement in integrating bilingual structure matching models into neural translation models is using the product model for input and output words.

The attention model is considered very important in a sequence-to-sequence processing model Their study

[59] added an alignment model (called attention mechanism) to the decoder The encryption part gave us a string representation from ℎ 𝑗 = (⃖ℎ , ℎ ), and the decryption required the context c i at each step i The attention mechanism allows connecting these two components.

Figure 1.14 depicts the location of the attention model when it comes to neural machine translation, input and output, and the connections between the components in the model The idea is the attention mechanism will link information about all representations from the input

, ℎ ) and the previously hidden state c i of the decoder, thereby computing ci input context

Figure 1.14 Describe the location of the Attention model in neural machine translation

Attention mechanisms are employed to enhance neural machine translation (NMT) model training The Attention mechanism facilitates the calculation of the connection between the decoding phase and each output word This connection can be calculated using weight vectors and bias, and the attention values are then normalized using the Softmax function.

The Self-Attention mechanism, which extends the Attention mechanism within the encoder, focuses on calculating connections between input words instead of input-output words A broader context is provided, and this mechanism supports parallel processing The self-attention mechanism can also be incorporated into the decoding phase.

The training process for NMT models consists of trading training data, dividing the data into large and small sets, processing mini-batches, collecting gradients, and updating parameters The training typically undergoes 5-

15 epochs Advanced training can leverage existing bilingual structure matching models to improve quality and speed, and self-attention mechanisms can enhance the translation model further.

1.3.3 Statistical machine translation based on phrase

Statistical machine translation views the translation problem as a machine learning problem An algorithm that extracts statistical information from an extensive database of previously translated texts (database of parallel texts) The system can then translate the new sentences Warren Weaver presented the first ideas about probabilistic statistical machine translation in 1949 The memo written by Warren Weaver in 1949 was perhaps the most influential publication in the early days of statistical machine translation Weaver first cited the possibility of using computers for translation in 1947 His proposals involved the applicability of coding methods to translation and the existence of universal language rules and logic between languages etc Statistical machine translation was then reintroduced in 1991 by researchers at the IBM Research Center and has contributed to a renewed interest in statistical machine translation in recent years Statistical machine translation is still one of the world's most studied machine translation methods.

Today, with tools available and enough parallel text data, we can build a translation system for a new language pair in a relatively short time This method can be applied to a large number of language pairs The accuracy of these systems depends mainly on the quantity, quality, and suitability of the parallel text field used.

1.3.3.1 Statistical machine translation problem based on phrase

Speech synthesis evaluation metrics

The Mean Opinion Score (MOS) is a pivotal metric in the realm of audio and speech quality assessment. MOS is employed to evaluate the perceived quality of audio or speech by listeners It stands as a standardized method for gauging user satisfaction with the quality of synthesized audio, be it through telecommunication systems, voice communication systems, or voice synthesis applications such as virtual assistants.

The formula for calculating MOS is typically represented as follows:

 𝑁: The number of listeners participating in the evaluation (typically chosen randomly).

 𝑅 𝑖 : The rating given by listener i, usually on a scale of 1 to 5, with the highest score indicating the highest satisfaction with the audio or speech quality.

The Mean Opinion Score (MOS) holds significant importance in the assessment and enhancement of audio and speech quality Its primary significance encompasses:

 Synthesized Quality Assessment: MOS allows researchers and developers to evaluate the performance of speech synthesis systems By collecting MOS scores from real listeners, they can ascertain user satisfaction levels and identify weaknesses that require improvement.

 System Comparison: MOS provides a standardized means to compare audio quality between different systems or variations of a system This aids in selecting the best-performing system or developing improved iterations.

 Quality Threshold Establishment: MOS can be used to establish acceptable quality thresholds For example, a speech synthesis system with a MOS below a specific threshold may be deemed unacceptable and necessitate improvements.

In conclusion, MOS serves as a potent tool for the evaluation and improvement of audio and speech quality, ensuring that users have the best possible listening experience.

A Confidence Interval (CI) is a concept in statistics and science used to estimate a range within which the true value of a variable may lie It quantifies the uncertainty in data and allows us to understand the level of confidence in a measurement or estimation Below is the standard formula for calculating a Confidence Interval (CI):

 𝑍: The value from the standard distribution table corresponding to the desired level of confidence (e.g., 1.96 for 95% confidence).

The Confidence Interval (CI) helps us gain a deeper understanding of the level of uncertainty in the data and the certainty of an estimation The significance of CI includes:

 Generalizing the estimate: CI allows us to ascertain a range of certainty around the estimated value. Instead of providing only specific figures, it enables the identification of a range within which the true value may lie.

 Comparing and deciding: CI enables comparisons between estimates and assesses whether differences are statistically significant It also supports decision-making based on the level of confidence.

 Visualization and presentation of results: When reporting research findings, CI can be graphically represented or presented alongside the estimated value, aiding in the visual communication of data uncertainty.

Mel Cepstral Distortion (MCD) [87], [88] is a common metric in the field of speech synthesis and speech processing used to evaluate the similarity between two speech signals, typically a synthesized signal and a reference signal recorded from a human speaker.

The formula to calculate the Mel Cepstral Distortion (MCD) is typically represented as follows:

 𝑁: Number of frames in the speech signal.

 𝑀: Number of Mel cepstral coefficients (usually 12 or 24).

 𝐶 (𝑡) : Mel cepstral coefficients of the synthesized (target) signal.

 𝐶 (𝑟) : Mel cepstral coefficients of the reference (real) signal.

MCD measures the distance between two speech signals, one typically synthesized and the other real A lower MCD value indicates a higher degree of similarity between the two signals The significance of MCD includes:

 Quality Assessment: MCD helps assess how closely the synthesized speech signal matches the reference, serving as an indicator of quality A lower MCD implies a closer match.

 Model Optimization: In the field of speech synthesis, researchers and developers can use MCD to adjust and improve synthesis models to make the synthesized speech sound more similar to the reference.

1.4.2.4 MCD with Dynamic Time Warping (MCD – DTW)

Dynamic Time Warping (DTW) [89] is applied in speech synthesis to align and normalize the Mel Cepstral Distortion (MCD) between a synthesized speech signal and a reference (real) speech signal It's used to account for timing differences and make a meaningful comparison between the two signals By employing DTW, MCD becomes a more robust metric for evaluating the quality and similarity of synthesized speech, ensuring that temporal variations are considered during the assessment process This alignment and normalization process enhances the accuracy of speech synthesis evaluation and contributes to producing more natural and high-quality synthesized speech.

The Analysis of Variance, or ANOVA [90], [91], is a sophisticated statistical methodology employed to discern differences among group means It is an instrumental technique used extensively across various domains, such as psychology, medicine, and social sciences, to name a few, predominantly in experimental research The primary function of ANOVA is to analyze if there exists any statistically significant disparity among the means of two or more groups by scrutinizing the variances both within and between these groups.

The fundamental philosophy underpinning ANOVA is the comparison of the between- group variance with the within-group variance If the former substantially outweighs the latter, the conclusion drawn is that there are indeed noteworthy differences in the means of the groups under comparison ANOVA's potency lies in its capacity to detect even minor deviations between groups Multiple variants of ANOVA exist, each catering to different purposes and applications, such as one-way ANOVA, two-way ANOVA, and repeated measures ANOVA One-way and two-way ANOVA are both statistical methodologies used for comparing means across diverse groups or treatments; however, they differ in the number of factors or independent variables under analysis.

One-way ANOVA involves the analysis of a single factor or independent variable This statistical method tests for significant differences in means across three or more groups For instance, if the aim is to compare the average heights of students across three distinct schools, a one-way ANOVA would be the method of choice to determine any significant disparities in the mean heights among the schools.

The ANOVA formula for one-way ANOVA is as follows:

 𝐹: The F-statistic, used to test the difference between groups.

 𝑀𝑆𝐵: Calculated by dividing the sum of squares between groups by the degrees of freedom between groups It measures the variability between groups Higher MSB indicates significant differences between groups.

 𝑀𝑆𝑊: Calculated by dividing the sum of squares within each group by the degrees of freedom within groups It measures the variability within each group Lower MSW indicates higher similarity within groups.

 𝐹-statistic: This is a numerical value calculated based on the differences between groups in the data.

A high F-value suggests significant differences between groups.

 Mean Square Between (𝑀𝑆𝐵): 𝑀𝑆𝐵 quantifies the variation between groups It is computed by summing the squares of the differences between the group means and the overall mean, divided by the degrees of freedom between groups.

 Mean Square Within (𝑀𝑆𝑊): 𝑀𝑆𝑊 quantifies the variation within each group It is computed by summing the squares of the differences between individual data points and their respective group means, divided by the degrees of freedom within groups.

When comparing MSB and MSW by dividing MSB by MSW to calculate the F-value, if the F-value is high, there is evidence to believe that there are significant differences between the groups In other words, ANOVA helps us test whether the differences between groups are statistically significant.

This formula is used to assess the impact of an independent variable (such as a speech synthesis model) on a dependent variable (such as speech quality) across multiple groups or conditions.

Conclusion

TTS technology converts written text into spoken speech and can be implemented using various methods, such as unit selection, statistical parameter speech synthesis, and neural speech synthesis For low-resourced languages, approaches like emulating models, polyglot TTS, and adaptive methods have been used Emulating models are quick and cost-effective but may suffer from low quality and limited accuracy Polyglot TTS models offer multilingual capabilities, higher quality, and better accuracy but are computationally expensive and complex to implement Adaptive methods provide high accuracy and dynamic adaptation but are also computationally expensive and challenging to implement.

Newer approaches like transfer learning, zero-shot learning, unsupervised learning, and adversarial training are leveraging machine learning advancements to improve TTS quality and accuracy for low-resource languages Transfer learning utilizes pre-trained models in rich-resource languages for fine-tuning TTS models in low-resourced languages Zero-shot learning enables TTS models to generate speech for unseen languages, and unsupervised learning helps TTS models learn language characteristics without labelled data Adversarial training aims to improve the realism of synthesized speech These methods show promising results in enhancing speech synthesis for low-resourced languages.

In this thesis, the emulating method and cross-lingual transfer learning approach are chosen for several reasons:

 Resource efficiency: The emulating method is quicker and more cost- effective than other approaches, as it requires fewer data and computational resources This makes it an attractive choice for low- resourced languages where acquiring substantial amounts of data is challenging.

 Pre-trained models: Cross-lingual transfer learning leverages pre-trained models on rich- resource languages, which have already captured essential linguistic and acoustic features This enables the model to gain a head- start when fine-tuning for low-resourced languages, ultimately saving time and computational resources.

 Adaptability: Cross-lingual transfer learning allows the model to adapt its knowledge from a rich-resource language to a low-resourced language, enabling it to generalize better across languages This adaptability can help improve the overall performance of the TTS system in low-resourced languages.

 Simplicity: The emulating method involves a relatively straightforward process of training a TTS model on a rich-resource language and fine- tuning it on a low-resourced language This simplicity makes it easier to implement and maintain, especially in comparison to more complex approaches like polyglot TTS or adaptive methods.

 Wider applicability: The combination of the emulating method and cross- lingual transfer learning can be applied to various low-resourced languages, making it a versatile solution for TTS systems that aim to support multiple languages.

By choosing the emulating method and adapting from the cross-lingual transfer learning approach, this thesis aims to develop an efficient, adaptable, and straightforward TTS system for low-resourced languages while maximizing the use of available resources and knowledge from rich-resource languages.

In the case of unwritten low-resourced languages, it is necessary to replace text with an intermediate phoneme-level representation This issue can be addressed through research that combines various machine translation methods previously presented in Chapter 1 Thus, translation and TTS problems are integrated to create TTS systems for unwritten low- resourced languages.

The implementation of the above content shows the need to study the phonetic matching of the Viet-Muong language pair carefully, and the next chapter will detail these studies.

VIETNAMESE AND MUONG LANGUAGE

Vietnamese language

The pronunciation of Vietnamese is considered simple and easy to learn, with a consistent and straightforward pronunciation system The Vietnamese phonetic system is a combination of Latin script with diacritical marks. There are 29 letters in the Vietnamese alphabet, with some letters used for vowels and consonants The most important feature of Vietnamese phonetics is its tone system, which consists of 6 tones that can change the meaning of a word These tones are indicated by diacritical marks placed above the vowels.

Vietnamese [92], is the language of the Vietnamese (Kinh people) and is the official language in Vietnam This is the mother tongue of about 85% of the Vietnamese, along with more than four million Vietnamese overseas. Vietnamese is also the second language of ethnic minorities in Vietnam Although Vietnamese has some borrowed words from the Chinese language and previously used Nôm (a Chinese-based script) to write, Vietnamese is considered one of the languages of the Austroasiatic family that has the number of people who speak the most.

Vietnamese is officially recognized in the constitution as the national language of Vietnam Vietnamese includes pronunciation in Vietnamese and Quoc Ngu to write However, there are no official documents forVietnamese standards and national accents at the state level Vietnamese is a native language derived from the civilization of agriculture in what is now the northern region of the Red River and the Ma River in Vietnam.According to A G Haudricourt, in 1954, the Viet-Muong language group in the early Christian era was a language or dialect Later, through interaction with the Chinese language and especially with the linguistic Tai-Kadai language, which has a highly developed tone system, the tone system in Vietnamese appears and looks today, according to the rules of tone formation The appearance of tones began around the 6th century (the northern part of Vietnamese history) with three tunes and stable development around the 12th century (the Ly) with six tones Then some of the early consonants change to this day In the process of change, the final consonants go to change the syllabic endings, and the consonant head moves from confused to tangible.

Since the French invaded Vietnam in the second half of the 19th century, the French gradually replaced the position of Confucian as the official language in education, administration, and diplomacy The Quoc Ngu, which was created by some European missionaries, especially two Portuguese Gaspar monks, Amaral and Antonio Barbosa, for the purpose of using Latin characters to express Vietnamese, is increasingly being used popularly and, at the same time, influenced by new terms and expressions of the Western language (mainly from French [93]) such as phanh, lốp, găng, pê đan , Han as chính đảng, kinh tế, giai cấp, bán kính Gia Dinh Newspaper was the first newspaper published in the national language in 1865, affirming the development and trend of the Vietnamese language as the Official writing of independent Vietnam later Quoc Ngu is a recording using only 27 Latin and six diacritic marks; simple, convenient, scientific, easy to learn, easy to remember, completely replated French and Chinese, those are difficult to read, hard to remember, not popular with the Vietnamese.

After the reunification of Vietnam in 1975, North-South relations were reconnected Recently, the popularity of television and radio nationwide has made Vietnamese a standardized part Many Vietnamese words are popularly used instead of Vietnamese, as well with the advancement of the internet and globalization The influence of English is growing in the media and the correspondents, many from abroad They were introduced in Vietnamese, lack of selectivity, and were written in a foreign language.

Phonetic speech units can be classified into several categories, including syllables, sounds, phonemes, and prosodic elements such as stress, rhythm, and intonation These units work together to create the unique sound and rhythm of speech, helping to convey meaning and express emotions.

Strings of words are broken down into different groups of sounds, varying in size from large to small The smallest unit is the syllable, such as in the word "xà phòng," which is pronounced as "xà" and "phòng." This word is considered to have two syllables [94] Syllables can be classified into different categories based on various criteria. The most common classification is based on the ending of the syllable Using this criterion, syllables can be split into four types [95]: open syllables, semi-open syllables, half-closed syllables, and closed syllables.

Sound is the smallest natural unit of speech Different criteria are used to distinguish between these sounds, such as their acoustic features and vocal characteristics The distinction between vowels and consonants can also be based on the structure of the vocal organs Specifically, consonants have a point of articulation, known as

"phonological focus," while vowels do not [95] In the context of the Vietnamese-Muong language pair, the differentiation between vowels and consonants is critical in determining the meaning of words By understanding the unique characteristics of these sounds, TTS systems can be developed to produce natural- sounding speech in low-resourced languages.

The smallest unit of language is a phoneme, an abstract unit rather than a specific sound produced by an individual For instance, "G" is a phoneme representing a cluster of distinct features expressed simultaneously [94]. These distinctive features are a social convention and allow for communication between speakers of the same language.

In Vietnamese, for example, the phoneme /d/ is pronounced with the same distinctive features, such as plosive and voiced, by all speakers Although there may be differences in how people and dialects pronounce words, these differences are unimportant and do not affect the meaning The phoneme represents a general social phonetic constraint that individuals must share for successful communication So, the sound is the expression of the phonemes in speech [95].

Analyzing a language's syllable structure is crucial to understanding its phonemic system In the case of Vietnamese, the complex syllable structure, which includes numerous nuclei/vowels, combinations of glide and vowel, and a prominent tone system, requires careful analysis to identify the distinct sounds and their role in the language.

Through analysis of the syllable structure, researchers can gain a deeper understanding of the relationships between the various components of Vietnamese syllables and how they contribute to the unique sound and rhythm of the language This information can be used to inform the analysis of the phonemic system, helping to identify the distinct sounds and their role in the language.

Vietnamese phonetic and grammatical units, such as syllables and morphemes, are the same, and each syllable in Vietnamese has a stable and complete structure consisting of distinct sound units The syllable structure in Vietnamese is much different from that of European languages, where each syllable typically contains a consonant and vowel sound.

In Vietnamese, each syllable has five components: tone, initial sound, medial tone, nucleus tone, and coda syllable These components play different roles in the syllable, such as regulating the pitch, opening the syllable, changing the tone, and forming the nucleus and end of the syllable Each component has its function, and the combination of these components forms the syllable.

Muong language

2.2.1 Overview of Muong people and Muong language

The Muong people are an ethnic minority predominantly inhabiting northern Vietnam's mountainous regions, particularly in Hoa Binh, Thanh Hoa, and Son La They are the second- largest ethnic minority group in Vietnam (1.452.095 people according to the 2019 census) and are closely related to the majority ethnic Vietnamese, or Kinh people, in terms of linguistic and cultural aspects Both groups are believed to share a common ancestry and historical roots.

Throughout history, the Muong people have maintained a distinct cultural identity shaped by their unique social structure, language, and way of life The traditional Muong society was organized around a hierarchical system,with chieftains ruling various regions and maintaining power through family lineages The nobility and commoners made up the remaining social strata This social organization has evolved, reflecting changes in political and social environments.

The Muong language is part of the Austroasiatic family and closely relates to Vietnamese Despite these linguistic similarities, the Muong language has unique features and characteristics, such as its tonal system and vocabulary Muong folklore, music, and traditional rituals contribute to their cultural heritage and help to reinforce their identity.

Agriculture, particularly wet rice cultivation, has been an essential part of the Muong way of life, highlighting their strong connection to the natural environment The Muong people have developed sophisticated agricultural techniques and practices over time to adapt to the challenges posed by the mountainous terrain.

The history of the Muong people is a testimony to their resilience and adaptability, as well as their ability to maintain their cultural identity while integrating and coexisting with other ethnic groups in Vietnam Understanding the Muong history provides valuable insights into the diversity and richness of the Vietnamese cultural landscape.

In terms of language family, Vietnamese and Muong belong to the same group Viet Muong belongs to the Mon-Khmer branch of the Austroasiatic family.

Figure 2.1 Mon-Khmer branch of the Austroasiatic family [109, pp 175–176]

The Viet Muong group is shown in Figure 2.2, according to the study of Ferlus [110].

The Muong language has unmistakable similarities with Vietnamese This was pointed out by André-Georges Haudricourt in his article "Position of Vietnamese in South Asian Linguistics" [111] Haudricourt has launched and discussed 12 words in the basic vocabulary of the human body in various Mon-Khmer languages, including Viet, Muong, Phong, Kuy, Mon, Bahnar, Mnong, and some other languages From the Table A.6, it can be seen that the correspondences between Vietnamese and Muong are absolute.

Many views have been made more specifically about the relationship between Vietnamese and Muong - the central issue in the study of the Muong language Since the Vietnamese (or Kinh) and the Muong formally split into two groups, each group tends to form its language Therefore, Muong and Vietnamese are independent languages with different dialects However, in terms of linguistics, these are just the manifolds from the same root of the Viet- Muong sub-group The study of these development processes has yet to be a clear division.

In a narrow sense, Viet-Muong refers to Northern Vietic languages, including Vietnamese, Muong, and Nguon, which share irregular tones in some basic vocabulary and phonetic features that distinguish them from southern Vietnamese languages Ferlus [112] suggested that the Proto-Vietic expansion to the north was a key factor in developing these distinctions Xinh Mun of the Khmuic was the only historical trace of this language group Linguistic diversity in the northern regions of Vietnam results from rapid changes, with Vietnamese eventually becoming the dominant language.

The Muong language has developed more conservatively than Vietnamese, retaining phonetic features closer to the Proto Viet-Muong Notably, the Muong language retains pre- syllabic sounds that disappear without substitution, while in Vietnamese, these sounds changed, such as in the word for 'chicken'—Muong /ka/ andVietnamese /ɣa/ [113].

Muong people live in three provinces of Hoa Binh (Muong accounts for 63.3% of the province's population), Phu Tho (Muong accounts for 13.1% of the province's population), and Thanh Hoa (9.5%) Figure 2.3 displays the geographical distribution of the various dialects of the Muong language.

Figure 2.3 The distribution of the Muong dialects [114, p 299]

In addition to the four main dialects of the Muong language (Muong Bi, Muong Vang, Muong Thang, andMuong Dong), it is important to mention the Muong Tan Son dialect, spoken in the Tan Son district of Phu ThoProvince in northern Vietnam.

 Muong Bi: Primarily spoken in Hoa Binh Province, Muong Bi is considered the most prestigious dialect and often serves as the standard for the Muong language.

 Muong Vang: Predominantly spoken in Thanh Hoa Province, the Muong Vang dialect exhibits differences from Muong Bi in terms of vocabulary and pronunciation.

 Muong Thang: Found in areas of both Hoa Binh and Thanh Hoa Provinces, Muong Thang shares similarities with the Muong Vang dialect.

 Muong Dong: Mainly spoken in Son La Province, Muong Dong is the least studied and understood among the Muong dialects, with more distinct variations from the other three dialects.

 Muong Tan Son: Spoken in the Tan Son district of Phu Tho Province, the Muong Tan Son dialect exhibits unique phonetic, lexical, and grammatical features that differentiate it from the other Muong dialects.

These five dialects together represent the linguistic diversity and richness of the Muong language, shaped by regional, historical, and cultural factors Further research into each of these dialects, including Muong Tan Son, can provide valuable insights into the language's development and the cultural heritage of the Muong people.

The issue of the Muong written script is an exciting topic in the context of the Muong language and its development Historically, the Muong people did not have a formal writing system Instead, they relied on an oral tradition to transmit their culture, folklore, and knowledge from one generation to the next Over time, the Muong people adopted various writing systems influenced by the languages and cultures they interacted with.

One notable example is the use of the Vietnamese script (Chữ Nôm) to transcribe the Muong language Chữ Nôm is a logographic writing system used for the Vietnamese language from the 13th to the early 20th century, and it was based on the Chinese script Some Muong people, especially scholars and elites, learned Chữ Nôm and used it to record Muong texts, such as stories, poems, and historical documents.

Comparison between Vietnamese and Muong

This section will be based on the studies mentioned above, drawing comparisons between the similarities and differences between Vietnamese and the two Muong dialects, Muong Bi and Muong Tan Son In the following chapters, these comparative analyses will serve as a foundation for experimenting with TTS methods for low- resourced languages.

When comparing Vietnamese and Muong, the phonetic element of the two languages can be divided into three groups:

 Equivalent elements: Muong phonemes coincide with phonemes in Vietnamese, so we can use equivalent simulations.

 Closed elements: Muong phonemes are similar to phonemes in Vietnamese, so we can use simulators to replace phonemes approximately.

 Distinct element: Muong phonemes are not found in phonemes in Vietnamese.

The following will present the proposal for the transformation rules between Muong and Vietnamese in terms of consonants, vowels, and tone Firstly, in terms of the initial consonant, Muong Hoa Binh has 24 initial consonants [115], the Vietnamese has 20 initial consonants [108] Closed consonants between Muong and Vietnamese are /b - ɓ, c - tɕ, g - ɣ, k h – x, p h – f/, they are part of the reason for the decline in the quality of synthetic speech The five consonant phonemes in Muong are not in Vietnamese: /p, r, hr, tl, kl/, in writing respectively p, r, hr, tl, kl The Muong Hoa Binh final consonants consist of 9 consonants and two approximants /w, j/ [115] Hanoi Vietnamese licenses eight segments in coda position: three unreleased voiceless obstruent /p, t, k/, three nasals /m, n, ŋ/, and two approximants /j, w/ With consonant tone Ha Noi Vietnamese distinguishes the palate of the mouth /k͡ p/ followed by o, u, ô (/oŋ͡ m˦/ - ông) and /ŋ̟ / followed by i, ê, a (/sik̟ ˦˥/

- xích) With the nasal consonant, distinguish the next /ŋ͡ m/ followed by o, u, ô (/oŋ͡ m˦/ - ông), and /ŋ̟ / followed by i, ê, a (/kiŋ̟ ˦/ - kinh) [108] So, we see only seven consonants /p, t, k, m, n, ɲ, ŋ/ and two semi-vowel /w, j/ in Muong is equivalent to

Vietnamese consonants We must find an alternative equivalent in Vietnamese for two consonants /c, l / This issue will be addressed in the following study.

One consonant, /  / is present in Vietnamese but not in Muong One consonant similar to Vietnamese but only in the Muong Tan Son dialect and not in Muong Bi /  / One consonant in the Muong Bi dialect is not in Muong Bi and Vietnamese /w/.

Muong has one medial /w/ is written in w For example, kwêl khwắn (smoking), khwắi

(snack); kwa (we), kwải (throw), kwang (clean) Vietnamese has a medial /w/ written by two letters o and u. Example hoa quả [115].

In the vowels system, Muong Hoa Binh has 14 vowel sounds [115], the Vietnamese have 11 vowels and two vowels [122, p 58] So, Muong and Vietnamese's vowel system is equivalent to 11 vowels and two diphthongs. The difference is that, in the Muong language, the diphthong /ɯɤ/ is transitive of diphthong ươ; in Vietnamese, it is transcribed into /ɯə/ There are several differences in orthography which required us to adapt Muong spelling rules to Hanoi Vietnamese, for example, changing êê to ê, oo to o, ôô to ô, uu to u, ưư to ư [115] As for the vowel system, Vietnamese has 11 vowels [94] Muong monophthongs, written with the vowel letters a, e, i, o, u correspond reasonably equivalent to Hanoi Vietnamese vowels written similarly Muong has 14 vowels, and the diphthong /ɯɤ/ - ươ is replaced by another transcription /ɯə/ Hanoi Vietnamese has quite complex letter-to- phoneme mappings, mainly to do with voicing The Muong-to- “Hanoi Vietnamese” transliteration process is largely but not entirely automatable, so some manual revision of the texts is necessary Muong language does not have two short vowels /ɛ/ and /ɔ/, like in Vietnamese.

Following the analysis of the phonetic characteristics between Vietnamese and Muong above, the phonetic mapping for consonants, phonemes, vowels, and tones between Muong and Vietnamese are proposed in theTable 2.12.

Table 2.12 Muong and Vietnamese phonetic comparison (orthography in normal, IPA in italic; Vi: Vietnamese;

Mb: Muong Bi ; Mts : Muong Tan Son)

Mb Mts Vi Mb Mts Vi Mb Mts Vi

/tl/ - ng /ŋ/ ng /ŋ/ ng, ngh /ŋ/ ph /ph/ ph /ph/ ph /f/

/  / tr /  / nh /ɲ/ nh /ɲ/ nh /ɲ/ - - s /  / t /t/ t /t/ t /t/ w /w/ - - th /th/ th /th/ th /th/ v /v/ v /v/ v /v/ x /s/ x /s/ x /s/ z /z/ z /z/ d, gi /z/

Mb Mts Vi Mb Mts Vi Mb Mts Vi p

/c/ - t /t/ t /t/ t /t/ l /l/ l /l/ - c /k/ c /k/ c /k/ m /m/ m /m/ m /m/ n /n/ n /n/ n /n/ nh /ɲ/ nh /ɲ/ nh /ɲ/ ng /ŋ/ ng /ŋ/ ng /ŋ/ w /w/ w /w/ w /w/

Mb Mts Vi Mb Mts Vi Mb Mts Vi aa, a /a/ aa, a /a/ aa, a

/a/ - - e /ɛ/ ă /ă/ ă /ă/ ă /ă/ - - o /ɔ/ â /ɤ̆ / â /ɤ̆ / â /ɤ̆ / e /ɛ/ e /ɛ/ e /ɛ/ êê, ê/e/ êê, ê/e/ êê, ê/e/ i /i/ i /i/ i /i/ oo, o /ɔ/ oo, o /ɔ/ oo, o /ɔ/

In this thesis, we use the Muong Hoa Binh tonal system described in [115], which contains five tones: 33-Level, 42- Falling, 323 - Falling Rising, 34 - High Rising, and 342?- Low Falling Meanwhile, Hanoi Vietnamese have eight tones, a six-tone paradigm in open or sonorant final syllables, and a two-tone paradigm in syllables ending in an unreleased oral stop [108] Muong does not have a falling tone like in Vietnamese Details are described in Table 2.13.

Table 2.13 Comparing the tone of Vietnamese with Muong Tan Son and Muong Bi

No Tone Muong Bi Phonetic

Distinction Criteria Muong Tan Son Phonetic

Level, medium frequency Level Level, flat, medium frequency A1 – Level

2 Falling, low Falling, low, long A2 – Mid falling

Rising, high Rising, high, short C1 – Low falling < Hỏi >

Falling-rising, glottalization at the beginning of the syllable

High level, ending with glottal closure

Falling to low, short, ending with glottal closure

SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE

 Chapter 3, titled "Emulating Muong TTS Based on Input Transformation ofVietnamese TTS, " presents the proposal to synthesize Muong speech by adapting existing Vietnamese TTS systems This approach can be experimentally applied to create TTS systems for other Vietnamese ethnic minority languages quickly.

 Chapter 4, titled "Cross-Lingual Transfer Learning for Muong Speech Synthesis": In this chapter, we use and experiment with approaches for Muong TTS that leverage Vietnamese resources We focus on transfer learning by creating Vietnamese TTS, further training it with different Muong datasets, and evaluating the resulting Muong TTS.

PART 3: SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN LANGUAGE

 Chapter 5, titled "Generating Unwritten Low-Resourced Language's Speech Directly from Rich-resource Language's Text," presents our approach for addressing speech synthesis challenges for unwritten low- resourced languages by synthesizing L2 speech directly from L1 text The proposed system is built using end-to-end neural network technology for text-to-speech.

We use Vietnamese as L1 and Muong as L2 in our experiments.

 Chapter 6, titled "Speech synthesis for Unwritten Low-Resourced Languages Using Intermediate Representation": This chapter proposes using phoneme representation due to its close relationship with speech within a single language The proposed method is applied to the Vietnamese and Muong language pair Vietnamese text is translated into an intermediate representation of two unwritten dialects of the Muong language: Muong Bi - Hoa Binh and Muong Tan Son - Phu Tho The evaluation reveals relatively high translation quality for both dialects.

In conclusion, speech synthesis for low-resourced languages is a significant research area with the potential to positively impact the lives of speakers of these languages. Despite challenges posed by limited data and linguistic knowledge, advancements in speech synthesis technology and innovative approaches enable the developing of high- quality speech synthesis systems for low-resourced languages The work presented in this dissertation contributes to this field by exploring novel methods and techniques for speech synthesis in low-resourced languages.

For future work, there is a need to continue developing innovative approaches to speech synthesis for low-resourced languages, particularly in response to the growing demand for accessible technology This can be achieved through ongoing research in transfer learning,unsupervised learning, and data augmentation Additionally, there is a need for further investment in collecting and preserving linguistic data for low-resourced languages and developing phonological studies for these languages With these efforts, we can ensure that speech synthesis technology is accessible to everyone, regardless of their language.

PART 1 : BACKGROUND AND RELATED WORKS

Chapter 1 Overview of speech synthesis and speech synthesis for low-resourced language

This section presents a concise overview of Text-to-Speech (TTS) synthesis and its application to low-resourced languages It highlights the challenges faced in developing TTS systems for languages with limited resources and data Additionally, it introduces various approaches and techniques to address these challenges and improve TTS quality for low- resourced languages.