Leveraging sentence oriented augmentation and transformer based architecture for vietnamese bahnaric translation

General Introduction

Machine Translation(MT) [1] is a major sub-field ofNatural Language Process- ing (NLP) [2] that focuses on translating human languages automatically by using a computer Machine translation relies heavily on manual translation rules and lin- guistic knowledge in the early stages However, because the nature of language is significantly complicated, it is impossible to cover all irregular cases with just hand- crafted translation rules During the development process of MT, more and more large-scale parallel corpora appeared With the data-driven approaches, Statistical Machine Translation (SMT) [3] has replaced the original rule-based translation due to its availability to study latent factors such as word alignment or phrases directly from corpora But SMT is still far from expectations because it cannot model long- distance word dependencies With the emergence of deep learning in recent years,

Neural Machine Translation (NMT) [4], [5] has become a new model and replaced SMT to become the mainstream of MT.

This project primarily aims to use NMT to translate Vietnamese to Bahnar (the language used by one of the ethnic minorities of Vietnam - the Bahnar people) The translation system can make communication between Bahnar people and others who use native Vietnamese easier Moreover, the system can be enhanced and developed

1 to become a more friendly application(web or mobile) Besides, due to Circular No. 34/2020/TT-BGDĐT, now published by the Ministry of Education and Training in

2020 [6], Bahnar is a subject in the language education field that students from ele- mentary level to high school level can learn Studying Bahnar is a way to conserve the national language and honor the spiritual values and culture of the Bahnar people. While the availability of large parallel corpora significantly impacts how a neural machine translation system performs, the Bahnar language itself is a low-resource language [7], which can make the system suffer from poor translation quality [8]. Therefore,Data Augmentation(DA) [9] needs to be involved in the project to generate extra data points from the empirically observed training set to train the NMT model. Data augmentation was first widely applied in the computer vision field and then used in natural language processing, achieving improvements in many tasks DA help to improve the diversity of training data, thereby helping the model anticipate the unseen factors in testing data.

Problem Description

Machine translation is the task of automatically converting source text in one language to text in another language Automatically translating a sequence of text from one language to another poses a significant challenge due to the inherent ambi- guity and flexibility of human language As a result, there is no singular, definitive translation that can be considered the best This difficulty in achieving accurate and natural machine translation makes it one of the most complex problems in the field of artificial intelligence Assuming X is the source language sentence, Y is the target- language sentence In this context, X is a Vietnamese sentence that will be translated into Y - a Bhanar sentence (Bhanar Kriêm) Example:

"Năm sau, tôi sẽ đi học ở dưới huyện"

⇒"Sơnăm anô, inh năm hok u˘ei tơ huen"

In reality, besides the Bhanar Kriêm dialect, Bahnar has four other dialects (mentioned in Section 3.3) However, Bahnar can be considered as a low-resource language due to the lacking documentation(textbooks, local educational materials, etc.) This issue can be explained by several factors, such as limited budget revenues, population dispersion, divided traffic, and slow zoning progress [7] From the available resource, there exist some differences between Bahnar and Vietnamese, such as the position of exclamation, skipping to be word, and wrong collocationin translating phrases and compound words These issues can be observed and analyzed further by the linguis- tics experts if collecting enough materials, but it takes a high cost for the experts to analyze; therefore, NMT needs to be involved in solving this problem Moreover, Vietnam still has different ethnic minority languages(Thai, Jrai, Mnông, etc.) with the same situation as Bahnar, which makes NMT more important in translating these low- resource languages in the future Because of involving both NMT and Vietnamese, BARTPho - a SOTA Seq2Seq for Vietnamese suddenly became a suitable option as a pre-trained model.

As mentioned before, the translation quality of NMT depends on the size of the parallel corpus, which is hard to achieve in the context of translating Vietnamese- Bahnar So, DA becomes a vital method to improve the accuracy of the NMT approach DA applications in NLP have been investigated in recent years, and the most well-known fields are text classification [10]–[12], text generation (including NMT) [13]–[15], and structure prediction [16], [17] DA is still a super common and all-over- the-place approach in NMT, which samples some fake data distributionPf(X ′ )using some common methods (Figure 1.1) based on real data distributionPr(X), whereX 1 f ,

X f 2 refer to augmented data generated from real data using common approaches, such as replacing, swapping.

Sơnăm anô, inh năm hok u˘ei tơ huen

Common methods for DA: Replace, Swap, etc.

Sơnăm anô, nhi năm hok u˘ei tơ sa

Sơnăm anô, huen năm hok u˘ei tơ inh

Figure 1.1: The commonly used methods of DA for NMT

In this project, suitable DA methods will be investigated and applied to support the low-resource translation Vietnamese-Bahnar These methods will focus on some specific context to show their performance.

Objectives And Missions

The thesis aims to explore and utilize data augmentation strategies in the field of neural machine translation, with a particular focus on low-resource setting language. Therefore, the main objectives of this thesis can be listed as:

• Understanding NMT and DA in NLP

• Studying and stating the features of the Bahnar language and the differences between Bahnar and Vietnamese to list the focused sentence types

• Proposing suitable solutions for augmenting low-resource dataset

• Constructing and expanding the original dataset based on proposed DA solutions

Based on the stated objectives, this thesis needs to accomplish the following tasks:

• Studying DA techniques in NLP, related works, DA methods applicable to low- resource NMT, and assessing their benefits and drawbacks.

• Researching on Bahnar language and observing some of noticeable sentence sub- sets

• Proposed the DA techniques that improve translation quality for low-resource NMT, especially the specific focused dataset

• Experimenting and evaluating the proposed approaches

• Stating the contributions, the existing issues, and the future research direction

Scope Of Work

The project will focus strongly on a few aspects, which can be listed as:

• Data augmentation methods, especially for NMT

• Dataset languages: Bahnar and Vietnamese

• Applying data augmentation techniques and evaluating the translation results based on the BLEU score

Contributions

Through the defined goals and all the work that the researcher has accomplished, some remarkable points of this thesis can be stated:

• Researching the knowledge background of general concepts of DA in NLP, especially DA techniques in NMT

• Observing some different special points between Bhanar and Vietnamese

• Applying the research techniques flexibly in the context of translating Viet- namese to Bahnar in both general and special context

• Showing the effectiveness of proposed DA approaches: multi-task learning approach and sentence boundary argumentation

Thesis Structure

This current thesis structure includes five chapters:

• Chapter 1 INTRODUCTION: The introduction to this thesis, which is also the current section, provides a general view of the thesis

• Chapter 2 BACKGROUND: This chapter provides the necessary background knowledge for implementing the thesis

• Chapter 3 BAHNAR LANGUAGE: This chapter provides an overview of the

Bahnar language including the grammar structure, dialects, noticeable sentence types

• Chapter 4 RELATED WORKS: This chapter introduces related research of

DA methods and states the foundation approaches of this thesis.

• Chapter 5 APPROACHES: This chapter clearly describes the proposed methods, their motivation, and how they work

• Chapter 6 EXPERIMENTS AND EVALUATIONS: This chapter states the environment, the tools, and the configurations to conduct the experiments, evaluates and discusses the impact of the proposed method

• Chapter 7 CONCLUSION: the final chapter summarizes the contributions of the thesis and issues, discussing future improvements

Neural Machine Translation

Assuming a source sentence x = {x1, , xS} and a target sentence y = {y1, ,yT} are given By using the chain rule, the conditional distribution of a standard NMT

[18], [19] can factorize a sentence-level translation probability as a product of word- level probabilities from left-to-right (L2R) as:

NMT models which conform the Eq 2.1 is referred to as L2R autoregressive NMT [4], [5] for the prediction at time-stept is taken as input at time-stept+ 1. NMT normally uses maximum log-likelihood (MLE) as the training objective function, which usually uses for estimating the parameters of a probability distribution Given the training corpusD ={⟨x (s) ,y (s) ⟩} S s=1 , the goal of training is to find a set of model parameters that maximize the log-likelihood on the training set: θˆMLE =argmax x

7 where the log-likelihood is defined as

By the back-propagation algorithm, the gradient ofL can be computed with re- spect toθ NMT model training usually adopts thestochastic gradient search(SGD) algorithm Instead of computing gradients on the full training set, SGD computes the loss function and gradients on a mini-batch of the training set The plain SGD optimizer updates the parameters of an NMT model with the following rule: θ ←θ−α▽L(θ), (2.4)

Whereα is the learning rate The parameters of NMT are guaranteed to converge into a local optima with a well-adjusted learning rate In reality, adaptive learning rate optimizers like Adam [20] are found to significantly reduce training time compared to basic SGD optimizer.

Data Augmentation

Overview

Data augmentation refers to techniques for increasing training data diversity without collecting extra data Most methods either produce synthetic data or add slightly modified versions of existing data, expecting that the augmented data can serve as a regularizer and lessen overfitting while training ML models [21], [22] DA has been widely employed in computer vision, where model training typically includes operations like cropping, flipping, and color transforming In NLP, where the input space is discrete, it is less evident how to create efficient augmented instances that capture the desired invariances.

Goals And Trade-offs

Despite challenges posed by text data, numerous data augmentation (DA) strategies have been devised for natural language processing (NLP), ranging from rule-based manipulations to complex generative systems The objective of DA is to provide an alternative to data collection, thus the ideal DA technique should be both easy to implement and effective in improving model performance However, most DA strategies involve trade-offs between simplicity and performance.

Rule-based methods yield minor enhancements despite their ease of implementation In contrast, trained models enhance performance more substantially, but their implementation requires greater resources and introduces data variability Tailored model-based techniques can yield significant performance gains for specific tasks, but their development and effective use pose challenges.

Additionally, it is important that the augmented data distribution strike a balance between being neither too similar nor too different from the original data If the augmented data is too similar, it may result in overfitting At the same time, if it is too different, it can lead to poor performance due to training on examples that do not accurately represent the intended domain Therefore, effective data augmentation approaches should strive for a harmonious equilibrium.

Discussing the interpretation of DA, Dao, Gu, Ratner,et al.[25] note that "data augmentation is typically performed in an ad-hoc manner with little understanding of the underlying theoretical principles", and claim that it is insufficient to explain DA as regularization In general, there is a noticeable absence of comprehensive research on the precise mechanisms underlying the effectiveness of DA Existing studies mostly focus on superficial aspects and seldom delve into the theoretical foundations and principles involved.

Techniques

The general approaches of DA techniques are mentioned in the survey of Feng, Gangal, Wei,et al.[26], DA techniques can be grouped as rule-based techniques, interpolation techniques, and model-based techniques For rule-based techniques, these techniques do not require model components and use simple, preset transforms A typical example of these methods isEasy Data Augmentation(EDA) [11], which performs a set of random perturbation operations on token level such as random insertion, swap, and deletion A different category of DA techniques, called interpolation, initially introduced by MIXUP [27], involves interpolating the inputs and labels from multiple real examples With the model-based approach, these DA techniques from this approach have also utilized Seq2seq models and language models An example is the widely used "back-translation" approach [28], which involves translating a sequence into a different language and then translating it back to the original language. However, DA approaches can be more specifically categorized based on the characteristics of the techniques and the diversity of augmented data Li, Hou, and Che

[29] frame DA methods into three categories, including paraphrasing, noising, and sampling The more specific classification is shown in Figure 2.1.

• The paraphrasing-based methods generate augmented data that retains a strong semantic resemblance to the original data by making controlled and limited modifications to the sentences The augmented data effectively conveys almost identical information as the original data.

• The nosing-based methods aim to enhance the model’s robustness by introducing discrete or continuous noise while ensuring the validity of the data These methods focus on adding noise in a controlled manner to improve the model’s ability to handle different scenarios.

• The sampling-based methods excel at understanding the data distributions and generating novel data samples from within these distributions By employing artificial heuristics and trained models, these techniques produce more diverse data that effectively caters to a wider range of requirements for downstream tasks.

Wei and Zou [11]; Coulombe et al. [30];

Language Models Jiao et al [31]

Back-translation Xie et al [32]; Fab- bri et al [33]

Kang et al [38]; Zhang et al [39]

Pretrained Models Anaby-Tavor et al.

Self-training Thakur et al [41];

Figure 2.1: Taxonomy of DA NLP Methods

Paraphrases, commonly observed in natural language, offer alternative expres- sions that convey identical information as the original text [43], [44] Given their inherent nature, generating paraphrases is a suitable data augmentation approach.Paraphrasing encompasses various levels, such as lexical paraphrasing, phrase paraphrasing, and sentence paraphrasing Consequently, the paraphrasing-based data augmentation techniques described below can be classified within these three levels.Zhang, Zhao, and LeCun [15] are the first to apply thesaurus in data augmentation They employ a WordNet-derived thesaurus that groups synonyms of words based on their similarities In this method, they identify all the replaceable words within each sentence and randomly select a subset of r words to be replaced The probability of selectingris determined by a geometric distribution with parameter p in whichP[r]∼ p r EDA also replaces the original words with their synonyms using WordNet: they randomly choose nwords, which are not stop-words, from the original sentence Each of these words is replaced with a random synonym Apart from synonyms, Coulombe [30] suggest incorporating hypernyms as replacements for the original words Additionally, they propose arranging the augmented words based on increasing difficulty, starting with adverbs, followed by adjectives, nouns, and finally verbs.

Semantic embeddings overcome the limitations of replacement range and parts of speech in the thesaurus-based method It replaces the sentence’s original word with its nearest neighbor in the embedding space using pre-trained word embeddings from programs like Glove, Word2Vec, FastText, etc In the Twitter message classification task, Wang and Yang [10] use both word embeddings and frame embeddings instead of discrete words As for word embeddings, each original word in the tweet is replaced with one of its k-nearest-neighbor words using cosine similarity Regarding frame semantic embeddings, the authors undertake semantic parsing of 3.8 million tweets and construct a continuous bag-of-frame model utilizing Word2Vec [45] to represent each semantic frame.

Recently, pretrained language models have gained popularity as they exhibit out- standing performance Masked language models (MLMs) like BERT and RoBERTa are capable of predicting masked words within a text by leveraging contextual cues. This ability can be effectively utilized for augmenting text data Jiao, Yin, Shang, et al [31] utilize both word embeddings and masked language models to generate augmented data They employ the BERT tokenizer to split words into multiple word pieces Each word piece has a 0.4 probability of being replaced If a word piece does not represent a complete word, it is substituted with its K-nearest-neighbor words in the Glove embedding space For complete word pieces, they are replaced with

[MASK] and BERT is employed to predict K words to fill in the blank.

A natural way to paraphrase is through translation Machine translation is well- known as an augmentation technique in various jobs thanks to the development of machine translation models and the availability of online APIs Back-translation is the process of translating the original text into new languages and then translating it back into the original language to produce the augmented text [28] Unlike word- level approaches, back translation rewrites the entire sentence in a produced style rather than replacing individual words Xie, Dai, Hovy,et al [32], and Fabbri, Han,

Li, et al.[33] use English-French translation models (in both directions) to perform back-translation on each sentence and obtain their paraphrases.

Unlike back-translation, the unidirectional translation method involves directly translating the original text into other languages without the need to translate it back to the original language This approach is typically employed within a multilin- gual context In the task of unsupervised cross-lingual word embeddings (CLWEs), Nishikawa, Ri, and Tsuruoka [34] build pseudo-parallel corpus with an unsupervised machine translation model Initially, the authors trainunsupervised machine translation(UMT) models by utilizing the source and target training corpora These models are then employed to translate the corpora The resulting machine-translated corpus is combined with the original corpus to train monolingual word embeddings separately for each language Subsequently, the learned monolingual word embeddings are aligned and mapped to a shared space CLWE.

The semantics of the natural language is sensitive to text order, while slight order change is still readable for humans [46] Hence, employing random swapping of words or even sentences within a feasible range can serve as an effective approach for data augmentation.

Wei and Zou [11] randomly choose two words in the sentence and swap their positions This process is repeated n times, in whichnis proportional to the sentence lengthl Apart from swapping words at the word level, certain studies have also sug- gested swapping at the sentence or even instance level For instance, in the context of tweet sentiment analysis, Luque [35] divides tweets into two halves and combines randomly sampled first and second halves with the same label While the data generated through this approach may lack grammaticality and semantic coherence, it still retains relatively complete semantic meaning and emotional polarity compared to individual words Besides, the noising model has been applied in unsupervised NMT

[47], [48] to make the model able to reconstruct any source sentence given a noisy translation of the same sentence in the target domain.

The deletion method means randomly deleting words in a sentence or deleting sentences in a document As for word-level deletion, Wei and Zou randomly remove each word in the sentence with probabilityp Longpre, Wang, and DuBois [17] and

Zhang, Li, Zhang, et al [12] also apply the same method As for sentence-level deletion, Yan, Li, Zhang, et al [36] propose randomly removing each sentence in a legal document based on a specific probability This approach is motivated by the fact that legal cases often contain numerous irrelevant statements, and removing them does not compromise the overall understanding of the case.

The insertion method means randomly inserting words into a sentence or inserting sentences into a document As for word-level insertion, Wei and Zou Wei and Zou select a random synonym of a random word in a sentence that is not a stop word, then insert that synonym into a random position in the sentence This process is repeatedn times In the task of legal document classification, where documents sharing the same label often contain similar sentences, Yan, Li, Zhang,et al.[36] utilize sentence-level random insertion This technique involves randomly selecting sentences from other legal documents that share the same label, thereby generating augmented data.

Substitution means randomly replacing words or sentences with other strings.Unlike the above paraphrasing methods, this method usually avoids using strings semantically similar to the original data Coulombe [30] and Regina, Meyer, andGoutal [49] introduce a list of the most common misspellings in English to generate augmented texts containing common misspellings Xie, Wang, Li,et al.[37] draw inspiration from the concept of "word-dropout" and enhance generalization by reducing the amount of information within the sentence Wang, Pham, Dai,et al.[14] propose a method that randomly replaces words in the input and target sentences with other words in the vocabulary.

Strategies

The above section provided information on three types of DA methods: paraphrasing, noising, and sampling, along with their respective characteristics However, it is important to note that the effectiveness of these DA methods can be influenced by various factors in real-world applications This section will mention the factors that should be considered to build suitable DA methods.

The methods described in Section 2.2.3 are not restricted to being used inde- pendently They can be combined to achieve improved performance Some common combinations include:

• The Same Type of Methods: Certain studies integrate various approaches based on paraphrasing and generate diverse paraphrases to enhance the diversity of augmented data For example, Liu, Lee, and Lee [50] use both thesauruses and semantic embeddings Regarding methods based on noising, they often com- bine different techniques that were previously considered unlearnable, as demonstrated in [51] This is because these methods are straightforward, efficient, and mutually beneficial Some methods also adopt different sources of noising or paraphrasing like [37] The combination of different resources could also improve the robustness of the model

In situations where task-agnostic methods are required for data augmentation, unsupervised methods come into play EDA, a widely used unsupervised method, leverages techniques such as synonym replacement, random insertion, swap, and deletion to augment data without requiring labeled examples or specific task information Unsupervised methods are advantageous in scenarios where labeled data is scarce or not readily available They offer a generalizable approach to data augmentation, making them suitable for a wide range of tasks.

To improve the robustness of augmented data, multi-granularity is used, which involves applying the same method at different levels to add diverse changes This approach enhances the data by incorporating varying degrees of granularity For instance, Wang and Yang [10] leveraged Word2Vec to train both word and frame embeddings.

The augmentation methods mentioned earlier rely on hyperparameters that significantly influence the impact of augmentation Some common hyperparameters are listed in Figure 2.2

Semantic Embed- dings; 3 Language Models;

1 Number of replacements; 2 Pro- portion of replacement

(1)Number of (inter- mediate) languages; (2)Types of (inter- mediate) languages

1 Number of operations; 2 Proportion of operations;

1 Parameters in the neural network;

Figure 2.2: Hyperparameters that affect the augmentation effect in each DA method

Applications on NLP tasks

Due to the diverse nature of tasks, metrics, datasets, architectures, and experimental setups in NLP, it is challenging to compare the performance of various data augmentation methods directly Therefore, instead of conducting direct comparisons, the effectiveness of data augmentation methods will be analyzed in different NLP tasks such as text classification, text generation, and structured prediction.

• Text classification is a fundamental and straightforward task in NLP It involves assigning a given input text to a specific category from a predefined set of categories The objective is to determine the appropriate category that best represents the content of the text.

• Text generation, as its name suggests, involves producing or generating textual content based on given input data A well-known example of text generation is machine translation, where the goal is to generate the translated version of a given input text in a different language.

• The structured prediction problem, which is often specific to NLP, differs from text classification in that it involves output categories that exhibit strong correla- tions and have specific formatting requirements.

DA methods generally find wider application in text classification than other NLP tasks, including within each category Additionally, each specific DA method can be effectively applied to text classification This advantage stems from the straightforward nature of text classification, which focuses on predicting labels based on the input text, thus enabling DA to primarily focus on preserving the semantic meaning of crucial words for accurate classification.

In text generation, sampling-based methods enhance semantic diversity due to their randomness Conversely, structured prediction tasks prioritize data validity, making paraphrasing-based methods more suitable This is because paraphrasing maintains the original data structure, crucial for tasks sensitive to data format.

When comparing different DA methods, it becomes evident that simple and efficient unsupervised techniques, such as machine translation, paraphrasing using thesaurus, and random word substitution, have gained significant popularity Addi- tionally, learnable methods like paraphrasing-based model generation and sampling- based pre-trained models have also garnered considerable attention due to their ver- satility and effectiveness.

Overview

Bahnar language or Ba-na language is a Mon-Khmer language belonging to the Bahnaric group and used by Bahnar people mainly living in central Vietnam [52]. Bahnaric can be divided into two sub-group: Northern Bahnar(Xêđăng, Halăng, Jeh, ) and Southern Bahnaric(Kơho, Mnông, Chrau Jro, ) Bahnar language is an interme- diate language of these two groups Bahnar language does not have register, which is similar to Southern Bahnaric Besides, the structure of its phonemes is more simple than Northern Bahnaric However, it shares more standard features in the vocabulary with Northern Bahnaric Therefore, the Bahnar language can be classified as Central Bahnaric.

In terms of complexion, the word structure of the Bahnar language has several solid rules that connect firmly with lexemes Words in the Bahnar language can be constructed using "affix", "reduplication", and "compound" [53] Method affix is the most complicated one among the three ways These methods can create derivational words with typical features such as variation word meanings and changing grammar functionality (e.g., nouns can become verbs).

Grammar Rules

Firstly, in the Bahnar language, a pre-syllable (sub-syllable or weak syllable, which can be considered sesquisyllabic) is the sound before the main syllable It is only viewed from the angle of language learning because it is not edited for meaning Some typical pre-syllables can be listed as a, bơ, dơ, hơ, jơ, etc (Example: ame (chăm nom), bơbah (cuối nguồn), hơhoi (không có), jơnap (đầy đủ), etc) In Bahnar, there is no word with two pre-syllables.

Secondly, the sentence structure of Bhanar is similar to that of Vietnamese An ordinary simple sentence consists of two main components Subjectand Predicate.

The order of subject and predicate is the same as in Vietnamese.

The verb consists of 3 main tenses: Present (Oei - đang), Past(Đ˘ı - đã), and Future (Gô - sẽ).

Thirdly, Bahnar also has the rules for using "Quantifiers" in a sentence which can be stated as follows:

• Book/notebook: Using "sâp" and "hlak" (Sâp - quyển, cuốn; Hlak - tờ).

• Currency: Using "hlak" or "long"

Finally, there are a few rules that must be obeyed when writing in Bahnar:

• Handling pairs of letters that sound the same (s/x, w/v)

– The letters s and w are normally used like other letters in the Bahnar ethnic alphabet Example: pơsat(mồ mả), pơwih(dọn dẹp), wao(hiểu), etc.

– The letters x and v are used for borrowed words from Vietnamese or foreign languages Example: xi măng(xi măng), oxi(oxi), vi la(biệt thự), etc.

• For pre-syllabic (affix) words, it is necessary to write that affix word together with the main word

When mentioning specific locations, administrative areas, or other entities with local designations, it is imperative to utilize the appropriate Bhanar language names For instance, instead of "Đăk Đoa," the correct term in Bhanar is "Đak Đoa." Similarly, "Con Dơng" should be replaced with "Kon Dơng" to ensure accuracy and cultural sensitivity.

• When two distinct words are identified, the two words that cannot be written together

• For reduplication and compound words

– For the reduplication words: Partial reduplication (or whole reduplication), words must all be written separately.

* main and sub-pair compound word: Example: hơtaih hơt˘o(xa xôi, xa lắc), etc.

* isocoupled compound word: Example: Hơrih sa(ăn ở), etc.

• For words borrowed from foreign languages or other ethnic languages, when writing in Bhanar, it is necessary to write in Bhanar Example: Example: đèn => kơđen, cái bàn => kơbang

Dialects In Bahnar Language

The division of dialects can be based on geographical regions, differences distinc- tive in the details of certain habits and traditions, and different sounds in the language. Currently, they can be divided into five main groups:

• Bahnar Roh: Lives mainly in Dak Doa and Mang Yang districts

• Bahnar TơLô: K’Bang, Đak Pơ, Kông Cro districts

• Bahnar Kon KơĐeh: Lives in a part of the southeast of K’Bang district and part of Kontum province

• Bahnar Bơnâm: Lives in part of the west of K’Bang district and An Khe town

• Bahnar Kriêm: Lives mainly in Vinh Thanh district, Binh Dinh province, and some districts in Phu Yen

Despite the presence of numerous dialects, Bahnar can be categorized into two primary groups: Group 1 (Bahnar Roh and Bahnar TơLô) and Group 2 (Bahnar Kon KơĐeh, Bahnar Bơnâm, and Bahnar Kriêm) The primary distinction between these dialects is confined to vocabulary However, upon closer examination, this difference manifests in the following aspects:

• Bahnar group 1 usually retains the full pre-syllable while group 2 tends to abbre- viate, losing the pre-syllable Examples are shown in Table 3.1

Table 3.1: Example of Bahnar dialects differences

Vietnamese Group 1 Group 2 người bơngai ngai con trâu kơpô pô

• Similarities and differences in phonetics (sound variation) (Show in Table 3.2)

Table 3.2: Similarity level of two groups in Bahnar language

Degree of similarity Group 1 Group 2

100% similar đak(nước), hnam(nhà) đak(nước), hnam(nhà)

50% similar hơgei(giỏi), anăn(tên) rơgei(giỏi), hnăn(tên)

100% difference năm(đi), gơng(cái cầu) bôk(đi), kơtua(cái cầu)

The third case is entirely different phonetically (synonyms) Therefore, it is pos- sible that people in Group 1 can be confusing for people in Group 2.

Vietnamese-Bahnar Translating Notices

Translating from one language to another requires careful consideration of numerous factors that significantly influence the accuracy and quality of the translation. These factors hold true when undertaking the translation process from Vietnamese to Bahnar as well, demanding ken attention and careful handling.

• Spelling: A spelling error refers to a deviation from the standard or accepted way of spelling a word While it is not a major issue, it could occur due to errors in the input files.

• Collocation: Concerning the question of whether a specific phrase consists of words that naturally occur together or co-occur.

• Grammatical: Grammatical error refers to an occurrence of incorrect, unconven- tional, or disputed language usage, such as a misplaced modifier or an inappro- priate verb tense.

• Typo: A typographical error, commonly known as a typo, refers to a mistake made during the typing or printing of written or electronic material.

• Word-by-word translation: Word-for-word translation is commonly understood as the process of translating text from one language to another by directly using the exact words from the original text This issue can create grammatical issues and collocation issues.

Besides, in normal conversation, there exist some different points between Bha- nar and Vietnamese.

• Some sentences in Bahnar tend to skip words in simple sentences For example,

"Ở Vịnh Thạnh là đông nhất" in Vietnamese can be translated to "Uei Vinh Thanh lư loi"; in this case, the word "là" in Vietnamese can be skipped during translating.

• The position of exclamation in Bhanar is also different with Vietnamese InVietnamese, exclamation words usually stay behind the predicate, but it is the opposite in Bahnar For example, "Mẹ ơi" in Vietnamese will be translated to "ƠMi" in Bahnar.

Data Augmentation in NMT

Overview

In the survey of Li, Hou, and Che [29], the utilization of DA in these tasks has witnessed a notable increase in recent years Text classification, being the pioneer- ing task to adopt DA, has garnered a larger number of research papers compared to the other two tasks: text generation and structure prediction In Section 2.2, it has been noted that each specific DA method has the potential to be implemented in text classification tasks Basically, DA methods, which apply to text classification, can also apply to neural machine translation However, due to the different nature of these tasks, some methods, which have shown powerful improvements in text classification, cannot perform well in neural machine translation EDA is an example of the above statement This method has created the abnormal in the context of sentences, such as: producing new vocabulary, changing word order, and skipping words EDA will be presented in the next section and analyzed further in Section 6.3 to prove its effectiveness So, DA in NMT needs suitable approaches, which are still based on the foundation of the origin DA methods but need novelty modifications Section 4.1.2 will state the popular DA methods applied in NMT in recent years.

Approaches

The first method which has been applied in this project was EDA EDA or Easy Data Augmentation is a simple set of data augmentation techniques for NLP EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion In the original research of Wei and Zou, EDA was used for boosting the performance of text classification task Basically, for any sentence, the author can choose and apply randomly one of the four following operations:

1 Synonym Replacement (SR):Randomly choosenwords from the sentence that are not stop words Replace each of these words with one of its synonyms chosen at random.

2 Random Insertion (RI):Find a random synonym of a random word in the sentence that is not a stop word Insert that synonym into a random position in the sentence Do thisntimes.

3 Random Swap (RS): Randomly choose two words in the sentence and swap their positions Do thisntimes.

4 Random Deletion (RD): Randomly remove each word in the sentence with probabilityp.

Since long sentences have more words than short ones, they can absorb more noise while maintaining their original class label To make up for this, the number of words changed,n, for SR, RI, and RS is varied based on the sentence lengthlwith the formula n=αl, where α is a parameter that indicates the percent of the words in a sentence are changed (p=α for RD) Furthermore, for each original sentence, there arenaugaugmented sentences generated EDA was applied in this project to compare with other proposed methods.

Semantic embeddings provide an alternative approach for text classification, utilizing pre-trained word embeddings to replace words with their closest neighbors in the embedding space Wang and Yang exploited a large tweet dataset to identify #petpeeve tweets describing annoying behaviors They incorporated lexical, syntactic, and semantic features, resulting in improved classification accuracy Additionally, they introduced a data augmentation approach using lexical and semantic embeddings, generating additional instances by leveraging neighboring words in continuous representations Employing Word2Vec to train various embedding models, they expanded the training data by fivefold Their data augmentation strategy included three types of lexical embeddings, yielding a significant 6.1% relative F1 improvement when utilizing Google News lexical embeddings.

Data augmentation (DA) approaches using Neural Machine Translation (NMT) have gained prominence, with back-translation being a popular method Sennrich et al [13] demonstrated significant BLEU score improvements (4.3-11.2) by utilizing automatic back-translation of monolingual data, dropout, and target-bidirectional models However, the focus in this section is on DA methods that do not require additional resources beyond the parallel training corpus.

Li, Liu, Huang, et al [54] conducted an evaluation of back-translation and forward translation in a similar context They trained NMT systems in both forward and backward directions using the existing parallel data, and then utilized these models to generate synthetic samples by translating either the target side (following the approach of [13]) or the source side (following the approach of Zhang and Zong [55]) of the original training corpus They approach data augmentation from two perspec- tives, namely input sensitivity and prediction margin, which are defined in a manner that is independent of a specific test set This allows them to make findings that have relatively low variance They point out that DA can have a positive impact on the performance of a model with improved sensitivity or prediction margin, especially for low-frequency words.

The two approaches: reward-augmented maximum likelihood (RAML) [56] and its extension to the source language called SwitchOut [14] These methods aim to expand the support of the empirical data distribution while maintaining its smooth- ness, ensuring that similar sentence pairs have similar probabilities They achieve this by replacing words with other words sampled from a uniform distribution over the vocabulary This approach tends to overrepresent infrequent words in practice. For RAML, by establishing a connection between log-likelihood and expected reward objectives, they demonstrate that the optimal regularized expected reward can be attained when the conditional distribution of outputs, given inputs, is proportional to their exponentiated scaled rewards Based on this, they propose a framework to enhance the predictive probability of outputs by incorporating their corresponding rewards They optimize the conditional log-probability of augmented outputs, which are sampled proportionally to their exponentiated scaled rewards They demonstrate consistent improvement in both speech recognition (using the TIMIT dataset) and machine translation (using the WMT’14 dataset) by sampling output sequences based on their edit distance to the ground truth outputs Given a dataset of input-output pairs,D={(x (i) ,y ∗(i) )} N i=1 , structured output models learn a parametric score function p θ (y|x), which scores different output hypotheses,y∈Y They define a distribution in the output space, termed theexponentiated payoff distribution: q(y|y ∗ ;τ) = 1

, whereZ(y ∗ ,τ) =∑ y∈ Yexp{r(y,y ∗ )/τ}withr(y,y ∗ )is reward function RAML was proposed, which generalizesmaximum likelihood (ML) by allowing a non-zero tem- perature parameter in the exponentiated payoff distribution while still optimizing the

KL divergence in the ML direction,

For Switchout, they approach the design of a data augmentation policy as an optimization problem and develop a generic analytical solution This solution encompasses existing augmentation methods and also offers a straightforward data augmentation strategy for Neural Machine Translation (NMT) The strategy involves randomly replacing words in both the source and target sentences with other random words from their respective vocabularies Experimental results on three translation datasets of varying sizes including large-scale WMT 15 English-German dataset, and two medium-scale datasets IWSLT 2016 German-English and IWSLT 2015 English- Vietnamese, demonstrate that this method, called SwitchOut, consistently improves the performance by approximately 0.5 BLEU Given X,Y,etc as random variables and x,y,etc as their actual values, the augmented version of them can be written as

Xˆ,Yˆ,x,ˆ y,ˆ etc The boldfaced characters p, q represent for the probability distributions The researchers enhance their discussion by utilizing a probabilistic framework that provides support and justification for data augmentation algorithms WithX,Y being the sequences of words in the source and target languages, the canonical MLE framework maximizes the objective can be defined as:

In their work, they focus on a specific family of q, which depends on the empirical observations by q(X,b Yb) =Ex,y∼ b p(x,y)[q(Xb,Yb|x,y)] (4.4)

In order to enhance the validity of training data, a selection process is used This process evaluates the deviation between an augmented pair (x, ˆy)ˆ and the observed data Significant deviations are indicative of potential invalidity, rendering the pair detrimental to the training process and subsequently excluded from the selection.

Additionally, Guo, Kim, and Rush [57] proposed a related approach that promotes compositional behavior, where replaced words are selected from another sentence rather than the vocabulary Their method, SeqMix, generates additional synthetic examples by blending input/output sequences from the training set smoothly SeqMix consistently achieves an improvement of approximately 1.0 BLEU on five different translation datasets (IWSLT ’14 German-English, IWSLT ’14 English-German, Ital- ian, Spanish, and WMT ’14 English-German) when compared to strong Transformer baselines.

Auxiliary tasks have been employed primarily for data augmentation (DA), mostly on the source side.* Zhang et al [58] used token replacement and detection as auxiliary tasks to prevent overfitting and enhance generalization.* Token Drop substitutes tokens with "" to retain semantic information while enabling learning from surrounding context.* Token Drop serves as both data augmentation, allowing diverse sentences, and regularization, introducing natural noise to input sentences.

• Zero-Out: During training, the method eliminates complete words by assigning a zero value to their word embeddings However, this approach is limited as the zero vector cannot capture contextual information in the self-attention layer.

• Drop-Tag: This method utilizes a "" tag to replace tokens The tag is then treated as a regular word in the vocabulary and has an associated word embedding that undergoes training.

• Unk-Tag: This approach replaces tokens with a generic unknown word token

"" The researchers found this method particularly suitable for NMT systems, especially on self-attention layers Unlike the Drop-Tag method, it requires no additional token or parameters.

They first conducted Token Drop through three drop methods on NIST Chinese- English and WMT16 English-Romanian In general, they achieved improvements of 2.37, 1.15, and 1.73 in terms of BLEU scores across the three tasks, respectively. Similarly, Xie, Wang, Li, et al [37] evaluated the impact of replacements on the target data but did not follow an MTL approach By establishing a relation- ship between input perturbation in neural network language models and smoothing techniques in n-gram models, they have developed effective strategies for introducing noise Drawing inspiration from smoothing methods, they have applied these schemes to language modeling and machine translation tasks, demonstrating performance improvements They also mentioned about two noising methods:

• Unigram noising: For eachxi inx

Tiêu đề	Leveraging Sentence-Oriented Augmentation And Transformer-Based Architecture For Vietnamese-Bahnaric Translation
Tác giả	Nguyễn Tấn Sang
Người hướng dẫn	Assoc. Prof. Quản Thành Thơ, Dr. Nguyễn Tiến Thịnh
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	master thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	85
Dung lượng	397,09 KB