Data Augmentation for Low-Resource Vietnamese-Bahnaric Translation Using Transformer Architecture

MỤC LỤC

Problem Description

Moreover, Vietnam still has different ethnic minority languages(Thai, Jrai, Mnông, etc.) with the same situation as Bahnar, which makes NMT more important in translating these low- resource languages in the future. DA is still a super common and all-over- the-place approach in NMT, which samples some fake data distributionPf(X′)using some common methods (Figure 1.1) based on real data distributionPr(X), whereX1f, Xf2refer to augmented data generated from real data using common approaches, such as replacing, swapping.

Objectives And Missions

In this project, suitable DA methods will be investigated and applied to support the low-resource translation Vietnamese-Bahnar. • Studying DA techniques in NLP, related works, DA methods applicable to low- resource NMT, and assessing their benefits and drawbacks.

Scope Of Work

• Proposed the DA techniques that improve translation quality for low-resource NMT, especially the specific focused dataset.

Contributions

• Applying the research techniques flexibly in the context of translating Viet- namese to Bahnar in both general and special context. • Showing the effectiveness of proposed DA approaches: multi-task learning approach and sentence boundary argumentation.

Thesis Structure

NMT normally uses maximum log-likelihood (MLE) as the training objective function, which usually uses for estimating the parameters of a probability distribution. In reality, adaptive learning rate optimizers like Adam [20] are found to significantly reduce training time compared to basic SGD optimizer.

Data Augmentation

This advantage stems from the straightfor- ward nature of text classification, which focuses on predicting labels based on the input text, thus enabling DA to primarily focus on preserving the semantic meaning of crucial words for accurate classification. When comparing different DA methods, it becomes evident that simple and ef- ficient unsupervised techniques, such as machine translation, paraphrasing using the- saurus, and random word substitution, have gained significant popularity.

Dialects In Bahnar Language

Therefore, it is possible that people in Group 1 can be confusing for people in Group 2.

Vietnamese-Bahnar Translating Notices

Overview

Text classification, being the pioneer- ing task to adopt DA, has garnered a larger number of research papers compared to the other two tasks: text generation and structure prediction. However, due to the different nature of these tasks, some methods, which have shown powerful improvements in text classification, cannot perform well in neural machine translation. This method has created the abnormal in the context of sentences, such as: producing new vocabulary, changing word order, and skipping words.

So, DA in NMT needs suitable approaches, which are still based on the foundation of the origin DA methods but need novelty modifications.

Approaches

They explored three types of lexical embeddings, namely GoogleNews Lexical Embed- dings trained on 100 billion words, Twitter Lexical Embeddings trained on 51 million words, and Urban Dictionary Lexical Embeddings trained on 53 million words from slang definitions and examples, as part of their data augmentation approach. They trained NMT systems in both forward and backward directions using the existing parallel data, and then utilized these models to generate synthetic samples by translating either the target side (following the approach of [13]) or the source side (following the approach of Zhang and Zong [55]) of the original training corpus. Experimental results on three translation datasets of varying sizes including large-scale WMT 15 English-German dataset, and two medium-scale datasets IWSLT 2016 German-English and IWSLT 2015 English- Vietnamese, demonstrate that this method, called SwitchOut, consistently improves the performance by approximately 0.5 BLEU.

By removing specific semantic information within a sentence, these augmented samples encourage the model to consider a wider range of features that may contribute to accurately predicting the sentiment of the original input rather than relying solely on a limited set of prominent features.

Discussion

For both of the line processes, BARTPho will always be the pre-trained model that is used for training, and BLEU score (mentioned in section 5.3) will be the main evaluation metric of the whole progress. With the baseline process, the training set(whole training set or focused sentence forms subset) will be passed directly through the training process, and their results will be evaluated which is marked as the baselineresults. On the other hand, to investigate the effect of DA on translating performance, the training set will be augmented in a DA module that contains several DA methods, including MTL DA (mentioned in section 5.4 and sentence boundary augmentation (mentioned in section 5.5).

With each different approach, there will be a unique version of the augmented data set, and each of them will be passed through the same training process ofbaseline.

BARTpho

Architecture

Pre-training data

Optimization

BLEU Score

Sometimes, the decimal values turned into a 0 to 100 scale so they can be read more easily, for example, 0.7 can be written as 70. Even comparing BLEU scores within the same corpus but with varying numbers of reference translations can lead to highly misleading results. 0.20 - 0.29 The gist is clear, but there are substantial grammatical errors present 0.30 - 0.40 Understandable to good translations.

• First, computing the geometric average of the modified n-gram precisions, pn, using n-grams up to length N and positive weightswnsumming to one.

Multi-task Learning Data Augmentation

This framework does not require preprocessing steps, training additional systems, or data besides the available training parallel corpora, which makes it totally suitable for the low-resource target language - Bahnar. Hence, by reversing the sentence order, the system is expected to learn to rely more on the encoder for generating words that typically occur towards the end of the sentence, utilizing additional information. Different from MTL DA, this approach did not use the operation "mono" from MTL DA, "mono" try to reorder the tokens of the target language with the same order of tokens in the source language.

This operation is not necessary in the context of translating Vietnamese to Bahnar due to the reason that Vietnamese and Bahnar have the same sentence structure (mentioned in section 3.2).

Sentence Boundary Augmentation

By design, the truncation keeps most of the first sentence (lines 7, 8 forS1,T1: at most 30% of the start of the first sentence is discarded) while discarding the bulk of the second sentence (lines 7, 8 forS2,T2: at most 30% of the start of the second sentence is retained). The researcher has conducted experiments for the translation from Vietnamese to Bahnar to evaluate the effect of using each of the MTL DA auxiliary tasks, as well as the combination of the best-performing ones and the sentence boundary approach. The researcher also evaluated two strong DA methods that aim at extending the support of the empirical data distribution: Semantic embedding (using Word2Vec), which replaces some words with random samples from the vocabulary building from Contin- uous Bag of Words (CBOW) (applied on target site) and EDA stacked by four simple operations (random replacement, random insertion, random swap, and random deletion) (applied on target site).

Each sub-dataset contains two text files, the first text file is used for storing Vietnamese sentences (stored as .vi), and Bahnar sentences are stored in the other text file (stored as .ba).

Experimental Settings

The dataset was divided into three sub-datasets: a training set, a test set, and a validation set, which were used for training, testing, and validating, respectively. Colab provides every user with a free Tesla K80 GPU(V4, V100, and A100 for the premium Colab version), which can help users accelerate their deep- learning applications. The one-to-one word alignments require byreplaceof MTL DA was obtained by usingSimAlign 3, this alignment mechanism leverage multilingual word embeddings – both static and contextualized – for word alignment [78].

The proportion of words affected by the swap, token, and replace transformations of MTL DA is controlled by a hyperparameter α, whereas sentence boundary aug-.

Results And Discussion

Overall

While all five methods have shown improved results, source has its worst performance, which indicates that the translation task could be adversely affected by introducing a completely different vocabulary in the target. Interestingly, using each two of the three best auxiliary tasks together further improves the performance, achieving well-perform results in all translation tasks with BLEU scores between 8.53(replace+swap) and 11.09 (token+swap) points over the baseline. This issue happens due to such reasons: random insertion and random replacement need to use the external resource from the dictionary which may provide some new words to the vocabulary; random swap and random deletion could accidentally change the word; these four methods combined together could create a negative effect.

Semantic embedding may have good performance but still cannot compare with MTL-DA combination and sentence boundary, which can prove that word replacement is a possible and promising solution, but it needs to have reasonable strategies to decide how the method should be used such as augmented language site, hyperparameter, utilizing language model, strategy in choosing words to replace, etc.

Oriented Sentences

Besides, further improvements could be achieved by implementing more sophis- ticated approaches to multi-task learning, such as changing the proportion of data for the different tasks and evaluating different ways of parameter sharing between the different tasks (e.g. sharing the encoder but not the decoder). Yang, “That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets,” in Proceedings of the 2015 Con- ference on Empirical Methods in Natural Language Processing, Lisbon, Por- tugal: Association for Computational Linguistics, Sep. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Nov.

Monz, “Data augmentation for low-resource neural machine translation,” in Proceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada: Association for Computational Linguistics, Jul. Wang, “Multi-task learning for mul- tiple language translation,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Con- ference on Natural Language Processing (Volume 1: Long Papers), Beijing, China: Association for Computational Linguistics, Jul. Titov, “Analyzing the source and target contributions to predictions in neural machine translation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:. Long Papers), Online: Association for Computational Linguistics, Aug.

Example of Bahnar dialects differences

Similarity level of two groups in Bahnar language

Interpretation of BLEU scores [72]

Example of Multi-task Learning Data Augmentation

Example of Sentence Boundary Augmentation

Original Dataset Information

Training Hyperparameters

Total sentence pairs of the baseline and augmented training sets

BLEU scores obtained with the baseline and MTL DA approach, using different auxiliary tasks and combinations of them

BLEU scores obtained in evaluation and prediction, using different p in sentence boundary augmentation approach

BLEU scores obtained with the baseline and MTL DA approach combination, sentence boundary, EDA and semantic embedding

Translating issues of chosen sentences in test set

Predict BLEU scores of Collocation and word-by-word with baseline and other DA methods