Nghiên cứu giải pháp cải tiến chất lượng dịch tự động tiếng việt TT TIENG ANH

THE UNIVERSITY OF DA NANG UNIVERSITY OF SCIENCE AND TECHNOLOGY NGUYEN VAN BINH RESEARCH ON SOLUTIONS TO IMPROVE THE VIETNAMESE MACHINE TRANSLATION QUALITY Major : Computer science Code : 9480101 SUMMARY OF TECHNICAL DOCTORAL THESIS Đà Nẵng - 2021 The doctoral dissertation has been finished at UNIVERSITY OF SCIENCE AND TECHNOLOGY Advisors: Associate Professor Dr Huynh Cong Phap Professor Vincent Berment Reviewer 1: ……………………………………………… Reviewer 2: ……………………………………………… Reviewer 3: ……………………………………………… The dissertation is defended before The Assessment Committee at University of Science and Technology - The University of Danang Time … h … Date:……/……./2021 The dissertation is available at: - National Library of Vietnam - Center for Learning Information Resources and Communication, The University of Da Nang PREAMBLE Introduction The increasing need to exchange information among countries, cultures, among people in modern society makes translation become important and necessary Human-based translation is a manual work with high quality but slow speed, low productivity and high cost without being reusable Computer-based automatic translation (MT), if giving a good translation result, will bring efficiency with less cost, can quickly translate large number of documents in different fields of expertise Thereby MT systems will become a tool to help human access a huge storage of knowledge written in different languages When using an MT system, users are concerned with the quality of the translation However, at present, the quality of automatic translation between less popular language pairs is quite low, including translation from Vietnamese to English and other languages, so the translation result is mainly for reference and understanding main idea of a document In some cases, the translation makes readers misunderstand part or all of main content of the text Although MT systems have been widely used, many improvements are needed to provide translation results with better quality Therefore, it is necessary to have scientific evaluations to have specific data to demonstrate the quality of Vietnamese translation systems, thereby analyzing and proposing solutions to improve the quality of Vietnamese automatic translation In order to contribute to solving the above issues, the PhD candidate selects the topic "Research on solutions to improve the Vietnamese machine translation quality" as the research content of the doctoral thesis in engineering Research objectives General objective: propose specific solutions to improve the quality of Vietnamese translation systems, specifically with the Vietnamese - English language pair Specific objectives include: - Evaluate current situation of active Vietnamese machine translation systems at present; - Propose solutions to improve the quality of the translation system between English - Vietnamese language pair; - Build an English - Vietnamese machine translation system in the specific field of legal documents Research subject and scope Research subject of the thesis includes: - Methods to evaluate the quality of MT systems - Corpus and machine translation methods - Active Vietnamese machine translation systems Research scope of the thesis: - Focus on researching and evaluating current popular machine translation systems, propose solutions to improve the machine translation quality for Vietnamese - English language pair - Build an experimental application of MT from English to Vietnamese and vice versa in a narrow field of legal documents - Deploy the application on the website platform for users to access in a convenient manner Research Methodology - Theoretical and experimental methods The thesis’s layout The thesis is organized into three chapters: Chapter Overview of automatic translation and the quality of Vietnamese automatic translation at present This chapter presents an overview of the issues researched in the thesis, including translation methods, corpus, quality evaluation methods and an overview of general researches on improving the machine translation quality Chapter Solutions to improve the quality of Vietnamese MT Evaluate the quality of some popular English - Vietnamese automatic translation systems Propose some solutions to improve the quality of Vietnamese corpus and solutions to apply neural network translation model for English - Vietnamese language pair Propose a solution to implement a context-based semantic translation system Chapter Experiment and result evaluation Implement experimental steps to build a big corpus and build a neural network machine translation model for English - Vietnamese language pair Main contributions of the thesis The thesis has the following main contributions related to the solutions to improve the quality of Vietnamese MT: (1) Implement a campaign to evaluate the quality of active Vietnamese MT systems in a full and detailed manner Propose solutions to evaluate the translation system quality of through postedit processing (2) Propose solutions to improve the quality of Vietnamese translation through improving the corpus The specific solutions are to extend and consolidate the corpus; solution to build a big corpus; solution to identify proper nouns; solution to identify limit of compound words (3) Propose solutions to improve the quality of Vietnamese translation by artificial intelligence translation method, applying neural network machine learning model This is considered as a new and best solution at the time of research (in 2017) to improve the quality of Vietnamese automatic translation (4) Propose a new solution to build a contextual semanticoriented automatic translation system by improving the neural network translation model combined with a big semantic-enriched corpus (5) Make contribution in terms of experiment and actual product: build an automatic English-Vietnamese translation system VIKI Translator, showing a good result of testing the quality of Vietnamese translation in a narrow field (legal documents) OVERVIEW OF MT AND QUALITY OF VIETNAMESE MT AT PRESENT 1.1 Introduction According to the definition of Cambridge dictionary, machine translation (often abbreviated in English as: MT) is the process of converting text from one language to another language by computer In the researches on the field of MT, the input text to be translated is called the source text and the text that has been translated by the computer is called the target text An MT is a computer program that is responsible for receiving text in the source language, and then using its algorithms to predict translation result in the target language Algorithms in the MT problem operate on the basis of synthesizing and processing knowledge from natural language, such as through dictionaries, pairs of sample translation sentences; grammatical rules; word statistics, language model… 1.2 An general study on MT, corpus, automatic translation quality improvement and evaluation method Automatic translation methods 1.2.1.1 Example-based machine translation The example-based machine translation method (EBMT) was first proposed in 1984 at the work, with the following main idea: the translation of a simple sentence does not need to rely on the linguistically deep analysis process, instead of which, we separate the input sentence into discrete phrases, then translate these phrases into another language and finally combine these phrases together in a correct order to generate a complete long sentence The translation of discrete phrases will be done in the principle of similar translation, using sample examples for reference Three important components of an example-based translation method are: separating phrases on a basis of collating from data of actual examples, identifying corresponding translated texts and combining phrases to generate target text 1.2.1.2 Statistical Machine Translation Statistical Machine Translation (SMT) in recent years has been a potential development direction because of its outstanding advantages compared to other methods Instead of building dictionaries and manual conversion rules, this translation system automatically builds dictionaries and rules based on statistical results obtained from corpuses Therefore, the statistical machine translation is highly convertible, applicable to any language pair a Word-based statistical machine translation b Phrase-based statistical machine translation c Syntax-based statistical machine translation Regarding open source applications in the field of statistical machine translation, the most prominent is the emergence of Moses (http://www.statmt.org/moses/), a complete phrase-based SMT open source system Corpus in machine translation A corpus is understood as a collection of monolingual, multilingual or bilingual texts In the definition of Cambridge Dictionary, a corpus can be a collection of resources in form of text or speech A bilingual corpus is a collection of data including corresponding translated text pairs 1.2.2.1 Current corpora There have been many researched and published international corpuses with a relatively large number of languages and data volume, such as EuroParl (11 languages, 34-55 million words), JRC-Acquis (22 languages, 11-22 million words), XinHua News (2 languages, 1214 million words), EuroMatrix (9 languages sourced from the proceedings of the European Parliament from 1996–2006), Canadian Hansard (bilingual English - French, 2.8 million sentence pairs), WaCky (more than billion words collected from the Internet)… In addition, there are some big bilingual corpuses such as: Corpus name Wikipedia OpenSubtitles TED2013 EUbookshop Number of languages 21 62 15 48 Data size 25,90M 3,35G 3,81M 173,20M 1.2.2.2 Basic structure of bilingual corpus The bilingual corpus contains texts of two different languages, so in addition to the content, there is also processed information such as alignment, word labeling, etc - Primary data: Information about text, information about structure and content - Linguistic annotation - Information about alignment Evaluation of machine translation system quality Evaluate the quality of automatic translation system is to determine the completeness of a computer-generated translation or to compare the translation quality among different automatic translation systems 1.2.3.1 Subjective evaluation method Subjective evaluation is performed directly by human, based on rating scale for pre-built criteria The subjective evaluation method shows reliable results but it is time-consuming and expensive, depending on the ability of the evaluator a Evaluation of fluency and completeness of the scale Two of used human-made subjective evaluation methodbased most common evaluation parameters are fluency and adequacy Adequacy all meaning most meaning much meaning little meaning none Fluency flawless English good English non-native English disfluent English incomprehensible b Ranking-based evaluation c Translation proofreading-based evaluation 1.2.3.2 Objective evaluation method (automatic evaluation) Objective evaluation is the use of programs instead of human to evaluate Programs will match or measure the error rate of the results from the translation system with the available reference translation a Word Error Rate (WER) b Multi-Reference WER (MWER) c Position-independent Error Rate (PER) d Translation Error Rate (TER) e BLEU f NIST 1.3 10 Researches on building and improving the quality of Vietnamese machine translation Research on building a translation system and evaluating translation quality - Research on building an English - Vietnamese translation system using MOSES source code on the platform of statistical translation The author group uses the training and testing dataset of IWSLT 2015 and evaluates the result using BLEU indicator - Research on building a corpus consisting of 880,000 pairs of English-Vietnamese bilingual sentences and more than 11 million Vietnamese sentences, then using statistical translation model and MOSES source code to build an English-Vietnamese translation system The translation system result is evaluated and compared with the translation result of Google and Microsoft - Research on building a translation system using neural network and evaluation dataset of IWSLT 2015 for some less popular languages, including English - Vietnamese language pair - Research on approaching building a translation system between Czech - Vietnamese language pair, using English as an intermediate language Research on building and improving Vietnamese corpus In order to solve Vietnamese language processing problems, including MT, many research groups have built corpora dedicated to Vietnamese and offered solutions to improve quality of the corpora - Vietlex's Vietnamese corpus contains about 80,000,000; Project KC01.01/06-10, project branch "Vietnamese text processing" conducts research and building of Vietnamese corpus and English Vietnamese bilingual corpus; Computational Linguistics Center – VNU Ho Chi Minh University of Science - built Vietnamese corpora (named VTB and VCor) VTB has 201,594 sentences, 5,501,225 words, VCor corpus has 17,095,994 sentences (42 fields) 12 SOLUTIONS TO IMPROVE VIETNAM MACHINE TRANSLATION QUALITY 2.1 Introduction The translation model is the result of the training process of algorithms, representing statistical data, principles and rules that have been optimized after this process From a trained translation model, we input the source sentences so that the model predicts the output as the target sentences to be translated Therefore, the translation model plays a decisive role affecting the quality of the translation system As shown above, it can be seen that building a good translation model and creating a qualified translation system requires two key factors: data source and translation method: 2.2 - Data source must be of good quality and large quantity - The translation method is effective, suitable for the language, and minimizes semantic ambiguity Evaluation of the quality of Vietnamese MT systems This evaluation process was conducted in 2017, using the translation results of two systems, namely Google Translate and Microsoft Translator 13 Evaluation organization 2.2.1.1 Objective evaluation English sentences of each dataset are translated into Vietnamese through API functions of Google and Microsoft systems, using a tool built by the author The obtained result is in the following table Evaluation data tst2013 1000-cau tpp-tomtat tpp-chuong28 Language en-vi en-vi en-vi en-vi BLEU 32 06 42 44 Google NIST WER 7.54 0.51 2.88 0.75 8.29 0.46 7.29 0.47 Microsoft BLEU NIST 27 6.82 04 2.53 40 7.90 33 6.11 WER 0.58 0.82 0.51 0.58 2.2.1.2 Subjective evaluation The result shows that in the conversation dataset, there are only 516 sentences (for Google) and 308 sentences (for Microsoft), accounting for 52% and 30% Some sentences also make readers misunderstand the meaning Comment, evaluation The quality of Vietnamese translation systems is not good for a number of reasons: The translation method is not suitable and the corpus is incomplete Proposal of solutions to evaluate the quality based on the translation post-editing process 2.2.3.1 Some limitations for translation quality evaluation method Evaluation of the quality of automatic translation systems by the above methods and indicators has been widely studied and applied, but in some cases there are still limitations 2.2.3.2 Proposal of quality evaluation indicators Time indicator: Tpe = T/N Operation indicator: Ope = (D + I) / N 14 2.2.3.3 Solution of combining machine translation post-editing and quality evaluation It is proposed to combine machine translation proofreading with quality evaluation, helping reduce costs and improve accuracy 2.2.3.4 Experiment The experimental result shows the similarity between Tpe, Ope and Edit Distance and Word Error Rate 2.3 Solution to improve the quality of Vietnamese translation based on big corpuses Overview The corpuses exists discretely, have a very different structure and format, which creates difficulty in use and exploitation at present There are many corpuses that have been built but cannot be applied and shared for the research as well as processing of Vietnamese Related researches on improving the quality of corpuses 2.3.2.1 Overview of current situation of research on extending corpuses in terms of volume - Linguistics-directed corpus extension - Data building and supplement-directed corpus extension 2.3.2.2 Overview of current situation of research on extending the corpuses in terms of quality Solutions to improve the quality of corpuses 2.3.3.1 Extension of the corpus volume The research proposes that a corpus consists of two parts: The header contains information about linguistic material The body contains information of document types: , , Each document contains a description of its hierarchical structure: chapters, pages, sections, and a segment description: ( , , ) a) Consolidation of corpora 15 The proposed algorithm to consolidate two corpuses R1 and R2 contains datasets of language L1 and L2: o Consolidate data o Consolidate format and structure of corpuses Research and build a tool of converting existing corpuses to build a corpus with the proposed standard structure and format b) Extend the language of the corpus c) Add data to the corpus 2.3.3.2 Improvement of the corpus’s quality a) Improvement through post-editing It’s proposed to research and build a support system for postprocessing, allowing to load big corpuses and display data visually to check and improve data In addition, the system needs to act as a collaborative environment, allowing multiple users to participate in data improvement b) Build a semantic-enriched corpus Step 1: Define context-based layer types Step 2: Build properties for the defined layers Step 3: Identify the entities belonging to the defined layers 16 Step 4: Build information for the entities c) Identify and classify proper noun entities The thesis proposes a solution to combine the Maximum Matching algorithm and analyze the relationship between textual elements, including steps: word separation and proper noun recognition d) Solution to define Vietnamese word boundary Propose a solution to calculate the score of monosyllabic words standing next to each other to predict whether these words are compound words or not: − , = ( )× ( ) In which, score(wiwj) is the score of two words standing next to each other; count(wiwj) is the number of occurrence times of phrase wiwj, count(wi) is the number of occurrence times of word wi, δ is the coefficient to exclude low frequency phrases Evaluation of the role of the corpuses Research and deploy the experimental building of translation system with corpuses of different sizes, showing that the larger the amount of data, the better the translation quality is 2.4 Solutions to improve Vietnamese translation quality based on neural network machine learning model Overview There are many research works on solutions to improve the quality of translation models according to the SMT in recent times, but the result of the evaluation shows that their quality is still low Solutions to improve the quality of Vietnamese translation based on neural network machine learning model NMT is usually a trained big-sized neural network that stores vectors representing association information among words in the context, so it is capable of translating long text sentences in a good way 17 RNN model includes hidden states h and generates the output y when receiving the input sequence x = (x1, x2, … xT) At each time t, hidden state h of RNN model is updated according to the formula: h = f(h,xt), where f is a nonlinear activation function From the input training data, RNN can learn the distribution probability of the sequences and predict the following word in a given sequence 2.4.2.1 Steps to build NMT translation system a Represent input data b Build an encoder c Build a decoder Result of building translation system The research conducts the application of ST method and neural network translation method to train translation model, using Moses and OpenNMT source codes, the result is as follows: BLEU NIST OpenNMT 25.4 5.61 Moses 23.8 5.10 The above data show that, with the same input dataset, the neural network translation model generates better results than the ST model through BLEU and NIST evaluation scores 18 Proposal of solutions to build a semantic translation system The thesis proposes solutions to combine the neural networkbased MT system and ontology corpus to enrich the translation semantics and represent the most complete information of the MT’s result To connect the functions of the translation system, it is necessary to perform as follows: Build translation system by machine learning model using neural network: follow the proposal in section 2.4.2 Find and separate concepts from translated text: follow the solution as proposed in Section 2.3.4 Link enriched corpus concepts Build a semantic-enriched corpus Build an intuitive interface to express semantics Conclusion of Chapter Experimentally, the proposals on improving the corpuses and improving the translation method have contributed to increasing the quality of the automatic translation model compared with the statistical translation model and some other systems 19 EXPERIMENT AND RESULT EVALUATION 3.1 Introduction With the solutions to improve the corpuses and improve the translation model proposed in Chapter 2, the thesis synthesizes to experimentally build a specialized translation system in the field of legal documents and evaluate the result The system will be tested for users to record reviews of users in addition to other quality evaluation indicators The process is as follows: 3.2 Corpus building Process of implementation steps The process of building a corpus: Building of a big bilingual corpus (1) Find the suitable resources: Websites providing legal documents, learning materials, English learning materials, scientific materials provided on the Internet, dictionary websites, websites that provide bilingual sentence samples, English - Vietnamese bilingual movie websites, news websites providing translations in different languages , Vietnamized documents of open source software, web 20 applications, including translations of functions, user manuals, terms of use, etc (2) Perform data pre-processing steps As a result of the process of building a corpus, 1,479,000 pairs of English-Vietnamese bilingual sentences are obtained, including 460,000 pairs of bilingual sentences in the field of legal normative documents Field Number of sentences English sentence length Vietnamese sentence length Legal documents 460,000 25.8 31.2 Conversation 180,000 7.2 8.4 Other areas 839,000 18.5 24.1 (3) Normalize and make data become more accurate by identifying Vietnamese word boundary and identifying proper nouns Building of tool for linguistic and semantic extension Build a collaborative working environment that allows automatic translation systems to be called to extend the language of the corpus, collect data in parallel from multilingual websites, and enable data to be improved through post-editing function Building of ontology corpus Step 1: Define layers based on the context or domain of the corpus to build the ontology: Identify the domain; List and define concepts; Define layers, layer hierarchy 21 There are a total of 179 layers, including 14 main layers and 165 sublayers The figure below is an illustration of some layers and their hierarchical structure Step 2: Build properties for the defined layers Step 3: Identify specific words in the corpus that are expressions of the defined layers Layer word identification is based on context Step 4: Construct values for the properties of the identified entities 3.3 Experimental result of building an English-Vietnamese translation application in the field of legal documents (VIKI Translator) Process of implementation steps Build an English-Vietnamese translation system in the field of administrative and legal documents using neural network model combined with a collected big corpus The process of developing the translation model includes the following steps: 22 Organization of translation model training and model parameter adjustment Neural network building: The research uses OpenNMT open source with a designed neural network and components of the translation system to train the translation model Translation model training: Number of hidden layers of the neural network and number of nodes per layer: enc_layers = 2, dec_layers = 2, rnn_size = 500 Vocabulary size: src_vocab_size = 50,000, tgt_vocab_size = 50,000 At the loop up end_epoch = 21, the parameter representing the quality of the model (perplexity) achieves 4.80 for translation from English to Vietnamese and 4.66 for translation from Vietnamese to English Building of modules of the translation system Build components of the translation system and connect to the automatic translation machine The VIKI Translator translation system works on a web platform, connects directly to the server and installs the translation module in the following manner: 3.4 Result evaluation Experimental result Use datasets described in Chapter to evaluate the quality of the system As a result, BLEU score is 29 Use the above dataset to experimentally compare with a similar English-Vietnamese translation system that is Co Viet text translation system, as a result, BLEU score is 27 and Evtran system reaches 11 VIKI Translator Co Viet System Evtran system 23 BLEU 29.1 27.1 11.3 NIST 5.78 5.62 3.32 WER 0.63 0.68 0.93 Through the above comparisons, it can be seen that, by using a large-quantity and good-quality corpus, the neural network modelbased translation system that the research has built shows a good result Besides, thanks to the corpus’s focusing on the collected legal documents, the translation system can translate most of the terms related to this field, while some other systems still mistranslate important phrases User’s evaluation VIKI Translator translation system has been deployed since November 2017, providing users with online translation function from English to Vietnamese and Vietnamese to English through the Internet environment at https://vikitranslator.com The interface of VIKI Translator system is shown in the following figure Summary of some obtained results through the experimental implementation of the system: o Total app visits and use counts on all platforms: over 1,500,000 users 24 o Monthly website visits: nearly 70,000 users o App downloads on Windows: more than 30,000 times o Total number of introductory articles, user manuals from other websites: more than 30 articles o Total number of backlinks from other websites: 582,561 backlinks Graph of monthly users Statistics of total users 3.5 Conclusion of Chapter Content of Chapter presents experimental steps to build an English-Vietnamese MT system on the basis of synthesizing innovative solutions in terms of corpus and translation methods proposed in previous chapters The built VIKI Translator system shows superior results compared to the current Vietnamese translation system through specific evaluation data of BLEU, NIST and WER scores The English-Vietnamese translation system has been practically deployed for nearly years and has more than 1.5 million users, receives positive reviews from users, which shows that the innovative solutions proposed by the research have contributed to building a translation system with good quality, suitable for deploying and continued to be researched and developed for the Vietnamese MT problem 25 CONCLUSION AND DEVELOPMENT ORIENTATION Conclusion The thesis has researched the important factors affecting the quality of the results of the Vietnamese MT system, including the corpuses and translation methods, thereby proposing specific solutions to improve the quality of Vietnamese translation systems The specific research contents are as follows: - Research on methods of evaluating the quality of machine translation, implement a general and detailed evaluation of the quality of active Vietnamese translation systems and provide data as a basis for analysis and comparison among translation systems and translation quality in different fields in the same system as well as comparison with the quality of translation systems in other languages On that basis, give an overview of the quality of existing Vietnamese translation systems The research also proposes a new method and measurement to calculate the quality of translation results in the process when users edit target texts This method ensures accuracy, and at the same time saves the resources to organize the evaluation - Research on the corpuses for Vietnamese MT and propose solutions to improve the quality of corpuses These innovative solutions in terms of both qualitative and quantitative improvement, including solutions to extend and consolidate the corpuses; solution for building a big corpus; solution to identify proper nouns by combining the Maximum Matching algorithm and analyzing the relationship between text elements; solution to identify the limit of Vietnamese compound words from the distribution model of words and phrases in the text On that basis, the research proceeds to build software modules to simulate the proposed and tested solutions that show good results From the above solutions to consolidate and extend the corpuses, the research also collects a large-quantity and goodquality corpus including 1,479, 000 pairs of bilingual English Vietnamese sentences to serve Vietnamese automatic translation systems 26 - Research on automatic translation methods and propose solutions to apply neural network machine learning model to Vietnamese MT problem in order to improve the quality of the translation system The research also organizes the installation and training of statistical and neural network-based translation models and compares the results of these translation models, thereby showing the suitability of the neural network-based translation model in the Vietnamese MT problem Research and propose a model of a semantic-oriented automatic translation system, thereby translation systems can provide full contextual semantics of the text to be translated and help readers fully understand the content of the text - Build and deploy an English -Vietnamese MT system called VIKI Translator that is provided to users through the Internet This translation system is a product that applies the solutions proposed in the research, built on the basis of re-evaluating the effectiveness of solutions to improve the quality of Vietnamese translation The system has more than 1.5 million users and has received positive reviews Development orientation In order to perfect the solutions for the Vietnamese MT systems and help the systems achieve better quality, in the coming time, the PhD candidate will continue to focus on researching the following main contents: - Continue to research and innovate neural network-based translation method to achieve higher efficiency - Build a richer corpus by different methods, at the same time describe the semantics of the data and combine semantic analysis in the translation method - Extend the building of the corpus in different fields and implementing evaluation, analysis and comparison - Evaluate the contextual elements of the entire text as input parameters for the translation system, thereby improving the quality of the translation results ... Lac Viet dictionary service, has developed more automatic translation service at: http://tratu.coviet.vn/hoc -tieng- he/dich-van-ban.html - Google Translate: is an online translation tool provided... expertise Thereby MT systems will become a tool to help human access a huge storage of knowledge written in different languages When using an MT system, users are concerned with the quality of the... systems have been widely used, many improvements are needed to provide translation results with better quality Therefore, it is necessary to have scientific evaluations to have specific data to

Định dạng
Số trang	26
Dung lượng	0,92 MB