Enhancing the quality of machine translation system using cross lingual word embedding models

Enhancing the quality of Machine Translation System Using Cross-Lingual Word Embedding Models Nguyen Minh Thuan Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Associate Professor Nguyen Phuong Thai A thesis submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science November 2018 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom I have worked at UET/Coltech or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Hanoi, November 15th , 2018 Signed i ii ABSTRACT In recent years, Machine Translation has shown promising results and received much interest of researchers Two approaches that have been widely used for machine translation are Phrase-based Statistical Machine Translation (PBSMT) and Neural Machine Translation (NMT) During translation, both approaches rely heavily on large amounts of bilingual corpora which require much effort and financial support The lack of bilingual data leads to a poor phrase-table, which is one of the main components of PBSMT, and the unknown word problem in NMT In contrast, monolingual data are available for most of the languages Thanks to the advantage, many models of word embedding and cross-lingual word embedding have been appeared to improve the quality of various tasks in natural language processing The purpose of this thesis is to propose two models for using cross-lingual word embedding models to address the above impediment The first model enhances the quality of the phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT Publications: Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai Nguyen and Chi-Mai Luong Enhancing the quality of Phrase-table in Statistical Machine Translation for Less-Common and Low-Resource Languages In the 2018 International Conference on Asian Language Processing (IALP 2018) iii ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my lecturers in university, and especially to my supervisors - Assoc.Prof Nguyen Phuong Thai, Dr Nguyen Van Vinh and MSc Vu Huy Hien They are my inspiration, guiding me to get the better of many obstacles in the completion this thesis I am grateful to my family They usually encourage, motivate and create the best conditions for me to accomplish this thesis I would like to also thank my brother, Nguyen Minh Thong, my friends, Tran Minh Luyen, Hoang Cong Tuan Anh, for giving me many useful advices and supporting my thesis, my studying and my living Finally, I sincerely acknowledge the Vietnam National University, Hanoi and especially, TC.02-2016-03 project named “Building a machine translation system to support translation of documents between Vietnamese and Japanese to help managers and businesses in Hanoi approach Japanese market” for supporting finance to my master study To my family ♥ iv Table of Contents Introduction Literature review 2.1 Machine Translation 2.1.1 History 2.1.2 Approaches 2.1.3 Evaluation 2.1.4 Open-Source Machine Translation 2.1.4.1 Moses - an Open Statistical Machine Translation System 2.1.4.2 OpenNMT - an Open Neural Machine Translation System 2.2 Word Embedding 2.2.1 Monolingual Word Embedding Models 2.2.2 Cross-Lingual Word Embedding Models 4 10 11 12 13 Using Cross-Lingual Word Embedding Models for Machine Translation Systems 3.1 Enhancing the quality of Phrase-table in SMT Using Cross-Lingual Word Embedding 3.1.1 Recomputing Phrase-table weights 3.1.2 Generating new phrase pairs 3.2 Addressing the Unknown Word Problem in NMT Using Cross-Lingual Word Embedding Models 17 17 18 19 21 Experiments and Results 27 4.1 Settings 27 4.2 Results 31 v TABLE OF CONTENTS 4.2.1 4.2.2 4.2.3 Conclusion vi Word Translation Task 31 Impact of Enriching the Phrase-table on SMT system 32 Impact of Removing the Unknown Words on NMT system 35 38 List of Figures 2.1 2.2 The CBOW model predicts the current word based on the context, and the Skip-gram predicts surrounding words based on the current word 13 Toy illustration of the cross-lingual embedding model 14 3.1 3.2 3.3 Flow of training phrase 22 Flow of testing phrase 23 Example in testing phrase 25 vii List of Tables 3.1 The sample of new phrase pairs generated by using projections of word vector representations 21 4.1 4.2 4.3 4.4 Monolingual corpora Bilingual corpora Bilingual dictionaries The precision of word translation retrieval top-k nearest neighbors in Vietnamese-English and Japanese-Vietnamese language pairs Results on UET and TED dataset in the PBSMT system for VietnameseEnglish and Japanese-Vietnamese respectively Translation examples of the PBSMT in Vietnamese-English Results of removing unknown words on UET and TED dataset in the NMT system for Vietnamese-English and Japanese-Vietnamese respectively Translation examples of the NMT system in Vietnamese-English 4.5 4.6 4.7 4.8 viii 28 28 29 32 33 34 35 37 4.1 Settings 29 words for k = 1, 5, 10, 20, 50 The size of the testing sets are shown in Table 4.3 Table 4.3: Bilingual dictionaries Automatic dictionary Manual dictionary Testing Vietnamese-English 7000 7000 238 Vietnamese-Japanese 5000 5000 300 Generate new phrase pairs In order to generate new phrase pairs (shown in the Section 3.1.2), we used the monolingual vector models for Vietnamese, English, and Japanese languages as mentioned above For selecting a proper phrase pair from many possible phrase pairs, we use Viterbi algorithm in (Ryan and Nudd, 1993) to calculate best path with highest probabilities In our experiments, we consider Vietnamese as the source language to extract phrases, English and Japanese as the target languages We also used the British National Corpus and Japanese text in Leipzig Corpora to filter phrases in the target language for English and Japanese respectively For learning the linear mapping from source to target language space, we used the method of (Xing et al., 2015) As a result, we obtained 100.625 new Vietnamese-English phrase pairs and 84.004 new Vietnamese-Japanese phrase pairs Statistical Machine Translation System In all our experiments using the statistical approach, we trained our phrase-based statistical machine translation models by using Moses system as shown in (Koehn et al., 2007) For the Vietnamese-English PBSMT system, we consider Vietnamese as the source language and English as the target language For the JapaneseVietnamese PBSMT system, we consider Japanese as the source language and Vietnamese as the target language The parallel data are shown in Table 4.2 The detail of the translation system settings can be described as follow: the maximum sentence and maximum phrase length are 80 and respectively We followed the default settings of Moses in (Koehn et al., 2007) We used KenLM in (Heafield, 2011) for constructing two language models with gram based on the British National Corpus and Vietnamese text in Leipzig Corpora for English and Vietnamese respectively We also used minimum error rate training (MERT) techniques shown 4.1 Settings 30 in (Och, 2003) to tune our model weights For clarification the the impact of our proposed model in PBSMT system, we conducted following experiments (Each experiment was conducted in both VietnameseEnglish and Japanese-Vietnamese language pairs): ❼ baseline: the phrase-based SMT baseline by only using Moses system ❼ r : We recomputed weights of the original phrase-table Then we use these new weights to replace the original weights in the phrase-table ❼ base + r: We recomputed weights of the original phrase-table and then com- bine the new weights with the original weights ❼ base + r + n: We add new phrase pairs generated by our proposed method (shown in the Section 3.1.2) into the phrase-table obtained in the Experiment base + r Neural Machine Translation In all our experiments using the neural approach, we trained our neural machine translation models by using OpenNMT system as shown in (Klein et al., 2017) For the Vietnamese-English NMT system, we consider Vietnamese as the source language and English as the target language For the Japanese-Vietnamese NMT system, we consider Japanese as the source language and Vietnamese as the target language The parallel data are shown in Table 4.2 The detail of the translation system setting can be described as follow: The word embedding dimension is 512 for all source and target words, and the number of hidden units is 512 for both the encoder and decoder We used a 2-layer bidirectional RNN for the encoder and a 2-layer RNN for the decoder LuongAttention as shown in (Luong et al., 2015a) is used as an attention mechanism The size of the mini-batch is 128, and the other hyperparameters are chosen by following the OpenNMT default settings For clarification of the the impact of our proposed model on handling the unknown word problem in NMT system, we conducted following experiments (Each experiment was conducted in both Vietnamese-English and Japanese-Vietnamese language pairs): ❼ baseline: the NMT baseline by only using OpenNMT system 4.2 Results 31 ❼ base + unk Xing: This experiment resolved the unknown word problem in the baseline NMT system by using our proposed model (shown in section 3.2), which utilized the cross-lingual embedding model of (Xing et al., 2015) trained in the manual dictionaries ❼ base + unk Conneau: This experiment addressed the unknown word problem in the baseline NMT system by using our proposed model (shown in section 3.2), which utilized the cross-lingual embedding model of (Conneau et al., 2017) 4.2 Results In this subsection, we first present the results in word translation to choose the best approaches for Vietnamese-English and Japanese-Vietnamese language pairs We then indicate the result of the PBSMT system in term of the BLEU score to evaluate the effect of our proposed model for enhancing the quality of the phrasetable Finally, we report the result of the NMT system, which incorporates our replaced unknown words model and shows some examples of translation 4.2.1 Word Translation Task Table 4.4 presents the precision of word translation task using various models on the different dataset for Vietnamese-English and Japanese-Vietnamese language pairs In this table, the three columns Mikolov, Xing, and Conneau show the results on using the three cross-lingual word embedding models proposed by (Mikolov et al., 2013b), (Xing et al., 2015), and (Conneau et al., 2017) respectively The sub columns auto dict indicate the result of the cross-lingual models, which are trained from the automatic dictionaries extracted automatically from the bilingual corpus while the sub columns manual dict show the result of the cross-lingual models, which are trained from the manual dictionaries The sub column with symbol null presents the result of the method of learning cross-lingual embeddings without bilingual data In Table 4.4, we observe that using the manual dictionaries offers better results than using the automatic dictionaries The reason is that the dictionaries extracted from the small parallel data are incorrect because of lacking data Looking at the Conneau column, one can see that the method without bilingual data obtains promising 4.2 Results 32 results, which are better than the methods of training on the automatic dictionaries and a bit smaller than the methods of training on the manual dictionaries This means by using only monolingual data, we can learn a relatively accurate crosslingual word embedding model This is very useful because attaining a good dictionary is costly and time-consuming, especially for less-common and low-resource languages In short, the result of word translation task shows that the method of (Xing et al., 2015) trained on a small manual bilingual dictionary is the best approach for learning cross-lingual word embeddings in Vietnamese-English and Vietnamese-Japanese language pairs Table 4.4: The precision of word translation retrieval top-k nearest neighbors in Vietnamese-English and Japanese-Vietnamese language pairs Mikolov Xing Conneau auto dict manual dict auto dict manual dict null The precision of top-k nearest neighbors in Vietnamese-English Top-1 0.09 0.39 0.14 0.39 0.27 Top-5 0.19 0.5 0.3 0.53 0.42 Top-10 0.24 0.53 0.35 0.56 0.46 Top-20 0.29 0.57 0.39 0.6 0.5 Top-50 0.35 0.63 0.45 0.69 0.59 The precision of top-k nearest neighbors in Japanese-Vietnamese Top-1 0.09 0.07 0.15 0.16 0.15 Top-5 0.19 0.22 0.3 0.32 0.32 Top-10 0.23 0.27 0.36 0.38 0.36 Top-20 0.26 0.34 0.41 0.47 0.45 Top-50 0.33 0.50 0.54 0.58 0.56 4.2.2 Impact of Enriching the Phrase-table on SMT system The result of the experiments on the PBSMT system is shown in Table 4.5 in term of the BLEU score (Papineni et al., 2002) The experiment r shows that weights recomputed by word vector representation similarity in phrase-table are able to attain 83% and 80% of the BLEU score of Moses system for Vietnamese-English and Japanese-Vietnamese respectively This means by using major monolingual data and small bilingual data, we create a relatively accurate system comparing to the original Moses which only use bilingual data In the experiment base + r, results of 4.2 Results 33 our translation are higher than the baseline in both the language pairs, indicating that combining the original Moses’s phrase-table and the phrase-table in the experiment r enhances an accuracy of phrase-table weights In the remaining experiment, we use both recomputing phrase-table weights and incorporating new phrase pairs for enhancing the quality of the phrase-table Our approach retrieves better results than the others and the baseline Notably, the experiment base + r + n acquires the highest BLEU score which is 0.23 and 1.16 higher than the baseline in the Vietnamese-English and Japanese-Vietnamese respectively The reason is that integrating the new phrase pairs created by our method has improved the quality of the phrase-table in the PMSMT system Table 4.5: Results on UET and TED dataset in the PBSMT system for VietnameseEnglish and Japanese-Vietnamese respectively Vietnamese-English (UET data) Japanese-Vietnamese (TED data) baseline r base+r base+r+n 28.02 23.25 28.21 28.25 (+0.23) 12.35 9.88 12.89 13.51 (+1.16) We showed some translation examples of our PBSMT system, which use both recomputing phrase-table weights and incorporating new phrase pairs for the VietnameseEnglish language pair in Table 4.6 In the Example 1, it can be seen that the result of the base+r+n is similar to the reference sentence while the remaining results are incorrect The explanation is that in our approach, the new phrase pair (sẽ hồi_phục ; will recover in), which does not appear in the original phrase-table, was created in the step of generating new phrase-pair (section 3.1.2) Likewise, in the Example and 4, the translation of the base+r+n is nearly similar to the reference sentences thanks to the creation of the new phrase-pairs (quay lại thi_đấu; back to the competition and cổ_động_viên arsenal; arsenal fans) However, in the Example 2, all results are incorrect and the result of base+r+n is the worst In our analysis, bác_sĩ phan_văn_nghiệm trưởng_phòng was translated to the chief of the department The reason for this incorrect translation is that the We use character ‘_’ to denote a phrase including words For example, hồi_phục is written as hồi phục in normal text We use ‘_’ to distinguish the two words hồi and phục from the phrase hồi phục 4.2 Results 34 Table 4.6: Translation examples of the PBSMT in Vietnamese-English Content Example source reference baseline base+r base+r+n Example source reference baseline base+r base+r+n Example source reference baseline base+r base+r+n Example source reference baseline base+r base+r+n cô hồi_phục tháng she will recover in a month she will be recovered in a month she will be recovered in a month she will recover in a month bác_sĩ phan_văn_nghiệm trưởng_phòng cấp_cứu người tận_tâm với bệnh_nhân dr phan văn nghiệm , the chief of the emergency department , is very dedicated to patients the doctor phan_văn_nghiệm emergency bureau ’s a very dedicated to the patient the doctor phan_văn_nghiệm emergency bureau is a very dedicated to the patient emergency the chief of the department ’s a very dedicated to the patient anh quay lại thi_đấu cho đội_tuyển quốc_gia he ’ll be back to the competition for the national team he ’ll be back at the national team he ’ll be back at the national team he ’ll be back to the competition for the national team cổ_động_viên arsenal buồn đội nhà liên_tiếp thất_bại arsenal fans was very sad when the home team was in a row of failure fans arsenal was upset when the home team consecutive defeat fans arsenal was upset when the home team consecutive defeat arsenal fans was very sad when the home team in a row of failure 4.2 Results 35 new pair (bác_sĩ phan_văn_nghiệm trưởng_phòng; the chief of the department) was added directly to the phrase-table It can be explained that our method cannot produce good enough phrase pairs in this case 4.2.3 Impact of Removing the Unknown Words on NMT system Table 4.7 shows the result of our experiment on the NMT system in term of BLEU score for Vietnamese-English and Japanese-Vietnamese As described in the Neural Machine Translation settings, the column base+unk Xing indicates the results using our unknown word model based on the method of (Xing et al., 2015) while the column base+unk Conneau presents the results of our model based the method of (Conneau et al., 2017) Overall, the results show that our approach of addressing the unknown word problem in NMT system retrieves better results than the baseline, indicating that incorporating an external module, which replaces the unknown words with the semantically close words included in the vocabulary list enhances the quality of the NMT system In particular, the experiment base+unk Xing acquires the highest BLEU score which is 0.56 and 1.66 higher than the baseline in VietnameseEnglish and Japanese-Vietnamese respectively This means by using a better crosslingual word embedding model, we can create a better replacement unknown word module In this case, the cross-lingual model of (Xing et al., 2015) is better than the model of (Conneau et al., 2017) as shown in Word Translation Task Table 4.7: Results of removing unknown words on UET and TED dataset in the NMT system for Vietnamese-English and Japanese-Vietnamese respectively baseline base+unk Xing Vietnamese-English (UET data) Japanese-Vietnamese (TED data) base+unk Conneau 25.87 26.43 (+0.56) 26.12 (+0.25) 9.4 11.06 (+1.66) 10.65 (+1.25) For more detail of the impact of our model on addressing the unknown word problem in NMT system, we show some translation examples of our NMT system for Vietnamese-English in the testing phase in Table 4.8 In each example, source indicates the sentence in the source language (in these examples, Vietnamese is the 4.2 Results 36 source language), reference is the standard translation of the source sentence in the target language (in these examples, English is the target language), source_replaced is the source sentence which is replaced unknown words with the most similar invocabulary word, baseline is the translation of the baseline OpenmNMT system, NMT output is the translation of the OpenNMT system which incorporates our replacement unknown word model based on the cross-lingual model of (Xing et al., 2015), and post-processed is the final output, which is the result of applying the restoration module in the NMT output In Example 1, it can be seen that the postprocessed output is more meaningful than the baseline output, which includes a token The explanation for this results is that replacing the unknown word ú_ớ in the source sentence with the similar in-vocabulary word lẩm_bẩm makes the NMT system generate an output without tokens Afterward, the proper output is created by using the restoration module to replace the word mutter in the NMT output with the word yelp - a translation of the source unknown word ú_ớ Similarly, in Example 2, the post-processed output is better than the other outputs This means our proposed model of using a replacement and a restoration module helps the NMT system generate an expected output In the Example 3, the output of post-processed step and NMT output are the same because in the replacement module, we have chosen a synonym word gian_truân for the unknown word gian_lao This indicates that if the replaced word is the synonym of the unknown word, the NMT system can immediately generate a proper translation In Example 4, although the unknown word nhà_ăn is replaced with the similar word căng_tin, the NMT system still generates a poor output However, by using the attention information, restoration module has identified that symbol in the NMT output is the translation of căng_tin in the source_replaced sentence After that, the symbol is replaced by restaurant - the translation of the source unknown word nhà_ăn 4.2 Results 37 Table 4.8: Translation examples of the NMT system in Vietnamese-English Content Example source reference source_replaced baseline NMT output post-processed Example source reference source_replaced baseline NMT output post-processed Example source reference source_replaced baseline NMT output post-processed Example source reference source_replaced baseline NMT output post-processed anh bật kêu lên tiếng ú_ớ ngã xuống sàn he gave a strangled cry and fell to the floor anh bật kêu lên tiếng lẩm_bẩm ngã xuống sàn he turned on a and landed on the floor he turned on to a mutter and fell into the floor he turned on to a yelp and fell into the floor anh có ghi tên vào danh_sách trúng_tuyển khơng ? have you been short-listed for the post ? anh có ghi tên vào danh_sách thi_tuyển không ? have you got the name on the list ? are you marked on the entrance list ? are you marked on the selected list ? nếu_như anh làm_việc gian_lao anh bị kiệt_sức you will crack up if you go on working so hard nếu_như anh làm_việc gian_truân anh bị kiệt_sức if you work you’ll get exhausted if you work hard you’ll be exhausted if you work hard you’ll be exhausted nhà_ăn dọn suất ăn ít_ỏi this restaurant serves very mingy portions căng_tin dọn suất ăn ít_ỏi this provides little food the made some small meals the restaurant made some small meals Chapter Conclusion In this thesis, we proposed two models to enhance the quality of the Machine Translation system by using cross-lingual word embedding models The first model enriches the phrase-table in PBSMT system by recomputing the phrase weights and generate new phrase pairs for the phrase-table The second model addresses the unknown word problem in NMT system by replacing the unknown words with the most appropriate in-vocabulary words The analyses and results on experiments point out that our models help translation systems overcome the spare data of less-common and low-resource language In the PBSMT system, using both of recomputing the phrase weights and incorporating the new phrase pairs, the phrase-table quality has been significantly improved As a result, the BLEU score increased by 0.23 and 1.16 in Vietnamese-English and Japanese-Vietnamese respectively Likewise, in the NMT system, integrating our proposed model to address unknown words has improved the BLEU score by 0.56 and 1.66 in Vietnamese-English and Japanese-Vietnamese respectively However, there are some drawbacks in our approach since our methods created incorrect entries for the phrase-table and bad translations for the unknown words In the future, we will work on specific cases of generating bad phrase pairs for the phrase-table and bad translations for the unknown words to enhance the quality of the translation We would also like to continue to experiment with different crosslingual word embedding models to enhance the quality of the machine translation system 38 Bibliography Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio Neural machine translation by jointly learning to align and translate arXiv e-prints, abs/1409.0473, September 2014 URL https: //arxiv.org/abs/1409.0473 Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin A neural probabilistic language model J Mach Learn Res., 3:1137–1155, March 2003 ISSN 1532-4435 URL http://dl.acm.org/citation.cfm?id=944919.944966 Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov Enriching word vectors with subword information Transactions of the Association for Computational Linguistics, 5: 135–146, 2017 URL http://aclweb.org/anthology/Q17-1010 Mauro Cettolo, Christian Girardi, and Marcello Federico Wit3 : Web inventory of transcribed and translated talks In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pages 261–268, Trento, Italy, May 2012 Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio Learning phrase representations using rnn encoder– decoder for statistical machine translation In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734 Association for Computational Linguistics, 2014 doi: 10.3115/v1/D14-1179 URL http://www.aclweb.org/ anthology/D14-1179 Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou Word translation without parallel data CoRR, abs/1710.04087, 2017 Yiming Cui, Conghui Zhu, Xiaoning Zhu, Tiejun Zhao, and Dequan Zheng Phrase table combination deficiency analyses in pivot-based smt In Elisabeth Métais, Farid Meziane, Mohamad Saraee, Vijayan Sugumaran, and Sunil Vadera, editors, Natural Language Processing and Information Systems, pages 355–358, Berlin, Heidelberg, 2013 Springer Berlin Heidelberg ISBN 978-3-642-38824-8 George Doddington Automatic evaluation of machine translation quality using n-gram cooccurrence statistics In Proceedings of the Second International Conference on Human Lan- 39 Bibliography 40 guage Technology Research, HLT ’02, pages 138–145, San Francisco, CA, USA, 2002 Morgan Kaufmann Publishers Inc URL http://dl.acm.org/citation.cfm?id=1289189.1289273 Marcello Federico, Nicola Bertoldi, and Mauro Cettolo Irstlm: an open source toolkit for handling large scale language models In INTERSPEECH, 2008 Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages In In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12, 2012 Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio Generative adversarial nets In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2672–2680, Cambridge, MA, USA, 2014 MIT Press URL http://dl.acm.org/citation.cfm? id=2969033.2969125 Kenneth Heafield Kenlm: Faster and smaller language model queries In In Proc of the Sixth Workshop on Statistical Machine Translation, 2011 Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng Improving word representations via global context and multiple word prototypes In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pages 873–882, Stroudsburg, PA, USA, 2012 Association for Computational Linguistics URL http://dl.acm.org/citation.cfm?id=2390524.2390645 Hien Vu Huy, Phuong-Thai Nguyen, Tung-Lam Nguyen, and M L Nguyen Bootstrapping phrasebased statistical machine translation via wsd integration In IJCNLP, 2013 Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush Opennmt: Open-source toolkit for neural machine translation In Proceedings of ACL 2017, System Demonstrations, pages 67–72 Association for Computational Linguistics, 2017 URL http: //aclweb.org/anthology/P17-4012 Philipp Koehn Statistical Machine Translation Cambridge University Press, New York, NY, USA, 1st edition, 2010 ISBN 0521874157, 9780521874151 Philipp Koehn, Franz Josef Och, and Daniel Marcu Statistical phrase-based translation In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54, Stroudsburg, PA, USA, 2003 Association for Computational Linguistics doi: 10.3115/1073445 1073462 URL https://doi.org/10.3115/1073445.1073462 Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondˇrej Bojar, Alexandra Constantin, and Evan Herbst Moses: Open source toolkit for statistical machine Bibliography 41 translation In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA, 2007 Association for Computational Linguistics URL http://dl.acm.org/citation.cfm?id=1557769.1557821 Taku Kudo Mecab: Yet another part-of-speech and morphological analyzer 01 2005 Xiaoqing Li, Jiajun Zhang, and Chengqing Zong Towards zero unknown word in neural machine translation In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 2852–2858 AAAI Press, 2016 ISBN 978-1-57735-770-4 URL http://dl.acm.org/citation.cfm?id=3060832.3061020 Thang Luong, Hieu Pham, and Christopher D Manning Effective approaches to attention-based neural machine translation In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421 Association for Computational Linguistics, 2015a doi: 10.18653/v1/D15-1166 URL http://www.aclweb.org/anthology/D15-1166 Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba Addressing the rare word problem in neural machine translation In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 11–19 Association for Computational Linguistics, 2015b doi: 10.3115/v1/P15-1002 URL http://www.aclweb.org/anthology/ P15-1002 Tomas Mikolov, Kai Chen, Gregory S Corrado, and Jeffrey Dean Efficient estimation of word representations in vector space CoRR, abs/1301.3781, 2013a Tomas Mikolov, Google Inc, Mountain View, Quoc V Le, Google Inc, Ilya Sutskever, and Google Inc Exploiting similarities among languages for machine translation, 2013b Franz Josef Och Minimum error rate training in statistical machine translation In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL ’03, pages 160–167, Stroudsburg, PA, USA, 2003 Association for Computational Linguistics doi: 10.3115/1075096.1075117 URL https://doi.org/10.3115/1075096.1075117 Franz Josef Och and Hermann Ney A systematic comparison of various statistical alignment models Comput Linguist., 29(1):19–51, March 2003 ISSN 0891-2017 doi: 10.1162/ 089120103321337421 URL http://dx.doi.org/10.1162/089120103321337421 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu Bleu: A method for automatic evaluation of machine translation In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002 Association for Computational Linguistics doi: 10.3115/1073083.1073135 URL https://doi.org/10 3115/1073083.1073135 Bibliography 42 Peyman Passban, Chris Hokamp, Andy Way, and Qun Liu Improving phrase-based smt using cross-granularity embedding similarity In Proceedings of the 19th Annual Conference of the European Association for Machine Translation, pages 129–140, 2016 URL http://www.aclweb org/anthology/W16-3403 Jeffrey Pennington, Richard Socher, and Christoper Manning Glove: Global vectors for word representation In EMNLP, volume 14, pages 1532–1543, 01 2014 Lê Hˆ ong Phuong, Nguyên Thi Minh Huyên, Azim Roussanaly, and Hô Tuòng Vinh Language and automata theory and applications chapter A Hybrid Approach to Word Segmentation of Vietnamese Texts, pages 240–249 Springer-Verlag, Berlin, Heidelberg, 2008 ISBN 978-3-540-88281-7 doi: 10.1007/978-3-540-88282-4 23 URL http://dx.doi.org/10.1007/ 978-3-540-88282-4_23 Sebastian Ruder, Ivan Vuli’c, and Anders Sogaard A survey of cross-lingual word embedding models 2017 Matthew S Ryan and Graham R Nudd The viterbi algorithm Technical report, Coventry, UK, UK, 1993 Peter H Schă onemann A generalized solution of the orthogonal procrustes problem Psychometrika, 31(1):1–10, Mar 1966 ISSN 1860-0980 doi: 10.1007/BF02289451 URL https://doi.org/ 10.1007/BF02289451 Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, pages 539–549, Stroudsburg, PA, USA, 2012 Association for Computational Linguistics ISBN 978-1-937284-19-0 URL http://dl acm.org/citation.cfm?id=2380816.2380881 Rico Sennrich, Barry Haddow, and Alexandra Birch Neural machine translation of rare words with subword units In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725 Association for Computational Linguistics, 2016 doi: 10.18653/v1/P16-1162 URL http://www.aclweb.org/anthology/P16-1162 Andreas Stolcke Srilm - an extensible language modeling toolkit pages 901–904, 2002 Stephan Vogel and Christian Monson Augmenting manual dictionaries for statistical machine translation systems In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04) European Language Resources Association (ELRA), 2004 URL http://www.aclweb.org/anthology/L04-1334 Stephan Vogel, Hermann Ney, and Christoph Tillmann Hmm-based word alignment in statistical translation In Proceedings of the 16th Conference on Computational Linguistics - Volume 2, COLING ’96, pages 836–841, Stroudsburg, PA, USA, 1996 Association for Computational Linguistics doi: 10.3115/993268.993313 URL https://doi.org/10.3115/993268.993313 Bibliography 43 Chao Xing, Dong Wang, Chao Liu, and Yiye Lin Normalized word embedding and orthogonal transform for bilingual word translation In HLT-NAACL, 2015 Xiaoning Zhu, Zhongjun He, Hua Wu, Conghui Zhu, Haifeng Wang, and Tiejun Zhao Improving pivot-based statistical machine translation by pivoting the co-occurrence count of phrase pairs In EMNLP, 2014 Copyright ➞ 2018 by Nguyen Minh Thuan Printed and bound by Nguyen Minh Thuan ... processing The purpose of this thesis is to propose two models for using cross- lingual word embedding models to address the above impediment The first model enhances the quality of the phrase-table... is the set of K nearest neighbors in the target space of the source word Chapter Using Cross- Lingual Word Embedding Models for Machine Translation Systems In this chapter, we propose two models. .. two models for improving the quality of machine translation system based on cross- lingual word embedding models The first model enhances the quality of phrase-table in SMT system by recomputing

Định dạng
Số trang	54
Dung lượng	753,02 KB