Luận án tiến sĩ nghiên cứu xây dựng tài nguyên song ngữ việt anh ứng dụng cho dịch máy

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN ——————— NGUYỄN TIẾN HÀ NGHIÊN CỨU XÂY DỰNG TÀI NGUYÊN SONG NGỮ VIỆT-ANH ỨNG DỤNG CHO DỊCH MÁY THEO MIỀN LUẬN ÁN TIẾN SĨ TOÁN HỌC Hà Nội - 2020 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN ——————— NGUYỄN TIẾN HÀ NGHIÊN CỨU XÂY DỰNG TÀI NGUYÊN SONG NGỮ VIỆT-ANH ỨNG DỤNG CHO DỊCH MÁY THEO MIỀN Chuyên ngành: Cơ sở toán học cho tin học Mã số: 9460117.02 LUẬN ÁN TIẾN SĨ TOÁN HỌC NGƯỜI HƯỚNG DẪN KHOA HỌC: TS Nguyễn Thị Minh Huyền PGS.TS Nguyễn Hữu Ngự Hà Nội - 2020 LỜI CAM ĐOAN Tôi xin cam đoan nội dung trình bày luận án kết nghiên cứu tôi, thực hướng dẫn TS Nguyễn Thị Minh Huyền PGS TS Nguyễn Hữu Ngự Các nội dung trích dẫn từ nghiên cứu tác giả khác trình bày luận án ghi rõ nguồn phần tài liệu tham khảo Nguyễn Tiến Hà LỜI CẢM ƠN Tôi xin gửi lời cảm ơn sâu sắc đến TS Nguyễn Thị Minh Huyền PGS.TS Nguyễn Hữu Ngự trực tiếp hướng dẫn, bảo tận tình, hỗ trợ tạo điều kiện tốt cho tơi q trình học tập nghiên cứu Tôi xin gửi lời cảm ơn đến thầy/cô giáo Khoa Toán - Cơ - Tin học, Trường Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội, đặc biệt thầy/cô giáo Bộ môn Tin học, người trực tiếp giảng dạy giúp đỡ tơi q trình học tập nghiên cứu trường Tôi xin gửi cảm ơn đến TS Nguyễn Văn Vinh, PGS TS Nguyễn Phương Thái, PGS TS Phan Xuân Hiếu Trường Đại học Công nghệ, Đại học Quốc gia Hà Nội; TS Trần Thị Oanh khoa Quốc tế, Đại học Quốc gia Hà Nội; PGS TS Lê Thanh Hương, TS Đỗ Thị Ngọc Diệp Trường Đại học Bách khoa Hà Nội; PGS TS Đỗ Trung Tuấn, TS Đỗ Thanh Hà, TS Lê Hồng Phương, PGS TS Lê Trọng Vĩnh, TS Nguyễn Thị Bích Thủy, TS Vũ Tiến Dũng Trường Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội, thầy/cơ có góp ý chỉnh sửa để tơi hồn thiện luận án Tơi xin gửi lời cảm ơn đến tất anh, chị, em Bộ mơn Tin học, Khoa TốnCơ-Tin học, Trường đại học khoa học Tự nhiên, Đại học Quốc gia Hà Nội Bộ mơn Khoa học máy tính, Khoa Công nghệ thông tin, Trường Đại học Công nghệ, Đại học Quốc gia Hà Nội giúp đỡ thời gian làm nghiên cứu sinh Cuối cùng, xin gửi lời cảm ơn đến tất thành viên gia đình, bạn bè, đồng nghiệp nơi tơi công tác ủng hộ, chia sẻ, động viên khích lệ tơi học tập, nghiên cứu Mục lục Danh mục chữ viết tắt Mở đầu Tổng quan dịch máy tài nguyên ngôn ngữ 1.1 1.2 15 Tổng quan dịch máy 15 1.1.1 Lịch sử dịch máy 16 1.1.2 Kiến trúc hệ thống dịch máy 19 1.1.3 Các phương pháp dịch máy 22 1.1.4 Các hệ thống dịch máy sử dụng để thực nghiệm 30 1.1.5 Đánh giá hệ thống dịch máy 32 Tài nguyên ngôn ngữ cho hệ thống dịch máy 35 1.2.1 Tài nguyên đa ngữ cho dịch máy 35 1.2.2 Tài nguyên song ngữ Việt-Anh 38 1.3 Thích ứng miền dịch máy 41 1.4 Các công cụ tiền xử lý văn 43 1.5 Kết luận chương 45 Xây dựng kho ngữ liệu song ngữ Việt - Anh dóng hàng mức câu theo miền 2.1 47 Xây dựng kho ngữ liệu song ngữ Việt-Anh theo miền 48 2.1.1 Phương pháp thu thập ngữ liệu song ngữ dóng hàng câu 48 2.1.2 2.2 2.3 Xây dựng kho ngữ liệu song ngữ Việt - Anh miền du lịch 49 Dóng hàng văn song ngữ Việt-Anh 52 2.2.1 Phương pháp dóng hàng văn song ngữ mức câu 52 2.2.2 Cải tiến cơng cụ dóng hàng câu XAlign 54 Ứng dụng kho ngữ liệu du lịch song ngữ Việt-Anh cho hệ thống dịch máy 64 2.4 2.3.1 Kết thực nghiệm 65 2.3.2 Một số lỗi hệ thống dịch 68 Kết luận chương 70 Xây dựng kho ngữ liệu từ, cụm từ song ngữ Việt-Anh 3.1 72 Xây dựng tự động kho từ vựng song ngữ Việt - Anh 73 3.1.1 Xây dựng kho từ vựng song ngữ 73 3.1.2 Phương pháp xây dựng tự động từ vựng song ngữ Việt-Anh 77 3.1.3 Phương pháp xây dựng tự động từ vựng song ngữ ViệtAnh miền du lịch 79 3.1.4 3.2 Thực nghiệm kết 84 Trích rút thuật ngữ song ngữ Việt-Anh từ văn đơn ngữ tiếng Việt dựa vào tập luật 88 3.2.1 Các công trình nghiên cứu có liên quan 90 3.2.2 Phương pháp trích rút thuật ngữ song ngữ Việt-Anh từ văn đơn ngữ tiếng Việt 93 3.2.3 3.3 Thực nghiệm 104 Kết luận chương 106 Khai thác kho ngữ liệu song ngữ Việt-Anh cho dịch máy 4.1 108 Tiền xử lý liệu huấn luyện dịch máy nơ-ron 108 4.1.1 Phương pháp tiền xử lý câu dài dịch máy nơ-ron 110 4.2 4.1.2 Phương pháp trích rút cụm từ ExtPhrase 112 4.1.3 Thực nghiệm kết 115 Phương pháp sinh tự động giải tiếng Việt cho hình ảnh 119 4.2.1 Các cơng trình có liên quan đến sinh giải cho ảnh 119 4.2.2 Đề xuất quy trình xây dựng hệ thống sinh giải tiếng Việt cho ảnh 121 4.3 Kết luận chương 128 Kết luận 131 Danh mục cơng trình khoa học tác giả liên quan đến luận án 133 Tài liệu tham khảo 135 Danh mục chữ viết tắt ALPAC Automatic Language Processing Advisory Committee (Hội đồng cố vấn xử lý ngôn ngữ tự động) BiTES Bilingual Term Extraction System (Hệ thống trích rút thuật ngữ song ngữ) BLEU BiLingual Evaluation Understudy (Chỉ số đánh giá chất lượng dịch song ngữ) CNN Convolutional Neural Network (Mạng nơ-ron tích chập) DTW Dynamic Time Warping (Thuật toán chỉnh thời gian động) GRU Gated Recurrent Unit (Đơn vị hồi quy cổng) LSTM Long Short Term Memory (Bộ nhớ dài ngắn hạn) MI Mutual Information (Thông tin tương hỗ) NLP Natural Language Processing (Xử lý ngôn ngữ tự nhiên) NMT Neural Machine Translation (Dịch máy mạng nơ-ron) OPUS The open parallel corpus (Kho ngữ liệu song song mở) PBSMT Phrase-Based Statistical Machine Translation (Dịch máy dựa cụm từ) PER Position-independent word Error Rate (Tỷ lệ lỗi từ độc lập vị trí) RNN Recurrent Neural Network (Mạng nơ-ron hồi quy) SMT Statistical Machine Translation (Dịch máy thống kê) SALM Suffix Array tool kit for empirical Language Manipulations (Công cụ lọc bảng cụm từ Moses) TER Translation Error Rate (Tỷ lệ lỗi dịch) TV Television (Truyền hình) VLSP Vietnamese Language Speech Processing (Xử lý ngơn ngữ tiếng nói tiếng Việt) WER Word Error Rate (Tỷ lệ lỗi từ) Danh sách hình vẽ 1.1 Tam giác Vauquois 20 1.2 Mơ hình dịch trực tiếp 21 1.3 Mơ hình dịch qua ngơn ngữ trung gian 1.4 Mơ hình dịch máy thống kê 24 1.5 Cấu trúc hệ thống dịch máy dựa mạng nơ-ron 27 1.6 Cấu trúc hệ thống dịch máy MOSES 30 3.1 Phương pháp xây dựng tự động từ vựng Việt-Anh 77 3.2 Phương pháp xây dựng tự động từ điển Việt-Anh miền du lịch 82 3.3 Mơ hình trích rút thuật ngữ song ngữ Việt-Anh từ văn tiếng Việt 3.4 22 94 Mô hình áp dụng luật để lựa chọn ứng viên thuật ngữ song ngữ Việt-Anh 98 4.1 Mơ hình ý toàn cục 111 4.2 Mơ hình ý cục 111 4.3 Mơ hình tiền xử lý câu dài 30 từ huấn luyện hệ thống dịch máy 113 4.4 Điểm BLEU hệ thống theo độ dài từ tiếng Việt coi câu tiếng Việt dài 118 4.5 Mơ hình giải tiếng Việt cho ảnh 123 4.6 So sánh chất lượng dịch máy với Google 125 the 14th Annual Meeting of The Association for Natural Language Processing [33] Franz Och and Hermann Ney (2004), “The Alignment Template Approach to Statistical Machine Translation”, Computational Linguistics, Vol.30, pp 417– 449, doi: 10.1162/0891201042544884 [34] Graham Neubig (2017), “Neural Machine Translation and Sequence-tosequence Models: A Tutorial”, CoRR, abs/1703.01619, arXiv: 1703.01619, url: http://arxiv.org/abs/1703.01619 [35] G R Tahir and S Asghar and N Masood (2010), “Knowledge Based Machine Translation”, 2010 International Conference on Information and Emerging Technologies, pp 1–5, doi: 10.1109/ICIET.2010.5625695 [36] Guillaume Klein and Yoon Kim and Yuntian Deng and Jean Senellart and Alexander M Rush (2017), “OpenNMT: Open-Source Toolkit for Neural Machine Translation”, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, 67–72 [37] Suchin Gururangan et al (2020), “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks”, arXiv e-prints, arXiv:2004.10964, arXiv: 2004.10964 [cs.CL] [38] Hao Fang and Saurabh Gupta and Forrest N Iandola and Rupesh Kumar Srivastava and Li Deng and Piotr Dollár and Jianfeng Gao and Xiaodong He and Margaret Mitchell and John C Platt and C Lawrence Zitnick and Geoffrey Zweig (2014), “From Captions to Visual Concepts and Back”, CoRR, abs/1411.4952, arXiv: 1411.4952, url: http://arxiv.org/abs/ 1411.4952 [39] Hardik Gourisaria et al (2019), “Generating Captions for Underwater Images Using Deep Learning Models”, Conference on Artificial Intelligence: Research, Innovations and its Applications 140 [40] Hendra Setiawan et al (2005), “Phrase-Based Statistical Machine Translation: A Level of Detail Approach”, Natural Language Processing – IJCNLP 2005, ed by Robert Dale et al., pp 576–587 [41] Holger Schwenk and Guillaume Wenzek and Sergey Edunov and Edouard Grave and Armand Joulin (2019), “CCMatrix: Mining Billions of HighQuality Parallel Sentences on the WEB”, ArXiv, abs/1911.04944 [42] Howard Johnson et al (2007), “Improving Translation Quality by Discarding Most of the Phrasetable”, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp 967–975, url: https: //www.aclweb.org/anthology/D07-1103 [43] Al-muzaini, Huda and N., Tasniem and Hafida, Benhidour (2018), “Automatic Arabic image captioning using RNN-LSTM-based language model and CNN”, International Journal of Advanced Computer Science and Applications, Vol.9, doi: 10.14569/IJACSA.2018.090610 [44] Jakob Uszkoreit et al (2010), “Large Scale Parallel Document Mining for Machine Translation”, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp 1101–1109, url: http: //www.aclweb.org/anthology/C10-1124 [45] Jean Pouget-Abadie and Dzmitry Bahdanau and Bart van Merrienboer and Kyunghyun Cho and Yoshua Bengio (2014), “Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation”, Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 78–85 [46] Jeff Donahue and Lisa Anne Hendricks and Sergio Guadarrama and Marcus Rohrbach and Subhashini Venugopalan and Kate Saenko and Trevor Darrell (2014), “Long-term Recurrent Convolutional Networks for Visual 141 Recognition and Description”, CoRR, abs/1411.4389, arXiv: 1411.4389, url: http://arxiv.org/abs/1411.4389 [47] Jingshu Liu and Emmanuel Morin and Sebastián Pe˜ na Saldarriaga (2018), “Towards a unified framework for bilingual terminology extraction of singleword and multi-word terms”, Proceedings of the 27th International Conference on Computational Linguistics (COLING), pp 2855–2866 [48] John Hutchins (2004), “Two Precursors of Machine Translation: Artsrouni and Trojanskij”, International Journal of Translation, Vol.16(1), 11–31 [49] John R Pierce and John B Carroll (1966), Language and Machines: Computers in Translation and Linguistics, National Academy of Sciences/National Research Council, USA [50] Jăorg Tiedemann (2016), OPUS - Parallel Corpora for Everyone, English, Baltic Journal of Modern Computing, Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT) – Projects/Products Volume: Proceeding volume, p 384, issn: 2255-8942 [51] Junhua Mao and Wei Xu and Yi Yang and Jiang Wang and Alan L Yuille (2014), “Explain Images with Multimodal Recurrent Neural Networks”, ArXiv, abs/1410.1090 [52] Junjie Hu et al (2019), “Domain Adaptation of Neural Machine Translation by Lexicon Induction”, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 2989–3001, doi: 10.18653/ v1/P19-1286, url: https://www.aclweb.org/anthology/P19-1286 [53] Junjie Hu and Mengzhou Xia and Graham Neubig and Jaime G Carbonell (2019), “Domain Adaptation of Neural Machine Translation by Lexicon Induction”, CoRR, abs/1906.00376, arXiv: 1906 00376, url: http : / / arxiv.org/abs/1906.00376 142 [54] Keiji Yasuda and Eiichiro Sumita (2013), “Building a Bilingual Dictionary from a Japanese-Chinese Patent Corpus”, Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 2, pp 276–284, doi: 10.1007/978-3-642-37256-8_23, url: http://dx.doi.org/10.1007/978-3-642-37256-8_23 [55] Kenji Imamura and Eiichiro Sumita (2018), “Multilingual Parallel Corpus for Global Communication Plan”, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) [56] K Papineni and S Roukos and T Ward and W J Zhu (2002), “Bleu: a method for automatic evaluation of machine translation”, Proc of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp 311–318 [57] Laurent Romary and Patrice Bonhomme (2000), “Parallel alignment of structured documents”, Jean Véronis Parallel Text Processing, Kluwer Academic Publisher, pp 233–253 [58] Le An Ha et al (2008), “Mutual Bilingual Terminology Extraction”, LREC 2008 [59] Le Quang Hung and Le Anh Cuong (2010), “Extracting Parallel Texts from the Web”, Proceeding of The Second International Conference on Knowledge and Systems Engineering, doi: 10.1109/KSE.2010.14 [60] Lieve Macken and Julia Trushkina and Lidia Rura (2007), “Dutch parallel corpus: MT corpus and translator’s aid”, Proceedings of Machine Translation Summit XI, pp 313–320 [61] Luong Minh-Thang, Pham Hieu, and Christopher D Manning (2015), “Effective Approaches to Attention-based Neural Machine Translation”, CoRR, abs/1508.04025, arXiv: 1508.04025, url: http://arxiv.org/ abs/1508.04025 143 [62] Maja Popovi´c and Hermann Ney (2007), “Word Error Rates: Decomposition over Pos Classes and Applications for Error Analysis”, Proceedings of the Second Workshop on Statistical Machine Translation, pp 48–55, url: http://dl.acm.org/citation.cfm?id=1626355.1626362 [63] M Anand Kumar and V Dhanalakshmi and K P Soman and V Sharmiladevi (2014), “Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing”, CoRR, abs/1409.8581, arXiv: 1409.8581, url: http://arxiv.org/abs/1409.8581 [64] Markus Freitag, Isaac Caswell, and Scott Roy (2019), “APE at Scale and Its Implications on MT Evaluation Biases”, Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp 34–44, url: http://www.aclweb.org/anthology/W19-5204 [65] Martin Kay and Martin Roscheisen (1993), “Text-Translation Alignment”, Computational Linguistics, Vol.19 (1), pp 121–142, url: https://www aclweb.org/anthology/J93-1006 [66] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul (2006), “A study of translation edit rate with targeted human annotation”, Proceedings of Association for Machine Translation in the Americas, pp 223–231 [67] Matt Post, Chris Callison-Burch, and Miles Osborne (2012), “Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing”, Proceedings of the Seventh Workshop on Statistical Machine Translation, pp 154–162, url: http://oldsite.aclweb.org/anthology-new/W/W12/W12-3152 pdf [68] Michel Simard and Pierre Plamondon (1998), “Bilingual sentence alignment: Balancing Robustness and accuracy”, Proceedings of the Conference of the Association for Machine Translation in the Americas, 59–80 144 [69] Miguel Gra¸ca et al (2019), “Generalizing Back-Translation in Neural Machine Translation”, Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp 45–52, url: http://www.aclweb org/anthology/W19-5205 [70] Mihael Arˇcan et al (2014), “Enhancing Statistical Machine Translation with Bilingual Terminology in a CAT Environment”, Conference: Association for Machine Translation in the Americas (AMTA-2014), doi: 10 13140/2.1.1019.8404 [71] Mohammed M.Sakre and Mohammed M.Kouta and Ali M.N.Allam (May, 2016), “Automated construction of Arabic-English parallel corpus”, Arab World English Journal (AWEJ) Special Issue on Translation, No.5 [72] Myle Ott and Michael Auli and David Grangier and Marc’Aurelio Ranzato (2018), “Analyzing Uncertainty in Neural Machine Translation, ArXiv, abs/1803.00047 [73] Mă uller and Meinard (2007), “Dynamic Time Warping”, vol chapter 4, pp 69–84 [74] Naama Twitto, Noam Ordan, and Shuly Wintner (2015), “Statistical Machine Translation with Automatic Identification of Translationese”, Proceedings of the Tenth Workshop on Statistical Machine Translation, pp 47– 57, url: http://aclweb.org/anthology/W15-3002 [75] Ngo Quoc Hung and Werner Winiwarter and Bartholomăaus Wloka (2013), “EVBCorpus - A Multi-Layer English-Vietnamese Bilingual Corpus for Studying Tasks in Comparative Linguistics”, Proceedings of the 11th Workshop on Asian Language Resources (11th ALR within the IJCNLP2013), pp 1–9 [76] Ngo Quoc-Hung and Werner Winiwarter (2012), “Building an EnglishVietnamese Bilingual Corpus for Machine Translation”, Proceedings of the 145 2012 International Conference on Asian Language Processing, pp 157– 160 [77] Nguyen Thị Minh Huyen and Mathias Rossignol (2006), “A languageindependent method for the alignement of parallel corpora”, Proceedings of 20th Pacific Asia Conference on Language, Information and Computation (PACLIC), pp 223–230 [78] Oriol Vinyals and Alexander Toshev and Samy Bengio and Dumitru Erhan (2015), “Show and Tell: A Neural Image Caption Generator”, Proceedings of The 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), doi: 10.1109/CVPR.2015.7298935, arXiv: 1411.4555v2 [cs.CV] [79] Peter F Brown, Jennifer C Lai, and Robert L Mercer (1991), “Aligning Sentences in Parallel Corpora”, Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, 169–176, doi: 10 3115 / 981344.981366, url: https://doi.org/10.3115/981344.981366 [80] Peter F Brown et al (1990), “A Statistical Approach to Machine Translation”, Comput Linguist., 16 (2), pp 79–85, issn: 0891-2017, url: http: //dl.acm.org/citation.cfm?id=92858.92860 [81] Pham Ngoc-Quan et al (2019), “Improving Zero-shot Translation with Language-Independent Constraints”, Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp 13–23, url: http://www.aclweb.org/anthology/W19-5202 [82] Philipp Koehn (2002), “Europarl: A Multilingual Corpus for Evaluation of Machine Translation”, Draft [83] Philipp Koehn (2020), Moses Statistical Machine Translation System User Manual and Code Guide, Statistical Machine Translation, url: http:// www.statmt.org/moses/manual/manual.pdf 146 [84] Philipp Koehn and Franz Josef Och and Daniel Marcu (2003), “Statistical phrase-based translation”, Proceedings of HLT-NAACL 2003, pp 127–133 [85] Philipp Koehn and Rebecca Knowles (2017), “Six Challenges for Neural Machine Translation”, CoRR, abs/1706.03872, arXiv: 1706.03872, url: http://arxiv.org/abs/1706.03872 [86] Philip Resnik (1998), “Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text”, Machine Translation and the Information Soup, ed by David Farwell, Laurie Gerber, and Eduard Hovy, pp 72– 82 [87] Philip Resnik (1999), “Mining the Web for Bilingual Text”, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp 527–534, doi: 10.3115/1034678.1034757, url: https://www aclweb.org/anthology/P99-1068 [88] Rob Gaizauskas et al (2015), “Extracting bilingual terms from the Web”, Terminology International Journal of Theoretical and Applied Issues in Specialized Communication, Vol.21, pp 205–236, doi: 10.1075/term.21 2.04gai [89] Rui Wang et al (2017), “Sentence Embedding for Neural Machine Translation Domain Adaptation”, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 560–566, doi: 10.18653/v1/P17-2089 [90] Saba Amsalu (2006), “Data-driven Amharic-English Bilingual Lexicon Acquisition”, Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), url: http://www.lrec-conf.org/ proceedings/lrec2006/pdf/666_pdf.pdf 147 [91] SANJIKA HEWAVITHARANA and Stephan Vogel (2016), “Extracting parallel phrases from comparable data for machine translation”, Natural Language Engineering, Vol.22, pp 549–573, doi: 10.1017/S1351324916000139 [92] Sara Ebrahim et al (2015), “English-Arabic Statistical Machine Translation: State of the Art”, Computational Linguistics and Intelligent Text Processing, ed by Alexander Gelbukh, pp 520–533 [93] Sergei Nirenburg et al (2002), “Machine Translation: A Knowledge-Based Approach”, Morgan Kaufmann Publishers Inc.340 Pine Street, Sixth FloorSan Francisco CA United States [94] Shaohui Kuang and Deyi Xiong (2016), “Automatic Long Sentence Segmentation for Neural Machine Translation”, Proceeding of the 24th International Conference on Computer Processing of Oriental Languages, ICCPOL 2016, pp 162–174 [95] Shengxiang Gao et al (2018), “A Method to Chinese-Vietnamese Bilingual Metallurgy Term Extraction Based on a Pivot Language”, Proceedings of the 6th CCF Conference, Big Data, pp 3–20, doi: 10.1007/978-981-132922-7_1 [96] Shuoyang Ding, Hainan Xu, and Philipp Koehn (2019), “Saliency-driven Word Alignment Interpretation for Neural Machine Translation”, Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp 1–12, url: http://www.aclweb.org/anthology/W19-5201 [97] S K Jang and J.S Chang (1997), “A Class-based Approach to Word Alignment”, Computational Linguistics, Vol.23(2), pp 313–343 [98] Solomon Teferra Abate et al (2018), “Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language Pairs”, Proceedings of the First Workshop on Linguistic Resources for Natural Lan- 148 guage Processing, pp 83–90, url: https://www.aclweb.org/anthology/ W18-3812 [99] Stanley F Chen (1993), “Aligning Sentences in Bilingual Corpora Using Lexical Information”, Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, 9–16, doi: 10.3115/981574.981576, url: https://doi.org/10.3115/981574.981576 [100] Stephan Vogel et al (2000), “Statistical Methods for Machine Translation”, Springer, Berlin, Heidelberg, doi: 10 1007 / 978 - - 662 - 04230 4_27 [101] Stig-Arne Grăonroos, Sami Virpioja, and Mikko Kurimo ((2015)), “Tuning Phrase-Based Segmented Translation for a Morphologically Complex Target Language”, Proceedings of the Tenth Workshop on Statistical Machine Translation, pp 105–111, url: http : / / aclweb org / anthology / W15 3010 [102] TAN Min, DUAN Xiangyu, ZHANG Min (2019), “Neural Machine Translation Domain Adaptation Based on Domain Features”, Journal of Chinese Information Processing, Vol.33 (7), p 56, url: http://jcip.cipsc.org cn/EN/abstract/article_2798.shtml [103] Thomas M Cover and Joy A Thomas (1991), Elements of Information Theory, New York : Wiley, c 1991., isbn: 978-0-471-24195-9 [104] Trieu Hai Long and Nguyen Le Minh (2017), “A Multilingual Parallel Corpus for Improving Machine Translation on Southeast Asian Languages”, Machine Translation Summit XVI [105] Trieu Hai-Long, Nguyen Phuong-Thai, and Nguyen Le-Minh (2015), “A New Feature to Improve Moore’s Sentence Alignment Method”, VNU Journal of Science: Comp Science & Com, Eng Vol 31 No 1, 32–44 149 [106] Van Ngoc Sang (2007), “Building Vietnamese -Jrai; Jrai - Vietnamese dictionary”, Research project and technology, Ministry level, Vietnam [107] Van Ngoc Sang, Mohamad Bin Bilal Ali, Noor Dayana Abd Halim (2016), “Building Cham - Vietnamese Electronic Dictionary”, Journal Pendidikan Nusantara, ISSN 2289 -9375 (Print) Special Edition, No 1, pp 215–223 [108] V I Levenshtein (1966), “Binary Codes Capable of Correcting Deletions, Insertions and Reversals”, Soviet Physics Doklady, Vol.10, pp 707–710 [109] Vincent Vandeghinste et al (2006), “METIS-II: Machine Translation for Low Resource Languages”, Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) [110] Warren Weaver (1953), “Recent Contributions to the Mathematical Theory of Communication”, ETC: A Review of General Semantics, Vol.10, No.4, pp 261–281 [111] Wei Yang, Jinghui Yan, and Yves Lepage (2016), “Extraction of Bilingual Technical Terms for Chinese-Japanese Patent Translation”, Proceedings of the NAACL Student Research Workshop, pp 81–87, doi: 10.18653/v1/ N16-2012, url: https://www.aclweb.org/anthology/N16-2012 [112] William A Gale and Kenneth Ward Church (1991), “A program for Aligning sentences in bilingual corpora”, Proceedings of the 29th Annual Meeting of the Association of Computational Linguistics (ACL) [113] Wujie Zheng and Wenyu Wang and Dian Liu and Changrong Zhang and Qinsong Zeng and Yuetang Deng and Wei Yang and Tao Xie (2018), “Oracle-free Detection of Translation Issue for Neural Machine Translation”, CoRR, abs/1807.02340 [114] Xabier Saralegi and I˜ naki San Vicente and Antton Gurrutxaga (2008), “Automatic Extraction of Bilingual Terms from Comparable Corpora in 150 a Popular Science Domain”, Proceeding of the Workshop on Comparable Corpora, LREC 2008, pp 27–32 [115] Yasuhiro Ogawa and Makoto Nakamura and Tomohiro Ohno and Katsuhiko Toyama (2018), “Extraction of legal bilingual phrases from the Japanese Official Gazette, English Edition”, Journal of Information and Telecommunication, Vol.2 (4), pp 359–373, doi: 10 1080 / 24751839 2017 1380272, eprint: https : / / doi org / 10 1080 / 24751839 2017 1380272, url: https://doi.org/10.1080/24751839.2017.1380272 [116] Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Lukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens and George Kurian and Nishant Patil and Wei Wang and Cliff Young and Jason Smith and Jason Riesa and Alex Rudnick and Oriol Vinyals and Greg Corrado and Macduff Hughes and Jeffrey Dean (2016), “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”, CoRR, abs/1609.08144, arXiv: 1609 08144v2 [cs.CL] [117] Yorick Wilks (1979), “MACHINE TRANSLATION AND ARTIFICIAL INTELLIGENCE"”, Translating and the Computer, B.M Snell (ed.) [118] Yun-Chuang Chiao et al (2006), “Evaluation of multilingual text alignment systems: the ARCADE II project”, Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), url: http://www.lrec- conf.org/proceedings/lrec2006/pdf/506_ pdf.pdf 151 [119] Zi-Yi Dou et al (2019), “Domain Differential Adaptation for Neural Machine Translation”, Proceedings of the 3rd Workshop on Neural Generation and Translation (WNGT 2019), pp 59–69, doi: 10.18653/v1/D19-5606 152 Phụ lục Một số thuật ngữ sử dụng xây dựng kho ngữ liệu: Kho ngữ liệu: Là tập hợp mảnh ngôn ngữ chọn lựa xếp theo số tiêu chí ngơn ngữ học rõ ràng để sử dụng mẫu ngôn ngữ Kho ngữ liệu số: Là kho ngữ liệu mã hóa theo chuẩn định đồng để khai thác cho ứng dụng khác Kho ngữ liệu song ngữ: Là tập văn viết hai ngôn ngữ Kho ngữ liệu song song đa ngôn ngữ: Là tập văn viết nhiều ngôn ngữ Dóng hàng văn song ngữ: • Dóng hàng mức văn bản: văn kho ngữ liệu ánh xạ với nhau, tài liệu dịch tài liệu • Dóng hàng mức đoạn: đoạn hai văn ánh xạ với nhau, vài đoạn dịch vài đoạn • Dóng hàng mức câu: câu hai văn ánh xạ với nhau, câu dịch câu • Dóng hàng mức cụm từ: cụm từ hai văn ánh xạ với nhau, cụm từ dịch cụm từ • Dóng hàng mức từ: từ hai văn ánh xạ với nhau, từ dịch từ dóng hàng mức cụm từ mức dóng hàng chi tiết kho ngữ liệu song ngữ ... trên, luận án nghiên cứu xây dựng kho ngữ liệu song ngữ Việt- Anh theo miền cho hệ thống dịch máy miền liệu ưu tiên xây dựng du lịch y tế Mục tiêu cụ thể luận án: • Xây dựng kho ngữ liệu song ngữ Việt- Anh. .. ——————— NGUYỄN TIẾN HÀ NGHIÊN CỨU XÂY DỰNG TÀI NGUYÊN SONG NGỮ VIỆT -ANH ỨNG DỤNG CHO DỊCH MÁY THEO MIỀN Chuyên ngành: Cơ sở toán học cho tin học Mã số: 9460117.02 LUẬN ÁN TIẾN SĨ TOÁN HỌC NGƯỜI... cặp ngôn ngữ cụ thể 1.2.2 Tài nguyên song ngữ Việt- Anh Tiếng Việt nay, tài nguyên song ngữ Việt- Anh có sẵn hạn chế, đặc biệt tài nguyên dùng xây dựng phát triển hệ thống dịch máy Việt- Anh Phương

Tiêu đề	Nghiên Cứu Xây Dựng Tài Nguyên Song Ngữ Việt-Anh Ứng Dụng Cho Dịch Máy
Tác giả	Nguyễn Tiến Hà
Người hướng dẫn	TS. Nguyễn Thị Minh Huyền, PGS.TS. Nguyễn Hữu Ngự
Trường học	Đại học Quốc gia Hà Nội
Chuyên ngành	Cơ sở toán học cho tin học
Thể loại	luận án tiến sĩ
Năm xuất bản	2020
Thành phố	Hà Nội

Định dạng
Số trang	158
Dung lượng	2,96 MB