Integrated linguistic to Statistical Machine Translation Vương Hoài Thu Đại học Cơng nghệ Ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS Nguyễn Văn Vinh Năm bảo vệ: 2012 Abstract: In the field of Natural Language Processing, automatic machine translation is an attractive application for a supporting user to translate some sentences in a language to others Today, Phrase-based Statistical Machine Translation is the-state-of-the-art with benet in the word choosing, distortion based on the distance between words However, we still have some problem with global dis-tortion model of different languages (long distance between words) In some previous studies, the linguistic information such as a syntax tree, morphology information or hierarchical of phrase is used Similarly, we also use the syntax tree to help the distortion model However, instead of using full parse tree, we use a shallow syntax tree (the height of tree is limited) By using some trans-formation rules, we can arrange the order of some nodes in the shallow syntax tree Hence, we reorder the words in the sentence A special point in our study is applying the transformation rule on the sentence in the source language to get new sentence with new order of words, which is similar with the target language, as preprocessing step before training translation model or decoding with beam search and log linear model The experiment results from an English-Vietnamese pair showed that our approach achieves significant improvements over MOSES which is the state-of-the-art phrase based system Keywords: Khoa học máy tính; Xử lý ngơn ngữ tự nhiên; Thông tin ngôn ngữ; Dịch máy Content Contents Introduction 1.1 Overview 1.1.1 A Short Comparison Between English 1.2 Machine Translation Approaches 1.2.1 Interlingua 1.2.2 Transfer-based Machine Translation 1.2.3 Direct Translation 1.3 The Reordering Problem and Motivations 1.4 Main Contributions of this Thesis 1.5 Thesis Organization and Vietnamese Related works 2.1 Phrase-based Translation Models 2.2 Type of orientation phrases 2.2.1 The Distance Based Reordering Model 2.3 The Lexical Reordering Model 2.4 The Preprocessing Approaches 2.5 Translation Evaluation 2.5.1 Automatic Metrics 2.5.2 NIST Scores 2.5.3 Other scores 2.5.4 Human Evaluation Metrics 2.6 Moses Decoder Shallow Processing for SMT 3.1 Our proposal model 3.2 The Shallow Syntax 3.2.1 Definition of the shallow syntax 3.2.2 How to build the shallow syntax 3.3 The Transformation Rule 3.4 Applying the transformation rule into the shallow syntax tree 1 2 3 5 7 9 10 11 11 12 12 13 13 15 15 16 16 17 18 19 Experiments 21 4.1 The bilingual corpus 21 4.2 Implementation and Experiments Setup 21 4.3 BLEU Score and Discussion 22 Conclusion and Future Work 25 5.1 Conclusion 25 5.2 Future work 25 Bibliography Peter F Brown, J Cocke, Stephen A Della Pietra, Vincent J Della Pietra, F Jelinek, John D Lafferty, R L Mercer, and P S Roossin A statistical approach to machine translation Computational Linguistics, 16(2):79–85, 1990 (Cited on pages and 7.) Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra, and R L Mercer The mathematics of statistical machine translation: parameter estimation Computational Linguistics, 19(2):263–311, 1993 (Cited on pages and 7.) David Chiang A hierarchical phrase-based model for statistical machine translation In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 263–270, Ann Arbor, Michigan, June 2005 (Cited on page 5.) David Chiang Hierarchical phrase-based translation Computational Linguistics, 33(2): 201–228, 2007 (Cited on pages 13 and 16.) M Collins, P Koehn, and I Kucerov´a Clause restructuring for statistical machine translation In Proc ACL 2005, pages 531–540 Ann Arbor, USA, 2005 (Cited on pages and 10.) George Doddington Automatic evaluation of machine translation quality using n-gram cooccurrence statistics In Proceedings of the Second International Conference on Human Language Technology Research, pages 138–145, San Francisco, CA, USA, 2002 Morgan Kaufmann Publishers Inc (Cited on page 12.) D Farwell and Y Wilks Ultra: A multi-lingual machine translator In Proceedings of Machine Translation Summit III, pages 19–24, Washington DC, USA, 1991 (Cited on page 3.) Marcello Federico, Nicola Bertoldi, and Mauro Cettolo Irstlm: an open source toolkit for handling large scale language models In INTERSPEECH, pages 1618–1621, 2008 (Cited on page 13.) Michel Galley and Christopher D Manning A simple and effective hierarchical phrase reordering model In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 848–856, Honolulu, Hawaii, October 2008 Association for Computational Linguistics URL http://www.aclweb.org/anthology/D08-1089 (Cited on pages and 9.) Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer Scalable inference and training of context-rich syntactic translation models In Proceedings of COLING/ACL 2006, pages 961–968 Sydney, Australia, 2006 (Cited on page 16.) 32 Bibliography N Habash Syntactic preprocessing for statistical machine translation Proceedings of the 11th MT Summit, 2007 (Cited on pages 10 and 11.) Liang Huang and Haitao Mi Efficient incremental decoding for tree-to-string translation In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 273–283, Cambridge, MA, October 2010 Association for Computational Linguistics URL http://www.aclweb.org/anthology/D10-1027 (Cited on page 5.) Liang Huang, Kevin Knight, and Aravind Joshi Statistical syntax-directed translation with extended domain of locality In Proceedings of AMTA 2006, pages 66–73, Boston, MA, USA, 2006 (Cited on page 16.) Daniel Jurafsky and James H Martin Speech and Language Processing Prentice Hall, Englewood Cliffs, NJ, 2000 (Cited on page 4.) Daniel Jurafsky and James H Martin Speech and Language Processing (2nd Edition) Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009 ISBN 0131873210 (Cited on pages 1, 3, 4, and 13.) Jason Katz-Brown, Slav Petrov, Ryan McDonald, Franz Och, David Talbot, Hiroshi Ichikawa, Masakazu Seno, and Hideto Kazawa Training a parser for machine translation reordering In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 183–192, Edinburgh, Scotland, UK., July 2011 Association for Computational Linguistics URL http://www.aclweb.org/anthology/D11-1017 (Cited on page 5.) Philipp Koehn Pharaoh: A beam search decoder for phrase-based statistical machine translation models In Proceedings of AMTA, pages 115–124, 2004 (Cited on page 8.) Philipp Koehn, Franz Josef Och, and Daniel Marcu Statistical phrase-based translation In Proceedings of HLT-NAACL 2003, pages 127–133 Edmonton, Canada, 2003 (Cited on pages 4, 5, 7, 8, 9, 13 and 23.) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst Moses: Open source toolkit for statistical machine translation In Proceedings of ACL, Demonstration Session, 2007 (Cited on pages 7, 13, 16, 22, 23 and 25.) Yang Liu, Qun Liu, and Shouxun Lin Tree-to-string alignment template for statistical machine translation In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 609–616, Sydney, Australia, July 2006 Association for Computational Linguistics URL http://www.aclweb.org/anthology/P06-1077 (Cited on page 16.) Daniel Marcu and Daniel Wong A phrase-based,joint probability model for statistical machine translation In Proceedings of the 2002 Conference on Empirical Methods Bibliography 33 in Natural Language Processing, pages 133–139 Association for Computational Linguistics, July 2002 URL http://www.aclweb.org/anthology/W02-1018 (Cited on page 8.) Daniel Marcu, Wei Wang, Abdessamad Echihabi, and Kevin Knight Statistical machine translation with syntactified target language phrases In Proceedings of EMNLP 2006, pages 44–52 Sydney, Australia, 2006 (Cited on page 16.) Mitchell P Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz Building a large annotated corpus of english: The penn treebank COMPUTATIONAL LINGUISTICS, 19(2):313–330, 1993 (Cited on page 17.) Teruko Mitamura Controlled language for multilingual machine translation In Proceedings of Machine Translation Summit VII, pages 46–52, Singapore, 1999 (Cited on page 3.) Makoto Nagao A framework of a mechanical translation between japanese and english by analogy principle In Proceedings of the international NATO symposium on Artificial and human intelligence, pages 173–180, New York, NY, USA, 1984 Elsevier NorthHolland, Inc (Cited on page 4.) Phuong Thai Nguyen, Akira Shimazu, Le-Minh Nguyen, and Van-Vinh Nguyen A syntactic transformation model for statistical machine translation International Journal of Computer Processing of Oriental Languages (IJCPOL), 20(2):1–20, 2007 (Cited on pages 11 and 21.) Thai Phuong Nguyen and Akira Shimazu Improving phrase-based smt with morphosyntactic analysis and transformation In Proceedings AMTA 2006, 2006 (Cited on pages 10, 19 and 25.) Thai Phuong Nguyen, Akira Shimazu, Tu Bao Ho, Minh Le Nguyen, and Vinh Van Nguyen A tree-to-string phrase-based model for statistical machine translation In Proceedings of the Twelfth Conference on Computational Natural Language Learning (CoNLL 2008), pages 143–150, Manchester, England, August 2008a Coling 2008 Organizing Committee URL http://www.aclweb.org/anthology/W08-2119 (Cited on page 21.) Vinh Van Nguyen, Thai Phuong Nguyen, Akira Shimazu, and Minh Le Nguyen A reordering model for phrase-based machine translation In Proceedings of Advances in Natural Language Processing, GoTAL 2008, Lecture Notes in Computer Science, pages 476–487, Gothenburg, Sweden, 2008b (Cited on page 21.) Franz J Och and Hermann Ney A systematic comparison of various statistical alignment models Computational Linguistics, 29(1):19–51, 2003 (Cited on page 7.) Franz J Och and Hermann Ney The alignment template approach to statistical machine translation Computational Linguistics, 30(4):417–449, 2004 (Cited on pages 5, 7, and 16.) 34 Bibliography Franz Josef Och Minimum error rate training in statistical machine translation In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan, July 2003 Association for Computational Linguistics URL http://www.aclweb.org/anthology/P03-1021 (Cited on page 8.) K Papineni, S Roukos, T Ward, and W J Zhu 2002 Bleu: a method for automatic evaluation of machine translation In Proc of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318 Philadelphia, PA, July, 2002 (Cited on pages 11 and 23.) Chris Quirk, Arul Menezes, and Colin Cherry Dependency treelet translation: Syntactically informed phrasal smt In Proceedings of ACL 2005, pages 271–279 Ann Arbor, Michigan, USA, 2005 (Cited on pages and 16.) Erik F Tjong Kim Sang Transforming a chunker to a parser In LINGUISTICS IN THE, pages 177–188, 2000 (Cited on page 17.) Andreas Stolcke Srilm - an extensible language modeling toolkit In Proceedings of International Conference on Spoken Language Processing, volume 29, pages 901–904, 2002 (Cited on pages 13 and 22.) David Talbot and Miles Osborne Smoothed Bloom filter language models: Tera-scale LMs on the cheap In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 468–476, Prague, Czech Republic, June 2007 Association for Computational Linguistics URL http://www.aclweb.org/anthology/D/D07/D07-1049 (Cited on page 14.) David Talbot, Hideto Kazawa, Hiroshi Ichikawa, Jason Katz-Brown, Masakazu Seno, and Franz Och A lightweight evaluation framework for machine translation reordering In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 12– 21, Edinburgh, Scotland, July 2011 Association for Computational Linguistics URL http://www.aclweb.org/anthology/W11-2102 (Cited on page 5.) Christoph Tillmann A unigram orientation model for statistical machine translation In Daniel Marcu Susan Dumais and Salim Roukos, editors, Proceedings of HLT-NAACL 2004: Short Papers, pages 101–104, Boston, Massachusetts, USA, May - May 2004 Association for Computational Linguistics (Cited on page 8.) Yoshimasa Tsuruoka and Jun´ıchi Tsujii Chunk parsing revisited In Proceedings of the 9th International Workshop on Parsing Technologies (IWPT 2005), 2005 (Cited on pages 17 and 21.) Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou Fast full parsing by linearchain conditional random fields In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’09, pages 790–798, Stroudsburg, PA, USA, 2009 Association for Computational Linguistics URL http: //dl.acm.org/citation.cfm?id=1609067.1609155 (Cited on pages 17 and 21.) Bibliography 35 Ashish Venugopal, Stephan Vogel, and Alex Waibel Effective phrase translation extraction from alignment models In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 319–326, Sapporo, Japan, July 2003 Association for Computational Linguistics URL http://www.aclweb.org/anthology/ P03-1041 (Cited on page 8.) Chao Wang, Michael Collins, and Philipp Koehn Chinese syntactic reordering for statistical machine translation In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 737–745, Prague, Czech Republic, June 2007 Association for Computational Linguistics URL http://www.aclweb.org/anthology/D/ D07/D07-1077 (Cited on page 10.) Fei Xia and Michael McCord Improving a statistical mt system with automatically learned rewrite patterns In Proceedings of Coling 2004, pages 508–514, Geneva, Switzerland, Aug 23–Aug 27 2004 COLING (Cited on pages 5, 10, 11 and 25.) Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz Och Using a dependency parser to improve smt for subject-object-verb languages In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 245–253, Boulder, Colorado, June 2009 Association for Computational Linguistics URL http://www.aclweb.org/ anthology/N/N09/N09-1028 (Cited on pages and 10.) Kenji Yamada and Kevin Knight A syntax-based statistical translation model In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 523–530, Toulouse, France, July 2001 Association for Computational Linguistics URL http://www.aclweb.org/anthology/P01-1067 (Cited on page 4.) Yuqi Zhang, Richard Zens, and Hermann Ney Chunk-level reordering of source language sentences with automatically learned rules for statistical machine translation In Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation, pages 1–8, 2007 (Cited on pages 11 and 16.) ... template for statistical machine translation In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics,... mathematics of statistical machine translation: parameter estimation Computational Linguistics, 19(2):263–311, 1993 (Cited on pages and 7.) David Chiang A hierarchical phrase-based model for statistical. .. statistical machine translation Computational Linguistics, 30(4):417–449, 2004 (Cited on pages 5, 7, and 16.) 34 Bibliography Franz Josef Och Minimum error rate training in statistical machine translation