LexicalizedstatisticalparsingforVietnamese Phạm Thị Minh Thu Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Công nghệ thông tin Người hướng dẫn: TS Lê Anh Cường Năm bảo vệ: 2010 Keywords: Phân tích thống kê; Từ vựng; Cú pháp; Tiếng Việt; Tin học Content Table of Contents Acknowledgements ii Introduction 1.1 What is syntactic parsing? 1.2 Current Studies in Parsing 1.3 Vietnamese syntactic parsing 1.4 Objective of the Thesis 1.5 Thesis structure 1 Parsing approaches 2.1 Context Free Grammar (CFG) 2.2 Parsing Algorithms 2.2.1 Top-down parsing 2.2.2 Bottom-up parsing 2.2.3 Comparison between top-down parsing and bottom-up parsing 2.2.4 CYK algorithm (Cocke-Younger-Kasami) 2.2.5 Earley algorithm 2.3 Probabilistic context-free grammar (PCFGs) 2.3.1 The concept of PCFG 2.3.2 Disadvantages of PCFGs 2.4 Lexical Probabilistic Context Free Grammar (LPCFGs) 7 8 9 11 13 13 14 15 2.4.1 2.4.2 2.4.3 Head structure 16 The concept of Lexical Probabilistic Context Free Grammar (LPCFGs) 16 Three models of Collins 18 Vietnameseparsing and our approach 3.1 Vietnamese characteristics 21 21 3.1.1 PennTreeban k 22POS tagging 23 3.1.2 Bracketing 23 3.2 Viet Treebank 25 3.2.1 Objectives 25 3.2.2 The POS tagset and Syntax tagset forVietnamese 27 3.3 Our approach in building a Vietnamese parser 27 3.3.1 Adapting Bikel's parser forVietnamese 29 3.3.2 Analyze error and propse using heuristic rules 30 Experiments and Discussion 33 3.4 Data 33 3.5 Bikel's parsing tool 34 3.6 Adaptating Bikel's tool to Vietnamese 35 3.6.1 Investigate different configurations 35 3.6.2 Training 38 3.6.3 Parsing 39 3.6.4 Evaluation of the parser 39 3.6.5 Results 40 3.7 Experimental results on using heuristic rules 42 Conclusions and Future Work 3.8 Summary 3.9 Contribution 3.10 Future work 46 46 46 47 References: Agirre, E., & Baldwin, T (2008) Improving parsing and PP attachment performance with sense information Proceedings of ACL-08: HLT (pp 317—325) Columbus, Ohio: Association for Computational Linguistics Anh-Cuong, L., Phuong-Thai, N., Hoai-Thu, V., Minh-Thu, P., & Tu-Bao, H (2009) Experimental study on lexicalizedstatisticalparsingforVietnamese KSE 2009 International Conference on Knowledge and Systems Engineering (pp 162—167) Hanoi, Vietnam Bikel, D M (2004) On the parameter space of generative lexicalizedstatisticalparsing models Doctoral dissertation, Philadelphia, PA, USA Supervisor-Marcus, Mitchell P Candito, M., & Crabbe, B (2009) Improving generative statisticalparsing with semisupervised word clustering IWPT '09: Proceedings of the 11th International Conference on Parsing Technologies (pp 138—141) Morristown, NJ, USA: Association for Computational Linguistics Carreras, X., Collins, M., & Koo, T (2008) Tag, dynamic programming, and the perceptron for efficient, feature-rich parsing CoNLL '08: Proceedings of the Twelfth Conference on Computational Natural Language Learning (pp 9—16) Morristown, NJ, USA: Association for Computational Linguistics Collins, M (1997) Three generative, lexicalised models forstatisticalparsing ACL-35: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (pp 16—23) Morristown, NJ, USA: Association for Computational Linguistics Collins, M (1999) Head-driven statistical models for natural language parsing Doctoral dissertation, University of Pennsylvania Collins, M (2003) Head-driven statistical models for natural language parsing Computational Linguistics, 29, 589—637 Manning, C D., & Schutze, H (1999) Foundations of statistical natural language processing Cambridge, MA: MIT Press 10 Phuong-Thai, N., & Xuan-Luong, V (2009) Building a large syntactically-annotated corpus of Vietnamese Proceedings of the Third Linguistic Annotation Workshop (pp 182—185) Suntec, Singapore: Association for Computational Linguistics 11 Quoc-The, N., & Thanh-Huong, L (2008) Vietnamese syntactic parsing using the lexicalized probabilistic context-free grammar Proceedings of FAIR conference 2007 (pp 9—10) Nha Trang, Vietnam 12 Rafferty, A N., & Manning, C D (2008) Parsing three german treebanks: lexicalized and unlexicalized baselines PaGe '08: Proceedings of the Workshop on Parsing German (pp 40—46) Morristown, NJ, USA: Association for Computational Linguistics 13 Watson, R., Briscoe, T., & Carroll, J (2007) Semi-supervised training of a statistical parser from unlabeled partially-bracketed data IWPT '07: Proceedings of the 10th International Conference on ParsingTechnologies (pp 23—32) Morristown, NJ, USA: Association for Computational Linguistics 14 Xiong, D., Li, S., Liu, Q., Lin, S., & Qian, Y (2005) Parsing the penn Chinese treebank with semantic knowledge In Proceedings ofIJCNLP 2005 (pp 70—81) ... Association for Computational Linguistics (pp 16—23) Morristown, NJ, USA: Association for Computational Linguistics Collins, M (1999) Head-driven statistical models for natural language parsing. .. generative lexicalized statistical parsing models Doctoral dissertation, Philadelphia, PA, USA Supervisor-Marcus, Mitchell P Candito, M., & Crabbe, B (2009) Improving generative statistical parsing. .. perceptron for efficient, feature-rich parsing CoNLL '08: Proceedings of the Twelfth Conference on Computational Natural Language Learning (pp 9—16) Morristown, NJ, USA: Association for Computational