Extraction of Vietnamese collocation from text corpora Đỗ Thị Ngọc Quỳnh Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS Lê Anh Cường Năm bảo vệ: 2011 Abstract Collocations have wide application in the fields of languages, compiled a dictionary as well as the problem of natural language processing Therefore, the extraction of collocations in each language is really necessary, to improve the accuracy and the nature of the application of natural language processing, as well as help to learn a new language easier However, in Vietnam, the study of collocation is quite a new field This paper focused on researching some method of extracting collocations methods to find efficient model for the Vietnamese collocations extraction The mentioned methods were based on some classic statistical methods commonly used such as frequency, t-test, chi-square, mutual information We also suggested some general method using linguistic measure to increase the accuracy of the process of extraction Input data included the data has been through a POStagging and data has been parsed By running the program with different methods and combination of multiple methods together, comparing the accuracy of the method, we draw out the efficient method of extracting of Vietnamese Collocation from Text Corpora Keywords Xử lý ngôn ngữ; Xử lý liệu; Ngơn ngữ tự nhiên; Trí tuệ nhân tạo Content Table of Contents Introduction 1.1 Definitions 1.2 Related works and motivation 1.3 Contribution of the thesis 2 Collocation: concept, roles and applications Collocations’ characteristics 2.1.1 Recurrent 2.1.2 Arbitrary 2.1.3 Domain-dependent 2.1.4 Non-substitutability (theclosely linked in terms of vocabulary) Classification of collocations 2.2.1 Idiomatic Phrases 2.2.2 Support Verb Construction 2.2.3 Fixed Phrases Applications Vietnamese collocations 10 10 10 11 12 Basic methods in Collocation extraction 3.1 Frequency 3.2 Hypothesis testing 3.2.1 T-Test 3.2.2 Chi-Square 3.3 Point-wise Mutual Information (PMI) 14 15 16 17 18 20 Our proposal for extracting Vietnamese collocation 23 4.1 Patterns for Vietnamese collocation 23 4.2 The Linguistic Measure 24 4.3 Designed model 25 Experiments 27 5.1 Data preparation 27 5.1.1 Collecting corpora 27 5.1.2 Extracting bi-grams 28 5.1.3 Adding syntactic information to bi-grams .28 5.2 The test models 29 5.3 Experimental results with statistical methods 30 5.3.1 Bi-grams with syntactic information 31 5.4 The experiments of our proposal 32 Conclusion 2.1 2.2 2.3 2.4 Bibliography 8 35 36 References [1] M Benson The Structure of the Collocational Dictionary International Journal of Lexicography, 2(1):1-14, 1989 [2] Raj Kishor Bisht and H S Dhami The application of fuzzy logic to collocation extraction CoRR, abs/0811.1260, 2008 [3] Elisabeth Breidt Extraction of v-n-collocations from text corpora: A feasibility study for german In In CoRR-1996, pages 74-83, 1993 [4] Mai Ngọc Chừ; Vũ Đức Nghiêu Hoàng Trọng Phiến Cơ sở ngôn ngữ học tiếng Việt Nxb Giáo dục, 1997 [5] John Carroll, Guido Minnen, Darren Pearce, Yvonne Canning, Siobhan Devlin, and John Tait Simplifying english text for language impaired readers 1999 [6] G Castiglione, A Restivo, and S Salemi Patterns in words and languages Discrete Appl Math., 144:237-246, December 2004 [7] Y Choueka, A S Fraenkel, and S T Klein Compression of concordances in full-text retrieval systems In Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’88, pages 597-612, New York, NY, USA, 1988 ACM [8] Hoàng Thị Châu Vài nhận xét q trình tiêu chuẩn hố tiếng việt thể hiên qua cách dùng từ địa phương sách vở, báo chí truớc sau cách mạng tháng tám (4), 1970 [9] Kenneth Ward Church and Patrick Hanks Word association norms, mutual information, and lexicography Comput Linguist., 16:22-29, March 1990 [10]A P Cowie The treatment of collocations and idioms in learners’ dictionaries Applied Linguistics, II:223-235, March 1981 [11]D.A Cruse Lexical semantics Cambridge University Press, 1991 [12] John Rupert Firth A synopsis of linguistic theory 1930-1955 In Studies in Linguistic Analysis, pages 1-32 Blackwell, Oxford, 1957.Eric Gaussier, David Hull, and Salah Aitmokhtar Term Alignment in Use: Machine-Aided Human Translation 2000 [13] John S Justeson and Slava M Katz Technical terminology some linguistic properties and an algorithm for identification in text Natural Language Engineering, 1(1):9-27, 1995 [14] John S Justeson and Slava M Katz Technical terminology: some linguistic properties and an algorithm for identification in text In Natural Language Engineering, pages 927 Cambridge University Press., 1995 [15] Adam Kilgarriff and David Tugwell Word sketch: Extraction and display of significant collocations for lexicography Proc ACL workshop on COLLOCATION Computational Extraction Analysis and Exploitation Toulouse July 3238, 2001 [16] Vuong Hoai Vu Pham Minh Thu Ho Tu Bao Le Anh Cuong, Nguyen Phuong Thai An experimental statiscal on lexicalized parsing for vietnamese KSE, 2009 [17] Dekang Lin Extracting Collocations from Text Corpora 1998 [18] Dekang Lin Extracting Collocations from Text Corpora 1998 [19] Dekang Lin Using collocation statistics in information extraction In In Proceedings of the Seventh Message Understanding Conference (MUC-7, 1998 [20] Christopher D Manning and Hinrich Schütze Foundations of statistical natural language processing MIT Press, Cambridge, MA, USA, 1999 [21] Christopher D Manning and Hinrich Schuütze Foundations of statistical natural language processing MIT Press, Cambridge, MA, USA, 1999 [22] Johannes Matiasek Exploiting long distance collocational relations in predictive typing In Proceedings of the EACL-03 Workshop on Language Modeling for Text Entry Methods, pages 1-8, 2003 [23] Gitsaky C.Daigaku N and Tailor R Iranian Journal of Applied Linguistics, pages 137- 169 [24] Darren Pearce and Bn Qh Using conceptual similarity for collocation extraction In Proceedings of the Fourth annual CLUK colloquium, 2001 [25] Pavel Pecina and Pavel Schlesinger Combining association measures for collocation extraction In In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 651-658.Sasa Petrovic Collocation extraction measures for text mining applications In Diploma Thesis num 1693, 2007 [26] Yin Li Qin Lu and Ruifeng Xu Improving xtract for chinese collocation extraction In Proceedings of IEEE Int Conf Natural Language Processing and Knowledge Engineering, pages 333-338, 2003 [27] Violeta Seretan and Eric Wehrli Accurate collocation extraction using a multilingual parser In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 953-960, Stroudsburg, PA, USA, 2006 Association for Computational Linguistics [28] Violeta Seretan and Eric Wehrli Multilingual collocation extraction: issues and solutions In Proceedings of the Workshop on Multilingual Language Resources and Interoperability, MLRI ’06, pages 40-49, Stroudsburg, PA, USA, 2006 Association for Computational Linguistics [29] Frank Smadja Retrieving collocations from text: Xtract Comput Linguist., 19:143-177, March 1993 [30] Frank Smadja and Kathleen McKeown Translating collocations for use in bilingual lexicons In Proceedings of the workshop on Human Language Technology, HLT ’94, pages 152-156, Stroudsburg, PA, USA, 1994 Association for Computational Linguistics [31] David A Smith Detecting events with date and place information in unstructured text In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, JCDL ’02, pages 191-196, New York, NY, USA, 2002 ACM [32] The teaching of collocations in EAP Technical report University of Leeds [33] Nguyen Cam Tu Hidden topic discovery toward classification and clustering in vietnamese web documents In Master Thesis in College of Technology, Viet Nam National University, 2008 [34] James Liu Wan Yin Li, Qin Lu Tctract-a collocation extraction approach for noun phrases using shallow parsing rules and statistic models In 20th Pacific Asia Conference on Language, Information and Computation (PACLICi06), pages 109-116, 2006 Joachim Wermter and Udo Hahn Collocation extraction based on modifiability statistics In Proceedings of the 20th international conference on ComputationalLinguistics, COLING ’04, Stroudsburg, PA, USA, 2004 Association for Computational Linguistics [35] [36] Janyce Wiebe, Theresa Wilson, and Matthew Bell Identifying collocations for recognizing opinions In In Proc ACL-01 Workshop on Collocation: Computational Extraction, Analysis, and Exploitation, pages 24-31, 2001 Thesis-related publication: J Le Anh Cuong, Do Thi Ngoc Quynh and Cao Van Viet Building and Evaluating Vietnamese Language Models VNU JOURNAL OF SCIENCE (revising) J Le Anh Cuong, Do Thi Ngoc Quynh Vietnamese collocation extraction (to be submitted) ... experimental statiscal on lexicalized parsing for vietnamese KSE, 2009 [17] Dekang Lin Extracting Collocations from Text Corpora 1998 [18] Dekang Lin Extracting Collocations from Text Corpora 1998... extracting Vietnamese collocation 23 4.1 Patterns for Vietnamese collocation 23 4.2 The Linguistic Measure 24 4.3 Designed model 25 Experiments 27 5.1 Data preparation... David Tugwell Word sketch: Extraction and display of significant collocations for lexicography Proc ACL workshop on COLLOCATION Computational Extraction Analysis and Exploitation Toulouse July 3238,