Phân cụm từ Tiếng Việt và nhận diện từ trái nghĩa Nguyễn Kim Anh Trường Đại học Công nghệ Ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS. Nguyễn Phương Thái Năm bảo vệ: 2013 Abstract: Automatically constructing and clustering of words similarity have many important applications in Natural Language Processing (NLP) tasks, such as dictionary construction, statistical machine translation, named-entity recognition, functional labeling, word segmentation… In recent years, it is a common trend that word clustering is researched in some languages as English, Germany, Chinese… However, the task of word clustering in Vietnamese is a more recent one. In this thesis, I use a large unlabeled data of Vietnamese of about 15 millions words which is equivalent to approximately 700 thousands of sentences. This unlabeled data is extracted from newspapers: Lao dong, PC World, Tuoi tre and then part-of-speech tagged. I investigated some approaches for constructing word clusters in Vietnamese, in which I mainly focus on two main methods by Brown and Dekang Lin. I use the same Vietnamese corpus and the same evaluating tool for these two methods so that I can compare and evaluate the effects of those methods in certain NLP tasks. Besides, I use the statistics method to suggest 20 frames of antonym which can be used to identify antonym classes in clusters. Keywords: Khoa học máy tính; Xử lý ngôn ngữ tự nhiên; Cụm từ; Từ trái nghĩa 6 Table of Contents Acknowledgements 4 Abstract 5 Chapter I - Introduction 10 1.1. Word Similarity 11 1.2. Hierarchical Clustering of Word 11 1.3. Function tags 12 1.4. Objectives of the Thesis 13 1.5. Our Contributions 13 1.6. Thesis structure 14 Chapter II - Related Works 15 2.1. Word Clustering 15 2.1.1. The Brown algorithm 15 2.1.2. Sticky Pairs and Sematic Classes 17 2.2. Word Similarity 18 2.2.1. Approach 18 2.2.2. Grammar Relationships 19 2.2.3. Results 20 2.3. Clustering By Committee 20 2.3.1. Motivation 21 2.3.2. Algorithm 21 2.3.3. Results 23 Chapter III - Our approach 25 3.1. Word clustering in Vietnamese 25 3.1.1. Brown's algorithm 25 3.1.2. Word similarity 26 3.2. Evaluating Methodology 28 3.3. Antonym classes 31 3.3.1. Ancillary antonym 31 3.3.2. Coordinated antonym 32 3.3.3. Minor classes 33 3.4. Vietnamese functional labeling 34 7 Chapter IV - Experiment 37 4.1. Results and Comparison 37 4.2. Antonym frames 40 4.3. Effectiveness of Word Cluster feature in Vietnamese Functional labeling 42 4.4. Error analyses 43 4.5. Summarization 44 Chapter V - Conclusion and Future works 45 5.1. Conclusion 45 5.2. Future works 45 Bibliography 46 46 Bibliography [1] A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics. 1996. [2] Abney, S. (2004), Understanding the Yarowsky Algorithm. Computational Linguistics, 30(3). [3] Anh-Cuong Le, Phuong-Thai Nguyen, Hoai-Thu Vuong, Minh-Thu Pham, Tu-Bao Ho. 2009, An Experimental on Lexicalized Statistical Parsing for Vietnamese. Proceedings of KSE 2009, pp 162-167. [4] Blum, A. and Chawla, S. (2001), Learning from Labeled and Unlabeled Data Using Graph Mincuts. In Proceedings of ICML 2001. [5] Blum, A. and Mitchell, T. (1998). Combining Labeled and Unlabeled Data with Co- training. In Proceedings of the Workshop on Computational Learning Theory. [6] Caixia Yuan, Fuji Ren, and Xiaojie Wang, Accurate Learning for Chinese Function Tags from Minimal Features, 2009. [7] Collins, M. and Singer, Y. (1999). Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. [8] Dekang Lin, Xiaoyun Wu, Phrase Clustering for Discriminative Learning. ACL/AFNLP 2009: 1030-1038. [9] Dekang Lin, Patrick Pantel, Induction of semantic classes from natural language text. KDD 2001: 317-322. [10] D. Lin. Automatic Retrieval and Clustering of Similar Words. COLING-ACL98, Montreal, Canada, August, 1998. [11] D. Lin. Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity. In Proceedings of ACL-97, Madrid, Spain. July, 1997. [12] Dekang Lin. 1997. Using syntactic dependency as local context to resolve word sense ambiguity. In Proceedings of ACL/EACL-97, pages 64–71, Madrid, Spain, July. [13] Don Blaheta, Function tagging. PhD thesis, 2003. 47 [14] Don Blaheta, Eugene Charniak, Assigning Function Tags to Parsed Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics, 2000. [15] Ido Dagan, Shaul Marcus, and Shaul Markovitch. 1993. Contextual word similarity and estimation from sparse data. In Proceedings of ACL-93, pages 164–171, Columbus, Ohio, June. [16] Jones, S, Antonymy: a corpus-based perspective. Routledge, 2002. eScholarID: 4b966 [17] Katz, Fodor, The Structure of a Semantic Theory, 1963. [18] Lafferty, J., McCallum, A., Pereira, F. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of ICML 2001, pages 282-289, Williamstown, USA. [19] Miller, S., Guinness, J., and Zamanian, A. (2004). Name Tagging with Word Clusters and Discriminative Training. In Proceedings of HLT-NAACL 2004, pages 337– 342. [20] NianwenXue, Martha Paler, CIS Department University of Penn Treebanksylvania, Automatic Semantic Role Labeling for Chinese Verbs, 2004. [21] Nguyen Thanh Huy, Nguyen Kim Anh, Nguyen Phuong Thai, Building an Efficient Functional-Tag Labeling System for Vietnamese. KSE 2011: 92-97 [22] P.F. Brown, V.J. Della Pietra, P.V deSouza, J.C. Lai, and R.L. Mercer. 1992. Class- based n-gram models of natural language. Computational Linguistics, 18(4):467- 479. [23] Patrick Pantel. 2003. Clustering by Committee. Ph.D. Dissertation. Department of Computing Science, University of Alberta. [24] Percy Liang, Semi-supervised learning for natural language. Massachusetts Institute of Technology, 2005. [25] Phuong-Thai Nguyen, Xuan-Luong Vu, Minh-Huyen Nguyen, Van-Hiep Nguyen, Hong-Phuong Le. Building a Large Syntactically-Annotated Corpus of Vietnamese.The 3rd Linguistic Annotation Workshop (LAW), ACL-IJCNLP 2009. [26] T. Koo, X. Carreras, and M. Collins. Simple Semi-supervised Dependency Parsing. In Proc. ACL, 2008, pp.595-603. . Phân cụm từ Tiếng Việt và nhận diện từ trái nghĩa Nguyễn Kim Anh Trường Đại học Công nghệ Ngành: Khoa học máy tính;. identify antonym classes in clusters. Keywords: Khoa học máy tính; Xử lý ngôn ngữ tự nhiên; Cụm từ; Từ trái nghĩa 6 Table of Contents Acknowledgements 4 Abstract 5 Chapter I - Introduction