Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 44 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
44
Dung lượng
1,33 MB
Nội dung
VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY NGUYEN KIM ANH VIETNAMESE WORD CLUSTERING AND ANTONYM IDENTIFICATION MASTER THESIS OF INFORMATION TECHNOLOGY Hanoi - 2013 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY NGUYEN KIM ANH VIETNAMESE WORD CLUSTERING AND ANTONYM IDENTIFICATION Major: Computer science Code: 60 48 01 MASTER THESIS OF INFORMATION TECHNOLOGY SUPERVISOR: PhD Nguyen Phuong Thai Hanoi - 2013 Table of Contents Acknowledgements Abstract Chapter I - Introduction 10 1.1 Word Similarity 11 1.2 Hierarchical Clustering of Word 11 1.3 Function tags 12 1.4 Objectives of the Thesis 13 1.5 Our Contributions 13 1.6 Thesis structure 14 Chapter II - Related Works 15 2.1 Word Clustering 15 2.1.1 The Brown algorithm 15 2.1.2 Sticky Pairs and Sematic Classes 17 2.2 Word Similarity 18 2.2.1 Approach 18 2.2.2 Grammar Relationships 19 2.2.3 Results 20 2.3 Clustering By Committee 20 2.3.1 Motivation 21 2.3.2 Algorithm 21 2.3.3 Results 23 Chapter III - Our approach 25 3.1 Word clustering in Vietnamese 25 3.1.1 Brown's algorithm 25 3.1.2 Word similarity 26 3.2 Evaluating Methodology 28 3.3 Antonym classes 31 3.3.1 Ancillary antonym 31 3.3.2 Coordinated antonym 32 3.3.3 Minor classes 33 3.4 Vietnamese functional labeling 34 Chapter IV - Experiment 37 4.1 Results and Comparison 37 4.2 Antonym frames 40 4.3 Effectiveness of Word Cluster feature in Vietnamese Functional labeling 42 4.4 Error analyses 43 4.5 Summarization 44 Chapter V - Conclusion and Future works 45 5.1 Conclusion 45 5.2 Future works 45 Bibliography 46 List of Figures Figure An example of Brown's cluster algorithm 16 Figure An example of Vietnamese word cluster 26 Figure The syntax tree of a sentence 26 Figure An example about Vietnamese word similarity 28 Figure Select word clusters by dictionary 30 Figure An example about sentences parses 35 Figure The true of k-clusters 38 List of Tables Table Results of CBC with discovering word senses 24 Table Results of CBC with document clustering 24 Table Ancillary antonym frames 32 Table Coordinated antonym frames 33 Table Transitional antonym frames 34 Table An unlabeled corpus in Vietnamese 37 Table The result of five initial clusters 39 Table The comparison between Word clustering and Word similarity 40 Table The result of antonym frames 41 Table 10 The relation of w1 and w2 pairs 42 Table 11 The effectiveness of word cluster feature 43 Chapter I Introduction In recent years, statistical learning methods have been vastly successful for using natural language processing tasks Most of machine learning algorithms which are used in natural language processing tasks are supervised and they require labeled data These labeled data are often made by hand or in some other ways, which could be time consuming and expensive However, while the labeled data is difficult to be created by hand, the unlabeled data is basically free on the Internet which is considered as raw text This raw text can be easily preprocessed to be made suitable for using in an unsupervised or semi-supervised learning algorithm Previous works have shown that using the unlabeled data to replace a traditional labeled data can improve performance (Miller et al., 2004; Abney, 2004; Collins and Singer, 1999) [19][2][7] In this thesis, I focus on some word clustering algorithms with the unlabeled data, in which I mainly apply two methods: word clustering by Brown’s algorithm [22] and word similarity by Dekang Lin [10] Those two methods are used in clustering words in the corpus While Brown’s method cluster words basing on the relationships between words standing before and after the clustered word, Dekang Lin’s method uses the relationships among those three words To compare the advantages and disadvantages of these two methods, I experimented them on the same corpus, using the same evaluation method and the same main word in clusters The result of word clustering contained different clusters and each cluster included words in the same contexts This result was used as features for the application: Vietnamese functional labeling I also evaluated influence of word clustering when using them as features in this application For example, word clusters were used to solve the data sparseness problem of the head word feature Besides, I use the statistics method to extract 20 frames of antonym which can use to identify antonym classes in clusters In this chapter, I describe word similarity, hierarchical clustering of word and their applications which are used in natural language processing Besides, I would like to 10 introduce function tags, word segmentation tasks, objective of thesis and our contribution Finally, I will describe the structure of the thesis 1.1 Word Similarity The semantic of an unknown word can be inferred from its context Consider the following examples: A bottle of Beer is on the table Everyone likes Beer Beer makes you drunk Contexts of the word Beer in which it is used suggest that Beer could be a kind of alcoholic beverage This means that other alcoholic beverage may occur in the same contexts as contexts of Beer and they may be related Consequently, two words are similar if they appear in similar contexts or they can be exchangeable to some extent For example, “Tổng_thống” (President) and “Chủ_tịch” (Chairman) are similar according to this definition In contrast, two words “Kéo” (Scissors) and “Cắt” (Knife) are not similar under this definition, while semantically related Intuitively, if I can generate a good clustering, the words in this cluster should be similar 1.2 Hierarchical Clustering of Word In recent years, some algorithms have been proposed to automatically clustering words based on a large unlabeled corpus, such as (Brown et al 1992, Lin et al, 1998) [22][10] I consider a corpus of T words, a vocabulary of V words, and a partition π of the vocabulary The likelihood L(π) of a bigram class model generating the corpus is given by: L(π) = I – H Here, H is the entropy of the 1-gram word distribution, and I is the average mutual information of adjacent classes in the corpus: I = ∑ Pr(c1c2 ) log c1 ,c2 11 Pr(c1c2 ) Pr(c1 ) Pr(c2 ) Where, Pr(c1c2 ) is the probability of a word in c1 is followed by a word in c2 So H and π are independent, the partition also maximizes the likelihood L(π) of the corpus, because it maximizes the average mutual information Thus, I can use the average mutual information to construct the clusters of word by repeating the merging step until the number of clusters is reduced to the predefined number C 1.3 Function tags Functional tags labeling is an important processing step for many natural language processing applications such as question answering, information extraction, and summarization Thus, there were some researches focusing in function tagging problem to cover additional semantic information which are more useful than syntactic labels There are two kinds of tags in linguistics: syntactic tags and functional tags For syntactic tags there are many theories and projects in English, Spanish, Chinese and etc Main tasks of these researches are common finding the part-of-speech and tag them for their constituents Functional tags are understood as abstract labels because they are not similar syntactic labels If a syntactic label has one notation for batch of words for all paragraphs, functional tags are representative of the relationship between a phrase and its utterance in each difference context So for each phrase, functional tags might be transforming It depends on the context of its neighbors For example, when I consider a phrase: “Baseball bat” the syntax of this phrase is “noun phrase” (in almost researches they are annotated as NP) But its functional tag might be a subject in this sentence: This baseball bat is very expensive In another case, its functional tag might be a direct object: I bought this baseball bat last month Or instrument agent in a passive voice: That men was attacked by this baseball bat Functional tags are directly mentioned by Blaheta (2003) [13] Since there are a lot of researches focusing on how to tag functional tags for a sentence This kind of research problem is called functional tags labeling problem, a class of problems aiming at finding semantic information of phrase To sum up, functional tag labeling is defined as a problem of how to find the semantic information of bag of words, and then tag them with a given annotation in its context 12 1.4 Objectives of the Thesis Most of these successful machine learning algorithms are supervised algorithm, and they usually use labeled data These labeled data are often created by hand, which is time consuming and expensive While unlabeled data is free, they can be obtained from the newspapers, website… and they exit as raw texts on the Internet In this thesis, I expected to investigate some methods of clustering words for unlabeled data, which was easily extracted from online sources Among automatic clustering methods, I focused on two: hierarchical word clustering by Brown’s algorithm and word similarity by Dekang Lin Besides, I also suggested a common evaluating tool for both methods when they are applied in the same Vietnamese corpus The output of the word clustering was used as features in natural language processing tasks such as Vietnamese functional labeling I also evaluated the influences of word clusters when they were used as features in this task 1.5 Our Contributions As I discussed above, the main aim of this thesis is to cluster unlabeled Vietnamese words Thus, the contribution of this paper is as follows • Firstly, I managed to automatic word clustering for unlabeled Vietnamese data with the corpus of about 700.000 sentences • Secondly, I suggested a qualified evaluating method for clusters after I clustered words using thesaurus dictionary with criteria • Thirdly, I compared two clustering methods for Vietnamese, which are word clustering by Brown and word similarity by Dekang Lin I used the results of clusters as features in Vietnamese functional labeling task to increase the task’s efficiency Besides, I use the statistics method to extract 20 frames of antonym which can use to identify antonym classes in the clusters In conclusion, our contribution is that I have implemented word clustering about 700,000 sentences in Vietnamese by hierarchical word clustering algorithm, using Vietnamese thesaurus dictionary and five criteria to evaluate the true of clusters and using 13 ... recent years in word clustering and word similarity tasks such as: class-based n-grams models by Brown’s algorithm [22], word similarity [10] and clustering by committee [23] 2.1 Word Clustering 2.1.1... corpus given by: Pr(w | c) = C(w) C(c) and Pr(c) = C(c) , T where, C(c) is the number of words in T-words for the class c Since, c = π (w) and: Pr(w) = Pr(w | c)* Pr(c) = C(w) T For a 1-gram class... implemented word clustering about 700,000 sentences in Vietnamese by hierarchical word clustering algorithm, using Vietnamese thesaurus dictionary and five criteria to evaluate the true of clusters and