Extraction of vietnamese collocation from text corpora

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY DO THI NGOC QUYNH EXTRACTION OF VIETNAMESE COLLOCATION FROM TEXT CORPORA MASTER THESIS Hanoi – 2011 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY DO THI NGOC QUYNH EXTRACTION OF VIETNAMESE COLLOCATION FROM TEXT CORPORA Major : Computer Science Code : 60 48 01 MASTER THESIS SUPERVISOR: Doctor Le Anh Cuong Hanoi – 2011 Table of Contents Introduction 1.1 Definitions 1.2 Related works and motivation 1.3 Contribution of the thesis 2 Collocation: concept, roles and applications 2.1 Collocations’ characteristics 2.1.1 Recurrent 2.1.2 Arbitrary 2.1.3 Domain-dependent 2.1.4 Non-substitutability (the closely linked in terms of vocabulary) 2.2 Classification of collocations 2.2.1 Idiomatic Phrases 2.2.2 Support Verb Construction 2.2.3 Fixed Phrases 2.3 Applications 2.4 Vietnamese collocations 7 8 9 10 10 10 11 12 Basic methods in Collocation extraction 3.1 Frequency 3.2 Hypothesis testing 3.2.1 T-Test 3.2.2 Chi-Square 3.3 Point-wise Mutual Information (PMI) 14 15 16 17 18 20 Our proposal for extracting Vietnamese collocation 23 4.1 Patterns for Vietnamese collocation 23 4.2 The Linguistic Measure 24 v vi TABLE OF CONTENTS 4.3 Designed model 25 Experiments 5.1 Data preparation 5.1.1 Collecting corpora 5.1.2 Extracting bi-grams 5.1.3 Adding syntactic information to bi-grams 5.2 The test models 5.3 Experimental results with statistical methods 5.3.1 Bi-grams with syntactic information 5.4 The experiments of our proposal 27 27 27 28 28 29 30 31 32 Conclusion 35 Bibliography 36 ABSTRACT Collocations have wide application in the fields of languages, compiled a dictionary as well as the problem of natural language processing Therefore, the extraction of collocations in each language is really necessary, to improve the accuracy and the nature of the application of natural language processing, as well as help to learn a new language easier However, in Vietnam, the study of collocation is quite a new field This paper focused on researching some method of extracting collocations methods to find efficient model for the Vietnamese collocations extraction The mentioned methods were based on some classic statistical methods commonly used such as frequency, t-test, chi-square, mutual information We also suggested some general method using linguistic measure to increase the accuracy of the process of extraction Input data included the data has been through a POS-tagging and data has been parsed By running the program with different methods and combination of multiple methods together, comparing the accuracy of the method, we draw out the efficient method of extracting of Vietnamese Collocation from Text Corpora -1- CHAPTER 1: Introduction Firth [7] defines the concept of collocation is an abstract syntax, not directly related to the meaning of words constitute it Choueka [5] said that the concept of collocation is a sequence of two or more consecutive words which has the characteristics of a syntactic unit means, and its meaning could not be inferred directly from the meaning of words components According to Benson [2], a collocation is a combination of the fixed and repeated words Thus, Firth paid attention to the lexical of collocation, and Choueka tend to aspects of syntactic function of collocation in the text The definition of Benson is one of the most used defining, but it ignores a number of features and attributes of collocation applications in machine translation such as a collocation could not be translated in English into Vietnamese word by word Collocations are an expression of two or more words that correspond to a conventional way of saying things They are also known as a class of word groups which lie between idioms and free word combination [4] However, it is typical to draw a line between a phrase and a collocation Idioms and phrase may be defined as expression in the language that is peculiar to itself either grammatically or especially in having a meaning that cannot be derived from the sum of the meanings of its elements It becomes well impossible to guess the meaning of an idiom from the words it contains And, moreover, the meanings that idioms have are often stronger than the meanings of non-idiomatic phrases There have been many studies of collocation to be conducted in English, but there is no standard definition of collocation is made, and the definition of collocation depends on the point and purpose of each of the researchers In this thesis, we accept the definition: collocation is a combination of words that often appear together in the normal range in the text, position and grammatical relations are relatively fixed Collocations have wide application in the fields of languages [2, 21, 23], compiled a dictionary [11] as well as the problem of natural language processing [4, 16, 18, 25, 27] Therefore, the extraction of collocations selected in each language is really necessary, to improve the accuracy and the nature of the application of natural language processing, as well as help to learn a new language easier In addition, collocation translation improves the quality of machine translation Automatic identification of important collocations to be listed in a dictionary is the task of computational lexicography The knowledge of collocations can improve the performance of information retrieval system Statistical methods have shown a remarkable presence in collocation extraction Frequency measure was used to identify a particular type of collocations Mutual information was used to extract word pairs that tend to co-occur within a fixed size window (normally words), in which extracted words may not be directly related The use of ttest to find words whose co-occurrence patterns best distinguish between two words was suggested before They also applied likelihood ratio test to collocation discovery -2- CHAPTER 2: Related words A good example of the type of problem is Halliday's example of strong vs powerful tea (Halliday 1966: p150) It is a convention in English to talk about strong tea, not powerful tea, although any speaker of English would also understand the latter unconventional expression The combination of words that not following a rule of grammar or semantics is definition of collocations Thus, one collocation can be interpreted as a combination of the words which not follow a rule of grammar or semantics at all In some points of view, collocations are fixed and inflexible Means of one collocation is not usually inferred from the meaning of words into parts, and replace a word with one component of synonyms can completely change the meaning of the collocation Collocations are also understood as idiosyncratic pragmatics combination of lexical items (Fontenelle, 1992, p222): heavy rain, light breeze, great difficulty, grow steadily, meet requirement, reach consensus, pay attention, ask a question Unlike idioms (kick the bucket, lend a hand, pull someone’s leg), their meaning is fairly transparent and easy to decode Differently from the regular productions, (big house, cultural activity; read a book) collocations expressions are highly idiosyncratic, since the lexical items a headword combines with in order to express a given meaning is contingent upon that word (Mel’ˇcuk, 2003) As it has been pointed out by many researchers (Cruse, 1986; Benson, 1990; McKeown and Radev, 2000), collocations cannot be described by means of general syntactic and semantic rules They are arbitrary and unpredictable, and therefore need to be memorized They constitute the so-called semi-finished products of language (Hausmann, 1985) or the islands of reliability (Lewis, 2000) on which the speakers build their utterances In addition, collocation is a special problem of linguistic Syntax imposes constraints on word order or the occurrence of particular phrasal types such as PPs or NPs, and lexical semantics imposes Joachim Wermter and Udo Hahn [1] introduced a linguistic measure for identifying PP-verb collocations in German, which is based on the property of non- or limited modifiability Due to their popularity that there are a large number of collocation extraction word concerns the English language: (Choueka, 1988; Church et al, 1989; Church and Hanks, 1990; Smadja, 1993; Justeson and Katz, 1995; Kjellmer in 1994, Sinclair., in 1995; Lin, 1998), among many others Choueka (1988) provide methods to detect n-grams (consecutive) simply by calculating the co-occurrence frequency Justeson and Katz (1995) apply a POS-filter on the pair of their extraction (Kjellmer 1994) Smadja (1993) using the z-score associated with multiple diagnostic (e.g., the presence of two systems of lexical items at the same distance in the text) and extracts predicative collocations, rigid noun phrases and phrasal templates He then uses the parser to validate the results Parsing is shown to lead to an increase in accuracy from 40\% to 80\% (Church et al, 1989) and (Church and Hanks, 1990) using POS information and parsed to -3- extract verb-object pairs, then they are ranked according to the mutual information (MI) measure Lin (1998) also proposes a hybrid approach based on a dependency parser The candidate extracted then compare with MI result In the document production tasks such as machine translation [2, 21, 23] and natural language processing [4, 16, 18, 25, 27], collocations also presented the importance Furthermore, they are useful in a variety of other applications, such as word sense disambiguation (Brown et al, 1991) and parsing (Alshawi and Carter, 1994) Collocations is particularly important because the incidence in the native language, in all the areas or categories According to Jackendoff (1997, 156) and Mel 'Cuk (1998, 24), a large number of collocations appeared in the vocabulary of a language The past decade has witnessed a considerable development of collocation extraction techniques that concerns both monolingual (parallel) multilingual corpora We can mention here only a part of this work: (Berry-Rogghe, 1973; Church et al., 1989; Smadja, 1993; Lin, 1998; Krenn and Evert, 2001) for monolingual extraction, and (Kupiec, 1993; Wu, 1994; Smadja et al., 1996; Kitamura and Mat-sumoto, 1996; Melamed, 1997) for bilingual extraction via alignment In the first paper on fuzzy decision making Raj Kishor Bisht and H.S.Dhami [3] suggest a way to check the possibility whether a word combination can be considered as collocation or not Fuzzy logic allows the formation of a logic based model by utilizing the reasoning behind the existing methods The resulting model has the simplicity of the logic based model and performs better than the existing statistical models In the study of collocation, German is the second most investigated language The first is the study of Breidt (1993) and more recently, Krenn and Evert, such as (Krenn and Evert in 2001; Evert and Krenn, 2001 Evert 2004) Breidt using MI and t-score and compare accuracy results when the different parameters change, such as window size, the presence compared with absence of lemmatization, corpus size, and the presence compared with absence of POS and syntactic information Then, Krenn and Evert (2001) used a German chunk-er to extract the pair syntax as P-N-V Their work set the basis of formal methods and the pricing system in collocation extraction Zinsmeister and Heid (2003, 2004) focused on combining NV and ANV determined using a stochastic parser Thanks to the outstanding work of Gross on lexicon-grammar (1984), French is one of the languages most studied on the distribution and conversion capabilities of the word This work was done before the computer era and the advent of corpus linguistics, while the automatic extraction was then performed, for example, in (Lafon, 1984; Daille in 1994 ; Bourigault in 1992, Goldman et al, 2001) There are also a number of methods to extract collocation studies in other languages For over 20 years ago, the field of natural language processing has achieved many accomplishments (such as labelling grade, topic detection, or recovery information ) However, most of these were made for Western languages and their value is lost when applied to other languages Only very recently, Vietnamese researchers are attracted linguistics and Vietnamese standard grade -4- The necessary data warehouse terms not built in a certain standard, and so far almost no resources are public It is difficult for amateurs to learn or research in this field In [26] (about discovery scheme for classification and clustering web documents in Vietnamese),she has gave the label based on N-gram testing to extract meaningful phrases (or collocation) from the n-gram on the basis of test statistics This paper give a few name of statistical methods to determine collocation, such as the mutual information (mutual information), the technical hypothesis testing (hypothesis testing technologies), Hypothesis Null (null hypothesis) on the independent of the n-gram from the ways of testing and to test the validity of theories in which the author has used methods of hypothesis testing for n-gram (n

Định dạng
Số trang	21
Dung lượng	530,07 KB