Lexicalized statistical parsing for Vietnamese

Lexicalized Statistical Parsing for Vietnamese Pham Thi Minh Thu Faculty of Information Technology Hanoi University of Engineering and Technology Vietnam National University, Hanoi Supervised by Doctor Le Anh Cuong A thesis submitted in fulfillment of the requirements for the degree of Master of Computer Science June, 2010 Table of Contents Acknowledgements ii Introduction 1.1 What is syntactic parsing? 1.2 Current Studies in Parsing 1.3 Vietnamese syntactic parsing 1.4 Objective of the Thesis 1.5 Thesis structure Parsing approaches 2.1 Context Free Grammar (CFG) 2.2 Parsing Algorithms 2.2.1 Top-down parsing 2.2.2 Bottom-up parsing 2.2.3 Comparison between top-down parsing and bottom-up parsing 2.2.4 CYK algorithm (Cocke-Younger-Kasami) 2.2.5 Earley algorithm 2.3 Probabilistic context-free grammar (PCFGs) 11 13 2.3.1 The concept of PCFG 13 2.3.2 Disadvantages of PCFGs 14 2.4 Lexical Probabilistic Context Free Grammar (LPCFGs) 2.4.1 Head structure 15 16 2.4.2 The concept of Lexical Probabilistic Context Free Grammar (LPCFGs) 16 2.4.3 Three models of Collins 18 Vietnamese parsing and our approach 21 3.1 Vietnamese characteristics 21 3.2 Penn Treebank 22 iii TABLE OF CONTENTS 3.2.1 POS tagging 23 3.2.2 Bracketing 23 3.3 Viet Treebank 3.3.1 Objectives iv 25 25 3.3.2 The POS tagset and Syntax tagset for Vietnamese 27 3.4 Our approach in building a Vietnamese parser 27 3.4.1 Adapting Bikel's parser for Vietnamese 29 3.4.2 Analyze error and propse using heuristic rules 30 Experiments and Discussion 33 4.1 Data 33 4.2 Bikel's parsing tool 34 4.3 Adaptating Bikel's tool to Vietnamese 35 4.3.1 Investigate different configurations 35 4.3.2 Training 38 4.3.3 Parsing 39 4.3.4 Evaluation of the parser 39 4.3.5 Results 40 4.4 Experimental results on using heuristic rules 42 Conclusions and Future Work 46 5.1 Summary 46 5.2 Contribution 46 5.3 Futurework 47 List of Figures 1.1 The parse tree of sentence "I go to school" 1.2 A parse tree in Vietnamese 2.1 The parse tree of the Vietnamese sentence "mÌo b¾t cht" 15 2.2 Two derivations of the sentence "T«i hiĨu Lan h¬n Nga" 16 2.3 A parse tree of Vietnamese in LPCFG 17 2.4 A tree with the "C" suffix used to identify 19 3.1 Set of tag in Penn Treebank 24 3.2 A sample of labeled data in Penn Treebank before manually treatment 25 3.3 A sample of labeled data in Penn Treebank after manually treatment 25 3.4 Tagset of Penn Treebank 26 3.5 A sample of complete data in English and Vietnamese 27 4.1 The Bikel's system overview 34 4.2 Result of testing standard Collins' model with training data's size change from 60% to 100% of the full data Where series and series stand for testing on sentences with length less equal 40 and 100 respectively v 43 List of Tables 2.1 Analysis table with CYK algorithm 11 3.1 POS tagset in Viet Treebank 28 3.2 Phrase tagset in Viet Treebank 28 3.3 Clause tagset in Viet Treebank 29 3.4 Syntax function tagset in Viet Treebank 29 4.1 The initial results on Viet Treebank with different configurations Key:CB = average crossing brackets, 0CB = zero crossing brackets, ≤ 2CB =≤ crossing brackets All results are percentages, except for those in the CB column 41 4.2 Number of sentence for training 42 4.3 The results with the change of the training data set 42 4.4 The error rate We use 520 sentences for development testing Then filtering sentences which have the F-score less than 70 % As the result, we collect 147 sentences into the set of error sentences The Percentage of a error is calculated by the number of sentences commit this error divide 147 Because a sentence may be some errors so the total percentage may exceed 100 44 4.5 The obtained results after applying some proposal rules to correct some wrong syntactic parsing vi 44 Chapter Introduction For a long time, human being have always dreamed of an intelligent machine which can listen to, understand and implement humans' requirements Many scientists have tried to make that dream and devoted many achievements for the science of artificial intelligence In artificial intelligence, natural language processing (NLP) is a field which studies on how to understand and generate automatically human language NLP has many practical applications such as machine translation, information extraction, discourse analysis, text summarization These applications have the same basic problems such as lexical analysis, syntactic parsing and semantic analysis In which, syntactic parsing is the central role and it is also the goal of this thesis 1.1 What is syntactic parsing? Syntactic parsing (parsing or syntactic analysis ) is the process of analyzing a given sequence of tokens (i.e a sentence) to identify their grammatical structure with respect to a given grammar The grammatical structure is often represented in the form which displays visually the dependence of components as a tree (is called parse tree or syntactic tree) In other words, parsing is the problem to get a given sequence of words as input and output is the parse trees corresponding to that sequence Figure 1.1 shows examples for parse tree: a) a English parse tree in usual form and b) a Vietnamese tree in other form Parsing is the major module of a grammar checking system In order to check grammar, we need to parse input sentences, then examine the correctness of the structures in the output Furthermore, a sentence which cannot be parsed may have grammatical errors 1.1 What is syntactic parsing? Figure 1.1: The parse tree of sentence "I go to school" Figure 1.2: A parse tree in Vietnamese Parsing is also the important intermediate stage of representation for semantic analysis, and thus plays an important role in applications like machine translation, question answering, and information extraction For example, in transfer-based machine translation the system will analyze the source sentence to output a parse tree and then construct the equivalent parse tree in the target language The output sentence will be generated mainly based on this equivalent parse tree It is to understand that in a question answering system we need parsing to find out which is the subject, object, or action It is also interesting that parsing can help speech processing It supports to correct the fault of the speech recognition process On the other hand, in speech synthesis parsing help put stress on the correct position in the sentence 1.2 Current Studies in Parsing Through these above example we can see that construct an accurate and effective parser will bring great benefits to many applications of natural language processing 1.2 Current Studies in Parsing As one of the basic and central problem of NLP, parsing attracts many studies They belong to one of the two approaches: rule-based and statistics-based In conventional parsing systems, a grammar is hand-crafted, often involves a large amount of lexically specific information in the form of sub-categorization information In there, ambiguity, a major problem in parsing, is solved through selectional restrictions For example, a lexicon might specify that "eat" must take an object with the feature + "food" In (Collins, 1999), the author has showed several problems with selectional restrictions such as increasing the volume of information required when the vocabulary size becomes so large In the other word, the biggest challenge is the large amount of vocabulary to require both selectional restrictions and structural preference should be encoded as the soft preferences instead of hard constraints To overcome these obstacles, the researchers began to explore machine-learning approaches to parsing problem, primary through statistical models In these approaches, a set of example pairs of sentence and the corresponding syntactic tree is annotated by hand and used to train parsing models A set of trees is called a " treebank" Several parts of the treebank are reserved as test data for evaluating the model's accuracy Early works investigate the use of probabilistic context free grammar (PCFG) Using PCFG is considered as the next generation of parsing and is also as a beginning step in statistical parsing In a PCFG, each grammar rule is associated with a probability The probability of a parse tree is the product of the probabilities of all rules used in that tree In the case, parsing is essentially the process of searching the tree that has the maximum probability However, a simple PCFG often fail due to its lack of sensitivity to lexical information and structural preferences Then some solutions were proposed to resolve this problem Several directions were listed in (Collins, 1999) such as: towards probabilistic version of lexicalized grammars; using supervised training algorithms; to construct models that had increased structural sensitivity; to look into history-based models Among them lexicalized probabilistic context free grammar (LPCFG) is a promising approach It can solve many ambiguity phenomena in parsing Some works that based on this approach achieved high performance, such as in (Collins, 1997) After this research, Daniel M Bikel and 1.3 Vietnamese syntactic parsing his coworker have developed Collins models and designed a parser for multiple languages It has been applied successfully for some languages as English, Chinese and Arabic (Bikel, 2004) The concrete results of the parser for these languages is reported in (Bikel, 2004): For English, F-measure is 90.01 %; for Chinese, F-measure is 81.2 % and in Arabic F-measure is 75.7 % According to these results and the comparison between current parsers (e.g Charniak parser, Bekelley parser, Standford parser), Bikels parser is still rated one of the best parser at present Recently, this approach of using LPCFG continues being applied for many languages Moreover, a number of new strategies have proposed to improve the accuracy of parsers In several researches, using semi-supervised training methods becomes a promising approach Their experimental results show that this approach outperforms the supervised one, without much additional computational cost Some other studies has integrated semantic information into parsing in order to fully exploit the benefits of lexical resources and upgrade the parser, such as in (Xiong et al., 2005), (Agirre & Baldwin, 2008) (Xiong et al., 2005) described the way of incorporating semantic knowledge as follow: Firstly, they used two Chinese electronic semantics dictionaries and heuristic rules in order to extract semantic categories Then they built a selection preference sub-model based on extracted semantic categories Similarly, in (Agirre & Baldwin, 2008), the sense information was added to parsing by substituting the original words with their semantic tags which correspond with their semantic classes, for example knife and scissors belong to TOOL class, cake and pork are assigned to FOOD class In addition, some other suggested tactics have been enhanced of the performance of parsing as a powerful learning technique (sample selection) for reducing the amount of human-labeled training data (Carreras et al., 2008); or, a strategy for utilizing POS tags resources to annotate parser input in (Watson et al., 2007) Through the review of approaches in parsing and especially some recent studies, we found that LPCFG appears in all of state-of-the-art parsing systems LPCFG Therefore in our opinion, LPCFG is a good choice for Vietnamese parsing 1.3 Vietnamese syntactic parsing In Vietnam, works in natural language processing (i.e computational linguistics) in general and in parsing in particular have been only motivated very recently A few of the parsers which follow the knowledge-based approach are constructed with the manual 1.4 Objective of the Thesis grammar rules Since the construction of grammar rules is manual, the accuracy of the parser is not high It only analyzes a limited number of sentences generated by the grammar The approach using statistics has been also studied, but also only at brief and has no experimental results For example, (Quoc-The & Thanh-Huong, 2008) presented about LPCFG but surprisingly it did not provide any experiment, only some examples were provided to illustrate the syntactic ambiguity of Vietnamese With such restricted results, no Vietnamese parser has been published widely It can say that while many countries in the world has gone forward a long way in parsing, Vietnam has just been at the stage to start The precondition for deployment these models for Vietnamese is a corpus containing parsed sentences which is the crucial resources for statistical parsing Since lack of corpus, the previous works on Vietnamese parsing have not had the significant experimental results Fortunately, at present there is a standard Vietnamese parsed corpus, called Viet Treebank, which developed in a project supported by Vietnamese government This corpus involves about 10,000 parsed sentences followed Penn Treebanks format, and therefore we can apply it to the Bikel's tool As mentioned above, lexicalized models have been applied successfully for multiple languages Among these languages, Chinese obtained 81.2 % F-score on average as Bikel shown in (Bikel, 2004) This result also motivated our study since the syntactic structure of Chinese is similarities to the syntactic structure of Vietnamese 1.4 Objective of the Thesis This thesis focuses on building a syntactic parser for Vietnamese using LPCFG approach We will use Viet Treebank as the parsed corpus and adapt Bikel's parsing tool for Vietnamese Then, we will also investigate some common errors appearing in the parsing, and propose a solution to improve accuracy of the parser To achieve this objective, this study has to find the answers for following questions: How to adapt Bikel's system for Vietnamese? How is the initial result? Which model/configure is appropriate for Vietnamese? How to improve performance of the system based on analyzing errors? In summary, in this study we try to carry out the following tasks - Study the basic techniques and methods in parsing, focusing on lexicalized statistical approaches; ... precondition for deployment these models for Vietnamese is a corpus containing parsed sentences which is the crucial resources for statistical parsing Since lack of corpus, the previous works on Vietnamese. .. in parsing and especially some recent studies, we found that LPCFG appears in all of state-of-the-art parsing systems LPCFG Therefore in our opinion, LPCFG is a good choice for Vietnamese parsing. .. the answers for following questions: How to adapt Bikel''s system for Vietnamese? How is the initial result? Which model/configure is appropriate for Vietnamese? How to improve performance of

Định dạng
Số trang	54
Dung lượng	0,99 MB