Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 12 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
12
Dung lượng
288,86 KB
Nội dung
Ripple Down Rules for Part-of-Speech Tagging Dat Quoc Nguyen1 , Dai Quoc Nguyen1 , Son Bao Pham1,2 , and Dang Duc Pham1 Human Machine Interaction Laboratory, Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi {datnq,dainq,sonpb,dangpd}@vnu.edu.vn Information Technology Institute, Vietnam National University, Hanoi Abstract This paper presents a new approach to learn a rule based system for the task of part of speech tagging Our approach is based on an incremental knowledge acquisition methodology where rules are stored in an exception-structure and new rules are only added to correct errors of existing rules; thus allowing systematic control of interaction between rules Experimental results of our approach on English show that we achieve in the best accuracy published to date: 97.095% on the Penn Treebank corpus We also obtain the best performance for Vietnamese VietTreeBank corpus Introduction Part-of-speech (POS) tagging is one of the most important tasks in Natural Language Processing, which assigns a tag representing its lexical category to each word in a text After the text is tagged or annotated, it can be used in many applications such as: machine translation, information retrieval etc A number of approaches for this task have been proposed that achieved state-of-the-art results including: Hidden Markov Model-based approaches [1], Maximum Entropy Model-based approaches [2] [3] [4], Support Vector Machine algorithm-based approaches [5], Perceptron learning algorithms [2][6] All of these approaches are complex statistics-based approaches while the obtained results are progressing to the limit The combination utilizing the advantages of simple rule-based systems [7] can surpass the limit However, it is difficult to control the interaction among a large number of rules Brill [7] proposed a method to automatically learn transformation rules for the POS tagging problem In Brill’s learning, the selected rule with the highest score is learned on the context that is generated by all preceding rules In additions, there are interactions between rules with only front-back order, which means an applied back rule will change the results of all the front rules in the whole text Hepple [8] presented an approach with two assumptions for disabling interactions between rules to reduce the training time while sacrificing a small fall of accuracy A Gelbukh (Ed.): CICLing 2011, Part I, LNCS 6608, pp 190–201, 2011 c Springer-Verlag Berlin Heidelberg 2011 Ripple Down Rules for Part-of-Speech Tagging 191 Ngai and Florian [9] presented a method to impressively reduce the training time by recalculating the score of transformation rules while keeping the accuracy In this paper, we propose a failure-driven approach to automatically restructure transformation rules in the form of a Single Classification Ripple Down Rules (SCRDR) tree [10][11][12] Our approach allows interactions between rules but a rule only changes the results of selected previous rules in a controlled context All rules are structured in a SCRDR tree, which allows a new exception rule to be added when the system returns an incorrect classification Moreover, our system can be easily combined with existing part of speech tagger to obtain an even better result For Vietnamese, we obtained the highest accuracy at present time on VietTreebank corpus [13] In addition, our approach obtains promising results in term of the training time in comparison with Brill’s learning The rest of paper is organized as follows: in section 2, we provide some related works including Brill’s learning, SCRDR tree, among others and describe our approach in section We describe our experiments in section and discussion in section The conclusion and future works will be presented in section 2.1 Related Works Transformation-Based Learning The well-known transformation-based error-driven learning method had been introduced by Brill [7] for POS tagging problem and this method has been used in many natural language processing tasks, for example: text chunking, parsing, named entity recognition The key idea of the method is to compare the golden-corpus that was correctly tagged and the current-corpus created through an initial tagger, and then automatically generate rules to correct errors based on predefined templates For example, corresponding with a template “transfer tag of current word from A to B if the next word is W” is some rules like as: “transfer tag of current word from JJ to NN if the next word is of” or “transfer tag of current word from VBD to VBN if the next word is by” Transformation-based learning algorithm runs in multiple iterations as follows: – Input: Raw-corpus that contains the entire raw text without tags extracted from the golden-corpus that contains manually tagged word/tag pairs – Step 1: Annotated-corpus is generated using an initial tagger where its input is the raw-corpus – Step 2: Comparing the annotated-corpus and the golden-corpus to determine tag errors in the annotated-corpus From these errors, all templates are used for creating potential rules – Step 3: Each rule will be applied to a copy of annotated-corpus The score of a rule is computed by subtracting of number of additional errors from number of correctly changed tags The rule with the best score is selected – Step 4: Update the annotated-corpus by applying selected rule 192 D.Q Nguyen et al – Step 5: Stop if the best score is smaller than a predefined threshold T, else repeat step – Output: Front-back ordered list of transformation rules The training process of Brill’s tagger includes two phases: – The first-phase is used to assign the most likely tag for unknown words Initially, the most likely tag for unknown words starting with a capital letter is NNP and otherwise it is NN In this phase, the lexical transformation rules are used to predict the most likely tag for unknown words The transformation templates in this phase depend on character(s), prefix, suffix of a word and only the preceding/following word For example, “change the most likely tag of an unknown-word to Y if the word has suffix x, |x|