DSpace at VNU: Combining statistical machine learning with transformation rule learning for Vietnamese Word Sense Disambiguation

Combining statistical machine learning with transformation rule learning for Vietnamese Word Sense Disambiguation Phu-Hung Dinh, Ngoc-Khuong Nguyen, and Anh-Cuong Le Dept of Computer Science University of Engineering and Technology Vietnam National University, Ha Noi 144 Xuan Thuy, Cau Giay, Ha Noi, Viet Nam {hungdp@wru.edu.vn, khuongnn.mcs09@vnu.edu.vn, cuongla@vnu.edu.vn} Abstract—Word Sense Disambiguation (WSD) is the task of determining the right sense of a word depending on the context it appears Among various approaches developed for this task, statistical machine learning methods have been showing their advantages in comparison with others However, there are some cases which cannot be solved by a general statistical model This paper proposes a novel framework, in which we use the rules generated by transformation based learning (TBL) to improve the performance of a statistical machine learning model This framework can be considered as a combination of a rule-based method and statistical based method We have developed this method for the problem of Vietnamese WSD and achieved some promising results Index Terms—Machine Learning, Transformation Based Learning, Naive Bayesian classification I INTRODUCTION As the ambiguity of natural languages, a word may have multiple meanings (senses) Practically speaking, an ambiguous word has ambiguity regarding its part-of-speech and meaning WSD usually considers to disambiguate the meaning of a word in a specific part-of-speech A word in a specific part-ofspeech which has several meanings is called polysemous For example, the noun “bank” has at least two different meanings: “bank” in “Bank of England” and “bank” in “river bank” Beside that, polysemous word also exists in Vietnamese For instance, consider the following sentences: • Anh ta câu cá ao He is fishing in a pond • Đại bác câu trúng lơ cốt The guns lobbed home shells on the blockhouse The occurrence of the word “câu” in the two sentences clearly denotes different meanings: “to fish” and “to lob” WSD means that determining its right sense in the particular context The success of this task makes benefit for many Natural Language Processing (NLP) problems such as information retrieval, machine translation, human-computer communication, and so on The automatic disambiguation of word senses has received concern since the 1950s [1] Since that time, many studies have been investigating on various methods for this problem, but the performances of available WSD systems or published results are limited These methods can be mainly divided into two approaches: knowledge-based and machine learning (using corpora) Knowledge-based methods rely on previously acquired linguistic knowledge Therefore, WSD task will be performed by matching the context in which they appear with information from an external knowledge source The methods in this approach are based on knowledge resources like WordNet thesaurus, as well as grammar rules or hand-coded rules for disambiguation (see [1] for more detail and discussion) In machine learning approach, since the decade of 1990s, empirical and statistical approach have attracted almost studies in NLP field Many machine learning methods have been applied to a large variety of NLP task (including WSD) with remarkable success The methods in this approach use techniques from statistics and machine learning to induce models of language usage from large samples of text Generally, basing on labeled data, unlabeled data, or both, machine learning methods can be divided into three groups including supervised, unsupervised, and semi-supervised one Because supervised systems are based on annotated data, they achieve better results Many machine learning methods have been applied to systems such as: use maximum entropy models [2], [3], use support vector machines (SVM) [4], use decision list [5], [6], use Naive Bayesian (NB) classifier [7], [8] Other studies have tried to use linguistic knowledge from dictionaries and thesauri as in [9], [10] Machine learning approach seems to show its advantages in comparison with knowledge based approach While knowledge based approach is based on rules generated by experts as well as their ability and meets difficulties in solving a big number of cases, machine learning approach can solve the problem on a large scale without paying much attention on linguistic aspects However, the obtained results for WSD (e.g in English) are still far from applying in a real system Although an average 978-1-4673-0309-5/12/$31.00 ©2012 IEEE accuracy on Senseval-2 and Senseval-3 is around 70%, some other studies such as [13] achieve higher accuracy (about 90% for several words) as being implemented on a large training data The first reason from our observation causing unexpected results for the WSD statistical machine learning systems is based on a spare corpus The second reason is that there are usually exceptional cases for any NLP problems (particularly for WSD), which does not depend on a general principle (or model) Therefore, in this paper we focus on correcting the cases which may be misclassified by a statistical machine learning system During the research, by borrowing the idea from knowledge based approach instead of generating these rules by an expert we will apply the techniques in TBL for automatically producing the rules Firstly, basing on the training corpus, a machine learning model is trained to be used as the initialized classification system for a TBL based error driven learning approach in order to amend the initial prediction, using a development corpus Consequently a set of TBL rules are produced Secondly, in the final model, as the first step we first use the machine learning model to detect senses for the polysemous words and then apply the obtained transformation rules on the results of the first steps to get the final senses The paper is organized into six parts including the introduction In section II, we will present a background including TBL and a statistical machine learning method And then, the detail of our proposed model will be presented in the section III In section IV, we will present feature selection and rule templates selection Data preparation and experiments will be presented in section V Finally, we conclude the paper in section VI II BACKGROUND In this section we will introduce the NB classification (in the corpus-based approach) and the TBL (in the rule-based approach), which are the two basic methods used in the combination method we propose A NB model NB method have been used in most classification work and were first used for WSD by Gale [7] NB classifiers work on the assumption that all the feature variables representing a problem are conditionally independent given the classes Assuming that the polysemous word w is disambiguating Suppose that w has a set of potential senses (classes) S = {s1 , , sc }, and it is given a context of w which is presented by a set of features F = {f1 , , fn } The Bayesian theory suggests that the word w should be assigned to class sk provided that the a posterior probability of that class is maximum, namely sk = arg maxP (sj |F ), j ∈ {1, , c} sj where the value of P (sj |F ) is computed by the following equation: See more detail about these corpora on http://www.senseval.org/ P (sj |F ) = P (sj )P (F |sj ) P (F ) P (F ) is constant for all senses and therefore does not influence the value of P (sj |F ) The sense sk of w is then: sk = arg maxP (sj |F ) sj = arg max sj P (sj )P (F |sj ) P (F ) n = arg maxP (sj ) sj P (fi |sj ) i=1 n = arg max[logP (sj ) + sj logP (fi |sj )] i=1 The values of P (fi |sj ) and P (si ) are computed via maximum-likelihood estimation as: C(sj ) C(fi , sj ) P (sj ) = and P (fi |sj ) = N C(sj ) where C(fi , sj ) is the number of occurrences of fi in a context of sense sj in the training corpus, C(sj ) is the number of occurrences of sj in the training corpus, and N is the total number of occurrences of the polysemous word w or the size of the training dataset To avoid the effects of zero counts when estimating the conditional probabilities of the model, we set P (fi , sj ) equal to 1/N for each sense sj when meeting a new feature fi in a context of the test dataset NB algorithm for WSD is presented as follows: Training: for all senses sj of w for all features fi extracted from the training data C(fi , sj ) P (fi |sj ) = C(sj ) end end for all senses sj of w C(w, sj ) P (sj ) = C(w) end Disambiguation: for all senses sj of w score(sj ) = log(P (sj )) for all features fi in the context window c score(sj ) = score(sj ) + log(P (fi |sj )) end end choose sk = arg maxscore(sj ) sj B Transformation-Based Learning TBL is known as the most successful method in the rulebased approach for many NLP tasks because it provides a method for automatically learning the rules Eric and Brill [11] introduced TBL and showed that it can part-of-speech tagging with fairly high accuracy The same method can be applied in many natural language processing tasks, for example: text chunking, parsing, named entity recognition, and word sense disambiguation The method’s key idea is to compare the golden-corpus being correctly tagged with the current-corpus being created through an initial tagger, then automatically generate rules to correct errors based on predefined templates Transformation-based learning algorithm runs in multiple iterations as following: Input: Raw-corpus containing the entire raw text without labels is extracted from the golden-corpus that contains manually labeled context/label pairs • Step 1: Generating initial-corpus by using an initial label where its input is the raw-corpus • Step 2: Comparing the initial-corpus with the goldencorpus to determine the initial-corpus’s label errors from which all rule templates are used for creating potential rules • Step 3: Appling each rule in potential rules to a copy of initial-corpus The score of a rule is computed by subtracting of number of additional errors from number of correctly changed labels The rule with the best score is selected • Step 4: Updating the initial-corpus by applying selected rule and moving this rule to the list of transformation rules • Step 5: Stopping if the best score is smaller than a predefined threshold T, otherwise repeat step Output: List of transformation rules III OUR APPROACH In this section, we describe our approach to induce a model which will correct the missing tagged senses of a statistical machine learning model (note that, we choose here the NB classification model) This model includes the training phase and testing phase Notice that in this model, we will use a training data for generating a NB classification and then use a developing data for learning the transformation rules These two sets of tagged data is constructed by manually labeling from a set of selected contexts of the polysemous word A The training phase The training process consists of two stages In the first stage, list error are determined based on the NB model This stage is described as shown in Figure Step 2: Using training-corpus to train a NB classification model This classification model is then tested on the raw developing-corpus obtained in the step The obtained result is called the initial-corpus • Step 3: Comparing initial-corpus with the developingcorpus to determine list all contexts with wrong labels from the NB classification model Output: List of contexts with wrong labels (we call it the list error as shown in Figure 1) • Figure The diagram describes training algorithm (first stage) In the second stage, the set of TBL rules is determined based on applying the TBL algorithms on the list error obtained at the step Notice that in this stage we will use a predefined templates for generating potential TBL rules (this is mentioned in detail in Section IV.B) Now, this stage is described as follows (shown in Figure 2) Input: Developing-corpus and initial-corpus, and the list error • Step 1: Applying the rule templates for the list error to generate a list of transformation rules (called list potential-rules) • Step 2: Applying each rule in list potential-rule to a copy of the initial-corpus Score of each rule is calculated as s2 − s1 , where s1 is the number of cases that right labels are transformed into wrong labels, s2 is the number of cases that are corrected The rule with highest score is selected • Step 3: Updating initial-corpus by applying the rule with the highest score and moving this rule to the selected TBL rules List error also has been updated accordingly by comparing initial-corpus with developing-corpus • Step 4: Stopping if the highest score is smaller than a predefined threshold T, else go to step Output: List of transformation rules (i.e selected TBL rules) B The test phase Input: Training-corpus and developing-corpus contain manually labeled context/label pairs • Step 1: Obtaining the raw developing-corpus by removing the labels from the developing-corpus The proposed approach uses selected TBL rules obtained at the training phase for the testing as follows (shown in Figure 3) Input: Test-corpus and selected TBL rules Figure The diagram describes training algorithm (second stage) Step 1: Obtaining raw test-corpus by removing the label from the test-corpus • Step 2: Using the NB Classification model on the raw test-corpus and obtaining the called initial-corpus • Step 3: Appling selected TBL rules for initial-corpus to create the labeled corpus • Step 4: Comparing the labeled corpus with test-corpus to evaluate system (i.e get the accuracy) Output: Accuracy of the proposed model • W is a context of w within a windows (-3,+3) in which w0 = w is the target word; for each i ∈ {−3 + 3} wi is a word appearing at the position i in relation with w Based on previous studies [12], [13], [14] and our experiment, we propose to use two kinds of knowledge and represent them as subsets of features, as following: • Bag-of-words, F1 (l,r) ={w−l , ,w+r }: We choose F1 (-3,+3), it has seven elements (features) as follows: F1 (-3,+3) = {w−3 , w−2 , w−1 , w0 , w1 , w2 , w3 } • Collocation of words, F2 ={w−l w+r }: We choose collocations provided that their lengths (including the target word) are less or equal to 4, it means (l + r + 1) ≤ It has nine elements (features) as follows: F2 ={w−1 w0 , w0 w1 , w−2 w−1 w0 , w−1 w0 w1 , w0 w1 w2 , w−2 w−1 w0 w1 , w−1 w0 w1 w2 , w−3 w−2 w−1 w0 , w0 w1 w2 w3 } In summary we obtain 16 features and denote them as (f1 , f2 , , f16 ) These features will be used in the NB classification model and for building a TBL rules B Rule templates for building TBL rules Rule templates are the important parts of the TBL algorithm They are used for automatically generating TBL rules Based on previous studies [15], [16] and presented features above, we propose some rule templates as follows (shown in Figure 4) A A A A A Figure The diagram describes test phase IV FEATURES AND RULE TEMPLATES A Feature Selection One of the most important tasks in WSD is the determination of useful information related to word senses In the corpus-based approach, most studies have just considered the information extracted from the context in which the target word appears Context is the only means to identify the meaning of a polysemous word Therefore, all works on sense disambiguation rely on the context of the target word to provide information being used for its disambiguation For corpus-based methods, context also provides the prior knowledge with which current context is compared to achieve disambiguation Suppose that w is the polysemous word to be disambiguated, and S = {s1 , s2 , sm } is the set of its potential senses Given a context W of w represented as: W = { w−3 , w−2 , w−1 , w0 , w1 , w2 , w3 } → → → → → B B B B B word word word word word C C C C C @ @ @ @ @ Figure [ [ [ [ [ -1 ] 1] -2 ] & word D @ [ -1 ] -1 ] & word D @ [ ] ] & word D @ [ ] The rule template For example, some explanation of the rule templates are described as follows: The rule template: “A → B word C @ [ ]” means “transfer label of current word from label A to label B if the next word is C” or the rule template “A → B word C @ [ -1 ] & word D @ [ ]” means “transfer label of current word from label A to label B if the previous word is C and the next word is D” V EXPERIMENT A Data preparation As considering WSD as a classification problem, we need an annotated corpus for this task For English, many studies use such kinds of corpus as Senseval-1, Senseval-2, Senseval3 and so on Because a standard corpus in Vietnamese does not exist, it is necessary to building a training-corpus for it To this end, we first use a crawler to collect data from web sites and obtain about 1.2GB of raw data (approximately 120.000 articles from more than 50 Vietnamese web sites such as www.vnexpress.net, www.dantri.com.vn, etc.) We then extract from this corpus the contexts (containing several sentences around the ambiguous word) for 10 ambiguous words For example, a context for the ambiguous word “bạc” is shown in Figure Trọng tâm tháng hòa hợp gia đình, thành viên đồng thuận đường nghiệp bạn Giữa tháng 3, tình hình tài bạn cải thiện nhiều Tiền bạc đổ dồn về, phải biết cách chi tiêu hợp lý Đây khoảng thời gian thích hợp để bạn đầu tư vào tài sản cố định Nếu may mắn, bạn thu khoản tiền lớn Figure A context of the word “bạc” After that, these contexts of 10 ambiguous words is manually labeled to obtain the labeled corpus The Table I describes in detail the number of samples and senses of ambiguous words in turn Table II DATA SETS No 10 Word Bạc Bạc Cất Câu Câu Cầu Khai Pha Phát Sắc Part of speech Noun Adj Verb Noun Verb Noun Verb Verb Verb Noun Senses 4 8 Firstly, from the labeled corpus, we divide this corpus into two parts by the ratio 3:1, called data-corpus and data-corpus respectively Data-corpus is used for training and datacorpus is used for test in the models like NB, TBL, SVM , and our proposed model Secondly, data-corpus is used for the purpose of building the TBL rules so it is divided randomly 10 times into two parts by the ratio 3:1, one part is used for training (called training-corpus) and the other is used for developing (called developing-corpus) Notice that the training phase for building TBL rules will be processed 10 times corresponding to the training and developing data, to cover as much as possible for extracting TBL rules The Table II shows the number of training, developing, and test sets we use libsvm for SVM model See more details about libsvm at: http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ Corpus Training Developing 687 230 308 105 673 229 1767 589 163 57 659 220 1944 650 331 112 1205 408 1124 376 Corpus Test 307 139 301 786 75 295 865 149 538 500 Table III NB MODEL RESULTS Examples 1224 552 1203 3142 295 1174 3459 592 2151 2000 To conduct the experiment we build some kinds of data as follows: Part of Speech Noun Adj Verb Noun Verb Noun Verb Verb Verb Noun B Experimental results In this section, we present experimental results on four models: NB model, TBL model, SVM model, and the proposed model that is the combination of NB and TBL model From data sets above, firstly, we evaluate the accuracy on the NB model and obtain the results shown in Table III The average accuracy of this model is about 86.5% Table I STATISTICS ON THE LABELED DATA No 10 Word Bạc Bạc Cất Câu Câu Cầu Khai Pha Phát Sắc No Word 10 Bạc Bạc Cất Câu Câu Cầu Khai Pha Phát Sắc Part of Speech Noun Adj Verb Noun Verb Noun Verb Verb Verb Noun Training Test Accur (%) 917 413 902 2356 220 879 2594 443 1613 1500 307 139 301 786 75 295 865 149 538 500 81.8 85.6 84.4 97.6 85.3 95.6 90.4 79.2 73.6 91.6 Averages 1328 444 86.5 Secondly, with each ambiguous word, using training algorithm in Section III, we obtain the lists of TBL rules As running this phase 10 times, we obtain 10 lists of TBL rules The Table IV shows the experimental results if we separately test (use the combination model) these TBL lists for the word "bạc", and its part-of-speech is adjective Moreover, if we combine all rules into an account and then we will obtain better accuracy 92.8% Beside that, some TBL rules we obtain for the word "bạc" is shown the Figure 2 2 → → → → → → → 3 word word word word word word word Figure vàng @ [ -1 ] sới @ [ -1 ] cao @ [ ] & word cấp @ [ ] tiền @ [ ] @ [ -2 ] & word triệu @ [ -1 ] tờ @ [ -1 ] két @ [ -1 ] Some TBL rules for the word “bạc” Finally, we show the experimental results on our system for the 10 ambiguous word It can be seen from the Table Table IV NB & RULES BASED MODEL’S RESULT FOR AMBIGUOUS WORD “BẠC” No List of Rules 10 list list list list list list list list list list 11 combined rules rules rules rules rules rules rules rules rules rules rules 10 Accuracy of NB & TBL (%) 89.2 89.9 89.2 89.9 89.9 89.9 89.2 90.6 92.1 89.2 ACKNOWLEDGMENT This work is partially supported by the Vietnams National Foundation for Science and Technology Development (NAFOSTED), project code 102.99.35.09 REFERENCES 92.8 V that results obtained from the proposed model (combining NB classification and TBL) is better than results obtained from NB classification model, TBL model, and SVM model The average accuracy of this model achieves about 91.3% for 10 ambiguous words, which is 4.8%, 7.4%, and 3.1% more accurate than the NB classification model, TBL model, and SVM model respectively Table V NB & TBL & SVM & OUR PROPOSED MODEL RESULTS No Word 10 Bạc Bạc Cất Câu Câu Cầu Khai Pha Phát Sắc Accuracy Accuracy Accuracy Accuracy of of of of other natural language processing tasks such as part-of-speech tagging, syntactic parsing, and so on Part of Speech Noun Adj Verb Noun Verb Noun Verb Verb Verb Noun Accur1 (%) 81.8 85.6 84.4 97.6 85.3 95.6 90.4 79.2 73.6 91.6 Accur2 (%) 82.4 83.5 79.7 97.3 88.0 85.4 88.2 76.5 75.2 83.2 Accur3 (%) 84.4 88.5 86.4 97.8 86.7 95.6 91.2 81.2 77.1 92.8 Accur4 (%) 88.6 92.8 89.7 98.3 96.0 95.9 92.9 83.9 80.9 94.0 Averages 86.5 83.9 88.1 91.3 NB model TBL model SVM model NB & TBL model VI CONCLUSIONS This paper has proposed a new method for combining the advantages of machine learning approach and rule-based approach for the task of word sense disambiguation In particular, we have used NB classification for the machine learning method and combined it with TBL We have experimented on some Vietnamese polysemous words and the obtained accuracy has increased until 4.8%, 7.4%, and 3.1% when being compared with the results of NB model, TBL model, and SVM model respectively It also proves that TBL can be utilized to correct wrong results from statistical machine learning models This model can be applied to other languages for the task of WSD, and we believe that it can be also applied to some [1] N Ide and J Véronis, “Introduction to the special issue on word sense disambiguation: the state of the art,” Comput Linguist., vol 24, pp 2–40, March 1998 [2] A Suárez and M Palomar, “A maximum entropy-based word sense disambiguation system,” in Proceedings of the 19th international conference on Computational linguistics - Volume 1, ser COLING ’02 Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp 1–7 [3] A L Berger, S A D Pietra, and V J D Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol 22, pp 39–71, 1996 [4] Y K Lee, H T Ng, and T K Chia, “Supervised word sense disambiguation with support vector machines and multiple knowledge sources,” in Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004, pp 137–140 [5] D Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” in Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ser ACL ’95 Stroudsburg, PA, USA: Association for Computational Linguistics, 1995, pp 189– 196 [6] T Pedersen, “A decision tree of bigrams is an accurate predictor of word sense,” in Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, ser NAACL ’01 Stroudsburg, PA, USA: Association for Computational Linguistics, 2001, pp 1–8 [7] W A Gale, K W Church, and D Yarowsky, “A method for disambiguating word senses in a large corpus,” Computers and the Humanities, vol 26, pp 415–439, 1992 [8] T Pedersen, “A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation,” 2000 [9] M Lesk, “Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone,” in Proceedings of the 5th annual international conference on Systems documentation, ser SIGDOC ’86 New York, NY, USA: ACM, 1986, pp 24–26 [10] R Navigli and P Velardi, “Structural semantic interconnections: A knowledge-based approach to word sense disambiguation,” IEEE Trans Pattern Anal Mach Intell., vol 27, pp 1075–1086, July 2005 [11] E Brill, “Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging,” Comput Linguist., vol 21, pp 543–565, December 1995 [12] C A Le, “A study of classifier combination and semi-supervised learning for word sense disambiguation,” Ph.D dissertation, School of Information Science Japan Advanced Institute of Science and Technology, 2007 [13] C A Le and A Shimazu, “High word sense disambiguation using naive bayesian classifier with rich features,” in The 18th Pacific Asia Conference on Language, Information and Computation (PACLIC-2004), 2004, pp 105–113 [14] R F Mihalcea, “Word sense disambiguation with pattern learning and automatic feature selection,” Nat Lang Eng., vol 8, pp 343–358, December 2002 [15] G Ngai and R Florian, “Transformation-based learning in the fast lane,” in Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, ser NAACL ’01 Stroudsburg, PA, USA: Association for Computational Linguistics, 2001, pp 1–8 [16] R L Milidiú, J C Duarte, and C Nogueira Dos Santos, “Current topics in artificial intelligence,” D Borrajo, L Castillo, and J M Corchado, Eds Berlin, Heidelberg: Springer-Verlag, 2007, ch TBL Template Selection: An Evolutionary Approach, pp 180–189 ... & RULES BASED MODEL’S RESULT FOR AMBIGUOUS WORD “BẠC” No List of Rules 10 list list list list list list list list list list 11 combined rules rules rules rules rules rules rules rules rules rules... method for combining the advantages of machine learning approach and rule- based approach for the task of word sense disambiguation In particular, we have used NB classification for the machine learning. .. Mihalcea, Word sense disambiguation with pattern learning and automatic feature selection,” Nat Lang Eng., vol 8, pp 343–358, December 2002 [15] G Ngai and R Florian, Transformation- based learning

Định dạng
Số trang	6
Dung lượng	198,92 KB