Building a semantic role labeling system for vietnamese sentences

UNIVERSITY OF ENGINEERING AND TECHNOLOGY VIETNAM NATIONAL UNIVERSITY, HANOI NGUYEN THANH HUY BUILDING A SEMANTICROLELABELING SYSTEM FOR VIETNAMESE SENTENCES Major : Computer Science Code :604801 MASTER THESIS Supervised by: PhD Nguyen Phuong Thai Hanoi - 2011 Table of Contents Acknowledgements Abstract List of Figures List of Tables Chapter I: Introduction 1.Function tags 2.Corpora for function tag labeling 3.Current studies on Function tagging 4.Objective of the thesis 5.Our contributions 6.Thesis structure Chapter II: Related works 1.Function Tags Labeling by Parsing 1.1Motivation 1.2Approach 1.3Result 2.Sequential Function Tag Labeling 2.1Features 2.2Learning model 3.Function Tag Labeling by Classification 3.1Feature 3.2Model Chapter III: The proposed approach 1.System Architecture 2.Function Tags in Vietnamese 3.Selected Features 4.Word Clustering 5.Classification Model 5.1 Maximum Entropy by Motivati 5.2 Maximum Entropy Modeling 5.3 Training data 5.4 Features and Constraints 5.5 Maximum Entropy Principle 6.Summarization Chapter IV: Experiment 1.Corpora and Tools 2.Functional Labeling Precisions 3.Error Analyses 4.Effectiveness of Word Cluster Feature 5.Summary Chapter V: Conclusion and Future work 1.Contributions 2.Future work Bibliography Publications List of Figures Figure Sample domain and frame element of Frame Net 10 Figure A parsing with function tags in Viet Treebank 11 Figure 3.The perceptron model for function tags labeling problem 21 Figure Model of Function Tag Labeling System for Vietnamese sentences 23 Figure An example for selected features in Viet Treebank 25 Figure Example of word cluster hierarchy 26 Figure Scenarios in constrained optimization 32 Figure Pseudo-code for extracting function labels 33 Figure An example of word cluster 36 Figure 10 Learning curve 38 Figure 11 The dependency between two function labels 39 List of Tables Table1 Functional Labeling Approaches 16 Table Result of labeling by parsing approach following Collin model 18 Table Function Tags on Viet Treebank 24 Table Vietnamese Treebank statistics 34 Table Evaluation of Vietnamese functional labeling system 37 Table Increases in precision by using word cluster feature 40 Chapter I: Introduction In this chapter, I introduce function tags and value of function tag in NLP applications, some current approaches, objective of thesis and our contribution Finally, I describe the structure of the thesis Function tags There are two kinds of tags in linguistics: syntactic tags and function tags For syntactic tags there are several theories and projects research result in English, Spanish, Chinese and [4][13][14][18] These research mainly focus on finding the partof-speech and tagging for their constituents Function tags are understood as abstract labels because they are not similar to syntactic labels If a syntactic label has one notation for a batch of words in a paragraph, function tags present the relationship between a phrase and its utterance in each difference context So for each phrase, function tags might be transforming It depends on context of its neighbors For example we consider a phrase: “Baseball bat” syntactic of this phrase is “noun phrase” (in almost research they are annotated as NP) But its function tag might be a subject in this sentence: This baseball bat is very expensive In other case its function tag might be a direct object: I bought this baseball bat last month Or instrument, agent in a passive voice: That man was attacked by this baseball bat Function tags are directly mentioned by Blaheta (2003) [2] There are a lot of research focuses in how to tag function tags for a sentence This kind of research problem is called function tags labeling problem, a class of problems to finding semantic information of phrase To sum up, function tag labeling is defined as a problem how to find the semantic information of a batch of words, and then tag them with a given annotation in its context Corpora for function tag labeling Nowadays machine learning is the popular method for most of modern problems especially in Nature Language Processing subject To build up a machine learning system we need a training data set There are some function tag labeling corpora that are applied for languages such as English and Chinese In English, there are two main corpora which are used for semantic role labeling and function tag labeling problems They are Frame Net(Baker, 1998; FillMore and Baker 2000) and Prop Bank (Palmer et al,2005).The main idea of Frame Net is that group all similar words in the same group, and then represents relationship of this group with other groups in a group-network That why it is called Frame Net Figure 1shows a small example branch on Frame Net Domain: Comunication Confer-v Talk-v Discussion-n Debate-v Figure Sample domain and frame element of Frame Net The second corpus is PropBank (Palmer et al., 2005) which is a modification of Penn Treebank by annotated addition information: function tags Penn Treebank and Prop Bank are organized as a set of trees; each tree illustrates a sentence which is tagged syntactic labels (for Penn Treebank) and with both syntactic and functional labels (for Prop Bank) PropBank and Chinese Treebank are linguistics resources which have been available for research purpose for a long time Whereas, Viet Treebank have been developed recently by Nguyen [17] by using experiment and approaches from Penn Treebank Hence, Viet Treebank has the same structure as Penn Treebank and Chinese Treebank in which each word is presented as leaf node of a tree, none terminal nodes are tagged syntactic label or functional label Figure will show an example from Viet Treebank with function tags http://vlsp.vietlp.org:8080/demo/?page=resources 10 Figure A parsing with function tags in Viet Treebank Current studies on Function tagging Function tags labeling is an important processing step for many natural language processing applications such as question answering, information extraction, and summarization Thus, there were some research that focused on function tagging problem to cover additional semantic information which is more useful than syntactic labels In 1997, Collins [7] introduced the idea to add some useful syntactic information, and then he proposed a parser to have enough ability in guessing the complement tag This parser is called Collins‟s parser, and it is considered as first system in tagging label The function tags labeling is defined precisely by Blaheta (2003) [2] His research used data from Pen Tree II which is covered extra function tags With Blaheta‟s proposal there are various investigation focusing on function tag labeling such as Merlo and Mussillo (2005), Blaheta and Charniak (2004), Chrupala and Genabith (2006), Sun, Sui (2009) These studies extend function tags labeling topic by focusing on new language such as Chinese, proposing new approaches, or investigating new features Nowadays, there are three main approach strategies for function tags labeling problem: The first approach is called parsing, which is tagging function labels during the parsing process, this approach is a modification of Collin‟s parser Following this approach, we can consider studies of Gabbard [9], and Marcus [17] 11 The second approach is called labeling method which includes two phases: extracting features and classifying function labels This approach has more techniques because of the diversity of classification techniques The most typical research of this approach is Blaheta‟s research [3] In his research, he has been applied some techniques to show the impact of each technique for function tag labeling problem The third approach is defined as sequential labeling approach For this approach, function tags are predicted from observed words chain (Yuan [23]) This approach is similar to selecting features of classification approach but the difference is that it uses a prediction model instead of a classification model These approaches will be discussed in detail in next chapter Today, there is a class problem which covers function tagging This class was mentioned by Carreras (2004) [5] and called Semantic Role Labeling Semantic Role Labeling is similar to function tag labeling but it works at a more abstract level When building a Semantic Role Labeling system, the training data will have more information They not only include time, location, manner, etc, but also object, instrument, agent, etc This problem is a new promised research for NLP-applications which need to understand the meaning of sentences Objective of the thesis As we mentioned above, assigning function tags has wide research, especially for English Recently, some studies were applied for Spanish, and Chinese All function tagging system have contributed in their corpora a semantic class which is very useful for other NLP-applications such as Question Answering, Summarization, Information Retrieval, etc In recent years, Nature Language Processing topics in Vietnam have developed rapidly Especially for Vietnamese, many studies have focused on how to recognize the syntactic of Vietnamese sentences by a POS tagging system But unfortunately, these NLP-applications not provide semantic information for a sentence Whereas, some NLP-applications need to know semantic information to answer questions: who, where, what, and whom To deal with this problem, our research focuses on building an automatic function tags labeling In this thesis, I call as stage one, temporarily; our research will build a function tagging system, a problem that is shallower than Semantic Role Labeling 12 5.5 Maximum Entropy Principle Assume that there are n given feature functions fi which determine important statistics in modeling process Our goal is that building a model according to these statistics That is, we would like p to lie in the subset C of Ƥ The set C is defined as: This equation will be demonstrated in figure below In figure 7, Ƥ is the space of all (without any constraints) probability distributions on points, we can call as a simplex In triangular (a), if no constraints for Ƥ are applied, all are allowed If there is a constraint, which present by C1 narrows (b), the set of allowable models to those which lie on the line defined by linear constraint If there are two consistent constraints (c) C1 and C2 define a single model assume that there are two inconsistent ( satisfy both of them In the last triangle (d), ; there is not any can For the models , the maximum entropy philosophy dictates that how to find the distribution which is most uniform The term “uniform” can be described by a mathematical measure of the uniformity of a conditional distribution p(y|x), provided by the conditional entropy The entropy is bounded from below by zero, the entropy of a model from above by , the entropy of uniform distribution over all possible |Y| values of y With these definition, the principle of maximum entropy is that: select a model from a set C of allowed probability distribution, choose the model with maximum entropy H(p): p* = arg max H ( p) (7) p∈C It can be seen that p* is always well-defined; it mean that, there is always a unique model p* with maximum entropy in any constraint set C 31 P P C1 (a) P C2 C1 (c) Figure Scenarios in constrained optimization Summarization Our system is shown in the sections above In this section, I present a brief overview of our works in function tag labeling system for Vietnamese As I mentioned in the introduction section, our work includes two main phases: training and testing In the training phase, we execute an algorithm to extract functional labels with their constituents from Viet Treebank Simultaneously, we use a word clustering toolbox (Pecy Liang, 2005) to build up a corpus which contains large clusters in which, each cluster include synonyms or words which are same topic During the training phase, we depth – first search (DFS), a familiar algorithm for searching a tree, tree structure or a graph, to achieve the function tags The process of algorithm is represented by the pseudo-code in figure below Then features including functional labels are taken as input of Maximum Entropy Model to build up a classification model for function tags labeling problem 32 Figure Pseudo-code for extracting function labels In testing phase, we this extraction algorithm again to get constituents These constituents will be taken to model, which is built from training phase, to testing model Output will match with functional labels, taken from testing phase, to evaluate performance of model This step is the evaluation of our system 33 Chapter IV: Experiment In this chapter, we discuss our system‟s efficiency in some parameter such as result of the classification model, number of function tags, and distribution of each tag in our data Thus, we can evaluate the most effective label in Vietnamese, and what cases the system often identifies wrong Corpora and Tools In experiments, the most important resource is a hand-crafted Vietnamese Treebank [17] This Treebank contains 10,471 sentences which were tagged with both constituent and functional labels The Treebank has been developed from 2006 Until now it has been updated regularly to support Vietnamese language processing research We used about 9,000 trees for training and the rest, 1,471 trees, for testing Table shows statistics of the Treebank Table presents Vietnamese function tags in four groups including clause types, syntactic roles, adverbials, and miscellaneous Sentences Words 10,471 Syllables 225,085 271,268 Table Vietnamese Treebank statistics The MEM tool4 we used in this paper is a library for maximum entropy classification of Tsuji‟s laboratory In the current version (2006), the library has several advance features such as fast parameter estimation using the BLVM algorithm, smoothing with Gaussian priors, etc For word clustering, we used an open source tool of Liang (2005), an efficient implementation of Brown‟s algorithm [4] An unlabeled corpus containing about 700,000 Vietnamese sentences was collected from online newspapers including Lao Dong, PC World, and TuoiTre This corpus was pre-processed, sentence split and word segmentation6, before word clustering We ran the word clustering tool with 700,000 sentences from collection corpus After clustering process we achieved 700 raw clusters In which we removed clusters that have repeat center words As a result, there are 670 clusters that can be used for function tag labeling Moreover, in 670 http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/maxent/ http://www.cs.berkeley.edu/~pliang http://www.loria.fr/~lehong/tools/vnTokenizer.php 34 clusters we evaluated 473 clusters which include synonyms We used five features as candidates to consider a good cluster, cluster have strong relationship between each word inside cluster; these features are: Complete synonyms: two or more words can replace other ones in some context Example: “vua” (king), “hoàngđế” (emperor) Antonym: words which have opposite meaning such as: “đẹp” (beautiful), “xấu” (ugly) Semantic relations specific – abstract: this feature describe relation between an object with entity Example: “nhạc” (music) – pop, rock, Semantic relations abstract – specific: it is the reversal of feature Similar meaning: words not belong to four features above but have same semantic They are considered as weak synonyms Example: “bàn” (table), “ghế” (chair) Figure shows an example of a good cluster The first line of each cluster represents its name and identification Each word in a cluster has a bit string This bit string is used when we want a bigger data by merging pair clusters to new one For simple understanding, we can refer back to figure to view the way we build a bigger cluster Next constituent in a cluster are words which were segmented from preprocess Last constituent present the frequency of this word in training data for word clustering problem 35 Figure An example of word cluster Functional Labeling Precisions To evaluate the system, we used the familiar precision measure in classification studies Precision is the proportion between the numbers of labels which are truly predicted by the system with the number of input labels The precision is defined as: In NLP research, we can see that authors often use two kinds of measure They are F-score and Precision score In our research, we not use F-score because the number of constituents in a syntactic tree is fixed In addition, for constituents that are not tagged any function tags, that describing in table 3, we define a “NoneLBL” label to tag for these constituents Thus, our system will not ignore any input constituents And then, the F-score measure is not necessary for our system 36 Label Overall ADV CND CNC CMD DIR DOB EXC EXT IOB LOC MDP MNR PRD PRP SPL SUB SQ TMP TPC TTL VOC NoneLBL Table Evaluation of Vietnamese functional labeling system Among 16,997 testing samples7, there were 14,913 correctly predicted, and 2,084 incorrectly predicted The overall precision was 87.77% In more detail, the precision and frequency of each functional label are presented in Table To investigate the relation between training corpus size and precision, we ran our system with different training corpus that size is increased double for each test Note that from one syntactic tree there can be many classification examples extracted depending on the number of constituents in that tree 37 The learning curve in Figure 10 shows that the precision increased fastest, around 2%, when the number of training sentences changed from 4000 to 8000 Figure 10 Learning curve Error Analyses In this section, we would like to analyze errors occurring in the testing phase Table shows that several functional labels such as CNC, CMD fall in zero precision since there are too few testing examples belonging to these categories Another type of error is caused by the dependency between functional labels Note that we not use functional label as a feature According to Vietnamese Treebank guideline, in some case there is dependency between two functional labels For example, the Figure11 shows that there is a dependency between TC and TPC labels In this case, if a clause (S) has a topical phrase (PP in this example), the clause will be labeled with TC tag Therefore, in general procedure where we extract features, its output sample will have format such as: Function label function label value of feature - value of feature –etc This dependency causes lack of information for major functional label Consequently, the functional label will be tagged as “NoneLBL” label In this case, the function label is PP-TPC But it contains another functional label S-TC as its feature 38 Figure 11 The dependency between two function labels Effectiveness of Word Cluster Feature Experimental results in Table were achieved by using seven features To evaluate the effectiveness of word cluster feature, we carried out an experiment using other features only As a result, the over-all precision of our system decreased 0.5% when we experimented without word cluster feature (seventh feature) Table shows increases in precision for some functional labels when using the word cluster feature Labels which have no increase (or decrease) are omitted Though the overall increase (0.5%) is not high, there are specific changes which are relatively high such as manner (MNR), vocative (VOC), exclamation (EXC) According to our observations, the head word feature was important in identifying these functional labels However, this feature was sparse in our training corpus Therefore, the word cluster feature, trained using a large corpus, was very useful in reducing the sparseness of the head word feature 39 Label Pr TTL SUB TPC MNR SPL IOB VOC CMD Table Increases in precision by using word cluster feature Summary Because our system is the first automatic function tagging system for Vietnamese, so there are no similar systems to compare with our result But with precision proportions at 87,73% and data base enough for training, our system show that function tagging problem for Vietnamese still be a challenge for researchers Beside, with our result, other NLP-applications for Vietnamese can be used to cover their missing semantic information 40 Chapter V: Conclusion and Future work In conclusion, the function tag labeling is one of various problems in Natural Language Processing field which are waiting to be explored As I mentioned in previous chapters, function tag labeling is a shallower task of semantic role labeling problem Both of them are not directly applications but their outputs are very useful for almost NLP-applications While in English, there are many applications such as question answering, information retrieval applied output of Blaheta‟s research as input data for their system Moreover, for English and Chinese, some research has been completed semantic role labeling problem Therefore, semantic finding still a hard challenge for researchers who want to build a NLP-application, especially for Vietnamese In this chapter, I would like to conclude our contributions and mention future works to improve performance of system Contributions In this research, we have investigated the Vietnamese functional labeling problem, a new problem for Natural Language Processing applications which are used research in Vietnamese With our system, in our point of view, we have made a number of significant contributions such as: · First, we built the first Vietnamese functional labeling system with a high precision · Second, we carried out various experiments to give a better understanding of this system such as learning curve, error analyses · Third, we built an automatic function tag labeling system that will used to enrich more functional labels of sub-trees on Vietnamese Treebank · Fourth, we contributed a new base-line system for research to update Semantic Role Labeling problem · Additionally, we showed the effectiveness of the word cluster feature for each function tag Again, with our result we believe that our selected features are not optimal but have high precision our system is reliable 41 Future work Although our results are reliable but there are some deficiencies in our project These deficiencies are tasks that we are going to work on in the future · First, our research needs more semantic tags to develop to Semantic Role Labeling problem We will research to put agent tags into our research such as: theme, patient, instrument, etc · Second, our training data are limited by quantity With approximate ten thousands sentence hand-crafted function tags on Viet Treebank, our research has got enough training data but we can reduce the overlap cases Thus, we will build lager training data to smooth our model · Finally, we will approach Function Tag Labeling with other strategies to discover the effect of each model to function tagging problem We believe that with more function tagging system, the output data will be improve the quality of semantic information and as the result, other NLP-applications will have more choice for their data 42 Bibliography [1] A L Berger, S A D Pietra, V J D Pietra “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics 1996 [2] Don Blaheta, “Function tagging”, PhD thesis, 2003 [3] Don Blaheta, Eugene Charniak, “Assigning Function Tags to Parsed Text”, Proceedings of the 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics, 2000 [4] P.F Brown, V.J DellaPietra, P.V deSouza, J.C Lai, and R.L Mercer 1992 “Class-based n-gram models of natural language” Computational Linguistics, 18(4):467-479 [5] Xavier Carreras, Lluís Màrquez , TALP Research Centre, Technical University of Catalonia, “Introduction to the Semantic Role Labeling”, CoNLL-2004 Share Task, 2004 [6] Grzegorz Churpala, Nicolas Stroppa, Josef van Genabith, “Better training for Function Lableling”, 2007 [7] Michael Collins “Three Generative, Lexicalised Models for Statistical Parsing” Proceedings of the ACL, 1997 [8] Fillmore, “Frame Semantics and the nature of language”, Annals of the New York Academy of Sciences, Conference on the Origin and Development of Language and Speech Volume 280:20-32, 1976 [9] Ryan Gabbard, Mitchell Marcus, Seth Kulick, Fully Parsing the Penn Treebank Treebank, 2006 [10] Katz, Fodor, “The Structure of a Semantic Theory”, 1963 [11] T Koo, X Carreras, and M Collins “Simple Semi-supervised Dependency Parsing” In Proc ACL, 2008, pp.595-603 [12] Lafferty, J., McCallum, A., Pereira, F 2001 “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data” In: Proceedings of ICML 2001, pages 282-289, Williamstown, USA [13] Anh-Cuong Le, Phuong-Thai Nguyen, Hoai-Thu Vuong, Minh-Thu Pham, Tu- Bao Ho 2009 “An Experimental on Lexicalized Statistical Parsing for Vietnamese” Proceedings of KSE 2009, pp 162-167 [14] Percy Liang, “Semi-supervised learning for natural language” Massachusetts Institute of Technology, 2005 43 [15] Merlo, P., Musillo, G 2005 “Accurate Function Parsing” In Proceedings of EMNLP 2005, pages 620-627, Vancouver, Canada [16] Mitchell P Marcus et al “Building a Large Annotated Corpus of English: The Penn Treebank” 1993 Computational Linguistics [17] Phuong-Thai Nguyen, Xuan-Luong Vu, Minh-Huyen Nguyen, Van-Hiep Nguyen, Hong-Phuong Le “Building a Large Syntactically-Annotated Corpus of Vietnamese” The 3rd Linguistic Annotation Workshop (LAW), ACL-IJCNLP 2009 [18] Rabiner, L 1989 “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition” In: Proceedings of the IEEE, 77(2):257-286 [19] Weiwei Sun, Zhifang Sui, “Chinese Function Tag Labeling”, 2009 [20] Honglin Sun, Daniel Jurafsky, “Shallow semantic parsing of Chinese” In Daniel Marcu Susan Dumains and Salim Roukos, editors, HLT-NAACL 2004: Main proccedings [21] Nianwen Xue, Martha Paler, CIS Department University of Penn Treebanksylvania, “Automatic Semantic Role Labeling for Chinese Verbs”, 2004 [22] Caixia Yuan, Fuji Ren, and Xiaojie Wang, “Accurate Learning for Chinese Function Tags from Minimal Features”, 2009 44 Publications [1] Nguyen Thanh Huy, Nguyen Kim Anh, Nguyen Phuong Thai, “Building an Efficient Functional-Tag Labeling System for Vietnamese”, The third International Conference on Knowledge and System Engineering, 2011 45 ... model as an equation, a rigorous notation is proposed to separate the difference of a random variable and particular value it may assume In this case, random variables are notated by capital letters... overview, table one provide a summary of approaches in function tag labeling problem Approach 1st : Gabbard et al 2nd: Blaheta 3rd : Yuan et al Ours Table1 Functional Labeling Approaches Note that,... paper is HM-SVM because of its advantages in learning labels As a result, the tagger reached a 96.18% accuracy rate Function Tag Labeling by Classification This approach carries out functional