DSpace at VNU: Named entity recognition for vietnamese documents using semi-supervised learning method of CRFs with generalized expectation criteria

2012 International Conference on Asian Language Processing Named Entity Recognition for Vietnamese documents using semi-supervised learning method of CRFs with Generalized Expectation Criteria Thi-Ngan Pham1,2,a, Le Minh Nguyen3,b, Quang-Thuy Ha2,c The Vietnamese People’s Police Academy, Hanoi, Vietnam KTLab, College of Technology (VNU-UET), Vietnam National University, Hanoi (VNU), Hanoi, Viet Nam Japan Advanced Institute of Science and Technology (JAIST), Japan Email: aptngan2012@gmail.com, bminhnl@jaist.ac.jp, cthuyhq@vnu.edu.vn data is combined with limited supervision, provided by the human trainer in the form of expected prior label distribution or class association for features We find that GE performs better than both supervised method and several alternative semi-supervised methods and provides better accuracy given the same amount of labeling effort We perform several experiments using various feature configurations We also investigate the effects of using difference sizes of training data The rest of this paper is organized as follows Recent NER studies related to our work are introduced in Section II Section III briefly introduces about GE criteria and how GE criteria can be applied to CRF given conditional probability distributions of labels given features Our proposed model is described in section IV The ways to design a feature set, create a set of constraints and preprocess data for the model are considered Experimental results and related remarks are presented in the next section Conclusions are showed in the last section Abstract—Named Entity Recognition (NER) is an important, useful task in many natural language processing applications and much previous work in NER has been done in many other languages such as English, Japanese, Chinese… However, Vietnamese NER task is still relatively new and challenge due to the characteristics of Vietnamese, the lack of a large annotated corpus… This paper presents a new approach for Vietnamese NER – a semi-supervised training method for Conditional random fields (CRFs) models using generalized expectation criteria to express a preference for parameter settings We perform several experiments using different feature setting and different training data to show the high performance of this method and compare to the other method Keywords- Generalized Expectation criteria, CRFs, semisupervised learning I INTRODUCTION NER aims to identify and classify certain proper nouns into some predefined target entity classes such as Person, Organization, Location, Numeral expressions, Temporal expression, Monetary values and Percentage Many previous studies in NER have been done in languages such as English, Japanese and Chinese and get high performance Supervised learning methods require labeled instances, which are often costly to obtain while semi-supervised learning methods are an appealing solution for reducing labeling effort In recent years, approaches have been proposed for semi-supervised learning, in which the approach using generalized expectation criterion has been much considered Generalized expectation (GE) criteria [1] are terms in a training objective function that assign scores to values of a model expectation GE provides a method for incorporating prior knowledge into model training, so that humans can directly express preferences to the parameter estimator naturally and easily using the language of expectations, rather than the often complex and counter-intuitive language of the parameters In this paper, the expectations are model predicted class distribution conditioned on the presence of selected features and the score function is the Kullback-Leibler divergence from reference distributions that are estimated using existing resources We apply a semi-supervised training method for conditional random fields (CRFs) with generalized expectation criteria that incorporates both labeled and unlabeled sequence data to estimate a discriminative structured predictor CRFs [9] model is a flexible and powerful one for structured predictors based on undirected graphical models that have been globally conditioned on a set of input covariates Here unlabeled 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.54 II RELATED WORKS In the past few years, there have been some studies on NER tasks for Vietnamese that archived certain results T.Q.Tran et al [14] presented a Support Vector Machine (SVM) based NER model which obtained an overall Fmeasure of 87.77% Hoang-Quynh Le et al [8] presented an integrated model to recognize person entity and extract relevant values of a pre-defined set of properties related to this person simultaneously for Vietnamese This model used various kinds of knowledge resources and applied famous machine learning method CRFs and got an Fmeasure of 83.39% These two models got acceptable results but they used supervised learning method that requires a large amount of labeled training data to get high performance We also mention some recent semi-supervised learning methods for NER task B.Mohit and R.Hwa [11] used Expectation Maximization (EM) algorithm along with their Naïve Bayes classifier to form a semi-supervised learning framework Y.Grandvalet and Y.Bengio [15] initially proposed a method for semi-supervised CRF training using entropy regularization and then F Jiao et al [2] extended this model to linear-chain CRFs In general, entropy regularization is fragile and accuracy gains can come only with precise settings Rathany Chan Sam et al [12] introduced a semi-supervised learning method for NER task in Vietnamese by combining proper name coreference, named-ambiguity heuristics with CRF model The F-measure of this model for extracting “Person”, “Location” and “Organization” entities are 93.13%, 88.15% and 79.35% 85 There have been few different approaches trying to learn from alternative forms of labeled resources R.Schapire et al [13] presented a method in which features are annotated with their associated majority labels and used this information to bootstrap a parameterized text classification model However, this model required some labeled data in order to train their model A.McCallum et al [1] introduced a special case of GE, label regularization, and demonstrate its effectiveness for training maximum entropy classifiers G.Druck et al [3] also used GE with full distributions for semi-supervised learning of maximum entropy models, except here the distributions are on labels conditioned on features G.S.Mann and A.McCallum [4] presented GE criteria for linear-chain conditional random fields, a new semi-supervised training method that makes use of labeled features rather than labeled instances This model uses conditional probability distributions of labels given features and can dramatically reduce annotation time Our research bases on the semi-supervised learning method of CRF using GE criteria [4] applying to Vietnamese and gains a better result than [8], [12], [14] L.Yao et al [10] also based on this model applying to Biomedical NER and got high performance This approach will be discussed in detailed in Section III III Thus a GE criterion may be specified independently of the parameterization, and independently of choices of any conditioning data sets A GE criterion may operate on some arbitrary subset of the variables in x The scoring function G and the distance function ο may be based on information theory, or be arbitrary functions For the purpose of this paper, we set the GE criterion objective function term is , in which the functions to be conditional probability distributions and ο(p,q)=D(p||q), the KL-divergence between two distribution For semi-supervised training of CRF, we augment the objective function with the regularization term: Where With the unnormalized potential: Where fm(x,j) is a feature that depends only on the {j : observation sequence x, and j* is defined as fm(x,j)=1}, and Um is the set of sequences where fm(x,j) is present for some j GENERALIZED EXPECTATION CRITERIA FOR CONDITIONAL RANDOM FIELDS Conditional Random Fields (CRFs) was first introduced in 2001 by J.Lafferty et al [9], it is a statistical sequence modeling framework for labeling and segmenting sequential data This model overcomes the weakness of HMM and MEMM, so it have been used in many approach in labeling tasks In this paper, we focus on GE criteria and how to use GE for CRF Let X be some set of variables, with assignments denoted x‫ࣲא‬Ǥ Ʌ ǡ Ʌሺሻ The expectation of some function f(X) according to the model is > f ( X )@ ¦ x X IV OUR PROPOSED MODEL A Analyzing proposed model Our general NER system includes two main phases as illustrated in Fig.1 x Training x Testing pT ( x ) f ( x ) Training Data Where f(x) is an arbitrary function of the variables x producing some scalar or vector value This function of course may depend only on a subset of the variables in x A set of assignments to input variables (training data instances) may be provided, and the conditional expectation is then CRFs learning GE Constraints Phase ET is given as a target distribution and NER model Phase Testing Data A generalized expectation (GE) criterion [1] is a term in a parameter estimation objective function that expresses some preference about the model’s expectation of variable values That is, a generalized expectation criterion is a function, G, that takes as an argument the model’s expectation of f(X), and return a scalar, which is added as a term in the parameter estimation objective function: Decoding Output Figure The semi-supervised CRF model with Generalized Expectation Both the training and testing processes are conducted using Mallet toolkit 1) Phase – Training phase In this phase, the input includes training data with correct labels and the GE criteria, in which the GE criteria are expressed in term of a set of GE constraints taking advantage of conditional probability distributions of labels In some cases G might be defined based on a distance to some “target value” for EɅሾሺሻሿǤ Let Ǥ ǡ G ǣ 86 http://mallet.cs.umass.edu/ given features We extract features of words in the training data, combining the probability distributions over labels in constraints, then used CRF to train and got a NER model 2) Phase – Testing phase In testing phase, the testing data will be recognized by the NER model constructed in the training phase the feature “Hӗ_Chí_Minh”/“Ho_Chi_Minh” (in this case, the feature is a word) may be a person’s name or stand in name of an organization “ĈRjQ WKDQKBQLrQ Fӝng_sҧn Hӗ_Chí_Minh”/“The Communist Youth Union of Ho_Chi_Minh” or stand in name of a location “Thành_phӕ Hӗ_Chí_Minh”/ “Ho_Chi_Minh city” We calculate the probability that the feature “Hӗ_Chí_Minh” belongs to one group based on the context (relation with the previous and the next position) and the frequency of this feature in document generally A constraint has general format as follows Feature_name label_name1 = probability1 label_name2 = probability2… We can obtain constraints manually or automatically There are some automated methods for obtaining constraints such as method of user-provided labeled features, method of machine-provided candidate features using Latent Dirichlet Allocation (LDA) based method of [6], [10]… In our experiments, we use the second method - LDA LDA is a generative probabilistic model for extracting latent topics from collections of discrete data such as text corpora To build corpus for LDA, we extract features of each word within the unlabeled data set to get LDA samples We use LDA to cluster unlabeled data into latent topics; sort the frequencies of features for each topic and select the most prominent features in each cluster After the statistical analysis of feature labeling, we obtain a set of constraints Example of one constraint: B Preprocessing data, feature set and constraints for proposed model 1) Preprocessing data We recognize entities according to CoNLL2003 shared task, in which we concentrate on three types of named entities: Person, Location and Organization We also add the feature of part of speech (POS) tag so each token in data has the following format: The process of making the training and testing data can be described as follows: Firstly, we collect and pick hundreds of articles from some popular Vietnamese newspapers such as VnExpress , Vietnamnet , Tienphongonline Then, vnTagger is used for the word segmentation and the POS label assignment After that, NER labels are assigned manually TABLE I Feature INITCAPS CAPSMIX ALLCAPS HASDIGIT SINGEDIGIT DOUBLEDIGIT TABLE II THE ORTHOGRAPHIC FEATURES Meaning The initial letter of the current word is capitalized Some letters in the current word is capitalized All the letters of the current word are capitalized The current word includes numbers The current word is a number between and 10 The current word is a number between 10 and 99 Examples 7KjQKBSKӕ +jB1ӝL WORD=ViӋt_Nam B-LOC:0.33333331111111303 BMISC:1.3333332177777873E-8 B-ORG:1.3333332177777873E-8 BPER:1.3333332177777873E-8 I-LOC:1.3333332177777873E-8 IORG:0.33333331111111303 I-PER:0.33333331111111303 O:1.3333332177777873E-8 TPHCM PC14 The constraints have great influence on the system results We try several sets of constraints and get some different results How to build an effective set of constraints is still a target of active researches The constraints should be balanced among labels and cover many documents 50 THE FEATURE OF POS TAG AND LEXICAL INFORMATION L-2, L-1,L0,L1,L2 L-2L-1, L-1L0, L0L1, L1L2 L-2L-1L0, L-1L0L1, L0L1L2 P-2, P-1,P0,P1,P2 POS P-2P-1, P-1P0, P0P1, P1P2 P-2P-1P0, P-1P0P1, P0P1P2 L-2P-2, L-1P-1, L0P0, L1P1, L0P0, L1P1, Combination L2P2 2) Feature set Apart from the orthographic features of current word (see Table I), we also utilize POS tag and lexical information as the features Denote Pos0 and Lex0 are respectively POS tag and lexical information at current position Posn and Lexn are POS tag and lexical information in n window where n is window size In our experiments, we set the windows size to (see Table II) 3) Constraints In order to use GE criteria in the model, we build a set of GE constraints which expresses the conditional probability distributions of given features For example, we know that V Lexical EXPERIMENTAL RESULTS AND DISCUSSIONS A Experimental setup x We use three training data sets: data1 - 500 tokens, data2 - 1000 tokens and data3 - 1500 tokens x The testing data is about 500 tokens x Three sets of constraints are used with the size of 614, 669 and 914 constraints The training and testing data sets are picked up randomly from processed data of 5000 tokens We also create three sets randomly from 1300 objects of unlabeled data to build three constraint sets mentioned above In order to evaluate the effect of training data and constraint set on the result of recognizing entities, we take experiments using the same testing data, with training data and sets of constraints In each experiment, we assign NER label using models: Supervised learning CRF model and Semi-supervised learning CRF using GE criteria model We use precision, recall and F-measure as the evaluation measure http://vnexpress.net http://vietnamnet.vn http://www.tienphong.vn Le Hong Phuong: http://www.loria.fr/~lehong/tools/vnTagger.php 87 B Experimental results and discussions applying the “feature” labeling instead of “instance” labeling to get training data saves lots of work annotating and incorporates unlabeled data One of the most important components in our NER system is building the set of constraints to calculate generalized expectation criteria A set of constraints had been determined The better set of constraints expresses, the better NER result is, then we are going to build effective sets of constraints The Table III shows the results of experiments with three training data sets and the set of 914 constraints Generally, it is a positive result The model of CRF using GE performs better than the model of CRF in two experiments using training data with 500 tokens and using training data with 1500 tokens Only in experiment using training data with 1000 tokens, the two models get the same F-measure The best result of model CRF using GE are 88.89% of precision, 91.43% of recall and 90.14% of F-measure showing the superiority of this model to the model of supervised CRF ACKNOWLEDGEMENTS This work was partly supported by MOET project – B2012-01-24, VNU-UET project CN-12.01 REFERENCES TABLE III ER PR% ORG PER LOC ALL 90.00 100.00 12.50 58.33 ORG PER LOC ALL 90.00 100.00 56.25 77.78 ORG PER LOC ALL 100.00 100.00 75.00 88.89 EXPERIMENTS WITH THREE SAMPLES SETS CRF CRF-GE RE% F1 % PR% RE% 500 tokens of training data 75.00 81.82 90.00 100.00 66.67 80.00 100.00 66.67 100.00 22.22 25.00 100.00 72.41 64.62 63.89 82.14 1000 tokens of training data 100.00 94.74 90.00 100.00 83.33 90.91 100.00 90.91 81.82 66.67 56.25 75.00 87.50 82.35 77.78 87.50 1500 tokens of training data 71.43 83.33 100.00 83.33 100.00 100.00 100.00 90.91 100.00 85.71 75.00 100.00 88.89 88.89 88.89 91.43 [1] F1 % [2] 94.74 80.00 40.00 71.88 [3] 94.74 95.24 64.29 82.35 [4] [5] 90.91 95.24 85.71 90.14 [6] Fig shows the experimental results of semisupervised learning CRF using GE with three different sets of constraints The performance of model with constraint set and set is better than that of model with constraint set in all three training data sets The performance of model with constraint set is better than that of model with constraint set in two training data1 and training data2, except in training data These experiments show the influence of constraints on the performance of the system [7] [8] [9] [10] [11] [12] [13] Figure The F-measure of the semi-supervised model CRF using GE with three different sets of constraints VI [14] CONCLUSIONS [15] In this paper, we have proposed a named entity recognizing model in Vietnamese documents based on semi-supervised CRF learning using GE criteria By 88 Andrew McCallum, Gideon Mann, Gregory Druck (2007) Generalized Expectation Criteria, Technical Report UM-CS-200760, University of Massachusetts Amherst, August, 2007 F Jiao, S Wang, C.-H Lee, R Greiner, and D Schuurmans (2006) Semi-supervised conditional random fields for improved sequence segmentation and labeling In COLING/ACL Gregory Druck, Gideon Mann, Andrew McCallum (2007) Leveraging Existing Resources using Generalized Expectation Criteria, NIPS WS, 2007 Gideon S Mann, Andrew McCallum (2008) Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields, ACL-08 (HLT): 870–878, 2008 Gideon S Mann, Andrew McCallum (2010) Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data, Journal of Machine Learning Research 11: 955-984, 2010 Gregory Druck, Gideon Mann and Andrew McCallum (2008) Learning from Labeled Features using Generalized Expectation Criteria, SIGIR 08 Gregory Druck, Gideon Mann, Andrew McCallum (2009) Semisupervised Learning of Dependency Parsers using Generalized Expectation Criteria, The 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP: 360–368 Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha (2011) An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text, IALP 2011:115118, Penang, Malaysia John Laferty, Andrew McCallum, Fernando Pereira (2001) Conditional Random Fields: Probabilistic Models for segmenting and labeling Sequence Data In Proc of the Eighteenth International Conference on Machine Learning (ICML-2001) Lin Yao, Chengjie Sun, Yan Wu, Xiaolong Wang, Xuan Wang (2011) Biomedical named entity recognition using generalized expectation criteria, Int J Mach Learn & Cyber (2011) 2:235– 243 Mohit, B., Hwa, R (2005) Syntax-based semi-supervised Named Entity Tagging In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp 57-60, Michigan Rathany Chan Sam, HuongThanh Le, ThuyThanh Nguyen, ThienHuu Nguyen (2011) Combining Proper Name-Coreference with Conditional Random Fields for Semi-supervised Named Entity Recognition in Vietnamese Text The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 512-525 R Schapire, M Rochery, M Rahim and N Gupta (2002) Incorporating prior knowledge into boosting In ICML T,Q Tran, T,X,T Pham, H,Q Ngo, D Dinh, C Nigel (2007) Named Entity Recognition in Vietnamese Documents, Journal of “Progress in Informatics", NII (National Institute for Informatics), Tokyo, Japan, Vol 2007, No.4, pp.1-9 Y Grandvalet and Y Bengio (2004) Semi-supervised learning by entropy minimization In NIPS ... manually or automatically There are some automated methods for obtaining constraints such as method of user-provided labeled features, method of machine-provided candidate features using Latent Dirichlet... (2008) Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields, ACL-08 (HLT): 870–878, 2008 Gideon S Mann, Andrew McCallum (2010) Generalized Expectation Criteria. .. Features using Generalized Expectation Criteria, SIGIR 08 Gregory Druck, Gideon Mann, Andrew McCallum (2009) Semisupervised Learning of Dependency Parsers using Generalized Expectation Criteria,

Định dạng
Số trang	4
Dung lượng	280,19 KB