2011 International Conference on Asian Language Processing An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha KTLab, College of Technology Vietnam National University, Hanoi (VNU) Hanoi, Viet Nam E-mail: {lhquynh, vutranmai, nambn_52, cuongpn_52, hqthuy}@gmail.com property extraction based on CRFs using a rich Vietnamese feature set we proposed The remaining of this paper is organized as follows Firstly, we present some related works in section II Section III mentions the machine learning method CRFs and its application in our problem Then, in section IV, we explain our proposed model and design a rich feature set from using various kinds of knowledge resource In the next section, we present experimental results and offer some discussion Finally, section VI is the conclusions Abstract—Personal names are among one of the most frequently searched items in web search engines and a person entity is always associated with numerous properties In this paper, we propose an integrated model to recognize person entity and extract relevant values of a pre-defined set of properties related to this person simultaneously for Vietnamese We also design a rich feature set by using various kind of knowledge resources and a apply famous machine learning method CRFs to improve the results The obtained results show that our method is suitable for Vietnamese with the average result is 84 % of precision, 82.56% of recall and 83.39 % of F-measure Moreover, performance time is pretty good, and the results also show the effectiveness of our feature set II In the past few years, this research topic has received considerable interest from the NLP community From 2007 to 2010, Web People Search Campaigns (WePS) [4, 10] had aimed at searching for people on the web These compagnes series had contributed many important researches on properties extraction The first WePS introduced a name disambiguation task and found that properties such as date of birth, nationality, affiliation, occupation, etc are particularly useful as features to identify namesakes [10] Consequently, in the second WePS, a property extraction subtask was introduced [10] and it was it was continue to consider in the third WePS [4] This subtask in WePS-2 is to extract 18 “attribute values” of target individuals whose names appear on each provided Web pages The problem was solved by a combination of many technologies, such as named entity recognition and classification, text mining, pattern matching, relation discovery, information extraction and more However, the result on the test set with 2,883 documents is quite low, the highest result having F value = 12.2 is PolyUHK system [10] The WePS-3 attribute extraction task is different from WePS-2 in that systems are requested to relate each attribute to a person (cluster of documents) System with best results had F-measure = 0.18, precision = 0.22 and recall = 0.24 [4] The results of WePS also show that some properties have higher frequency than others, such as work, occupation, and affiliation [10] Based on the most frequent properties of WePS-2, we use 10 types of properties for experiments, they are: other name, date of birth, date of death, birth place, death place, sex, occupation, nationality, affiliation and relatives Keywords- person named entity; property relation; property extraction; person property extraction; conditional random fields I INTRODUCTION Nowadays, personal names are one of the most frequently searched items in web search engines and a person entity is always associated with numerous properties (also called attributes) [4, 10] Property is characteristic or quality of an entity and property extraction is the extraction of property corresponding to an entity from text [3] In person property extraction, we predefined a fixed set of property types and try to extract these property values for a person in text Properties extraction for a particular person is important to uniquely identify that person on the web Consequently, extracting various properties has shown to be useful for personal name disambiguation [10] Property relation extraction also is used in object/entity analysis in text and plays an important role for expanding databases and ontology A system that attempts to extract person properties from text must solve several sub-problems: named entity recognition, languages ambiguity, grammar complexity, etc Named entity recognition (including person name, location, date, etc.) is a mandatory pre-processing for properties extraction Doing them in turn would require much effort, moreover because these two problems have many similar features, so the pipe-line model might repeats some step twice In this paper, we focus on recognizing person entity and extracting properties related to this person in Vietnamese text Our model integrate entity recognition and person 978-0-7695-4554-7/11 $26.00 © 2011 IEEE DOI 10.1109/IALP.2011.37 RELATED WORKS 115 http://nlp.uned.es/weps/ decided to use it to resolve our sequential labeling problem: Assume X = (x1 xT) is the input sentence, consists of T words, we must determine sequence of tags Y =(y1 yT) We used tags set include 43 tags of 21 labels (key person entity, 10 property types and 10 property values) In each tag, B denotes the beginning of a label and I denotes inside of a label In 2008, Banko and Etzioni built a system called O-CRF [6] used the components between two entities to discover their relation with a precision of 88.3% and a recall of 45.2% The effectiveness of using CRF for relation extraction in this system is one of the reasons why we choose to apply this method in our model The integration of two sub-problems in NLP received some interest from the NLP community Two problem had integrated in many works are word segmentation and POS tagging, almost of them achieved positive results (e.g research of Tran Thi Oanh et al [11] in 2010 for Vietnamese) Because both named entity recognition and person properties extraction can be solved as a sequential labeling problem, we proposed an approach model to integrate them There are quite lot researches of semantic relations in Vietnamese, but among them, number of works studies on property relation is few In 2010, Rathany Chan Sam et al [8] developed a relation extraction system for Vietnamese person name and other entities based on CRFs, average F of result is 82.10% of Person-Organization, 86.91% of PersonPosition and Person-Location is 87.71% Among Vietnamese NLP published studies, many researchers have used CRFs and it conduced to good results In 2006, Cam Tu Nguyen et al used CRF in Vietnamese word segmentation [1] the results has average of recall, precision and F1 is 93.76%, 94.28% and 94.05% respectively Our previous research [7] gave an experimental study on Vietnamese POS tagging using three machine learning methods (2009), the results shown that using CRF gave the best result (90.17% average precision) Recently, in 2010, Rathany Chan Sam et al [8] developed a Vietnamese relation extraction system based on CRF Because of the good results when applying CRFs to the Vietnamese NLP problem described above, we decided to use it to resolve our sequential labeling problem III IV OUR PROPOSED MODEL A Analyzing proposed model We proposed an integrated model in which named entity recognition and person property extraction simultaneous processed because of three main reasons: Firstly, the common pipe-line approaches recognizing named entity and extracting relations in turn had some limitations Secondly, both named entity recognition and person properties extraction can be solved as a sequential labeling problem Thirdly, after data surveying, we perceived that the tag of person entity, property type and property values are less ambiguous so they can share the same tags set Our model consisted of three main phases as illustrated in Fig.1 1) Phase 1- Sentences tagger training This phase received the input is training set of sentences and generated the tagger model We used a tagged training set to train a tagger model using CRFs Pre-processing includes tokenization, wordsegmentation, chunking, etc We annotated the training set with named entity and person properly manually with 43 tags of 21 labels Note that some of these properties might be entity like date, location, organization etc Unlike in conventional entity recognition, we used the appropriate tags to facilitate the determination of whether that entity is under any type of property In this phase, we extracted and selected rich feature set obtained by using various kinds of knowledge, this features will be describe in section IV.B For increasing the tagging results, Freebase English person name dictionary and our three Vietnamese supporting dictionaries (Vietnamese person name dictionary, Vietnamese location dictionary and prefix for people, locations and organizations dictionary) were used 2) Phase - Sentences tagging The input of this phase was test set and the output was tagged sentences set In this phase, we used tagger model obtained in phase to tag the test set The data was also annotated with named entities and person properties to evaluate the result, these tags are not used in process 3) Phase - Sentences filtering Used tagged data obtained in phase 2, phase retained the appropriate sentence Person property relation is always include three following elements: Key person entity, property type (such as other name, date of birth, etc) and property value (a specific value of the property relation expressed in words, such as May 2nd might be a value of date of birth property) In there, type of property can be identified by words or hidden, but two other elements (key CONDITIONAL RANDOM FIELDS Conditional Random Fields (CRFs) was first introduced in 2001 by Lafferty, McCallum and Pereira [5], it is a statistical sequence modeling framework for labeling and segmenting sequential data Several advanced convex optimization techniques can be used to train CRFs Because it has been found that a LBFGS and Newton methods converges much faster [9], we choose using L-BFGS method for CRFs optimization in our system For smoothing, a Gaussian Prior is a well-known method and has been used by many researchers (such as Chen and Rosenfeld (1999), Sha and Pereira (2003)) In research of C Sutton and A McCallum (2006) [2], when they set CRF parameter GaussianPriorVariance as factor of 10, the results were best In our proposed model, we used Gaussian Prior for smoothing and set GaussianPriorVariance = 10 To train CRFs with the given training data, we use multithread CRFs training so allows it to operate faster, Number Threads was set = Because of the good results when applying CRFs to the Vietnamese NLP problem described in section II, we 116 http://www.freebase.com/ conjunction, regular expression and Vietnamese syllable detection based on this work In addition, we used Freebase person name dictionary (1,397,865 words) and our three supporting dictionaries to extract more useful features, they are: - Vietnamese person name dictionary has 20,669 words - Vietnamese location dictionary has 18,331 words - Prefix dictionary included person prefix (like “ngài” (Mr.), “PGS.” (Assoc.), etc), location prefix (like “Qu̵n” (district), “Thành ph͙” (city), etc) and organization prefix (like “tr˱ͥng ÿ̩i h͕c” (university), “công ty” (company), etc) This dictionary has 790 Vietnamese words person entity and property values) must always appear in the sentence Because of this rule, in this phase, we removed all sentence does not contain any key person entity or property value Preprocessing Training set CRF training Features extraction Features selection TABLE I CRF model Dictionaries No Features type Current word POS tag of the current word Is current word is lowercase, initial capitalization or all capitalization? Test set CRF tagging Preprocessing Features extraction Filter Context words Syllable Conjunction Regular Expression tries to capture expressions describing date/time, numbers, marks, etc Vietnamese Syllable Detection Is this word a valid entry in name dictionaries? Is the previous word of considering word a valid entry in prefix dictionaries? Tagged data Features selection THE PROPOSED FEATURE SET V Notation W0 POS (W0) Is_Lower(0,0) Is_Initial_Cap (0,0) Is_All_Cap (0,0) Wi (i = -2,-1,1,2)) Syllable_Conj (-2,2)) Regex(0,0) Is_Valid_Vietnamese _Syllable(0,0) dict:name, dict:first_name dict:vname dict:vfirst_name prefix:per prefix:loc prefix:org EXPERIMENTAL RESULTS AND DISCUSSIONS A Experiments set up • Data set had 2,700 sentences crawled from Vietnamese Wikipedia These sentences were tagged manually • We used 10-fold cross validation in experiments • The comparison was using recall, precision and Fmeasures for the overall results and for each property Phase Phase Phase Results B Experimental results and discussions 1) Experiment results of whole system: We got average results were 84 % of precision, 82.56% of recall and 83.39 % of F-measure The best results were 92.19%, 90.22% and 91.19% in turn Preliminary, this is a quite good result, but since other similar researches not use the same corpus as ours, we can not compare directly our result with others Although our target and WePS-2 attribute extraction task [10] are both extracting person property, but there is no basis for comparison, too This subtask in WePS-2 is to extract 18 attribute values of target individuals at document-level The differences are not only of property types set but also of the complexity At document-level, there are many sub- Figure The proposed model B Design feature set for proposed model Our previous work in 2010 [11] showed that the use of various kind of knowledge resources might contribute to improve the result of NLP problems In this paper, we designed a rich feature set listed in table I by integrating some kinds of feature as mentioned below: Features of context and current word itself are used, they are quite similar to feature set used in [7, 11] The research of Cam Tu Nguyen et al [1] summarized general structure of Vietnamese formation (included structure of syllable, words and Vietnamese new words) Based on this summarization, they proposed some types of context predicate templates from which various features will be generated correspondingly We used features of syllable 117 http://vi.wikipedia.org/ simultaneously for Vietnamese The machine learning method CRF was applied to resolve this problem as a sequential labeling problem The proposed model consisted of three phases: Training CRF model, CRF tagging and filtering We also exploited a rich feature set by using various kinds of knowledge resources Experiments were conducted on 2700 manually annotated sentences, 10 frequent property types had choose for extract The obtained results showed that our method is suitable for Vietnamese with the best result is 92.19% of precision, 90.22% of recall and 91.19% of F Moreover, performance time was pretty good, satisfy to apply in realistic problems In addition, the result of evaluating on each tag showed that some tag have better result because they took advantage of useful features we proposed problems of ambiguities have to solve (some of them was mentioned in [10]) Our work was conducted in sentenceslevel, ambiguities are less and easier to solve Thus, low results of WePS show that there are still many complex problems need to be solved, and our research must be improved However, the results we achieved are satisfactory, and there is a great potential for development Solving this problem well at sentence-level will be a precondition for solving it well at document-level 2) Experiment results of each tag: The average experimental results of each tag are in table II TABLE II No 10 11 12 13 14 15 16 17 18 19 20 21 EXPERIMENT RESULTS FOR WHOLE SYSTEM Tag OPer NickPer RPer VBornLoc VDeadLoc VHomeLoc VJobOrg VJob VSex VBornTime VDeadTime R_OtherName R_Relationship R_WhereBorn R_WhereDead R_WhenDead R_Job R_WhereJob R_Sex R_WhenBorn R_WhenDead P (%) 91.35 89.88 80.46 83.45 80.35 93.39 78.25 81.49 90.45 83.77 80.40 91.67 81.98 80.89 80.23 85.65 77.35 75.92 73.29 85.75 76.10 R (%) 90.33 90.44 78.65 87.91 80.09 91.77 83.69 78.22 87.56 90.39 87.28 85.19 83.30 81.74 85.36 85.99 75.64 73.21 65.30 83.22 72.77 F (%) 90.84 90.16 79.54 85.62 80.22 92.57 80.88 79.82 88.98 86.95 83.70 88.31 82.63 81.31 82.72 85.82 76.49 74.54 69.06 84.47 74.40 ACKNOWLEDGEMENTS This work was partly supported by Vietnam National University Hanoi research project QG.10.38 and TRIG-B REFERENCES [1] [2] [3] [4] Generally, it is a positive result The results of property values were often better than of property types because the property types are sometimes hidden (not presented by any word) and influenced by language complexity more than other tags Moreover, in property values or property types, the achieved results were uneven among tags because some tags always appear in more complex grammar structure so they are hard to find, in additional, tags take advantages of useful features like dictionaries, Vietnamese characteristics (e.g OPer, NickPer, VHomeLoc) might have better results 3) Performance evaluation: In experiments, because phase (training model) can be done offline, so we just calculated processing time in phase (tagging) and phase (filter) Using personal computer: Chip Intel(R) Core Duo T7700 @ 2.4GHz, Ram: 2.00 GB, Microsoft Windows 7, system is programmed in Java Eclipse SDK, average time for processing an input sentence was 0.173 seconds Almost of previous works did not give the processing time, so there is no basis for comparison, but 0.173s of processing time for a sentence in an average configuration personal computer is pretty good results, satisfy to apply in realistic problems VI [5] [6] [7] [8] [9] [10] CONCLUSIONS [11] In this paper, we propose an integrated model recognizing person entity and extract relevant values of a pre-defined set of properties related to this person 118 Cam Tu Nguyen, Trung Kien Nguyen, Xuan Hieu Phan, Le Minh Nguyen, and Quang Thuy Ha, “Vietnamese Word Segmentation with CRFs and SVMs: An Investigation”, The 20th Pacific Asia Conference on Language, Information, and Computation (PACLIC), 1st-3rd November, 2006, Wuhan, China Charles Sutton and Andrew McCallum, “An Introduction to Conditional Random Fields for Relational Learning”, in Introduction to Statistical Relational Learning, Edited by Lise Getoor and Ben Taskar, MIT Press, 2006 Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008 Javier Artiles, Andrew Borthwick, Julio Gonzalo, Satoshi Sekine, and Enrique Amigó, “WePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks”, in the 3rd Web People Search Evaluation Workshop (WePS 2010) John D Lafferty, Andrew McCallum, Fernando C N Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.”, ICML 2001: 282289 Michele Banko, Oren Etzioni “The Tradeoffs Between Open and Traditional Relation Extraction”, ACL 2008: 28-36 Oanh Thi Tran, Cuong Anh Le Quang-Thuy Ha and Quynh Hoang Le, “An Experimental Study on Vietnamese POS tagging", International Conference on Asian Language Processing (IALP 2009):23-27, Dec 7-9, 2009, Singapore Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, The Minh Trinh, “Relation Extraction in Vietnamese Text Using Conditional Random Fields”, AAIRS 2010: 330-339 Robert Malouf, "A comparison of algorithms for maximum entropy parameter estimation", In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), Pages 49-55 Satoshi Sekine and Javier Artiles, “WePS2 Attribute Extraction Task”, in the 2nd Web People Search Evaluation Workshop (WePS 2, 2009) Tran Thi Oanh, Le Anh Cuong, Ha Quang Thuy (2010), "Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources", Journal of Natural Language Processing, 17 (3): 41-60, 5(2):890-909), 2010 ... November, 2006, Wuhan, China Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields for Relational Learning”, in Introduction to Statistical Relational Learning, Edited by... Xuan Hieu Phan, Le Minh Nguyen, and Quang Thuy Ha, Vietnamese Word Segmentation with CRFs and SVMs: An Investigation”, The 20th Pacific Asia Conference on Language, Information, and Computation... (key CONDITIONAL RANDOM FIELDS Conditional Random Fields (CRFs) was first introduced in 2001 by Lafferty, McCallum and Pereira [5], it is a statistical sequence modeling framework for labeling and