2011 International Conference on Asian Language Processing Co-reference Resolution in Vietnamese Documents Based on Support Vector Machines Duc-Trong Le, Mai-Vu Tran Tri-Thanh Nguyen, Quang-Thuy Ha KTLab, College of Technology Vietnam National University, Hanoi (VNU) 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam {mr.trongld, vutranmai}@gmail.com KTLab, College of Technology Vietnam National University, Hanoi (VNU) 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam {ntthanh, thuyhq}@vnu.edu.vn Ranking based approach: This is a traditional approach based on linguistic and the knowledge domain proposed in 1998 by Mitkov [6] It is actually an approach of pronoun resolution in the case of not having enough knowledge The input is checked against 10 agreement and antecedent indicators Candidates are assigned scores by each indicator and the candidate with the highest score is returned as the antecedent The most recent version of Mitkov’s Anaphora Resolution System (MARS) incorporates several advancements over the system MARS was improved to cater for several frequent causes of apparent number disagreement These consist of (i) collective nouns, (ii) NPs whose gender is under-specified, (iii) quantified nouns/indefinite pronouns, and (iv) organisation names These cases were handled by a combination of gazetteers, the integration of an animacy recognition module, and named entity recognition [7] Classification based approach: In 1995, Joseph F McCarthy proposed to transform co-reference problem into classification one, and used decision tree as the classifier Then Zoran Dzunic et al revised in 2006 with some improvements in matching algorithms [2] The main idea of this approach is to use a decision tree (based on features) to classify the relation of phrases into two classes: the co-reference and non-coreference classes Then the decision tree algorithm is applied to find coreferences in a document In 2005, Thomas Finly and Thorsten Joachims proposed to use SVM as the classifier [4] In 2009, this approach was inherited and expanded using 39 features by Rahman and Ng [13] Some of important features such as: Pronoun, subject, nested, gender, number, string match, etc were added In this approach, the authors considered a pair of an active mention mk and a candidate antecedent mj as a semantic relation Each relation is represented by a vector constructed from 39 properties The class associated with a training relation is either positive or negative, depending on whether mj and mk are co-referent or not After converting all multi-valued features into an equivalent set of binary-valued features, the authors used a SVM classifier (from the SVMlight package) to identify coreferences Clustering based approach: In 1999, Claire Cardie and Kiri Wagstaff [1] solved the co-reference task by clustering This approach begins with an assumption that each of co-reference group defines a class Therefore, it is natural to view the problem as one of partitioning or clustering noun phrases in docmuents Intuitively, all the noun phrases used to describe a specific concept will be "near" or related to each other in some ways, i.e their conceptual "distance" will be short Given a description Abstract - Co-reference resolution task still poses many challenges due to the complexity of the Vietnamese language, and the lack of standard Vietnamese linguistic resources Based on the mention-pair model of Rahman and Ng (2009) and the characteristics of Vietnamese, this paper proposes a model using support vector machines (SVM) to solve the co-reference in Vietnamese documents The corpus used in experiments to evaluate the proposed model was constructed from 200 articles in cultural and social categories from vnexpress.net newspaper website The results of the initial experiments of the proposed model achieved 76.51% accuracy in comparison with that of the baseline model of 73.79% with similar features Keywords: Co-reference resolution, Vietnamese co-reference, support vector machines I INTRODUCTION Co-reference resolution is the task of identifying phrases (in one or more documents) referring to the same real-world entity or concept In spite of its simple definition, co-reference is generally considered a difficult NLP task, typically because it involves the use of sophisticated knowledge resources and inference procedures [9, 10] It has received a great attention from researchers and annual conferences around the world such as: ACL, IJCAI, etc Co-reference information has been shown to be beneficial in many other tasks [15], including information extraction [5], question answering [8] and text summarization [14], etc The history of the study on co-reference resolution could be dated back to 1960s-1970s, the initial approaches were based on experiences So far many different approaches have been proposed to solve the problem, however, machine learning approaches are more prominent In this paper, we focus on building a model for co-reference resolution in Vietnamese documents using SVM based on the proposed model by Rahman and Ng [13] in 2009 The main idea of the approach is to consider each pair of potential phrases is a semantic relation where each relation is represented by a feature vector Using a SVM classifier to determine the label of feature vectors, we thereby can determine the co-referent nature of the potential phrases Finally, the co-referent pairs are grouped into same group II RELATED WORK In English, a lot of approaches are proposed to solve the co-reference task However, in this paper we only review some of the most related work based on which we develop our algorithm 978-0-7695-4554-7/11 $26.00 © 2011 IEEE DOI 10.1109/IALP.2011.63 89 of each noun phrase and a method for measuring the distance between two noun phrases using 11 features, then a clustering algorithm can group noun phrases into groups Noun phrases with distance greater than a clustering radius r are not placed into the same partition so they are not considered as co-references Distance metric between two noun phrases is defined as: ݀݅ݐݏ൫ܰܲ , ܰܲ ൯ = ∈ி phrases is determined Finally, the co-referent pairs are grouped into same group The detail of each phase is described in the next subsections A Pre-processing phase The purpose of this phase is to generate a set of potentially co-referent phrases from input documents This phase use Vietnam Semantic Web (VSW 1) toolkit which is an open source toolkit supporting sentence tokenizer, word segmentation, entity recognition, etc This phase will be conducted following two steps: Step 1: The input Vietnamese documents are parsed by VSW sentence tokenizer tool A set of raw sentences is produced Step 2: The set is pushed through the VSW POStagging and VSW NER tools in order to recognize the potentially co-referent phrases which are used in the next phase For example, with the Vietnamese paragraph: “Bách Dương tên thật Quách Định Sanh, sinh năm 1920 Hà Nam, Trung Quốc Ông nhà văn, nhà thơ, nhà phê bình, nhà nghiên cứu, nhà hoạt động xã hội tích cực …” ( “Bach Duong with the real name of Quach Dinh Sanh, was born in 1920 in Henan, China He is a writer, a poet, a critics, a researcher, a positively social activist, …") After processing, a set of potential phrases is returned: “Bách Dương”, “Quách Định Sanh”, “Hà Nam”, “Trung Quốc” in the first sentence And “Ông”, “nhà văn”, “nhà ݓ ∗ ݅݊ܿݕݐ݈ܾ݅݅݅ݐܽ݉ (ܰܲ , ܰܲ ) where F is the set of NP features mentioned above; incompatibilityf is a function that returns a value or l which indicates the degree of incompatibility between NPi and NPj in term of f ; and wf denotes the relative importance of feature f The incompatibility functions and their corresponding weights are listed in [1] In Vietnamese, there are a few published articles about co-reference such as the papers of Tru H Cao and Hien T Nguyen about Named entity disambiguation (NED) [11, 12] In these papers, the authors just only focused on determining the co-referent nature between named entities (NE), and the co-referent nature between NE and pronouns or NE and noun phrases have not been solved yet III PROPOSED MODEL Based on the mention-pair model’s main idea of Rahman and Ng [13] and the characteristic approach, we proposed a model as depicted in Figure In our proposed model, we built three main phases Vietnamese document Sentence tokenizer Feature vector generation phase Potential phrases Pairing Co-referent chains Feature selection Raw sentences Pos-Tagging, NER Feature vectors Pre-processing phase SVM classifier Recognition phase Figure 1: The proposed model x x x thơ”, “nhà phê bình”, “nhà nghiên cứu”,“nhà hoạt động xã hội” (he, writer, poet, critics, researcher, a positively social activist) in the second sentence The first phase: the pre-processing phase, focuses on recognizing potential phrases in sentences The potential phrases include: name entities, pronouns, noun phrases At the end of this phase, a set of potential phrases is returned The second phase: the feature vector generation phase, the phrases in the current sentence and three preceding sentences are paired together The feature selection module represents each pair by a feature vector The last phase: the recognition phase, a SVM classifier is used to determine the label of feature vectors Thereby the co-referent nature of the B Feature vector generation phase Based on Vietnamese language characteristics, we only use 13 (among 39) features from the model of Rahman and Ng Besides, we propose new features, i.e Subject (SUBJECT_i); job (JOB_i); both NPs are subjects (BOTH_SUBJECT); and token distance (TOK_DISTANCE) Totally, our model uses 17 features which are shown in Table and 90 http://code.google.com/p/vsw/ Table 1: Features of a potential phrase, NPi Feature Value TYPE_i 1, 2, SUBJECT_i JOB_i GENDER_i NUMBER_i 0, 0, -1, 0, 1, Meaning the feature type of NPi: if NPi is a named entity, if NPi is a noun phrase, if NPi is a pronoun if NPi is a subject, otherwise if NPi is a job, otherwise Gender feature of NPi if MALE, if FEMALE, otherwise -1 Number feature of NPi if PLURAL, otherwise Table 2: Features describing the relation between NP1 and NP2 Feature Value Meaning BOTH_SUBJECT 0,1 if NP1, NP2 are subjects, otherwise BOTH_NE 0, 1 if NP1, NP2 are the same named entity, otherwise BOTH_N 0, 1 if NP1, NP2 are the same noun, otherwise BOTH_PRO 0, 1 if NP1, NP2 are the same pronoun, otherwise STR_MATCH 0, 1 if NP1, NP2 are the same string, otherwise SUB_STR_MATCH 0, 1 if one phrase is a substring of the other, otherwise GENDER -1, 0, 1 if NP1, NP2 have the same gender, otherwise NUMBER -1, 0, 1 if NP1, NP2 have the same number, otherwise AGREEMENT -1, 0, 1 if NP1, NP2 have the same gender and number, otherwise APPOSITTIVE 0, 1 if NP1, NP2 have an appositive relation, otherwise SEN_DISTANCE 0, 1, 2, The sentence distance between NP1, NP2 TOK_DISTANCE -1, 0, 1, … The token distance between NP1, NP2 if NP1, NP2 are in the same sentence It is -1 if the two NPs are from different sentences In this example, “Bé Huy” and “Bé” are separated by the sentence: “Nguyên nhân sau xác định dây bị hở lõi đồng” that does not contain potential phrases We used the following strategy to produce pairs of phrases, includes two sub-steps: Step 1: From the result set of the first phase, processing sentence from the end of each document upwards Step 2: For each sentence, we pair the potential phrases in this sentence with its three preceding sentences The feature selection step: Based on the result from previous step, each pair (NP1, NP2) is represented by a feature vector contains 22 properties, including properties of NP1, properties of NP2 (as shown in Table 1), and 12 properties describing the relation between NP1 and NP2 (as shown in Table 2) These features are extracted automatically by a module The purpose of this phase is to generate feature vectors of pairs of phrases for later classification There are two steps in this phase The pairing step: We put potential phrases into pairs based on observations below: x In co-referent pairs, a named entity is often mentioned before its coreferences in the text In other words, a pronoun or a noun phrase can be paired with a preceding named entity From the above example, we can produce pairs (“Bách Dương”, “Quách Đình Sanh”), (“Bách Dương”, “Ơng”), (“Bách Dương”, “nhà văn”), etc x In Vietnamese document, in order to keep semantic coherence among sentences in a paragraph, co-references are often used in the range of no more than three sentences In other words, a noun phrase is not paired with a named entity occurring in the fourth preceding sentence x In some cases, two phrases that are co-referent in two sentences separated by a sentence that does not contain potential phrases For example: “ Bé Huy quận bị điện giật giẫm lên sợi dây điện mắc nối từ ổ cắm sang quạt máy Nguyên nhân sau xác định dây bị hở lõi đồng Vì tai nạn bé phải nằm viện tuần” (“The kid Huy in District was electrocuted when stepping on wires from the socket to the fans The cause was later identified as the exposed copper core wire It took her a week in hospital because of the accident.”) C Recognition phase The last phase includes two steps: Step 1:The set of feature vectors from the previous phase will be classified to determine label or the coreferent nature A pair of phrases is assigned label if the two phrases are considered to be coreferences, otherwise it is assigned Step 2: The co-referent phrases are grouped together For example, if (A,B) and (B,C) are two pairs which are classified as 1, then A, B and C are put into the same group 91 that when the training data is large enough, the proposed model can achieve a higher accuracy IV EXPERIMENTS AND DISCUSSIONS Besides the program according to our proposed model (denoted as PModel), this paper also builds a baseline model (denoted as BModel) according to the mention-pair model of Rahman and Ng [13] (pairing phrases with all of its preceding phrases) The detail of our experiments is described below Building the classifier: Through survey Vietnamese data domain, we found that the content of the articles of in cultural and social categories from Vnexpress.net newspaper has a reasonable number of co-reference relations in comparison with other categories Thus these articles are suitable for construction of the training data for our experiments After crawling, we selected 200 articles from this source to build the corpus The potential phrases from the corpus are manually tagged to ensure that all potential phrases in the text will be recognized automatically after the first phase In the next step, the potential phrases are paired and generated as the strategy described in subsection 3.2 About 2500 vectors were generated by PModel in comparison with over 3000 vectors of BModel.The label of feature vectors was manually tagged: the label is if two phrases are coreferences, otherwise Finally, we used libSVM toolkit to build the classifier based on the labeled vectors Evaluation and discussion: we used 10-fold crossvalidation method to evaluate the two classifiers of PModel and BModel Under this method, the training data is randomly divided into 10 approximately equal parts Each part, in turn, is used as the test set while the set of remaining parts is used as training set The average result is used as the final evaluation result The libSVM2 toolkit already supports this method In second experiment, we randomly selected 10 documents (from 200 documents) as the test set In this experiment we also evaluated precision (P), recall (R), F1 measure The results of the two experiments are shown in table and table V CONCLUSION AND FUTURE WORKS In this paper, we proposed and built a model to resolve the co-reference task in Vietnamese documents using support vector machines (SVM) The achieved results illustrated that our approach is completely reasonable and feasible However, the proposed model still has some limitations such as: lack of training case, the assessment in terms of data input is the output of the POS-tagging and Name entity recognition phase, user interfaces to interact with users Hence, our future works will focus on building a huge training dataset, investigation and application semantic relation features in Vietnamese to improve the accuracy of the model, building an user interfaces to interact with users ACKNOWLEDGEMENT This work is partly supported by the research project No QG.10.38 granted by Vietnam National University, Hanoi (VNUH), TRIG project and Nafosted foundation REFERENCES [1] C Cardie, K Wagstaff: Noun Phrase Coreference as Clustering, Empirical Methods in Natural Language Processing Conference (EMNLP), 1999, pp 3-5 [2] Z Dzunic, S Momcilovic, B Todorovic: Coreference Resolution Using Decision Tree, Neural Network Applications in Electrical Engineering, 2006, pp 6-10 [3] Pascal Denis, Jason Baldridge: A ranking approach to pronoun resolution Proceedings of the 20th International Joint Conference on Artifical intelligence (IJCAI), 2007, pp 1-2 [4] T Finley, T Joachims: Supervised clustering with Support Vector Machines, Proceeding of the 22nd International Conference on Machine Learning, Germany 2005 [5] Joseph F Mccarthy: A trainable approach to coreference resolution for information extraction, 1996 [6] Ruslan Mitkov: Robust pronoun resolution with limited knowledge The 17th international conference on Computational linguistics, COLING 1998, pp 1-3 [7] Ruslan Mitkov, Richard Evans, Constantin Orasan, Le An Ha, Viktor Pekar: Anaphora Resolution: To What Extent Does It Help NLP Applications? 6th DAAR 2007, pp 179-190 [8] Thomas S Mortan: Using coreference for question answering, In Proceedings of the 8th Text Retrieval Conference, 1999 [9] Vincent Ng : Machine Learning for Coreference Resolution: From Local Classification to Global Ranking Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005, pp 1-3 [10] Vincent Ng: Supervised Noun Phrase Coreference Research: The First Fifteen Years Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-10), 2010, pp 1-8 [11] Hien T Nguyen, Tru H Cao: A Knowledge-Based Approach to Named Entity Disambiguation in News Articles Australian Conference on Artificial Intelligence, 2007 [12] Hien T Nguyen, Tru H Cao: Named Entity Disambiguation: A Hybrid Statistical and Rule-Based Incremental Approach, Asian Semantic Web Conference (ASWC), 2008 [13] Altaf Rahman and Vincent Ng.:Supervised Models for Coreference Resolution Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP-09), 2009 [14] Josef Steinberger, Massimo Poesio, Mijail A Kabadjov, Karel Jezek: Two Uses of Anaphora Resolution in Summarization, Information Processing and Management: an International Journal, 2007 [15] Yannick Versley, Simone Paolo Ponzetto, Massimo Poesio : BART: A Modular Toolkit for Coreference Resolution, 6th Language Resources and Evaluation Conference (LREC), 2008, pp 1-6 Table 3: Results of cross-validation and second experiment Accuracy PModel BModel 10-fold cross-validation with LibSVM Second experiment 76.51 77.83 73.79 71.12 Table 4: Detail results of the second experiment PModel BModel Precision 77.83 71.12 Recall 34.34 32.06 F1 45.9 42.49 From table 3, the results of initial experiments on the proposed model achieved 76.51% accuracy while the accuracy of the base model is 73.79% with similar features It means that our approach is completely reasonable and highly applicable in Vietnamese data domain In addition, the results of the second experiment in Table reinforces that the proposed model is feasible However, from our observation, the corpus used in our experiments is rather small, and it does not cover all cases This may affect the results of our model We hope http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 92 ... will focus on building a huge training dataset, investigation and application semantic relation features in Vietnamese to improve the accuracy of the model, building an user interfaces to interact... Applications in Electrical Engineering, 2006, pp 6-10 [3] Pascal Denis, Jason Baldridge: A ranking approach to pronoun resolution Proceedings of the 20th International Joint Conference on Artifical... trainable approach to coreference resolution for information extraction, 1996 [6] Ruslan Mitkov: Robust pronoun resolution with limited knowledge The 17th international conference on Computational