Co-reference Resolution in Vietnamese Documents Based on Support Vector Machines Duc-Trong Le, Mai-Vu Tran KTLab, College of Technology Vietnam National University, Hanoi VNU 144 Xuan T
Trang 1Co-reference Resolution in Vietnamese Documents Based on
Support Vector Machines
Duc-Trong Le, Mai-Vu Tran KTLab, College of Technology Vietnam National University, Hanoi (VNU)
144 Xuan Thuy, Cau Giay, Hanoi, Vietnam
{mr.trongld, vutranmai}@gmail.com
Tri-Thanh Nguyen, Quang-Thuy Ha
KTLab, College of Technology Vietnam National University, Hanoi (VNU)
144 Xuan Thuy, Cau Giay, Hanoi, Vietnam
{ntthanh, thuyhq}@vnu.edu.vn
Abstract - Co-reference resolution task still poses many
challenges due to the complexity of the Vietnamese
language, and the lack of standard Vietnamese linguistic
resources Based on the mention-pair model of Rahman and
Ng (2009) and the characteristics of Vietnamese, this paper
proposes a model using support vector machines (SVM) to
solve the co-reference in Vietnamese documents The corpus
used in experiments to evaluate the proposed model was
constructed from 200 articles in cultural and social
categories from vnexpress.net newspaper website The
results of the initial experiments of the proposed model
achieved 76.51% accuracy in comparison with that of the
baseline model of 73.79% with similar features
Keywords: Co-reference resolution, Vietnamese co-reference,
support vector machines
I INTRODUCTION Co-reference resolution is the task of identifying
phrases (in one or more documents) referring to the same
real-world entity or concept In spite of its simple
definition, co-reference is generally considered a difficult
NLP task, typically because it involves the use of
sophisticated knowledge resources and inference
procedures [9, 10] It has received a great attention from
researchers and annual conferences around the world
such as: ACL, IJCAI, etc Co-reference information has
been shown to be beneficial in many other tasks [15],
including information extraction [5], question answering
[8] and text summarization [14], etc
The history of the study on co-reference resolution
could be dated back to 1960s-1970s, the initial
approaches were based on experiences So far many
different approaches have been proposed to solve the
problem, however, machine learning approaches are more
prominent In this paper, we focus on building a model
for co-reference resolution in Vietnamese documents
using SVM based on the proposed model by Rahman and
Ng [13] in 2009 The main idea of the approach is to
consider each pair of potential phrases is a semantic
relation where each relation is represented by a feature
vector Using a SVM classifier to determine the label of
feature vectors, we thereby can determine the co-referent
nature of the potential phrases Finally, the co-referent
pairs are grouped into same group
II RELATED WORK
In English, a lot of approaches are proposed to solve
the co-reference task However, in this paper we only
review some of the most related work based on which we
develop our algorithm
Ranking based approach: This is a traditional approach based on linguistic and the knowledge domain proposed in 1998 by Mitkov [6] It is actually an approach of pronoun resolution in the case of not having enough knowledge The input is checked against 10 agreement and antecedent indicators Candidates are assigned scores by each indicator and the candidate with the highest score is returned as the antecedent The most recent version of Mitkov’s Anaphora Resolution System (MARS) incorporates several advancements over the system MARS was improved to cater for several frequent causes of apparent number disagreement These consist of (i) collective nouns, (ii) NPs whose gender is under-specified, (iii) quantified nouns/indefinite pronouns, and (iv) organisation names These cases were handled by a combination of gazetteers, the integration of
an animacy recognition module, and named entity
recognition [7]
Classification based approach: In 1995, Joseph F
McCarthy proposed to transform co-reference problem into classification one, and used decision tree as the classifier Then Zoran Dzunic et al revised in 2006 with some improvements in matching algorithms [2] The main idea of this approach is to use a decision tree (based
on 9 features) to classify the relation of phrases into two classes: the co-reference and non-coreference classes Then the decision tree algorithm is applied to find co-references in a document
In 2005, Thomas Finly and Thorsten Joachims proposed to use SVM as the classifier [4] In 2009, this approach was inherited and expanded using 39 features
by Rahman and Ng [13] Some of important features such as: Pronoun, subject, nested, gender, number, string match, etc were added In this approach, the authors
considered a pair of an active mention m k and a candidate
antecedent m j as a semantic relation Each relation is represented by a vector constructed from 39 properties The class associated with a training relation is either
positive or negative, depending on whether m j and m k are co-referent or not After converting all multi-valued features into an equivalent set of binary-valued features, the authors used a SVM classifier (from the SVMlight package) to identify coreferences
Clustering based approach : In 1999, Claire Cardie
and Kiri Wagstaff [1] solved the co-reference task by clustering This approach begins with an assumption that each of co-reference group defines a class Therefore, it is natural to view the problem as one of partitioning or clustering noun phrases in docmuents Intuitively, all the noun phrases used to describe a specific concept will be
"near" or related to each other in some ways, i.e their conceptual "distance" will be short Given a description
2011 International Conference on Asian Language Processing
Trang 2of each noun phrase and a method for measuring the
distance between two noun phrases using 11 features,
then a clustering algorithm can group noun phrases into
groups Noun phrases with distance greater than a
clustering radius r are not placed into the same partition
so they are not considered as co-references Distance
metric between two noun phrases is defined as:
, =
∈
∗ ( , )
where F is the set of NP features mentioned above;
incompatibility f is a function that returns a value 0 or l
which indicates the degree of incompatibility between
NP i and NP j in term of f ; and w f denotes the relative
importance of feature f The incompatibility functions and
their corresponding weights are listed in [1]
In Vietnamese, there are a few published articles
about co-reference such as the papers of Tru H Cao and
Hien T Nguyen about Named entity disambiguation
(NED) [11, 12] In these papers, the authors just only
focused on determining the co-referent nature between
named entities (NE), and the co-referent nature between
NE and pronouns or NE and noun phrases have not been
solved yet
III PROPOSED MODEL
Based on the mention-pair model’s main idea of
Rahman and Ng [13] and the characteristic approach, we
proposed a model as depicted in Figure 1 In our
proposed model, we built three main phases
x The first phase: the pre-processing phase, focuses
on recognizing potential phrases in sentences The
potential phrases include: name entities, pronouns,
noun phrases At the end of this phase, a set of
potential phrases is returned
x The second phase: the feature vector generation
phase, the phrases in the current sentence and
three preceding sentences are paired together The
feature selection module represents each pair by a
feature vector
x The last phase: the recognition phase, a SVM
classifier is used to determine the label of feature
vectors Thereby the co-referent nature of the
phrases is determined Finally, the co-referent pairs are grouped into same group
The detail of each phase is described in the next subsections
A Pre-processing phase
The purpose of this phase is to generate a set of potentially co-referent phrases from input documents This phase use Vietnam Semantic Web (VSW1) toolkit which is an open source toolkit supporting sentence tokenizer, word segmentation, entity recognition, etc This phase will be conducted following two steps: Step 1: The input Vietnamese documents are parsed
by VSW sentence tokenizer tool A set of raw sentences
is produced Step 2: The set is pushed through the VSW POS-tagging and VSW NER tools in order to recognize the potentially co-referent phrases which are used in the next phase
For example, with the Vietnamese paragraph:
“Bách Dương tên thật là Quách Định Sanh, sinh
năm 1920 tại Hà Nam, Trung Quốc Ông là nhà văn, nhà thơ, nhà phê bình, nhà nghiên cứu, nhà hoạt động
xã hội tích cực …” ( “Bach Duong with the real name
of Quach Dinh Sanh, was born in 1920 in Henan, China He is a writer, a poet, a critics, a researcher, a positively social activist, …")
After processing, a set of potential phrases is returned:
“Bách Dương”, “Quách Định Sanh”, “Hà Nam”, “Trung
Quốc” in the first sentence And “Ông”, “nhà văn”, “nhà
thơ”, “nhà phê bình”, “nhà nghiên cứu”,“nhà hoạt động
xã hội” (he, writer, poet, critics, researcher, a positively social activist) in the second sentence
B Feature vector generation phase
Based on Vietnamese language characteristics, we only use 13 (among 39) features from the model of Rahman and Ng Besides, we propose 4 new features, i.e Subject (SUBJECT_i); job (JOB_i); both NPs are subjects (BOTH_SUBJECT); and token distance (TOK_DISTANCE)
Totally, our model uses 17 features which are shown in Table 1 and 2
1
http://code.google.com/p/vsw/
Vietnamese document
Potential phrases
Feature vectors
Co-referent chains
Pre-processing phase
Feature vector generation phase
Recognition phase
Sentence tokenizer
Raw sentences
Pos-Tagging, NER
Pairing
Feature selection
SVM classifier
Figure 1: The proposed model
Trang 3Table 1: Features of a potential phrase, NPi
TYPE_i 1, 2, 3 the feature type of NPi: 1 if NPi is a named entity, 2 if NPi is a
noun phrase, 3 if NPi is a pronoun
SUBJECT_i 0, 1 1 if NPi is a subject, otherwise 0 JOB_i 0, 1 1 if NPi is a job, otherwise 0 GENDER_i -1, 0, 1 Gender feature of NPi 1 if MALE, 0 if FEMALE, otherwise -1 NUMBER_i 1, 2 Number feature of NPi 2 if PLURAL, otherwise 1
Table 2: Features describing the relation between NP1 and NP2
SUB_STR_MATCH 0, 1 1 if one phrase is a substring of the other, otherwise 0
TOK_DISTANCE -1, 0, 1, … The token distance between NP1, NP2 if NP1, NP2 are in the same sentence It is -1 if the two NPs are from different sentences
The purpose of this phase is to generate feature
vectors of pairs of phrases for later classification There
are two steps in this phase
The pairing step: We put potential phrases into pairs
based on 3 observations below:
x In co-referent pairs, a named entity is often
mentioned before its coreferences in the text In
other words, a pronoun or a noun phrase can be
paired with a preceding named entity From the
above example, we can produce pairs (“Bách
Dương”, “Quách Đình Sanh”), (“Bách Dương”,
“Ông”), (“Bách Dương”, “nhà văn”), etc
x In Vietnamese document, in order to keep
semantic coherence among sentences in a
paragraph, co-references are often used in the
range of no more than three sentences In other
words, a noun phrase is not paired with a named
entity occurring in the fourth preceding sentence
x In some cases, two phrases that are co-referent in
two sentences separated by a sentence that does
not contain potential phrases
For example: “ Bé Huy ở quận 7 bị điện giật khi giẫm
lên sợi dây điện mắc nối từ ổ cắm sang chiếc quạt máy
Nguyên nhân sau đó được xác định vì dây bị hở lõi đồng
Vì tai nạn đó bé phải nằm viện mất một tuần” (“The kid
Huy in District 7 was electrocuted when stepping on
wires from the socket to the fans The cause was later
identified as the exposed copper core wire It took her a
week in hospital because of the accident.”)
In this example, “Bé Huy” and “Bé” are separated by the sentence: “Nguyên nhân sau đó được xác định vì dây
bị hở lõi đồng” that does not contain potential phrases
We used the following strategy to produce pairs of phrases, includes two sub-steps:
Step 1: From the result set of the first phase, processing sentence from the end of each document upwards
Step 2: For each sentence, we pair the potential phrases in this sentence with its three preceding
sentences
The feature selection step: Based on the result from
previous step, each pair (NP1, NP2) is represented by a
feature vector contains 22 properties, including 5
properties of NP1, 5 properties of NP2 (as shown in
Table 1), and 12 properties describing the relation
between NP1 and NP2 (as shown in Table 2). These features are extracted automatically by a module
C Recognition phase
The last phase includes two steps:
Step 1:The set of feature vectors from the previous phase will be classified to determine label or the co-referent nature A pair of phrases is assigned label 1 if the two phrases are considered to be coreferences, otherwise it is assigned 0
Step 2: The co-referent phrases are grouped
together For example, if (A,B) and (B,C) are two pairs which are classified as 1, then A, B and C are put into the
same group
Trang 4IV EXPERIMENTS AND DISCUSSIONS
Besides the program according to our proposed
model (denoted as PModel), this paper also builds a
baseline model (denoted as BModel) according to the
mention-pair model of Rahman and Ng [13] (pairing
phrases with all of its preceding phrases) The detail of
our experiments is described below
Building the classifier: Through survey Vietnamese
data domain, we found that the content of the articles of
in cultural and social categories from Vnexpress.net
newspaper has a reasonable number of co-reference
relations in comparison with other categories Thus these
articles are suitable for construction of the training data
for our experiments After crawling, we selected 200
articles from this source to build the corpus The
potential phrases from the corpus are manually tagged to
ensure that all potential phrases in the text will be
recognized automatically after the first phase In the next
step, the potential phrases are paired and generated as the
strategy described in subsection 3.2
About 2500 vectors were generated by PModel in
comparison with over 3000 vectors of BModel.The label
of feature vectors was manually tagged: the label is 1 if
two phrases are coreferences, 0 otherwise Finally, we
used libSVM toolkit to build the classifier based on the
labeled vectors
Evaluation and discussion: we used 10-fold
cross-validation method to evaluate the two classifiers of
PModel and BModel Under this method, the training
data is randomly divided into 10 approximately equal
parts Each part, in turn, is used as the test set while the
set of 9 remaining parts is used as training set The
average result is used as the final evaluation result The
libSVM2 toolkit already supports this method
In second experiment, we randomly selected 10
documents (from 200 documents) as the test set In this
experiment we also evaluated precision (P), recall (R),
F1 measure The results of the two experiments are
shown in table 3 and table 4
Table 3: Results of cross-validation and second experiment
10-fold cross-validation with LibSVM 76.51 73.79
Table 4: Detail results of the second experiment
From table 3, the results of initial experiments on the
proposed model achieved 76.51% accuracy while the
accuracy of the base model is 73.79% with similar
features It means that our approach is completely
reasonable and highly applicable in Vietnamese data
domain In addition, the results of the second experiment
in Table 4 reinforces that the proposed model is feasible
However, from our observation, the corpus used in our
experiments is rather small, and it does not cover all
cases This may affect the results of our model We hope
2
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
that when the training data is large enough, the proposed model can achieve a higher accuracy
V CONCLUSION AND FUTURE WORKS
In this paper, we proposed and built a model to resolve the co-reference task in Vietnamese documents using support vector machines (SVM) The achieved results illustrated that our approach is completely reasonable and feasible However, the proposed model still has some limitations such as: lack of training case, the assessment in terms of data input is the output of the POS-tagging and Name entity recognition phase, user interfaces to interact with users Hence, our future works will focus on building a huge training dataset, investigation and application semantic relation features
in Vietnamese to improve the accuracy of the model,
building an user interfaces to interact with users
ACKNOWLEDGEMENT This work is partly supported by the research project
No QG.10.38 granted by Vietnam National University, Hanoi (VNUH), TRIG project and Nafosted foundation
REFERENCES
[1] C Cardie, K Wagstaff: Noun Phrase Coreference as Clustering,
Empirical Methods in Natural Language Processing Conference (EMNLP), 1999, pp 3-5
[2] Z Dzunic, S Momcilovic, B Todorovic: Coreference Resolution Using Decision Tree, Neural Network Applications in Electrical
Engineering, 2006, pp 6-10
[3] Pascal Denis, Jason Baldridge: A ranking approach to pronoun resolution Proceedings of the 20th International Joint Conference on
Artifical intelligence (IJCAI), 2007, pp 1-2
[4] T Finley, T Joachims: Supervised clustering with Support Vector Machines, Proceeding of the 22nd
International Conference on Machine Learning, Germany 2005
[5] Joseph F Mccarthy: A trainable approach to coreference resolution for information extraction, 1996
[6] Ruslan Mitkov: Robust pronoun resolution with limited knowledge
The 17th international conference on Computational linguistics, COLING 1998, pp 1-3
[7] Ruslan Mitkov, Richard Evans, Constantin Orasan, Le An Ha,
Viktor Pekar: Anaphora Resolution: To What Extent Does It Help NLP Applications? 6th
DAAR 2007, pp 179-190
[8] Thomas S Mortan: Using coreference for question answering, In
Proceedings of the 8th Text Retrieval Conference, 1999
[9] Vincent Ng : Machine Learning for Coreference Resolution: From Local Classification to Global Ranking Proceedings of the 43rd
Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005, pp 1-3
[10] Vincent Ng: Supervised Noun Phrase Coreference Research: The First Fifteen Years Proceedings of the 48th Annual Meeting of the
Association for Computational Linguistics (ACL-10), 2010, pp 1-8
[11] Hien T Nguyen, Tru H Cao: A Knowledge-Based Approach to Named Entity Disambiguation in News Articles Australian Conference
on Artificial Intelligence, 2007
[12] Hien T Nguyen, Tru H Cao: Named Entity Disambiguation: A Hybrid Statistical and Rule-Based Incremental Approach, Asian
Semantic Web Conference (ASWC), 2008
[13] Altaf Rahman and Vincent Ng.:Supervised Models for Coreference Resolution Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing (EMNLP-09),
2009
[14] Josef Steinberger, Massimo Poesio, Mijail A Kabadjov, Karel
Jezek: Two Uses of Anaphora Resolution in Summarization,
Information Processing and Management: an International Journal,
2007
[15] Yannick Versley, Simone Paolo Ponzetto, Massimo Poesio :
BART: A Modular Toolkit for Coreference Resolution, 6th
Language
Resources and Evaluation Conference (LREC), 2008, pp 1-6
... focus on building a huge training dataset, investigation and application semantic relation featuresin Vietnamese to improve the accuracy of the model,
building an user interfaces...
[4] T Finley, T Joachims: Supervised clustering with Support Vector Machines, Proceeding of the 22nd
International Conference on Machine Learning, Germany... social activist) in the second sentence
B Feature vector generation phase
Based on Vietnamese language characteristics, we only use 13 (among 39) features from the model