DSpace at VNU: Co-reference resolution in Vietnamese documents based on support vector machines

Co-reference Resolution in Vietnamese Documents Based on Support Vector Machines Duc-Trong Le, Mai-Vu Tran KTLab, College of Technology Vietnam National University, Hanoi VNU 144 Xuan T

Trang 1

Co-reference Resolution in Vietnamese Documents Based on

Support Vector Machines

Duc-Trong Le, Mai-Vu Tran KTLab, College of Technology Vietnam National University, Hanoi (VNU)

144 Xuan Thuy, Cau Giay, Hanoi, Vietnam

{mr.trongld, vutranmai}@gmail.com

Tri-Thanh Nguyen, Quang-Thuy Ha

KTLab, College of Technology Vietnam National University, Hanoi (VNU)

144 Xuan Thuy, Cau Giay, Hanoi, Vietnam

{ntthanh, thuyhq}@vnu.edu.vn

Abstract - Co-reference resolution task still poses many

challenges due to the complexity of the Vietnamese

language, and the lack of standard Vietnamese linguistic

resources Based on the mention-pair model of Rahman and

Ng (2009) and the characteristics of Vietnamese, this paper

proposes a model using support vector machines (SVM) to

solve the co-reference in Vietnamese documents The corpus

used in experiments to evaluate the proposed model was

constructed from 200 articles in cultural and social

categories from vnexpress.net newspaper website The

results of the initial experiments of the proposed model

achieved 76.51% accuracy in comparison with that of the

baseline model of 73.79% with similar features

Keywords: Co-reference resolution, Vietnamese co-reference,

support vector machines

I INTRODUCTION Co-reference resolution is the task of identifying

phrases (in one or more documents) referring to the same

real-world entity or concept In spite of its simple

definition, co-reference is generally considered a difficult

NLP task, typically because it involves the use of

sophisticated knowledge resources and inference

procedures [9, 10] It has received a great attention from

researchers and annual conferences around the world

such as: ACL, IJCAI, etc Co-reference information has

been shown to be beneﬁcial in many other tasks [15],

including information extraction [5], question answering

[8] and text summarization [14], etc

The history of the study on co-reference resolution

could be dated back to 1960s-1970s, the initial

approaches were based on experiences So far many

different approaches have been proposed to solve the

problem, however, machine learning approaches are more

prominent In this paper, we focus on building a model

for co-reference resolution in Vietnamese documents

using SVM based on the proposed model by Rahman and

Ng [13] in 2009 The main idea of the approach is to

consider each pair of potential phrases is a semantic

relation where each relation is represented by a feature

vector Using a SVM classifier to determine the label of

feature vectors, we thereby can determine the co-referent

nature of the potential phrases Finally, the co-referent

pairs are grouped into same group

II RELATED WORK

In English, a lot of approaches are proposed to solve

the co-reference task However, in this paper we only

review some of the most related work based on which we

develop our algorithm

Ranking based approach: This is a traditional approach based on linguistic and the knowledge domain proposed in 1998 by Mitkov [6] It is actually an approach of pronoun resolution in the case of not having enough knowledge The input is checked against 10 agreement and antecedent indicators Candidates are assigned scores by each indicator and the candidate with the highest score is returned as the antecedent The most recent version of Mitkov’s Anaphora Resolution System (MARS) incorporates several advancements over the system MARS was improved to cater for several frequent causes of apparent number disagreement These consist of (i) collective nouns, (ii) NPs whose gender is under-specified, (iii) quantified nouns/indefinite pronouns, and (iv) organisation names These cases were handled by a combination of gazetteers, the integration of

an animacy recognition module, and named entity

recognition [7]

Classification based approach: In 1995, Joseph F

McCarthy proposed to transform co-reference problem into classification one, and used decision tree as the classifier Then Zoran Dzunic et al revised in 2006 with some improvements in matching algorithms [2] The main idea of this approach is to use a decision tree (based

on 9 features) to classify the relation of phrases into two classes: the co-reference and non-coreference classes Then the decision tree algorithm is applied to find co-references in a document

In 2005, Thomas Finly and Thorsten Joachims proposed to use SVM as the classifier [4] In 2009, this approach was inherited and expanded using 39 features

by Rahman and Ng [13] Some of important features such as: Pronoun, subject, nested, gender, number, string match, etc were added In this approach, the authors

considered a pair of an active mention m k and a candidate

antecedent m j as a semantic relation Each relation is represented by a vector constructed from 39 properties The class associated with a training relation is either

positive or negative, depending on whether m j and m k are co-referent or not After converting all multi-valued features into an equivalent set of binary-valued features, the authors used a SVM classifier (from the SVMlight package) to identify coreferences

Clustering based approach : In 1999, Claire Cardie

and Kiri Wagstaff [1] solved the co-reference task by clustering This approach begins with an assumption that each of co-reference group defines a class Therefore, it is natural to view the problem as one of partitioning or clustering noun phrases in docmuents Intuitively, all the noun phrases used to describe a specific concept will be

"near" or related to each other in some ways, i.e their conceptual "distance" will be short Given a description

2011 International Conference on Asian Language Processing

Trang 2

of each noun phrase and a method for measuring the

distance between two noun phrases using 11 features,

then a clustering algorithm can group noun phrases into

groups Noun phrases with distance greater than a

clustering radius r are not placed into the same partition

so they are not considered as co-references Distance

metric between two noun phrases is defined as:

, =

∈

∗ ( , )

where F is the set of NP features mentioned above;

incompatibility f is a function that returns a value 0 or l

which indicates the degree of incompatibility between

NP i and NP j in term of f ; and w f denotes the relative

importance of feature f The incompatibility functions and

their corresponding weights are listed in [1]

In Vietnamese, there are a few published articles

about co-reference such as the papers of Tru H Cao and

Hien T Nguyen about Named entity disambiguation

(NED) [11, 12] In these papers, the authors just only

focused on determining the co-referent nature between

named entities (NE), and the co-referent nature between

NE and pronouns or NE and noun phrases have not been

solved yet

III PROPOSED MODEL

Based on the mention-pair model’s main idea of

Rahman and Ng [13] and the characteristic approach, we

proposed a model as depicted in Figure 1 In our

proposed model, we built three main phases

x The first phase: the pre-processing phase, focuses

on recognizing potential phrases in sentences The

potential phrases include: name entities, pronouns,

noun phrases At the end of this phase, a set of

potential phrases is returned

x The second phase: the feature vector generation

phase, the phrases in the current sentence and

three preceding sentences are paired together The

feature selection module represents each pair by a

feature vector

x The last phase: the recognition phase, a SVM

classifier is used to determine the label of feature

vectors Thereby the co-referent nature of the

phrases is determined Finally, the co-referent pairs are grouped into same group

The detail of each phase is described in the next subsections

A Pre-processing phase

The purpose of this phase is to generate a set of potentially co-referent phrases from input documents This phase use Vietnam Semantic Web (VSW1) toolkit which is an open source toolkit supporting sentence tokenizer, word segmentation, entity recognition, etc This phase will be conducted following two steps: Step 1: The input Vietnamese documents are parsed

by VSW sentence tokenizer tool A set of raw sentences

is produced Step 2: The set is pushed through the VSW POS-tagging and VSW NER tools in order to recognize the potentially co-referent phrases which are used in the next phase

For example, with the Vietnamese paragraph:

“Bách Dương tên thật là Quách Định Sanh, sinh

năm 1920 tại Hà Nam, Trung Quốc Ông là nhà văn, nhà thơ, nhà phê bình, nhà nghiên cứu, nhà hoạt động

xã hội tích cực …” ( “Bach Duong with the real name

of Quach Dinh Sanh, was born in 1920 in Henan, China He is a writer, a poet, a critics, a researcher, a positively social activist, …")

After processing, a set of potential phrases is returned:

“Bách Dương”, “Quách Định Sanh”, “Hà Nam”, “Trung

Quốc” in the first sentence And “Ông”, “nhà văn”, “nhà

thơ”, “nhà phê bình”, “nhà nghiên cứu”,“nhà hoạt động

xã hội” (he, writer, poet, critics, researcher, a positively social activist) in the second sentence

B Feature vector generation phase

Based on Vietnamese language characteristics, we only use 13 (among 39) features from the model of Rahman and Ng Besides, we propose 4 new features, i.e Subject (SUBJECT_i); job (JOB_i); both NPs are subjects (BOTH_SUBJECT); and token distance (TOK_DISTANCE)

Totally, our model uses 17 features which are shown in Table 1 and 2

1

http://code.google.com/p/vsw/

Vietnamese document

Potential phrases

Feature vectors

Co-referent chains

Pre-processing phase

Feature vector generation phase

Recognition phase

Sentence tokenizer

Raw sentences

Pos-Tagging, NER

Pairing

Feature selection

SVM classifier

Figure 1: The proposed model

Trang 3

Table 1: Features of a potential phrase, NPi

TYPE_i 1, 2, 3 the feature type of NPi: 1 if NPi is a named entity, 2 if NPi is a

noun phrase, 3 if NPi is a pronoun

SUBJECT_i 0, 1 1 if NPi is a subject, otherwise 0 JOB_i 0, 1 1 if NPi is a job, otherwise 0 GENDER_i -1, 0, 1 Gender feature of NPi 1 if MALE, 0 if FEMALE, otherwise -1 NUMBER_i 1, 2 Number feature of NPi 2 if PLURAL, otherwise 1

Table 2: Features describing the relation between NP1 and NP2

SUB_STR_MATCH 0, 1 1 if one phrase is a substring of the other, otherwise 0

TOK_DISTANCE -1, 0, 1, … The token distance between NP1, NP2 if NP1, NP2 are in the same sentence It is -1 if the two NPs are from different sentences

The purpose of this phase is to generate feature

vectors of pairs of phrases for later classification There

are two steps in this phase

The pairing step: We put potential phrases into pairs

based on 3 observations below:

x In co-referent pairs, a named entity is often

mentioned before its coreferences in the text In

other words, a pronoun or a noun phrase can be

paired with a preceding named entity From the

above example, we can produce pairs (“Bách

Dương”, “Quách Đình Sanh”), (“Bách Dương”,

“Ông”), (“Bách Dương”, “nhà văn”), etc

x In Vietnamese document, in order to keep

semantic coherence among sentences in a

paragraph, co-references are often used in the

range of no more than three sentences In other

words, a noun phrase is not paired with a named

entity occurring in the fourth preceding sentence

x In some cases, two phrases that are co-referent in

two sentences separated by a sentence that does

not contain potential phrases

For example: “ Bé Huy ở quận 7 bị điện giật khi giẫm

lên sợi dây điện mắc nối từ ổ cắm sang chiếc quạt máy

Nguyên nhân sau đó được xác định vì dây bị hở lõi đồng

Vì tai nạn đó bé phải nằm viện mất một tuần” (“The kid

Huy in District 7 was electrocuted when stepping on

wires from the socket to the fans The cause was later

identified as the exposed copper core wire It took her a

week in hospital because of the accident.”)

In this example, “Bé Huy” and “Bé” are separated by the sentence: “Nguyên nhân sau đó được xác định vì dây

bị hở lõi đồng” that does not contain potential phrases

We used the following strategy to produce pairs of phrases, includes two sub-steps:

Step 1: From the result set of the first phase, processing sentence from the end of each document upwards

Step 2: For each sentence, we pair the potential phrases in this sentence with its three preceding

sentences

The feature selection step: Based on the result from

previous step, each pair (NP1, NP2) is represented by a

feature vector contains 22 properties, including 5

properties of NP1, 5 properties of NP2 (as shown in

Table 1), and 12 properties describing the relation

between NP1 and NP2 (as shown in Table 2). These features are extracted automatically by a module

C Recognition phase

The last phase includes two steps:

Step 1:The set of feature vectors from the previous phase will be classified to determine label or the co-referent nature A pair of phrases is assigned label 1 if the two phrases are considered to be coreferences, otherwise it is assigned 0

Step 2: The co-referent phrases are grouped

together For example, if (A,B) and (B,C) are two pairs which are classified as 1, then A, B and C are put into the

same group

Trang 4

IV EXPERIMENTS AND DISCUSSIONS

Besides the program according to our proposed

model (denoted as PModel), this paper also builds a

baseline model (denoted as BModel) according to the

mention-pair model of Rahman and Ng [13] (pairing

phrases with all of its preceding phrases) The detail of

our experiments is described below

Building the classifier: Through survey Vietnamese

data domain, we found that the content of the articles of

in cultural and social categories from Vnexpress.net

newspaper has a reasonable number of co-reference

relations in comparison with other categories Thus these

articles are suitable for construction of the training data

for our experiments After crawling, we selected 200

articles from this source to build the corpus The

potential phrases from the corpus are manually tagged to

ensure that all potential phrases in the text will be

recognized automatically after the first phase In the next

step, the potential phrases are paired and generated as the

strategy described in subsection 3.2

About 2500 vectors were generated by PModel in

comparison with over 3000 vectors of BModel.The label

of feature vectors was manually tagged: the label is 1 if

two phrases are coreferences, 0 otherwise Finally, we

used libSVM toolkit to build the classifier based on the

labeled vectors

Evaluation and discussion: we used 10-fold

cross-validation method to evaluate the two classifiers of

PModel and BModel Under this method, the training

data is randomly divided into 10 approximately equal

parts Each part, in turn, is used as the test set while the

set of 9 remaining parts is used as training set The

average result is used as the final evaluation result The

libSVM2 toolkit already supports this method

In second experiment, we randomly selected 10

documents (from 200 documents) as the test set In this

experiment we also evaluated precision (P), recall (R),

F1 measure The results of the two experiments are

shown in table 3 and table 4

Table 3: Results of cross-validation and second experiment

10-fold cross-validation with LibSVM 76.51 73.79

Table 4: Detail results of the second experiment

From table 3, the results of initial experiments on the

proposed model achieved 76.51% accuracy while the

accuracy of the base model is 73.79% with similar

features It means that our approach is completely

reasonable and highly applicable in Vietnamese data

domain In addition, the results of the second experiment

in Table 4 reinforces that the proposed model is feasible

However, from our observation, the corpus used in our

experiments is rather small, and it does not cover all

cases This may affect the results of our model We hope

2

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

that when the training data is large enough, the proposed model can achieve a higher accuracy

V CONCLUSION AND FUTURE WORKS

In this paper, we proposed and built a model to resolve the co-reference task in Vietnamese documents using support vector machines (SVM) The achieved results illustrated that our approach is completely reasonable and feasible However, the proposed model still has some limitations such as: lack of training case, the assessment in terms of data input is the output of the POS-tagging and Name entity recognition phase, user interfaces to interact with users Hence, our future works will focus on building a huge training dataset, investigation and application semantic relation features

in Vietnamese to improve the accuracy of the model,

building an user interfaces to interact with users

ACKNOWLEDGEMENT This work is partly supported by the research project

No QG.10.38 granted by Vietnam National University, Hanoi (VNUH), TRIG project and Nafosted foundation

REFERENCES

[1] C Cardie, K Wagstaff: Noun Phrase Coreference as Clustering,

Empirical Methods in Natural Language Processing Conference (EMNLP), 1999, pp 3-5

[2] Z Dzunic, S Momcilovic, B Todorovic: Coreference Resolution Using Decision Tree, Neural Network Applications in Electrical

Engineering, 2006, pp 6-10

[3] Pascal Denis, Jason Baldridge: A ranking approach to pronoun resolution Proceedings of the 20th International Joint Conference on

Artifical intelligence (IJCAI), 2007, pp 1-2

[4] T Finley, T Joachims: Supervised clustering with Support Vector Machines, Proceeding of the 22nd

International Conference on Machine Learning, Germany 2005

[5] Joseph F Mccarthy: A trainable approach to coreference resolution for information extraction, 1996

[6] Ruslan Mitkov: Robust pronoun resolution with limited knowledge

The 17th international conference on Computational linguistics, COLING 1998, pp 1-3

[7] Ruslan Mitkov, Richard Evans, Constantin Orasan, Le An Ha,

Viktor Pekar: Anaphora Resolution: To What Extent Does It Help NLP Applications? 6th

DAAR 2007, pp 179-190

[8] Thomas S Mortan: Using coreference for question answering, In

Proceedings of the 8th Text Retrieval Conference, 1999

[9] Vincent Ng : Machine Learning for Coreference Resolution: From Local Classification to Global Ranking Proceedings of the 43rd

Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005, pp 1-3

[10] Vincent Ng: Supervised Noun Phrase Coreference Research: The First Fifteen Years Proceedings of the 48th Annual Meeting of the

Association for Computational Linguistics (ACL-10), 2010, pp 1-8

[11] Hien T Nguyen, Tru H Cao: A Knowledge-Based Approach to Named Entity Disambiguation in News Articles Australian Conference

on Artificial Intelligence, 2007

[12] Hien T Nguyen, Tru H Cao: Named Entity Disambiguation: A Hybrid Statistical and Rule-Based Incremental Approach, Asian

Semantic Web Conference (ASWC), 2008

[13] Altaf Rahman and Vincent Ng.:Supervised Models for Coreference Resolution Proceedings of the 2009 Conference on

Empirical Methods in Natural Language Processing (EMNLP-09),

2009

[14] Josef Steinberger, Massimo Poesio, Mijail A Kabadjov, Karel

Jezek: Two Uses of Anaphora Resolution in Summarization,

Information Processing and Management: an International Journal,

2007

[15] Yannick Versley, Simone Paolo Ponzetto, Massimo Poesio :

BART: A Modular Toolkit for Coreference Resolution, 6th

Language

Resources and Evaluation Conference (LREC), 2008, pp 1-6

in Vietnamese to improve the accuracy of the model,

building an user interfaces...

[4] T Finley, T Joachims: Supervised clustering with Support Vector Machines, Proceeding of the 22nd

International Conference on Machine Learning, Germany... social activist) in the second sentence

B Feature vector generation phase

Based on Vietnamese language characteristics, we only use 13 (among 39) features from the model

Định dạng
Số trang	4
Dung lượng	192,42 KB