Luận văn thạc sĩ VNU UET tagset evaluation and automatical error verrification in pos tagged corpus, đánh giá tập nhãn và xác định lỗi tự động trong kho ngữ liệu đã gán nhãn

Characteristics of Vietnamese language

Every language in the world has its own features and so has Vietnamese To understand more Vietnamese, we would like to list some emerging features and compare Vietnamese with some other languages such as Chinese, English

Followed M.Ferlus and other domestic and international researchers in Vietnam, Vietnamese is native origin language, belongs to South Asian language, Mon-Khmer family, has relationship closely with Muong language Besides, Vietnamese belongs to a isolating language type with three prominent features Firstly, a syllable is foundation unit to form a word and a sentence The syllable may be single word or be element to compose a complex word, a compound word and a reiteration word Secondly, the Vietnamese word is not inflectional In particular, there are no difference between singular noun and plural noun; for example, “hai cuốn sách” (two books) and “một cuốn sách” (one book) Thirdly, grammatical meaning expresses mainly through word order and expletive method Given some expletives such as “sẽ, đã, không” and sentence “Tôi ra ngoài” We can make three different meaning sentences from given input: “Tôi sẽ ra ngoài”; “tôi đã ra ngoài”;” tôi không ra ngoài”

Figure 1 The features of Vietnamese type

In the world, some languages also belong to isolating language such as Chinese and Thai language English, French, Russian are flexional language So, there are some different features, for instance comparing Vietnamese, English and Chinese sentence

Syllable is foundation unit to form word or sentence

Vietnamese word is not inflectional

The grammatical meaning express mainly through word order and expletive method

Table 1 The expression of grammatical meaning in Vietnamese

Word order Tôi yêu anh ấy

Expletive Tôi không yêu anh ấy Wo bu ai ta I do not love him

Unlike Vietnamese and Chinese, in above English sentence when word order changes, object pronoun turns into personal pronoun (himhe).

Vietnamese part of speech

In European language, POS notion glues with morphological category such as gender, numeral, mood, so on In Vietnam, there are two idea followed:

 Firstly, POS does not exist in Vietnamese because Vietnamese does not have morphological modification (Le Quang Trinh, Nguyen Hien Le, Ho Huu Tung)

 Secondly, like European language, Vietnamese has also POS but to classify words in tags, or define POS of words, it is necessary to base on certain criteria

So far, Vietnamese branch has almost agreed using criteria following ( Diep Quang Ban, Hoang Van Thung, 2010): a General meaning: “The meaning of a POS is the general meaning of a words group, bases on vocabulary generalization foundation to form common grammatical category generalization (lexical-grammatical category)” POSs are suitable for definition of classification category These are groups having giant number of words that each group has a classification feature: object, quality, action or state, so on

Therefore, nhà, bàn, chim, học sinh, con, quyển, sự, so on, are classified into nouns because their vocabulary meaning is generalized and abstracted as objects The grammar category belongs to noun b Combination ability: With general meaning, words can get involve to one meaningful combination: some words can replace each other in a certain position of a combination, the rest of the combination make the setting for appearing replacement ability Followed example: nhà, bàn, chim, cát, and so on, can appear and replace each other in combination type: nhà này, chim này, cát này, etc and are classified as nouns c Syntax function: Participating in sentence composition, words can stand in one or some certain positions in a sentence, or can replace each other in the positions, and express one relation about syntax function with other parts in the sentence composition, can be classified into one POS For instance, some words such as nhà, bàn, chim, cát are noun They may be subjects in sentences in which the subject function is a syntax function to classify them into noun

1.2.2 The ways to build up tagset

Nowadays, there are two kinds of set of POS tags have developed in which the first kind received attention much more from linguistic researchers

The first kind bases on 8 basic POS tags that are used many in dictionaries or linguistic materials These are: noun, verb, adjective, pronoun, adverb, conjunction, interjection, emotive word From the 8 basic tags, some finer set of POS tags are built up Each researcher relies on certain criteria to build up the tagset finer (criteria are discussed in the section 1.2.1) Notably, VnQtag tagset of Tran Thi Oanh contains 14 tags;

VietTreeBank consists of 17 tags; VnQtag 59 tags (see appendix)

The second kind is built up by mapping a tagset from other language to Vietnamese based on association between words of two languages (Dinh Dien and Hoang Kiem

Copora

Annotated corpora are large bodies of text with linguistically-informative mark-up

They play an important role for current work in computational linguistics, so great attention has gone into developing such corpora Any countries, there are their own corpora as well Some common corpora such as: British National corpus (Leech et at,

1994), the Penn Treebank (Marcus et at, 1993), or the German NEGRA Treebank (Skut et at, 1997), the Lancaster corpus of Mandarin Chinese (Tony McEnery and Richard Xiao, 2005) In Vietnam, there are notable corpora: VnQtag, VnPos, VTB

To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson,

 Sampling and representativeness: elements in a corpus must be general, diversified and plentiful A sample is representative if what we find for the sample also holds for the general population

 Finite size: bigger the size of a corpus is, higher it is appreciated but it is still finite size

We must admit that it takes much time to build a large corpus by manual due to need huge linguistic knowledge With manually built large corpus, the quality of corpus is not surely good corpus Therefore, our thesis will find out and improve it

Two corpora we used in our experiments are VietTreeBank and VnQtag After that, we would like to deeper discuss about building way of the corpora

VietTreeBank is the result of a national project VLSP that is developed by VTB group (Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen and annotators) The corpus includes 142 documents belonging to a politics-society topic of the Youth news responding to 10.000 Vietnamese sentence annotated syntax (word segmentation, POS tagging, syntax structure) The group based on MEMs and CRFs machine learning model to assign POS tags The preciseness of the model is over 93% VTB is developed with the purpose to aid programs building: word segmentation, POS tagging, syntax parsing, and so on VTB group chose two criteria to classify POS: combination ability and syntactic function words For instance, noun has role as subject or object in a sentence Besides, noun can combine with numeral (three, four) and attribute (each, every)

One POS tag can contain information about basic class of words (noun, verb, adjective, so on), morphological information (countable or uncountable), subcategory (verb goes with noun, verb goes with a clause, etc), semantic information or other syntax information VTB group built up the tagset just based on basic class of words without other information such as morphological information, subcategory, etc (see tagset in appendix)

In addition to POS information, the group describes basic syntax elements as phrase and clause Syntax tags are the most foundation information in syntax tree, they forms spine of the tree A7 and A8 in appendix list phrase and clause tagset, respectively

Function tag of a syntax element expresses its role in syntax element in higher level

The tags are assigned to the main elements in the sentence such as subject, predicative, object They provide information help us identify basic grammar relationship as followed

Tagging process of each sentence in corpus consists of three steps: word segmentation, POS tagging, and syntactic parsing

Building VnQtag tagset belongs to KC01 national project and is performed by development group including Nguyen Thi Minh Huyen, Vu Xuan Luong, Le Hong Phuong The group based on a print dictionary (Vietnamese dictionary of Linguistic Institution in 2000) to carry out their work First of all, they segmented sentences into words by a syllable otomat and a lexical otomat Then, they used Qtag tagger to assign POS label to Vietnamese words The number of POS labels is 59 labels (see in appendix) In addition of grammar information, the group got adding semantic information (general meaning of word) to classify into 59 word class labels For example, words are considered verb that they express general meaning about process

Process meaning expresses directly in action feature of object This is action meaning

State meaning is generalized in relationship with action of object in time and space (Vietnamese grammar of Diep Quang Ban and Hoang Van Thung) The automatic tagger experiment is carried out on 7 documents that are listed in table 2 The annotated corpus plays an important in NLP; it is data database containing high quality linguistic sources; it obeys international standards and data express

The gained corpus has format following: each lexical unit and corresponding POS stand on one line, in which using space in each syllable, between word and POS have tab to separate The type of punctuation and other symbols in text are processed as lexical unit with label is punctuation corresponding This corpus includes 7 documents that belonged to different types such as story, novel, science and press It gathers common words used popularly in daily life and the press It also gathers words that we can usually see in literature works or science-technical terms

Table 2 Corpus with VnQtag tagset annotation

The number of lexical unit

The number of processing unit (included punctuations)

2 Chuyen tinh ke truoc luc rang dong-part I Novel 14277 16787

3 Chuyen tinh ke truoc luc rang dong- part II Novel 12499 14698

4 Luoc su thoi gian Science 10598 11626

6 Nhung bai hoc nong thon Story 6682 8244

7 Cong nghe va he thong phong thu quoc gia Press 1028 1162

Motivation

Until now, maybe you not image my thesis will solve which problems as well as the reasons I chose to solve them In this section, therefore, we will discuss about them

As we all know, linguistic theories first developed to describe of Indo-European languages and until now there are many significant archievements In our country, NLP field has begun since 1990, however; achieved results have still limit Whereas, Vietnamese processing issue is responsible for Vietnamese; we cannot expect this issue in foreign researchers (Ho Tu Bao, 2001) Therefore, this thesis wishes contributed a part in improving Vietnamese processing by concentrating on enhancing tagsets and detection errors in tagging

Natural language processing is done at five stages These are:

 Morphological and lexical analysis: The lexicon of a language is its vocabularies that include its words and expressions Morphology is the identification, analysis and description of structure of words The words are generally accepted as being the smallest units of syntax The syntax refers to the rules and principles that govern the sentence structure of any individual language

Lexical analysis: The aim is to divide the text into paragraphs, sentences and words The lexical analysis cannot be performed in isolation from morphological and syntactic analysis

 Syntactic Analysis: The analysis is of words in a sentence to know the grammatical structure of the sentence The words are transformed into structures that show how the words relate to each others Some word sequences may be rejected if they violate the rules of the languages for how words may be combined

 Semantic analysis: It derives an absolute meaning from context it determines the possible meanings of a sentence in a context

 Discourse integration: The meaning of an individual sentence may depend on the sentences that precede it and may influence the meaning of the sentences that follow it

 Pragmatic analysis: It derives knowledge from external commonsense information it means understanding the purposeful use of language in situations, particularly those aspects of language which require world knowledge For example: Do you know what time is it? The sentence should be interpreted as a request

Our thesis concentrates on the first stage (i.e morphological analysis) in natural language processing It is very important preprocessing step for following stages such as syntactic analysis and semantic analysis

Our thesis has two big problems and two small problems These are evaluating tagset and detecting tagging errors automatically; checking convertible possibility of tagset and detecting segmentation errors automatically, respectively a Evaluating and convertible possibility of tagset

In previous section, we mentioned some tagsets such as VietTreeBank (17 tags);

VNPOS (15 tags); VNQTag (59 tags) Such inconsistent tagsets emerge some questions such as: which tagsets can be better? What methods can evaluate these tagsets or how we can choose right set of POS tags for certain applications In the first part of this thesis, we will focus to answer the question

Another aspect we will also discuss here is tagsets conversion ability The choice one tagset much affect on the difficulty of POS tagging issues In particular, if big tagset will increase the difficulty but smaller one will not satisfy for a certain purpose

Therefore, it is necessary to balance between quality and the quantity in one tagset, it means that:

 Information quality more clear (i.e classify to more Part-of-speech based on concrete meaning)

 Possibility of tagging (i.e the number of Pos as little as possible) From above discussed problem, we try to find a method to balance them It means that we carry out experiment on source tagset (ST) and target tagset (TT) Then calculating the number of ambiguous words when we converted; therefore, we give conclusion b Detecting POS tagging and word segmentation errors

 If each word belongs to only one label then one limited a dictionary including words and corresponding labels can solve absolutely POS tagging issue In fact, however, one word can belong to more than one label and that leads to ambiguity and errors in POS tagging To fix this problem, it costs much time and money by manual We want to find out method to detect errors automatically to reduce cost about time and money

 Besides, it admits that Vietnamese word segmentation is a thorny issue

One sentence maybe to have many different segmentation ways For example, chiếc xe đạp nặng quá Way 1: chiếc/ xe/ đạp/ nặng/ quá Way 2: chiếc/ xe đạp/ nặng/ quá Here, we used “/” to separate words Both of ways are accepted because each sentence is private meaningful

One of reasons causes the difference is listed in following table And the last problem in our thesis is word segmentation:

Table 3 Principle differences between Vietnamese and English

Prefix or Suffix No Yes

Part of speech No agreement Defined clearly

Boundary of word Context meaningful combination of syllable Blank or Delimiters

All above reasons are motive power to help me find the last answer.

Organization of the thesis

The thesis is organized four main chapters with basic content following:

Chapter 1 provides a general picture about Vietnamese such as features of Vietnamese and part-of-speech Besides, reasons I chose the topic in the thesis also discuss

Chapter 2: Evaluating distributional properties and conversion possibility of tagsets in Vietnamese

Chapter 2 we will find out deeper about tagset for instance way to build up tagset or way to merge labels as well as introduction basic notions to carry out evaluating properties of tagsets

Chapter 3: Automatic error verification of pos-tagged corpus

In this chapter, we will introduce notion related to errors detecting method, after that present algorithm and discuss about classifying variation into errors or ambiguity

In this chapter, we will discuss about three issues These are thesis’s contributions about theory, experiment and further new directions It sums up achievement that we gained and discussed further some word needed solve in future.

Tagset evaluation

It is obvious that evaluating tagset has received much attention of NLP reserachers since over 20 years ago Tagset evaluation allows us to test and assess the impact of tagset modifications on results, by using different versions given tagset on the same texts (Martin Volk and Gerol Schneider, 1998) In 2000, Dzeroski Saso and Erjavec

Tomaz and Zavrel Jakub calculated by comparing accuracy of design tagsets that are formed by decreasing the cardinality of the tagset: ommitting certain attributes of the tagset or almost all, except certain attributes Accuracies were computed using a Black-Box combiner (Halteren, Dzeroski) In the same year, Herv Ejean Seminar and Hervé Déjean presented two kinds of a tagset evaluation: a global and a local one The first kind consists of evaluating the initial grammar generated by ALLiS The second kind is to use the notion of reliability that reliability of an element corresponds to the ratio between its frequence in the structure over its total frequency in the corpus

Besides, in Indian language, Madhav Gopal, Diwakar Mishra, and Devi Priyanka Singh (2010) gave some discussions about evaluated tagsets: ILMT tagset, JNU- Sanskrit tagset, LDCIL tagset, Sanskrit consortium tagset

Vietnamese is an isolating language and important syntactic information source is word order To evaluating Vietnamese tagsets, this chapter will introduce a simple method using internal criteria and external criteria Frequency frame and purity are used in internal criteria to check whether tag is assigned accurately External criteria review reduction cardinality of the tagset to check information quality is retained It is true that a number of evaluations showed that a lot of tagging errors are caused by sometimes too fine differentiations within major categories (Eugenie Giesbrecht,

A POS is a set of words with some grammatical characteristic(s) in common and each

POS differs in grammatical characteristics from every other POS For example, nouns have different properties from verbs, which have different properties from adjective and so on

Tagset is set of POS tags built up based on the criteria (see in 1.2) Therefore, tagsets usually vary quantity of tags and also used in various applications

Properties of tagset: One tagset need guarantee some properties as followed:

Retaining linguistic feature, reflect syntax structure, possibility of tagging accurately, reduction ambiguous words when we carried out tagging

2.1.3 A method for evaluating distributional properties of tagsets 2.1.3.1 Internal criterion

Among properties of tagsets, we high appreciate possibility of tags is assigned accurately in corpus It means that we mention of internal criterion It is worth noting that we can review this criterion through a frame notion and a purity formula The frame represents reviewed local context It can alert for us which tag can appear in this the frame Next, purity formula assesses possibility convergence of tag in the local context

As mention preciously, we use purity formula as external evaluation criterion for tagset (Stanford natural language processing, 357) Purity is widely used in cluster quality evaluation measure It is simple and transparent evaluation measure To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N

(1) Where: is the set of clusters is the set of classes

We interpret w k as the set of documents in w k and c j as the set of documents in c j in equation (1)

High purity is easy to achieve when the number of clusters in large, in particular, purity is 1 if each document gets its own cluster

Figure 2 Purity as external evaluation criterion for cluster quality Majority class and number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4

(cluster 2); and , 3 (cluster 3) Purity is

The frame notion is mentioned in 2006 by Mintz Then, in 2010, Dickinson and Jochim redefined it following: In local context, one frame consists of three words in which two words surrounding a target word leading to target’s categorization We will use frames to test the quality of distributional mappings In English, for example, the frame “you_it” generally predicts a verbal category for the target (i.e, target word may be hit, beat, eat, or kiss) In Vietnamese, the frame “mẹ_là” leads target word belonging to pronoun (Pp), i.e, “tôi, anh, chị, bác” Therefore, to have a more exact result, we used a frequency and a frequent frame notion Frequent frame supplies category information in child language corpora It means that, frame’s role in a corpus is not similar Many times one frame appears, more linguistic information the frame concentrates We identify the frequent threshold based on a formula about 0.03% of the frame total In particular, if we have 10000 frames in a corpus then the frequency is

3 (10000*0.03%) So, one frame appears above 3 times, we consider them as one frequent frame

Next, purity formula is applied in the method with respect to calculating possibility of distributing tag in one frame It means that percentage of each tag appears in frame is different To calculate purity value, we just consider to the biggest frequency of a tag x x o x x x x o  o o o x x 

  purity value is higher, then words ability can be tagged accurately higher For instance, we have two frames: Tôi_ở and mẹ _bảo The first frame appears 4 times in a corpus in which the target word belongs to two tags Vits, Vitn (1 times and 3 times, respectively) The second frame appears 8 times in which 7 times target word’s POS is

Np, 1 times is Pp We can calculate the purity value by

Normally, to evaluate tagset, linguistic scientists have mapped a tagset into reduced one because this work helps us check retained linguistic features Of course, reduced tagset is built up by merge tags; however, how do we have to merge? This is a difficult question that we need solve

Herv Ejean Seminar, Hervộ Dộjean and Universitọt Tỹbingen (2000) discussed about the theoretically minimal tagset They affirmed that the quality of a tagset does not depend on the quantity of tags They built up the minimum tagset necessary to parse sentence whatever the domain are Originally, they use a tagset with one tag per structure (NP-VP) Then, they estimated that a tagset of about 20 tags is enough to parse a sentence into PS and clause structures

Indeed, there are many ways to merge labels so the tagsets with various tags quantity have still existed English is morphological language so it is rather easily to identify situations can merge such as conflating base form verb (VB) and present tense verb (non-third person singular, VPB) but Vietnamese is not The tagsets are used in our thesis have two kinds:

Firstly, we used tagset that it is built up by preceding NLP researchers, for instance, VnQtag, VietTreeBank

Secondly, we conflate ourselves some labels based on Vietnamese features The number of tags in VnQtag is the largest, so we use it as source tagset to generate other tagsets

To concrete above mentioned theory, we would like to introduce the algorithm containing 5 steps in tagged corpus as followed

1 Identifying all the words and its POS in the corpus, store them and its positions

2 Calculating the quantity of frames in the corpus, after based on total of the frames to calculate a frequency

3 Then, finding frequency frames and a purity value

4 Mapping the original tagset to new reduced tagsets

5 Finally calculating the new purity value in the new tagsets and statistic lost ambiguous words

We carried out this method on corpus with VnQtag tagset annotated corpus

The experiments are performed on VnQtag corpus including four VnQtag annotated documents Then we carried out merging some tags in VnQtag to form new tagsets: tagset 3 and tagset 4 Therefore, we have: VietTreeBank (18 tags), basic tagset 2 (8 tags), tagset 3 (25 tags) and tagset 4 (40 tags) (see in appendix) We relied on the book (Ngữ pháp tiếng Việt - Diệp Quang Ban) to merge tags in which he organized Vietnamese POS system into two groups:

 Group 1: Noun, Verb, Adjective Numeral

 Group 2: Adjunct (Determine, adverb) Conjunction

Particle Each major category he classified finer-grain such as noun has two main kinds: Proper noun and common noun Common noun contains synthetic noun and non-synthetic noun Both of them are fine classified into countable noun and uncountable noun and so on

To gain 25-POSs and 30-POSs tagsets, we merged some tags of noun and verb They are basic categories and have the largest number of words in Vietnamese In the VnQtag tagset, noun is fine classified to 8 tags and verb with 10 finer tags We employed four annotated documents in VnQtag and four tagsets to gain results in the table 4 and the table 5

Table 4 Some frames is found in corpus

POS (Frequency) mẹ_là (4) Pp (4) Tôi_ở (4) Vits (1)

Tôi_nông dân (3) Vla (3) nhà _ở (3) Np (1)

No (2) Còn_sinh (3) Pp (3) ba _Phúc (2) Nh (2) với _đứa (2) Nn (2) sinh _nông thôn (3) Cm (3) Con _nhỏ (2) No (2) dăm _trẻ (2) Nu (2) đứa _dâng (2) Nh (2) trẻ _đào (2) Vta (2) có _người (3) Aa (2)

Np (7) đây _lần (2) Vla (2) là _đầu (2)

Possibility of Tagsets convertibility

As you know, existing different tagsets in the same language helps linguistic scientists have more tagset options In English language, there are some tagsets following:

Brown tagset in 1967 (87 tags), Susanne tagset in 1987 (353 wordtags), Penn Tree Bank tagset in 1991 (36 tags), IBM Lancaster in 1993 (132 tags) To give right decision, they have found out relationship between tagsets as well as specific applications

In Vietnamese, it is notable that there are three tagsets: VnQtag (59 tags), VnPos (15 tags), VietTreeBank (18 tags) Some Vietnamese linguistic researchers have advocated minimal tagset it means that they are interested in smaller tagset With small tagset, tagging is performed more easily, and less cost about time and money Therefore, we want to test converting from a large tagset into small one Of course, reverse direction is always true As a result, the first direction, some words can be lost ambiguity about tag This is not good sign However, if their number is small then we can just add some information of context or syntax to understand them For instance, Daniel Zeman

(2008) used Interset (Tagset diriver) to convert source tagset into target one Bartosz Zaborowski andAdam Przepiórkowski (2012) used set of rules converting particular tags

In our thesis, we emphasize to ability of conversion from one tagset to another The thing we wish found out here is that any large tagsets always can convert easily to small tagsets with minimal ambiguous word cardinality Ambiguous words are words that are lost a distinction in finer tags in target tagset In particular, we carried out as followed:

 Identifying tagsets that we want to check

 Identifying corpus annotated as well as tagger

 Calculating the number of word belonging to each POS tag of tagset

 Statistic o The number of ambiguous tokens in corpus (when we convert large tagset into small tagset, some tags in large one will merge to correspond to tags of small tagset) o The number of ambiguous word types in corpus

 Computing the percentage of ambiguous tokens and word types

The data input of this method is two tagsets: VnQtag and VietTreeBank Besides, we used Qtag probability and Vn Tagger to tag for the folder containing 7 documents (Hoàng tử bé, Chuyện tình, Lược sử thời gian, Những bài học nông thôn, Chiến tranh cục bộ, muối của rừng, An Dương Vương) with two tagsets respectively

Then we compared outputs to have last conclusion

Table 6 Some properties in tagset convertibility method in Hoangtube

Here, there are tiny note that word in above table is exactly word type not token It means that each word we just count once time Besides, the experiment is performed on one document (hoangtube) so we can see ambiguous percentage is quite small The number of ambiguous words sometime is large so in table we listed some situations not all of them

Table 7 Statistic ambiguous the word types in VnQtag corpus

POS Ambiguity Total of word type Percentage

In which : V3 tag is merged from followed tags: Vo, Vs, Vb, Va, Vc, V la, Vm, Vim, Vla,

Vv V1 tag: Vitb, Vits, Vitc, Vitm, Vitim V2 tag: Vta, Vtb, Vtc, Vtv, Vtim, Vto, Vts, Vtm, Vtv

Table 8 Statistic ambiguous the token in VnQtag corpus

POS Ambiguity Total of token Percentage

Table 9 Statistic detail ambiguous word types in VnQtag corppus

Number of ambiguous word types

A Aa/An 12 bé, cao, con, cái, gần, hun hút, lớn, nhè nhẹ, nhỏ, sâu, ít, đầy

Na/Np 6 Nguyên Đán, chúa, elip, thuyết tương đối, thuỷ, đường

Na/Nt 15 ban, cuộc đời, công nguyên, khoảnh khắc, một khi, phép, sớm, thuở, thế kỷ, thời bình, thời gian, thời kỳ, tuổi thơ, tương lai, tết

Nt/Nu 13 buổi, bông, bọn, bữa, canh, con, cuộc, cái, kỳ, lát, lần, mồng, phiên

Na/Ng 33 bài học, chư hầu, chứng cứ, công việc, cảnh vật, của, cử chỉ, dân chúng, gia cảnh, gia đình, giải phóng, huyện đội, hình ảnh, hệ thống, hội nghị, luận chứng, lý lẽ, lỗi lầm, lời nói, mầu sắc, ngày tháng, nhà nước, quân đội, sinh vật, thiên nhiên, thân hình, triều đại, vật chất, xã hội, đám ma, đường nét, đất nước, đồ

Nl/Nu 3 bên, khu, nơi

Na/Nu 41 bước, bộ, chừng, câu, cõi, cú, cấp, dòng, gợn, hàng, hình, hương, khối, loại, màu, món, mảng, mối, niềm, nước, nền, nụ, phần, thằng, tiếng, tiền, trận, tên, tụi, vì, vòng, vòng tròn, vị, vở, vụ, điều, điệu, đàn, đám, đơn vị, đạo

Nl/Nt 20 bận, chiều, cuối, giữa, hiện tại, hôm sau, lâu nay, mai, một hơi, ngày nay, năm ngoái, sau, trong, trưa, trước, trước hết, trước đây, đông, đầu, đầu tiên

Na/Nl 19 bề mặt, chân trời, căn cứ, góc, không gian, lòng, mặt, nguồn gốc, nước ngoài, phương, phạm vi, quân khu, trước mắt, trời, tầm, ven, vùng, vương quốc, đất

Na/Nl/Nu 3 chân, khoảng cách, nguồn

Ng/Nu 6 chốc lát, cậu, cỗ, em, tập hợp, đội

Na/Nn 4 con số, hai, số, tí

Nn/Nu 6 các, lũ, ngôi, rưỡi, từng, độ

Ng/Nn 4 cặp, dăm ba, năm tháng, toàn bộ

Ng/Nt/Nu 1 giây phút

Na/Ng/Nt/Nu 1 giấc

Nl/Nt/Nu 4 giờ, hồi, khoảng, lúc

Na/Ng/Nu 8 kích thước, loài, lớp, sự, thứ, việc, đoàn, đại đội

Na/Nt/Nu 3 lượt, lứa, nỗi

Na/Ng/Nl 1 thế giới

Na/Nl/Nt 2 trước tiên, điểm

Pi/Pp 2 ai, những ai

Pd/Pi 2 bao giờ, đó

Jr/Jt 2 bỗng dưng, bỗng nhiên

Jd/Jr 3 còn, cứ, luôn

Vitm/Vtm 16 bắn, co, leo, luồn, lùi, mỉm, ngược, rúc, rẽ, thót, vòng, văng, vật, xuyên, đuổi, động

Vits/Vts 4 bắt đầu, cách, lả, phí

Vitm/Vits 18 chan hoà, chơi, cuồn cuộn, dậy, dồn, hé, loà xoà, mở, ngồi, nằm, quỳ, rời, sập, thấm, vượt, úp, đổ, đứng

Vta/Vtv 3 chấp nhận, chịu, phải

Vitm/Vta 3 du nhập, dâng, dẫn

Vitm/Vto 4 lên, lại, qua, vào

Vitm/Vtm/Vto 6 ra, sang, về, xuống, đi, đến

V1 Vitm/Vits 18 chan hoà, chơi, cuồn cuộn, dậy, dồn, hé, loà xoà, mở, ngồi, nằm, quỳ, rời, sập, thấm, vượt, úp, đổ, đứng

Vta/Vtv 3 chấp nhận, chịu, phải

Vtm/Vto 6 ra, sang, về, xuống, đi, đến

It is worth noting that ambiguous percentage is quite big in table 8 (most of them above 15%) Therefore, in this step we can conclusion that it is hard to convert the original tagset into another tagset followed V, V1, V2, V3 However, if we analysis deeper then we will realize some different important aspects, here In one POS, ambiguous words usually concentrate on a certain group, for example in V category (Vitm/Vtm) group includes 16 words or (Vitm/Vits) group consist of 18 words in which (Vits/Vtc) and (Vitm/Vtc) groups have only 1 word In other words, the percentage we calculated changing following groups Table 9 shows that which group has more ambiguous words than other groups For example: In noun, (Na/Ng/Nu) group has 8 ambiguous words compared with 4 instances in (Nn/Nu/Nt) group.

Concept related to variation n-gram method

The method has some significant notions that we think you should grasp These are n- gram, variation and variation n-grams

N-gram: N-gram is contiguous sequence of n item from a given sequence or speech

In particular, item in question is word in computational linguistics An n-gram of size

1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram" Larger sizes are sometimes referred to by the value of n, e.g.,

"four-gram", "five-gram", and so on For example: I have just gone out

Unigram Bigram Trigram Four-gram Five-gram

I I have I have just I have just gone I have just gone out Have Have just Have just gone Have just gone out

Just Just gone Just gone out Gone Gone out

Variation: If a particular word occurs more than once in a corpus can thus be assigned different tags in a corpus We will refer to this as variation (Markus Dickinson, 2005)

For instance, in Hoangtube.txt, the word “sau” in 11-gram local context varies between a Nl (position noun) and Nt (Classification noun) sau (Nl) một lát im lặng, em lại nói: - sau (Nt) một lát im lặng, em lại nói: -

As Markus Dickinson mentioned, variation in corpus annotation is caused by one of the two reasons: ambiguity and error The former occurs when a word can assign in multiple lexical possible tags and different corpus occurrences of that word happen to realize the different options The latter we can detect it because tags of one word are inconsistent across comparable occurrences

Variation n-grams: The term variation n-grams of n-gram (of words) plays important role in our thesis It is the key to detect error in corpus annotation If same n-grams are detected in different position of corpus that they contain one word annotated differently then we call that n-grams is variation n-gram for an n-gram (of words) The word causes variation is referred to as the variation nucleus

For example: sau (Nl) một lát im lặng, em lại nói: -

Types of Vietnamese tagging error

A error is defined simply that when POS label of word is inconsistence in its present

If we want to improve results of any algorithms then we can identify mistake points that still exist in the results In other words, we can define how many errors exist in the result to correct it as possible Kübler and Wagner (2000) classified tagging error into four types:

Ambiguity: A word is ambiguous if it has more than one tag This is common error A part due to a word with many meanings, e.g, light is assigned as a noun (N) if it is a object that can shine It is assigned adjective (V) if it has opposite meaning to heavy

However, this error can happen due to tagger and we try to find the kind of errors

No ambiguity: The wrong tag is not possible tag for the word

Major category: In Vietnamese grammar, there are major eight categories, namely, noun, adjective, verb, adjunct, pronoun, conjunction, introductory word and emotivity word In fact, there are four major parts and four minor parts The four main POS are noun, adjective, adverb and verb Each main POS can classify finer-grain such as noun can divide into 8 subcategories: proper noun, countable noun, collective noun, classifier noun, concrete noun, abstract noun, numeral, locative noun Therefore, the error is mentioned when the major part-of-speech category is correct, but not the more fine-grained distinctions For example, word “tờ” is noun (N) but exactly it belongs to classifier noun (NC)

No major category: This error occurs when the major part-of-speech category is incorrect, e.g, word “vang” in same context “Phạm Huỳnh Tam lang-ký ức một thời vang bóng Kỳ 10” of VietTreeBank corpus is assigned as noun (N) and verb (V)

By using n-gram algorithm, we try to find errors belong to one of four error types.

A algorithm for detecting errors

After making the acquaintance of the basic notions, we continue to introduce about algorithm for detecting errors The algorithm is built up based on comment that variation n-gram must consist of variation (n-1) gram Therefore, we need start from n=1 to the longest n In particular as followed (Dickinson, 2003):

Step 1: Compute all of variation unigrams and store them and their positions

Step 2: Based on positions of the variation n-gram last stored, extend the n-grams to either side (of course, unless the corpus ends there) For each (n+1)-gram achieved, test whether it has another instance in the corpus If there are still exist and have variation in the way the different occurrences of the (n+1) gram are tagged

Step 3: Repeat step 2 until we reach an n that no variation n-grams are in the corpus.

Classifying variations

After we apply algorithm to find all variation n-gram as well as variation nuclei in corpus, we need identify which variations are errors and ambiguous To solve the issue, we based on comments as followed:

 The basic idea to detect errors in corpus is that if context is more length then variation nuclei is possible high as error Moreover, Vietnamese is isolating language The grammatical meaning expresses mainly through word order

Therefore, context plays important role in detect errors To more understand, we illustrate through example following

Context: ! - Thanh_niên thác_loạn : trách_nhiệm thuộc về ai ? - - Đêm trắng theo “thiếu_gia in which words is separated by space and using hyphen to connect syllables in one word This 18-gram context appears 3 times in which

“về” is once assigned as adverb (R), twice annotated as preposition (E)

Besides, left side of variation nuclei is 8 words and right side is 9 words (included punctuation) These things turn out to be error

A raising question is that how length of variation is large enough to review In here we start with n=5 This number does not fix, we can change however achieved result is quite reasonable

Another comment is structural boundaries, i.e, if a variation nucleus occurs within a complete sentence then it is likely to be an error

 In addition the length of context, we investigate more about context and variation nuclei If variation nuclei appears fringe of a context then variation nuclei is high possible as ambiguity In particular, a variation nucleus occurs in the beginning or ending of the context.

Result of detecting errors in POS tagging

Each variation nuclei, we just calculate on longest context In other words, contexts are developed from the same nuclei variation; we just compute on last context, other contexts do not take into account

Figure 3 N-gram and variation nuclei in VTB corpus with n up to 29

Applying variation n-gram method to VTB corpus, firstly we have gained figure 3 and secondly table 12 We were found the longest n up 29 and total of 11.428 variation nuclei in which variation nuclei of unigram (n=1) is greatly larger than bigram (In particular, 1691 and 5515 respectively) This happens because both of words in bigram are variation

Table 10 Statistic errors in corpus

N-gram Variation nuclei Errors N-gram Variation nuclei Errors

We appreciate high about the idea using variation n-gram to detect errors And table

10 showed 67 (0.107%) errors were found As mentioned in theory, if one variation nucleus appears fringe of context then we will not consider as errors So achieved result in the table rejected those situations Some rows appear zero errors because they are counted in followed contexts Here, we calculated errors starting from n=6; however, we believe that n can receive lower number Therefore, the number of errors can change

Table 11 The detail n-gram in tagged corpus

N- gram Context Nuclei Label Line File

, một lượt vé máy_bay Lượt NC 42 105055.seg.pos

Sau ba năm làm_việc ở Ở E 42 109898.seg.pos

Bộ Kế_hoạch - đầu_tư cấp phép Kế_hoạch NP 28 86375.seg.pos

Bộ Lao_động - thương_binh & xã_hội Lao_động

! - Thanh_niên thác_loạn : trách_nhiệm thuộc về ai ? Về

- Kỳ 1 : Kỳ 2 : Lênh_đênh chìm_nổi đời Lênh_đênh A 79 104884.seg.pos

- Kỳ 1 : Vào bộ_đội 06:04:00 Chào_mừng Đại_hội thi_đua Đại_hội Np 2 82711.seg.pos

- Thanh_niên thác_loạn : trách_nhiệm thuộc về ai ? - Về

- ký_ức một thời vang bóng Kỳ 10 : “ vang

Bữa nào khao nặng phải mất hai cục ( Cục N 21 104395.seg.pos

1 : Kỳ 2 : Lênh_đênh chìm_nổi đời người Kỳ Lênh_đênh A 79 104884.seg.pos

20 triệu hả ? ” Tuấn nhả khói thuốc hả I 19 104395.seg.pos

: Gặp lại “ kỳ_quan bóng_bàn thế_giới ”

6 : Danh_thủ Trương_Tấn_Nghĩa : tài_năng và đào_hoa ! Kỳ

Cái kết buồn của VĐV VN xuất_sắc nhất thế_kỷ 20 Kết

Ngã rẽ của ông Weigang và số_phận chiếc cúp vô_địch vô_địch

Phạm_Huỳnh_Tam_Lang - ký_ức một thời vang bóng Kỳ 10 : vang

_Văn_Hòa và chữ_ký giải nợ Chà_và Kỳ

3 : Gặp lại “ kỳ_quan bóng_bàn thế_giới Gặp

“ Lưỡng_thủ vạn_năng ” Phạm_Văn_Rạng Kỳ 6 : Danh_thủ Trương_Tấn_Nghĩa : tài_năng và đào_hoa ! Kỳ 7 : Danh_thủ

: “ Có đêm ông tiêu hết 20 triệu hả ? ” Tuấn nhả khói thuốc lạnh_lùng bảo Hả

Weigang và số_phận chiếc cúp vô_địch

Kỳ 9 : Phạm_Huỳnh_Tam_Lang - ký_ức một thời vang bóng Kỳ 10 : “ Nữ_hoàng

” không ngai môn bóng nhựa Kỳ 11 vang

V 57 108804.seg.pos bắt , lắc vẫn cứ lắc ! - Thanh_niên thác_loạn : trách_nhiệm thuộc về ai ? - - Đêm trắng theo “ thiếu_gia ” đi thác_loạn -

E 66 84143.seg.pos thế_giới ” Kỳ 5 : “ Lưỡng_thủ vạn_năng

” Phạm_Văn_Rạng Kỳ 6 : Danh_thủ Trương_Tấn_Nghĩa : tài_năng và đào_hoa ! Kỳ 7 : Danh_thủ Thể_Công làm HLV trên đất

Mai_Văn_Hòa và chữ_ký giải nợ Chà_và Kỳ 3 : Gặp lại “ kỳ_quan bóng_bàn thế_giới ” Kỳ 4 : Gặp lại “ kỳ_quan bóng_bàn

The Vietnamese treebank tagset

The tagset contains 59 part of speech tags which are distributed into 7 classes and 10 tags for punctuations and symbols

Id POS English Vietnamese Id POS English Vietnamese

Countable noun Danh từ đơn thể 2 Vitc Comparative intransitive verb Động từ nội động so sánh

3 Np Pronoun noun Danh từ riêng 4 Vitm Moving intransitive verb Động từ nội động chuyển động

5 Ng Collective noun Danh từ tổng thể 6 Pd Time and space pronoun Đại từ không gian, thời gian

7 Nt Classifier noun Danh từ loại thể 8 Pn Quantity pronoun Đại từ số lượng

9 Nu Concrete noun Danh từ đơn vị 10 Pi Interrogative pronoun Đại từ nghi vấn

11 Na Abstract Noun Danh từ trừu tượng 12 Pp Personal pronoun Đại từ xưng hô

13 Nn Numeral Danh từ số lượng 14 Pa Quality pronoun Đại từ hoạt động, tính chất

15 Nl Locative noun Danh từ vị trí 16 An Quantity adjective Tính từ hàm lượng

17 Vt Transitive verb Động từ ngoại động 18 Aa Quality adjective Tính từ hàm chất

19 Vit Intransitive verb Động từ nội động 20 Jt Time adverb Phụ từ thời gian

21 Vim Impression verb Động từ cảm nghĩ 22 Jd Degree adverb Phụ từ mức độ

23 Vo Orientation verb Động từ phương hướng 24 Jr Comparative adverb Phụ từ so sánh

25 Vs State verb Động từ tồn tại 26 Ja Negation or acceptation adverb

27 Vb Transformation verb Động từ biến hoá 28 Ji Imperative adverb Phụ từ mệnh lệnh

29 Va Acceptation verb Động từ tiếp thụ 30 Cm Cajor/minor conjunction Liên từ chính phụ

31 Vc Comparative verb Động từ so sánh 32 Cc Combination conjunction Liên từ liên hợp

33 Vla Verb 'là' Động từ 'là' 34 I Introductory word Trợ từ

35 Vm Moving verb Động từ chuyển động 36 E Emotivity word Cảm từ

37 Vv Volitive verb Động từ ý chí 38 X Unknown/Uncertain Không xác định

39 Vtim Impression transitive verb Động từ ngoại động cảm nghĩ 40 # Pound sign Dấu thăng

41 Vta Acceptation intransitive verb Động từ ngoại động tiếp thụ 42 $ Dollar sign Dấu đô-la

43 Vtb Transformation transitive verb Động từ ngoại động biến hóa 44 Sentence-final punctuation Dấu chấm hết câu

45 Vtc Comparative transitive verb Động từ ngoại động so sánh 46 , Comma Dấu phẩy

47 Vto Orientation transitive verb Động từ ngoại động chỉ hướng 48 : Colon Dấu hai chấm

49 Vts State transitive verb Động từ ngoại động tồn tại 50 ; Semi-colon Dấu chấm phảy

51 Vtm Moving transitive verb Động từ ngoại động chuyển động 52 ( Left bracket character

Dấu mở ngoặc đơn trái

53 Vtv Volitive transitive verb Động từ ngoại động chỉ ý chí 54 ) Right bracket character

Dấu đóng ngoặc đơn phải

Impression intransitive verb Động từ nội động cảm nghĩ

56 ' Single quote Dấu nháy đơn

-n intransitive verb Động từ nội động biến hóa

58 " Double quote Dấu nháy kép

59 Vits State intransitive verb Động từ nội động tồn tại 60.

Vietnamese Tagset (VietTreeBank)

1 Np Proper noun Danh từ riêng

2 Nc Classifier Danh từ chỉ loại

3 Nu Unit noun Danh từ đơn vị

4 N Noun other Danh từ khác

8 L Determiner (e.g mot, nhung, cac) Định từ

11 E Preposition Giới từ (Liên kết chính phụ)

12 C Conjunction Liên kết từ (Liên kết đẳng lập)

14 T Particle Trợ từ, tình thái từ (tiểu từ )

15 U Bound morpheme Từ tiếng nước ngoài

17 X Unknown Các từ không phân loại được

18 Symbol Symbol Các ký hiệu đặc biệt khác (? / # $)

Tagset 3 (25tags)

Countable noun Abstract noun Collective noun

Danh từ đơn thể Danh từ tổng thể Danh từ trừu tượng

3 Np Pronoun noun Danh từ riêng 4 Aa Quality adjective Tính từ hàm chất

5 Nt Classifier noun Danh từ loại thể 6 Jt Time adverb Phụ từ thời gian

Danh từ đơn vị Danh từ số lượng

8 Jd Degree adverb Phụ từ mức độ

9 Nl Locative noun Danh từ vị trí 10 Jr Comparative adverb Phụ từ so sánh

Vt/Vt im/V ta/Vt b/Vtc /Vto/

Vt Động từ nội động 12 Ja

Vit Động từ nội động 14 Ji Imperative adverb Phụ từ mệnh lệnh

Vr Động từ còn lại

16 Cm Cajor/minor conjunction Liên từ chính phụ

17 Cc Combination conjunction Liên từ liên hợp

18 Pd Time and space pronoun Đại từ không gian, thời gian

20 Pn Quantity pronoun Đại từ số lượng 21 E Emotivity word Cảm từ

22 Pi Interrogative pronoun Đại từ nghi vấn 23 X Unknown/Uncert ain Không xác định

24 Pp Personal pronoun Đại từ xưng hô 25 Pa Quality pronoun Đại từ hoạt động, tính chất

Countable noun Danh từ đơn thể 2 Vitb Transformation transitive verb Động từ nội động biến hóa

Pronoun noun Numeral Classifier noun

Danh từ loại thể Danh từ số lượng

4 Vits State transitive verb Động từ nội động tồn tại

5 Ng Collective noun Danh từ tổng thể 6 Vitc Comparative transitive verb Động từ nội động so sánh

7 Nu Concrete noun Danh từ đơn vị 8 Vitm Moving transitive verb Động từ nội động chuyển động

9 Na Abstract Noun Danh từ trừu tượng 10 Pd Time and space pronoun Đại từ không gian, thời gian

11 Nl Locative noun Danh từ vị trí 12 Pi Interrogative pronoun Đại từ nghi vấn

Intransitive verb/Comparati ve verb/Verb 'là'/Volitive verb/

Acceptation intransitive verb Động từ ngoại động /Động từ nội động /Động từ so sánh /Động từ 'là' /Động từ ý chí /Động từ ngoại động tiếp thụ

14 Pp Personal pronoun Đại từ xưng hô

15 Vim Impression verb Động từ cảm nghĩ 16 Pa Quality pronoun Đại từ hoạt động, tính chất

17 Vo Orientation verb Động từ phương hướng 18 An Quantity adjective Tính từ hàm lượng

19 Vs State verb Động từ tồn tại 20 Aa Quality adjective Tính từ hàm chất

21 Vb Transformation verb Động từ biến hoá 22 Jt Time adverb Phụ từ thời gian

23 Va Acceptation verb Động từ tiếp thụ 24 Jd Degree adverb Phụ từ mức độ

25 Vm Moving verb Động từ chuyển động 26 Jr Comparative adverb Phụ từ so sánh

27 Vtim Impression intransitive verb Động từ ngoại động cảm nghĩ 28 Ja Negation or acceptation adverb

29 Vtb Transformation intransitive verb Động từ ngoại động biến hóa 30 Ji Imperative adverb Phụ từ mệnh lệnh

31 Vtc Comparative intransitive verb Động từ ngoại động so sánh 32 Cm Cajor/minor conjunction Liên từ chính phụ

33 Vto Orientation intransitive Động từ ngoại động chỉ hướng 34 Cc Combination conjunction Liên từ liên hợp

35 Vts State intransitive verb Động từ ngoại động tồn tại 36 I Introductory word Trợ từ

37 Vtm Moving intransitive verb Động từ ngoại động chuyển động 38 E Emotivity word Cảm từ

39 Vtv Volitive intransitive verb Động từ ngoại động chỉ ý chí 40 Vitim Impression transitive verb Động từ nội động cảm nghĩ

A5 Syntax function tags in VTB

1 H The head element of phrase

3 DOB Direct object function label

4 IOB Indirect object function label

6 PRD Predicate function label not verb phrase

7 LGS Logic subject function label of passive voice sentence

8 EXT Complement function label expresses the range or frequence of action

9 VOC Complain component function label

A6 Adverbial classification tag of verb in VTB

1 TMP Adverbial function label expresses time

2 LOC Adverbial function label expresses location

3 DIR Adverbial function label expresses direction

4 MNR Adverbial function label expresses manner

5 PRP Adverbial function label expresses purpose or reason

6 CND Adverbial function label expresses condition

7 CNC Adverbial function label expresses concession

8 ADV Adverb function label (the rest of stituations)

Tiêu đề	Tagset Evaluation and Automatical Error Verrification in POS Tagged Corpus
Tác giả	Thi-Thanh-Tam Do
Người hướng dẫn	Dr. Nguyen Phuong Thai
Trường học	Vietnam National University Hanoi University of Engineering and Technology
Chuyên ngành	Computer science
Thể loại	Master Thesis
Năm xuất bản	2012
Thành phố	Ha Noi

Định dạng
Số trang	51
Dung lượng	843,16 KB