Tagset evaluation and automatical error verrification in pos tagged corpus, đánh giá tập nhãn và xác định lỗi tự động trong kho ngữ liệu đã gán nhãn

VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS MASTER THESIS (Natural language processing) Ha Noi - 2012 VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS Branch of knowledge: Information technology Major: Computer science Code: 60 48 01 MASTER THESIS Supervisor: Dr Nguyen Phuong Thai Ha Noi - 2012 ii TABLE OF CONTENTS ACKNOWLEDGEMENTS TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES NOTATIONS/ABBREVIATIONS ORIGINALITY STATEMENT ABSTRACT CHAPTER INTRODUCTION AND MOTIVATION 1.1 Characteristics of Vietnamese language 1.2 Vietnamese part of speech 1.2.1 Criteria to classify 1.2.2.The ways to build up tagset 1.3.Copora 1.3.1 VietTreeBank 1.3.2 VnQtag 1.4 Motivation 1.5 Organization of the thesis CHAPTER 2: EVALUATING DISTRIBUTIONAL PROPERTIES - CONVERSION POSSIBILITY OF TAGSETS IN VIETNAMESE 2.1.Tagset evaluation 2.1.1.Introduction 2.1.2.Tagset 2.1.3.A method for evaluating distributional properties of tagsets 2.1.4 Result of tagset evaluation iv 2.2 Possibility of Tagsets convertibility 19 Result of tagset convertibility 20 CHAPTER 3: 24 AUTOMATIC ERROR VERIFICATION 24 OF POS - TAGGED CORPUS 24 3.1 Concept related to variation n-gram method 24 3.2 Types of Vietnamese tagging error 25 3.3 A algorithm for detecting errors 26 3.4 Classifying variations 26 3.5 Result of detecting errors in POS tagging 27 3.6 Word segmentation 31 3.6.1 Word in Vietnamese 31 3.6.2 N-gram in word segmentation 32 3.6.3 Result of detecting errors in word segmentation 33 CHAPTER 4: 35 CONCLUSION AND SUMMARY 35 BIBLIOGRAPHY 37 APPENDIX 40 A.1 The Vietnamese treebank tagset 38 A.2 Vietnamese Tagset (VietTreeBank) 40 A.3 Tagset (25tags) 41 A4 Tagset (40 tags) 42 A5 Syntax function tags in VTB 43 A6 Adverbial classification tag of verb in VTB 43 A7 Phrase tagset in VTB 44 A8 Clause tagset in VTB 44 v LIST OF FIGURES Figure 1.1 The features of Vietnamese type Figure Purity as external evaluation criterion for cluster quality Majority class and number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4 (cluster 2); and , (cluster 3) Purity is 14 Figure N-gram and variation nuclei in VTB corpus with n up to 29 .27 vi LIST OF TABLES Table The expression of grammatical meaning in Vietnamese Table Corpus with VnQtag tagset annotation Table Principle differences between Vietnamese and English 11 Table Some frames is found in corpus 17 Table Result of tagset evaluation method 18 Table Some properties in tagset convertibility method in Hoangtube 20 Table Statistic ambiguous the word types in VnQtag corpus 21 Table Statistic ambiguous the token in VnQtag corpus 21 Table Statistic detail ambiguous word types in VnQtag corppus 22 Table 10 Statistic errors in corpus 28 Table 11 The detail n-gram in tagged corpus 28 Table 12 The errors and ambiguous statistic in word segmentation algorithm 33 Table 13: Detail of context and varitation in VTB corpus 34 vii CHAPTER INTRODUCTION AND MOTIVATION 1.1 Characteristics of Vietnamese language Every language in the world has its own features and so has Vietnamese To understand more Vietnamese, we would like to list some emerging features and compare Vietnamese with some other languages such as Chinese, English Followed M.Ferlus and other domestic and international researchers in Vietnam, Vietnamese is native origin language, belongs to South Asian language, Mon-Khmer family, has relationship closely with Muong language Besides, Vietnamese belongs to a isolating language type with three prominent features Firstly, a syllable is foundation unit to form a word and a sentence The syllable may be single word or be element to compose a complex word, a compound word and a reiteration word Secondly, the Vietnamese word is not inflectional In particular, there are no difference between singular noun and plural noun; for example, “hai sách” (two books) and “một sách” (one book) Thirdly, grammatical meaning expresses mainly through word order and expletive method Given some expletives such as “sẽ, đã, không” and sentence “Tôi ngoài” We can make three different meaning sentences from given input: “Tơi ngồi”; “tơi ngồi”;” tơi khơng ngồi” The characteristics of Vietnamese Syllable foundation unit form sentence word Figure The features of Vietnamese type In the world, some languages also belong to isolating language such as Chinese and Thai language English, French, Russian are flexional language So, there are some different features, for instance comparing Vietnamese, English and Chinese sentence Table The expression of grammatical meaning in Vietnamese Word order Expletive Unlike Vietnamese and Chinese, in above English sentence when word order changes,  object pronoun turns into personal pronoun (him he) 1.2 Vietnamese part of speech 1.2.1 Criteria to classify In European language, POS notion glues with morphological category such as gender, numeral, mood, so on In Vietnam, there are two idea followed:  Firstly, POS does not exist in Vietnamese because Vietnamese does not have morphological modification (Le Quang Trinh, Nguyen Hien Le, Ho Huu Tung)  Secondly, like European language, Vietnamese has also POS but to classify words in tags, or define POS of words, it is necessary to base on certain criteria So far, Vietnamese branch has almost agreed using criteria following ( Diep Quang Ban, Hoang Van Thung, 2010): a General meaning: “The meaning of a POS is the general meaning of a words group, bases on vocabulary generalization foundation to form common grammatical category generalization (lexical-grammatical category)” POSs are suitable for definition of classification category These are groups having giant number of words that each group has a classification feature: object, quality, action or state, so on Therefore, nhà, bàn, chim, học sinh, con, quyển, sự, so on, are classified into nouns because their vocabulary meaning is generalized and abstracted as objects The grammar category belongs to noun b Combination ability: With general meaning, words can get involve to one meaningful combination: some words can replace each other in a certain position of a combination, the rest of the combination make the setting for appearing replacement ability Followed example: nhà, bàn, chim, cát, and so on, can appear and replace each other in combination type: nhà này, chim này, cát này, etc and are classified as nouns c Syntax function: Participating in sentence composition, words can stand in one or some certain positions in a sentence, or can replace each other in the positions, and express one relation about syntax function with other parts in the sentence composition, can be classified into one POS For instance, some words such as nhà, bàn, chim, cát are noun They may be subjects in sentences in which the subject function is a syntax function to classify them into noun 1.2.2 The ways to build up tagset Nowadays, there are two kinds of set of POS tags have developed in which the first kind received attention much more from linguistic researchers The first kind bases on basic POS tags that are used many in dictionaries or linguistic materials These are: noun, verb, adjective, pronoun, adverb, conjunction, interjection, emotive word From the basic tags, some finer set of POS tags are built up Each researcher relies on certain criteria to build up the tagset finer (criteria are discussed in the section 1.2.1) Notably, VnQtag tagset of Tran Thi Oanh contains 14 tags; VietTreeBank consists of 17 tags; VnQtag 59 tags (see appendix) The second kind is built up by mapping a tagset from other language to Vietnamese based on association between words of two languages (Dinh Dien and Hoang Kiem 2003) 1.3 Copora Annotated corpora are large bodies of text with linguistically-informative mark-up They play an important role for current work in computational linguistics, so great attention has gone into developing such corpora Any countries, there are their own corpora as well Some common corpora such as: British National corpus (Leech et at, 1994), the Penn Treebank (Marcus et at, 1993), or the German NEGRA Treebank (Skut et at, 1997), the Lancaster corpus of Mandarin Chinese (Tony McEnery and Richard Xiao, 2005) In Vietnam, there are notable corpora: VnQtag, VnPos, VTB To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson, 2001, p.29)  Sampling and representativeness: elements in a corpus must be general, diversified and plentiful A sample is representative if what we find for the sample also holds for the general population  Finite size: bigger the size of a corpus is, higher it is appreciated but it is still finite size  Machine-readable form  Standard reference We must admit that it takes much time to build a large corpus by manual due to need huge linguistic knowledge With manually built large corpus, the quality of corpus is not surely good corpus Therefore, our thesis will find out and improve it Two corpora we used in our experiments are VietTreeBank and VnQtag After that, we would like to deeper discuss about building way of the corpora 1.3.1 VietTreeBank VietTreeBank is the result of a national project VLSP that is developed by VTB group (Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen and annotators) The corpus includes 142 documents belonging to a politics-society topic of the Youth news responding to 10.000 Vietnamese sentence annotated syntax (word segmentation, POS tagging, syntax structure) The group based on MEMs and CRFs machine learning model to assign POS tags The preciseness of the model is over 93% VTB is developed with the purpose to aid programs building: word segmentation, POS tagging, syntax parsing, and so on VTB group chose two criteria to classify POS: combination ability and syntactic function words For instance, noun has role as subject or object in a sentence Besides, noun can combine with numeral (three, four) and attribute (each, every) One POS tag can contain information about basic class of words (noun, verb, adjective, so on), morphological information (countable or uncountable), subcategory (verb goes with noun, verb goes with a clause, etc), semantic information or other syntax information VTB group built up the tagset just based on basic class of words without other information such as morphological information, subcategory, etc (see tagset in appendix) In addition to POS information, the group describes basic syntax elements as phrase and clause Syntax tags are the most foundation information in syntax tree, they forms spine of the tree A7 and A8 in appendix list phrase and clause tagset, respectively Dinh Dien and Hoang Kiem and Nguyen Van Toan (2001), “Vietnamese Word Segmentation”, Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, November 27-30, 2001, Hitotsubashi Memorial Hall, National Center of Sciences, Tokyo, Japan, pp.749-756 Markus Dickinson and W.Detmar Meurers (2002) “Detecting errors in Part of speech annotation” Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume EACL '03, (1-333-56789-0), pp.107-114 Daniel Jurafsky and James H.Martin (2010) Speech and Language Processing Upper Saddle river, New Jersey 07458 Stanford University and University of Colorado at Boulder Markus Dickinson (2008) Representations for category disambiguation Indiana University Cambridge University Press, evaluation of clustering (2009), pp 356-360 Johanna Geiss (2009) Gold Standard for Sentence clustering in Multi-document Summarization Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, pp101 Suntec, Singapore, August 2009 © 2009 ACL and AFNLP Kübler and Wagner (2000) Evaluating POS tagging under Sub-Optimal Conditions or: Does meticulousness pay? University of Tübingen Dickinson, Markus and Charles Jochim (2008) “A simple method for Tagset Comparison” In Proceedings of LREC 2008 Marrakech, Morocco http://www.loria.fr/~lehong/projects/vntreebank/tagset.utf8.html http://trac.sketchengine.co.uk/wiki/tagsets/vietnamese http://www.myreaders.info/10_Natural_Language_Processing.pdf http://wac.colostate.edu/books/sound/chapter6.pdf http://ezinearticles.com/?Recognizing-the-Four-Major-Parts-of-Speech&id=986159 Herve´ De´jean (2000), How to Evaluate and Compare Tagsets? A Proposal Seminar für Sprachwissenschaft Universitӓt Tübingen 38 Zeman, Daniel (2010), Hard Problems of Tagset Conversion,2010: Hong Kong, China: ICGL 2010: Proceedings of the Second International Conference on Global Interoperability for Language Resources (978-962-442-323-5) pp 181-185 Dinh Dien, Hoang Kiem POS-Tagger for English-Vietnamese Bilingual corpus In HLT-NAACL 2003 Workshop: Building and Using Parallel texts data driven machine translation and beyond, pp 88-95, Edmonton May-June 2003 Dzeroski Saso and Erjavec Tomaz and Zavrel Jakub Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets Published in the Proceedings of the Second International Conference on Language Resources and Evaluation LREC 2000, pp 1099–1104 Daniel Zeman (2008), “Reusable Tagset Conversion Using Tagset Drivers” Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - June 2008, Marrakech, Morocco Cam-Tu Nguyen, Trung-Kien Nguyen, Xuan-Hieu Phan, Le-Minh Nguyen and Quang-Thuy Ha (2006) “Vietnamese Word Segmentation with CRFs and SVMs: An Investigation” Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation (PACLIC20): 215-222, China Nguyen Thanh Bon and Nguyen Thi Minh Huyen and Romary Laurent and Vu Xuan Luong (2004), “Developing Tools and Building Linguistic Resources for Vietnamese Morpho-Syntactic Processing”, 4th International Conference on Language Resources and Evaluation - LREC'04, pages 39 APPENDIX A.1 The Vietnamese treebank tagset The tagset contains 59 part of speech tags which are distributed into classes and 10 tags for punctuations and symbols Id POS English Nc Countable noun Np Pronoun noun Ng Collective noun Nt Classifier noun Nu Concrete noun Na Abstract Noun Nn Numeral Nl Locative noun Vt Transitive verb 11 13 15 17 19 Intransitive Vit verb 21 Impression Vim verb 23 Orientation Vo verb 25 Vs State verb 27 Transformation Vb verb 29 Acceptation Va verb 31 Comparative Vc verb 33 Vla Verb 'là' 40 35 Vm Moving verb Volitive 37 Vv verb 39 Impression Vtim transitive verb Acceptation 41 Vta intransitive verb 43 Transformation Vtb transitive verb 45 Comparative Vtc transitive verb 47 Orientation Vto transitive verb State transitive Vts 49 verb 51 Moving Vtm transitive verb 53 Volitive Vtv transitive verb Impression 55 Vitim intransitive verb Transformatio 57 Vitb -n intransitive verb State 59 Vits intransitive verb 41 A.2 Vietnamese Tagset (VietTreeBank) Id PO N N N N V A P L M R E C I T U Y X Sym 10 11 12 13 14 15 16 17 18 42 A.3 Tagset (25tags) Id POS English Nc/ Countable noun Ng/ Abstract noun Na Collective noun Np Pronoun noun Nt Classifier noun Nu/ Concrete noun Nn Numeral noun Nl Locative noun Vt/Vt im/V ta/Vt 11 b/Vtc Vt /Vto/ 13 Vts/ Vtm/ Vtv Vit/V itim/ Vitb/ Vit Vits/ Vitc/ Vitm Vim/ Vo/V 15 s/Vb/ Vr Vv/V a/Vc/ Vla 18 Time and space Pd pronoun 20 Pn Quantity pronoun 22 Interrogative Pi pronoun Personal 24 Pp pronoun 43 A4 Tagset (40 tags) Id POS English Nc Countable noun Pronoun noun Np/N Numeral t/Nn Classifier noun Ng Collective noun Nu Concrete noun Na Abstract Noun Nl Locative noun Vt/Vi t/Vc/ Vla/ Vv/V ta Transitive verb/ Intransitive verb/Comparati ve verb/Verb 'là'/Volitive verb/ Acceptation intransitive verb 11 13 15 Impression Vim verb 17 Orientation Vo verb 19 Vs 21 State verb Transformation Vb verb 23 Acceptation Va verb 25 Vm Moving verb Impression 27 Vtim intransitive verb 29 Transformation Vtb intransitive verb Comparative 31 Vtc intransitive verb Orientation 33 Vto intransitive verb 44 State 35 Vts intransitive verb Moving 37 Vtm intransitive verb Volitive 39 Vtv intransitive verb A5 Syntax function tags in VTB ID NAME H SUB DOB IOB TPC PRD LGS EXT VOC A6 Adverbial classification tag of verb in VTB Id Name TMP LOC DIR MNR PRP CND CNC ADV 45 A7 Phrase tagset in VTB ID Name NP VP AP RP PP QP MDP 10 11 LST WHNP 12 13 WHRP WHPP UCP WHAP A8 Clause tagset in VTB ID NAME S SQ S-EXC S-CMD SBAR 46 ... linguistic sources; it obeys international standards and data express The gained corpus has format following: each lexical unit and corresponding POS stand on one line, in which using space in. .. UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS Branch of knowledge: Information technology Major:... part in improving Vietnamese processing by concentrating on enhancing tagsets and detection errors in tagging Natural language processing is done at five stages These are:  Morphological and

Định dạng
Số trang	66
Dung lượng	197,39 KB