VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS MASTER THESIS (Natural language processing) Ha Noi - 2012 VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS Branch of knowledge: Information technology Major: Computer science Code: 60 48 01 MASTER THESIS Supervisor: Dr Nguyen Phuong Thai Ha Noi - 2012 ii TABLE OF CONTENTS ACKNOWLEDGEMENTS iii TABLE OF CONTENTS iv LIST OF FIGURES vi LIST OF TABLES vii NOTATIONS/ABBREVIATIONS viii ORIGINALITY STATEMENT ix ABSTRACT CHAPTER INTRODUCTION AND MOTIVATION 1.1 Characteristics of Vietnamese language 1.2 Vietnamese part of speech 1.2.1 Criteria to classify 1.2.2.The ways to build up tagset 1.3 Copora 1.3.1 VietTreeBank 1.3.2 VnQtag 1.4 Motivation 1.5 Organization of the thesis 11 CHAPTER 2: 12 EVALUATING DISTRIBUTIONAL PROPERTIES - 12 CONVERSION POSSIBILITY OF TAGSETS 12 IN VIETNAMESE 12 2.1 Tagset evaluation 12 2.1.1.Introduction 12 2.1.2.Tagset 13 2.1.3.A method for evaluating distributional properties of tagsets 13 2.1.3.1 Internal criterion 13 2.1.3.2 External benchmark 15 2.1.3.3 Algorithm 15 2.1.4 Result of tagset evaluation 16 iv 2.2 Possibility of Tagsets convertibility 19 Result of tagset convertibility 20 CHAPTER 3: 24 AUTOMATIC ERROR VERIFICATION 24 OF POS - TAGGED CORPUS 24 3.1 Concept related to variation n-gram method 24 3.2 Types of Vietnamese tagging error 25 3.3 A algorithm for detecting errors 26 3.4 Classifying variations 26 3.5 Result of detecting errors in POS tagging 27 3.6 Word segmentation 31 3.6.1 Word in Vietnamese 31 3.6.2 N-gram in word segmentation 32 3.6.3 Result of detecting errors in word segmentation 33 CHAPTER 4: 35 CONCLUSION AND SUMMARY 35 BIBLIOGRAPHY 37 APPENDIX 40 A.1 The Vietnamese treebank tagset 38 A.2 Vietnamese Tagset (VietTreeBank) 40 A.3 Tagset (25tags) 41 A4 Tagset (40 tags) 42 A5 Syntax function tags in VTB 43 A6 Adverbial classification tag of verb in VTB 43 A7 Phrase tagset in VTB 44 A8 Clause tagset in VTB 44 v LIST OF FIGURES Figure 1.1 The features of Vietnamese type Figure Purity as external evaluation criterion for cluster quality Majority class and number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4 (cluster 2); and , (cluster 3) Purity is 14 Figure N-gram and variation nuclei in VTB corpus with n up to 29 27 vi LIST OF TABLES Table The expression of grammatical meaning in Vietnamese Table Corpus with VnQtag tagset annotation Table Principle differences between Vietnamese and English 11 Table Some frames is found in corpus 17 Table Result of tagset evaluation method 18 Table Some properties in tagset convertibility method in Hoangtube 20 Table Statistic ambiguous the word types in VnQtag corpus 21 Table Statistic ambiguous the token in VnQtag corpus 21 Table Statistic detail ambiguous word types in VnQtag corppus 22 Table 10 Statistic errors in corpus 28 Table 11 The detail n-gram in tagged corpus 28 Table 12 The errors and ambiguous statistic in word segmentation algorithm 33 Table 13: Detail of context and varitation in VTB corpus 34 vii CHAPTER INTRODUCTION AND MOTIVATION 1.1 Characteristics of Vietnamese language Every language in the world has its own features and so has Vietnamese To understand more Vietnamese, we would like to list some emerging features and compare Vietnamese with some other languages such as Chinese, English Followed M.Ferlus and other domestic and international researchers in Vietnam, Vietnamese is native origin language, belongs to South Asian language, Mon-Khmer family, has relationship closely with Muong language Besides, Vietnamese belongs to a isolating language type with three prominent features Firstly, a syllable is foundation unit to form a word and a sentence The syllable may be single word or be element to compose a complex word, a compound word and a reiteration word Secondly, the Vietnamese word is not inflectional In particular, there are no difference between singular noun and plural noun; for example, “hai sách” (two books) and “một sách” (one book) Thirdly, grammatical meaning expresses mainly through word order and expletive method Given some expletives such as “sẽ, đã, khơng” and sentence “Tơi ngồi” We can make three different meaning sentences from given input: “Tơi ngồi”; “tơi ngồi”;” tơi khơng ngồi” The characteristics of Vietnamese Syllable is foundation unit to form word or sentence Vietnamese word is not inflectional The grammatical meaning express mainly through word order and expletive method Figure The features of Vietnamese type In the world, some languages also belong to isolating language such as Chinese and Thai language English, French, Russian are flexional language So, there are some different features, for instance comparing Vietnamese, English and Chinese sentence Table The expression of grammatical meaning in Vietnamese Vietnamese Word order Expletive Chinese English Tôi yêu anh Wo ta I love him Anh yêu Ta wo He loves me Tôi không yêu anh Wo bu ta I not love him Unlike Vietnamese and Chinese, in above English sentence when word order changes, object pronoun turns into personal pronoun (himhe) 1.2 Vietnamese part of speech 1.2.1 Criteria to classify In European language, POS notion glues with morphological category such as gender, numeral, mood, so on In Vietnam, there are two idea followed: Firstly, POS does not exist in Vietnamese because Vietnamese does not have morphological modification (Le Quang Trinh, Nguyen Hien Le, Ho Huu Tung) Secondly, like European language, Vietnamese has also POS but to classify words in tags, or define POS of words, it is necessary to base on certain criteria So far, Vietnamese branch has almost agreed using criteria following ( Diep Quang Ban, Hoang Van Thung, 2010): a General meaning: “The meaning of a POS is the general meaning of a words group, bases on vocabulary generalization foundation to form common grammatical category generalization (lexical-grammatical category)” POSs are suitable for definition of classification category These are groups having giant number of words that each group has a classification feature: object, quality, action or state, so on Therefore, nhà, bàn, chim, học sinh, con, quyển, sự, so on, are classified into nouns because their vocabulary meaning is generalized and abstracted as objects The grammar category belongs to noun b Combination ability: With general meaning, words can get involve to one meaningful combination: some words can replace each other in a certain position of a combination, the rest of the combination make the setting for appearing replacement ability Followed example: nhà, bàn, chim, cát, and so on, can appear and replace each other in combination type: nhà này, chim này, cát này, etc and are classified as nouns c Syntax function: Participating in sentence composition, words can stand in one or some certain positions in a sentence, or can replace each other in the positions, and express one relation about syntax function with other parts in the sentence composition, can be classified into one POS For instance, some words such as nhà, bàn, chim, cát are noun They may be subjects in sentences in which the subject function is a syntax function to classify them into noun 1.2.2 The ways to build up tagset Nowadays, there are two kinds of set of POS tags have developed in which the first kind received attention much more from linguistic researchers The first kind bases on basic POS tags that are used many in dictionaries or linguistic materials These are: noun, verb, adjective, pronoun, adverb, conjunction, interjection, emotive word From the basic tags, some finer set of POS tags are built up Each researcher relies on certain criteria to build up the tagset finer (criteria are discussed in the section 1.2.1) Notably, VnQtag tagset of Tran Thi Oanh contains 14 tags; VietTreeBank consists of 17 tags; VnQtag 59 tags (see appendix) The second kind is built up by mapping a tagset from other language to Vietnamese based on association between words of two languages (Dinh Dien and Hoang Kiem 2003) 1.3 Copora Annotated corpora are large bodies of text with linguistically-informative mark-up They play an important role for current work in computational linguistics, so great attention has gone into developing such corpora Any countries, there are their own corpora as well Some common corpora such as: British National corpus (Leech et at, 1994), the Penn Treebank (Marcus et at, 1993), or the German NEGRA Treebank (Skut et at, 1997), the Lancaster corpus of Mandarin Chinese (Tony McEnery and Richard Xiao, 2005) In Vietnam, there are notable corpora: VnQtag, VnPos, VTB To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson, 2001, p.29) Sampling and representativeness: elements in a corpus must be general, diversified and plentiful A sample is representative if what we find for the sample also holds for the general population Finite size: bigger the size of a corpus is, higher it is appreciated but it is still finite size Machine-readable form Standard reference We must admit that it takes much time to build a large corpus by manual due to need huge linguistic knowledge With manually built large corpus, the quality of corpus is not surely good corpus Therefore, our thesis will find out and improve it Two corpora we used in our experiments are VietTreeBank and VnQtag After that, we would like to deeper discuss about building way of the corpora 1.3.1 VietTreeBank VietTreeBank is the result of a national project VLSP that is developed by VTB group (Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen and annotators) The corpus includes 142 documents belonging to a politics-society topic of the Youth news responding to 10.000 Vietnamese sentence annotated syntax (word segmentation, POS tagging, syntax structure) The group based on MEMs and CRFs machine learning model to assign POS tags The preciseness of the model is over 93% VTB is developed with the purpose to aid programs building: word segmentation, POS tagging, syntax parsing, and so on VTB group chose two criteria to classify POS: combination ability and syntactic function words For instance, noun has role as subject or object in a sentence Besides, noun can combine with numeral (three, four) and attribute (each, every) One POS tag can contain information about basic class of words (noun, verb, adjective, so on), morphological information (countable or uncountable), subcategory (verb goes with noun, verb goes with a clause, etc), semantic information or other syntax information VTB group built up the tagset just based on basic class of words without other information such as morphological information, subcategory, etc (see tagset in appendix) In addition to POS information, the group describes basic syntax elements as phrase and clause Syntax tags are the most foundation information in syntax tree, they forms spine of the tree A7 and A8 in appendix list phrase and clause tagset, respectively In sum, syllable is a basic unit of Vietnamese grammar composition The syllable can has meaning, meaning fading and no sense Moreover, among phenomena can appear transformation each other “Tiếng” may be: A word contains one “tiếng” is called as single word e.g “nhà” (house), “bàn” (table), “tôi” (I) A word contains more than two “tiếng” (almost two “tiếng”) is called as a complex word e.g “công ty” (company), “nghiên cứu” (studying), “công việc” (job) For simplicity, we can consider “tiếng” as “Vietnamese morpheme”, or “Vietnamese syllable” or “syllable” in short 3.6.2 N-gram in word segmentation There are two main approaches to segment word: rule-based approach and statisticsbased approach In rule-based approach, we can list some methods: Maximum matching, longest matching, greedy matching In the rest approach, these are: HMM, SVM, CRFs, ME In Vietnam, some researches about word segmentation are carried out In 2001, Dinh Dien, Hoang Kiem and Nguyen Van Toan purposed a new model that is combination between WFST and nueral network to segment Vietnamese word The preciseness of the model is 98.36% in 550-sentence corpus of Technology-Science and 94.67% in 150-sentence corpus of novel The group (Nguyen Cam Tu, Nguyen Trung Kien, Phan Xuan Hieu, Nguyen Le Minh and Ha Quang Thuy) (2006) used two models: CRFs and SVMs to solve this problem The preciseness of CRGs model is 93.76% and SVMs is 94% In 2010, VLSP national project combined the dictionary and ngram to segment word with preciseness over 97% It is obvious that preciseness in these methods is rather high; however, we need better result Variation n-gram method will detect inconsistent word segmentation in word segmented corpus Therefore, we can fix them to improve preciseness Some basic notions are different from n-gram notions in previous section In particular: N-gram: N-gram is contiguous sequence of n syllables from a given sequence or speech 32 Variation: If a particular syllable occurs more than once in a corpus can thus combine other same syllables to form different words It means that the same sentence can be segmented into different words We will refer to this as a variation Variation n-gram: If same n-grams are detected in different positions of a corpus that they contain one word segmentated syllables differently then we call that n-grams is variation n-gram for an n-gram (of syllables) The word causes variation is referred to as the variation nucleus For example: 5-gram: “ Hội Chữ_thập_đỏ” and “ Hội_Chữ_thập_đỏ” in which the first situation is two words (“hội” and “chữ thập đỏ”) and the second one is one word (“hội chữ thập đỏ”) Therefore, there is here one varitation 5-gram with “_chữ” nucleous 3.6.3 Result of detecting errors in word segmentation We carried out the algorithm on VTB corpus With same idea in detecting errors in POS tagging problem, we computed then number of variation n-gram, variation nuclei, errors and ambiguous words Table 12 The errors and ambiguous statistic in word segmentation algorithm N-gram gram gram gram gram gram 10 gram 11 gram 12 gram Variation nuclei 66 26 16 errors Ambiguous 1 0 1 25 7 0 0 The table showed that in VTB corpus there are 0.01% errors about word segmentation and 0.034% ambiguous words 33 Table 13: Detail of context and varitation in VTB corpus N-gram Context , phó chủ tịch xã , trước mắt tơi Cứ , Hội Chữ thập đỏ 5-gram Sau , Sáng hôm sau , gật đầu đồng ý tay vợt số thâu đêm suốt sáng , gram : 26 sức lao động gram : chống buôn lậu xăng dầu qua biên giới 11 gram : 12 gram : hút khách du lịch từ khắp nơi giới rẻ hút khách du lịch từ khắp nơi giới Variation , phó_chủ_tịch xã (_chủ_) , phó_chủ_tịch xã (_chủ_) , phó chủ_tịch xã (chủ_) , trước_mắt (_mắt) , trước mắt (mắt) Cứ như_thế , (_thế) Cứ , (thế) Hội Chữ_thập_đỏ (Chữ_) Hội_Chữ_thập_đỏ (_Chữ_) Sau , (này) Sau_này , (_này) Sáng hôm_sau , (_sau) Sáng hôm sau , (sau) gật đầu đồng_ý (đầu) gật_đầu đồng_ý (_đầu) tay_vợt số_một (_một) tay_vợt số (một) thâu_đêm_suốt_sáng , (_sáng) thâu đêm suốt sáng , (sáng) _sức lao_động (lao_) sức_lao_động (_lao_) _chống buôn_lậu xăng_dầu qua biên_giới (_dầu) chống buôn_lậu xăng dầu qua biên_giới (dầu) cuốn_hút khách du_lịch từ khắp nơi thế_giới (nơi) cuốn_hút khách du_lịch từ khắp_nơi thế_giới (_nơi) rẻ cuốn_hút khách du_lịch từ khắp nơi thế_giới (khắp) rẻ cuốn_hút khách du_lịch từ khắp_nơi thế_giới (khắp_) The same context as “phó chủ tịch xã”, we have two instances: “, phó_chủ_tịch xã” and “, phó chủ_tịch_xã” And correct word is “phó chủ_tịch_xã” 34 CHAPTER 4: CONCLUSION AND SUMMARY The thesis has investigated some problems related to tagset evaluation and automatical errors detection in Vietnamese annotated corpus The thesis has gained some results as followed: Method: we investigated characteristics of Vietnamese as well as differences between Vietnamese and other languages, in particular, English in this thesis Then we grasped some methods put in practice in English to apply to Vietnamese No stopping here, we still purpose new direction: using variation n-gram in word segmentation problem Variation is interpreted as difference between the ways can be segmented into words Experiment: based on the investigated theory, the thesis carried out in turn experiments Firstly, evaluating properties of tagset, we showed that which tagsets are appreciated higher In particular, these are VietTreeBank, basic tagset and tagset Secondly, my thesis provided the looking multiple aspects about Vietnamese tagset If one converts one tagset into another then we lost some information Thirdly, we applied variation n-gram algorithm in errors detecting problem And we have gained significant results 0.107% errors have found in POS tagging, 0.02 errors in word segmentation Further research directions: Due to limit in time and knowledge, in our thesis there still exists some issues need to be solved in the future o Continuing to find out deeper about the way merging tags in Vietnamese tagset It is absolutely that the precise will be improved The work depends on mainly semantic criteria to classify o Trying to change threshold in algorithm of evaluating tagset to find frames Therefore, we will look more comprehensively o Finding out the rules to convert one tagset into other tagsets 35 o In our thesis, we found out errors in POS tagging as well as word segmentation; however, in the future, we will improve it by studying rule or way to correct them 36 BIBLIOGRAPHY Vietnamese Diệp Quang Ban, Hoàng Văn Thung (2010) Ngữ pháp tiếng Việt Nhà xuất giáo dục Việt Nam Mai Ngọc Chừ, Vũ Đức Nghiệu, Hoàng Trọng Phiến (2008) Cơ sở ngôn ngữ học tiếng Việt Nhà xuất giáo dục 1997, tr 142–152 Nguyễn Thị Minh Huyền, Vũ Xuân Lương, Lê Hồng Phương (2003) Sử dụng gán nhãn loại xác suất Qtag cho văn tiếng Việt Hội thảo khoa học quốc gia lần thứ nghiên cứu, phát triển ứng dụng công nghệ thông tin truyền thông, tr 271-281 Bùi Minh Tốn (chủ biên), Nguyễn Thị Lương (2008) Giáo trình ngữ pháp tiếng Việt Nhà xuất đại học Sư phạm Nguyễn Kim Thản (1997) Nghiên cứu ngữ pháp Tiếng Việt Nhà xuất giáo dục Nguyễn Phương Thái, Vũ Xuân Lương, Nguyễn Thị Minh Huyền (2008) Xây dựng Treebank tiếng Việt Nguyễn Phương Thái, Vũ Lương, Nguyễn Thị Minh Huyền, nhóm liệu (2007) Hướng dẫn gán nhãn từ loại SP 7.3-VLSP English Dinh Dien, Hoang Kiem (2003), “Building an Annotated English-Vietnamese Parallel Corpus for Training Vietnamese-related NLPs” 3588098 Markus Dickinson (2005) Error detection and correction in annotated corpora The Ohio State University Alternately-formatted version of author's PhD thesis Markus Dickinson and Charles Jochim (2010) “Evaluating distributional properties of tagsets”, Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), (2-9517408-6-7) pp.19-21 37 Dinh Dien and Hoang Kiem and Nguyen Van Toan (2001), “Vietnamese Word Segmentation”, Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, November 27-30, 2001, Hitotsubashi Memorial Hall, National Center of Sciences, Tokyo, Japan, pp.749-756 Markus Dickinson and W.Detmar Meurers (2002) “Detecting errors in Part of speech annotation” Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume EACL '03, (1-333-56789-0), pp.107-114 Daniel Jurafsky and James H.Martin (2010) Speech and Language Processing Upper Saddle river, New Jersey 07458 Stanford University and University of Colorado at Boulder Markus Dickinson (2008) Representations for category disambiguation Indiana University Cambridge University Press, evaluation of clustering (2009), pp 356-360 Johanna Geiss (2009) Gold Standard for Sentence clustering in Multi-document Summarization Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, pp101 Suntec, Singapore, August 2009 © 2009 ACL and AFNLP Kübler and Wagner (2000) Evaluating POS tagging under Sub-Optimal Conditions or: Does meticulousness pay? University of Tübingen Dickinson, Markus and Charles Jochim (2008) “A simple method for Tagset Comparison” In Proceedings of LREC 2008 Marrakech, Morocco http://www.loria.fr/~lehong/projects/vntreebank/tagset.utf8.html http://trac.sketchengine.co.uk/wiki/tagsets/vietnamese http://www.myreaders.info/10_Natural_Language_Processing.pdf http://wac.colostate.edu/books/sound/chapter6.pdf http://ezinearticles.com/?Recognizing-the-Four-Major-Parts-of-Speech&id=986159 Herve´ De´jean (2000), How to Evaluate and Compare Tagsets? A Proposal Seminar für Sprachwissenschaft Universitӓt Tübingen 38 Zeman, Daniel (2010), Hard Problems of Tagset Conversion,2010: Hong Kong, China: ICGL 2010: Proceedings of the Second International Conference on Global Interoperability for Language Resources (978-962-442-323-5) pp 181-185 Dinh Dien, Hoang Kiem POS-Tagger for English-Vietnamese Bilingual corpus In HLT-NAACL 2003 Workshop: Building and Using Parallel texts data driven machine translation and beyond, pp 88-95, Edmonton May-June 2003 Dzeroski Saso and Erjavec Tomaz and Zavrel Jakub Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets Published in the Proceedings of the Second International Conference on Language Resources and Evaluation LREC 2000, pp 1099–1104 Daniel Zeman (2008), “Reusable Tagset Conversion Using Tagset Drivers” Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - June 2008, Marrakech, Morocco Cam-Tu Nguyen, Trung-Kien Nguyen, Xuan-Hieu Phan, Le-Minh Nguyen and Quang-Thuy Ha (2006) “Vietnamese Word Segmentation with CRFs and SVMs: An Investigation” Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation (PACLIC20): 215-222, China Nguyen Thanh Bon and Nguyen Thi Minh Huyen and Romary Laurent and Vu Xuan Luong (2004), “Developing Tools and Building Linguistic Resources for Vietnamese Morpho-Syntactic Processing”, 4th International Conference on Language Resources and Evaluation - LREC'04, pages 39 APPENDIX A.1 The Vietnamese treebank tagset The tagset contains 59 part of speech tags which are distributed into classes and 10 tags for punctuations and symbols Id POS English Vietnamese Id POS English Vietnamese Nc Countable noun Danh từ đơn thể Vitc Comparative intransitive verb Động từ nội động so sánh Np Pronoun noun Vitm Moving intransitive verb Động từ nội động chuyển động Ng Collective noun Danh từ tổng thể Pd Time and space pronoun Đại từ không gian, thời gian Nt Classifier noun Danh từ loại thể Pn Quantity pronoun Đại từ số lượng Nu Concrete noun Danh từ đơn vị 10 Pi Interrogative pronoun Đại từ nghi vấn 11 Na Abstract Noun Danh từ trừu tượng 12 Pp Personal pronoun Đại từ xưng hô 13 Nn Numeral Danh từ số lượng 14 Pa 15 Nl Locative noun Danh từ vị trí 16 An 17 Vt Transitive verb Động từ ngoại động 18 Aa Quality adjective Tính từ hàm chất 19 Vit Intransitive verb Động từ nội động 20 Jt Time adverb Phụ từ thời gian 21 Vim Impression verb Động từ cảm nghĩ 22 Jd Degree adverb Phụ từ mức độ 23 Vo Orientation verb Động từ phương hướng 24 Jr Comparative adverb Phụ từ so sánh 25 Vs State verb Động từ tồn 26 Ja Negation or acceptation adverb Phụ từ khẳng định 27 Vb Transformation Động từ biến hoá verb 28 Ji Imperative adverb Phụ từ mệnh lệnh 29 Va Acceptation verb Động từ tiếp thụ 30 Cm Cajor/minor conjunction Liên từ phụ 31 Vc Comparative verb Động từ so sánh 32 Cc Combination conjunction Liên từ liên hợp 33 Vla Verb 'là' Động từ 'là' 34 I Introductory word Trợ từ Danh từ riêng 40 Quality pronoun Quantity adjective Đại từ hoạt động, tính chất Tính từ hàm lượng 35 Vm 37 Vv Moving verb Volitive verb Động từ chuyển động 36 E Động từ ý chí 38 X Emotivity word Unknown/Uncertain Cảm từ Không xác định 39 Vtim Impression transitive verb Động từ ngoại động cảm nghĩ 40 # Pound sign Dấu thăng 41 Vta Acceptation intransitive verb Động từ ngoại động tiếp thụ 42 $ Dollar sign Dấu đô-la 43 Vtb Transformation Động từ ngoại transitive verb động biến hóa 44 Sentence-final punctuation Dấu chấm hết câu 45 Vtc Comparative transitive verb Động từ ngoại động so sánh 46 , Comma Dấu phẩy 47 Vto Orientation transitive verb Động từ ngoại động hướng 48 : Colon Dấu hai chấm 49 Vts State transitive Động từ ngoại verb động tồn 50 ; Semi-colon Dấu chấm phảy 51 Vtm Moving transitive verb Động từ ngoại động chuyển động 52 ( Left bracket character Dấu mở ngoặc đơn trái 53 Vtv Volitive transitive verb Động từ ngoại động ý chí 54 ) Right bracket character Dấu đóng ngoặc đơn phải Impression intransitive verb Transformatio -n intransitive verb State intransitive verb Động từ nội động cảm nghĩ 56 ' Single quote Dấu nháy đơn Động từ nội động biến hóa 58 " Double quote Dấu nháy kép Động từ nội động tồn 60 55 Vitim 57 Vitb 59 Vits 41 A.2 Vietnamese Tagset (VietTreeBank) Id 10 11 12 13 14 15 16 17 18 POS Np Nc Nu N V A P L M R E C I T U Y X Symbol English Proper noun Classifier Unit noun Noun other Verb Adjective Pronoun Determiner (e.g mot, nhung, cac) Numeral Adverb Preposition Conjunction Interjection Particle Bound morpheme Abbreviation Unknown Symbol 42 Vietnamese Danh từ riêng Danh từ loại Danh từ đơn vị Danh từ khác Động từ Tính từ Đại từ Định từ Số từ Phó từ Giới từ (Liên kết phụ) Liên kết từ (Liên kết đẳng lập) Thán từ Trợ từ, tình thái từ (tiểu từ ) Từ tiếng nước ngồi Từ viết tắt Các từ không phân loại Các ký hiệu đặc biệt khác (? / # $) A.3 Tagset (25tags) Id POS English Vietnamese Id POS English Vietnamese Countable noun Abstract noun Collective noun Danh từ đơn thể Danh từ tổng thể Danh từ trừu tượng An Quantity adjective Tính từ hàm lượng Nc/ Ng/ Na Np Pronoun noun Danh từ riêng Aa Quality adjective Tính từ hàm chất Nt Classifier noun Danh từ loại thể Jt Time adverb Phụ từ thời gian Nu/ Nn Concrete noun Numeral noun Danh từ đơn vị Danh từ số lượng Jd Degree adverb Phụ từ mức độ Nl Locative noun Danh từ vị trí 10 Jr Comparative adverb Phụ từ so sánh Vt Động từ nội động 12 Ja Negation or acceptation adverb Phụ từ khẳng định Vit Động từ nội động 14 Ji Imperative adverb Phụ từ mệnh lệnh 16 Cm Cajor/minor conjunction Liên từ phụ 17 Cc Combination conjunction Liên từ liên hợp 19 I Introductory word Trợ từ Cảm từ Vt/Vt im/V ta/Vt 11 b/Vtc /Vto/ Vts/ Vtm/ Vtv Vit/V itim/ 13 Vitb/ Vits/ Vitc/ Vitm Vim/ Vo/V 15 s/Vb/ Vv/V a/Vc/ Vla Vr Động từ lại Đại từ không gian, thời gian 18 Pd Time and space pronoun 20 Pn Quantity pronoun Đại từ số lượng 21 E Emotivity word 22 Pi Interrogative pronoun Đại từ nghi vấn 23 X Unknown/Uncert ain Không xác định Đại từ xưng hô 25 Pa Quality pronoun Đại từ hoạt động, tính chất 24 Pp Personal pronoun 43 A4 Tagset (40 tags) Id POS Nc English Countable noun Danh từ đơn thể Pronoun noun Np/N t/Nn Vietnamese Numeral Classifier noun Id POS English Vietnamese Vitb Transformation transitive verb Động từ nội động biến hóa Vits State transitive verb Động từ nội động tồn Vitc Comparative transitive verb Động từ nội động so sánh Danh từ riêng Danh từ loại thể Danh từ số lượng Ng Collective noun Danh từ tổng thể Nu Concrete noun Danh từ đơn vị Vitm Moving transitive verb Động từ nội động chuyển động Na Abstract Noun Danh từ trừu tượng 10 Pd Time and space pronoun Đại từ không gian, thời gian 11 Nl Locative noun Danh từ vị trí 12 Pi Interrogative pronoun Đại từ nghi vấn Vt/Vi t/Vc/ 13 Vla/ Vv/V ta Transitive verb/ Intransitive verb/Comparati ve verb/Verb 'là'/Volitive verb/ Acceptation intransitive verb Động từ ngoại động /Động từ nội động /Động từ so sánh /Động từ 'là' /Động từ ý chí /Động từ ngoại động tiếp thụ 14 Pp Personal pronoun Đại từ xưng hô 15 Vim Impression verb Động từ cảm nghĩ 16 Pa 17 Vo Orientation verb Động từ phương hướng 18 An 19 Vs State verb Động từ tồn 20 Aa Quality adjective Tính từ hàm chất 21 Vb Transformation Động từ biến hoá verb 22 Jt Time adverb Phụ từ thời gian 23 Va Acceptation verb Động từ tiếp thụ 24 Jd Degree adverb Phụ từ mức độ 25 Vm Moving verb Động từ chuyển động 26 Jr Comparative adverb Phụ từ so sánh Động từ ngoại động cảm nghĩ 28 Ja Negation or acceptation adverb Phụ từ khẳng định Động từ ngoại động biến hóa 30 Ji Imperative adverb Phụ từ mệnh lệnh Động từ ngoại động so sánh 32 Cm Cajor/minor conjunction Liên từ phụ Động từ ngoại động hướng 34 Cc Combination conjunction Liên từ liên hợp 27 Vtim 29 Vtb 31 Vtc 33 Vto Impression intransitive verb Transformation intransitive verb Comparative intransitive verb Orientation intransitive verb 44 Quality pronoun Quantity adjective Đại từ hoạt động, tính chất Tính từ hàm lượng 35 Vts 37 Vtm 39 Vtv State intransitive verb Moving intransitive verb Volitive intransitive verb Động từ ngoại động tồn 36 I Introductory word Trợ từ Động từ ngoại động chuyển động 38 E Emotivity word Cảm từ Động từ ngoại động ý chí 40 Vitim Impression transitive verb Động từ nội động cảm nghĩ A5 Syntax function tags in VTB ID NAME H SUB DOB IOB TPC PRD LGS EXT VOC EXPLAINING The head element of phrase Subject function label Direct object function label Indirect object function label Topic function label Predicate function label not verb phrase Logic subject function label of passive voice sentence Complement function label expresses the range or frequence of action Complain component function label A6 Adverbial classification tag of verb in VTB Id Name Explaining TMP Adverbial function label expresses time LOC Adverbial function label expresses location DIR Adverbial function label expresses direction MNR Adverbial function label expresses manner PRP Adverbial function label expresses purpose or reason CND Adverbial function label expresses condition CNC Adverbial function label expresses concession ADV Adverb function label (the rest of stituations) 45 A7 Phrase tagset in VTB ID Name NP VP AP RP PP QP MDP UCP 10 11 LST WHNP 12 13 WHRP WHPP WHAP Explaining Noun phrase Verb phrase Adjective phrase Adjunct phrase Preposition phrase Quantity phrase Morphological phrase Phrase consists of two or more components of no same category that are joined by independent conjunction Phrase marks items of a list Question noun phrase (who, what, so on) Question adjective phrase ( how is cold, how is beautiful, so on) Question phrase is used to ask about time and place, so on Question phrase (how ) A8 Clause tagset in VTB ID NAME S SQ S-EXC S-CMD SBAR EXPLAINING Statement (indicative or negative sentence) Question sentence Exclamatory sentence Command sentence Subordinate clause (modifier for noun, verb and adjective) 46 ... linguistic sources; it obeys international standards and data express The gained corpus has format following: each lexical unit and corresponding POS stand on one line, in which using space in. .. 104884.seg .pos 105055.seg .pos 82711.seg .pos 82853.seg .pos 82497.seg .pos 83595.seg .pos 84143.seg .pos 108023.seg .pos 108210.seg .pos 108377.seg .pos 108599.seg .pos 108804.seg .pos 104395.seg .pos 104395.seg .pos. .. Mapping VnQtag VietTreeBank Basic tagset Tagset Tagset VnQtag VietTreeBank Basic tagset Tagset Tagset VnQtag VietTreeBank Basic tagset Tagset Tagset VnQtag VietTreeBank Basic tagset Tagset Tagset