Extraction of vietnamese collocation from text corpora

Towards a framework for building an Annotated Named Entities Corpus Hoang Huu Son Faculty of Information Technology University of technology and engineering Vietnam National University, Hanoi Supervised by Doctor Pham Bao Son A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology June, 2010 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at Coltech or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom i have worked at Coltech lab or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed i Table of Contents Introduction 1.1 1.2 Overview Name Entity recognition(NER) NER Approach 1.2.1 1.2.2 1.2.3 Thesis contribution Thesis structure 1.3 1.4 Related Work 2.1 2.2 2.3 2.4 2.5 Overview our problem Building NER corpus research Researches about building corpus Proces Overview annotate tools Summary Corpus building process 3.1 3.2 3.3 3.4 ii Corpus building process 3.1.1 3.1.2 3.1.3 3.1.4 Building Vietnamese NER corpus by off-l 3.2.1 3.2.2 3.2.3 Discus about Vietnamese NER corpus bu Conclusion TABLE OF CONTENTS Online Annotation Framework 4.1 4.2 4.3 4.4 4.5 Evaluation 5.1 5.2 5.3 5.4 5.5 Conclusion And Future work 6.1 6.2 Introduction Training section Annotation documents 4.3.1 4.3.2 4.3.3 Quality control 4.4.1 4.4.2 4.4.3 Conclusion Introduction Corpus evaluation 5.2.1 5.2.2 5.2.3 Time costing 5.3.1 5.3.2 5.3.3 Named entity recognition system 5.4.1 5.4.2 5.4.3 5.4.4 Summary Conclusion Future work 6.2.1 6.2.2 6.2.3 iv A Name Entity guideline A.1 A.2 Basic concepts A.1.1 Entity and Entity Name A.1.2 A.1.3 A.1.4 Entity classification A.2.1 A.2.2 A.2.3 A.2.4 Facility A.2.5 List of Figures 3.1 3.2 3.3 3.4 Process building Annotation guide line Callisto formatting Callisto interface Comparing two user corpus 4.1 4.2 4.3 4.4 4.5 Online Annotation Process Annotation online tools Interface Annotation gudeline form Interface Review Tool Interface Compare two documents interface 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Inter Annotation Agreements result of two User Evaluate accuracy rate for each Entity kind Evaluate online corpus accuracy rate for each entity kind Name entity recognition system architecture Jape rule to recognize Person entity Performance on the training data using strict criteria Performance on test data using strict criteria Performance on the test data using lenient criteria v List of Tables 5.1 An example of par corpus which annotate bu two user (User A user B) 5.2 frequency annotated documents 5.3 Inter annotation agreements in online annotation 5.4 User corpus accurate rate in online method 5.5 Time spent to quality control corpus 5.6 Time spent During annotation process 5.7 Quality control time in online framework vi Chapter Introduction 1.1 Overview Name Entity recognition(NER) The ability to determine the named entities in a text has been established as an important task for several natural language processing areas, including information retrieval, machine translation, information extraction and language understanding The term ”Named Entity” widely use in Nature Language Processing(NLP), was coined for the Sixth Message Understanding Conference(MUC-6) At the time, MUC was focusing in Information Extraction(IE) tasks where structured informa-tion of computer activities and defense related activities is extracted from unstruc-tured text,such as newspaper articles In defining tasks,people noticed that it is essential to recognize information units like names including: Person, organization and location names and numerics expression including: time, date, money, percent expression Identifying references to these entities in text was recognition as one of the importance sub- task of IE and was called ”Name Entity Recognition and Classification” The computational research aiming at automatically identifying named entities Chapter Introduction in texts forms a vast and heterogeneous pool of strategies, methods and represen-tations One of the first research papers in the field was presented by Lisa F Rau (1991) at the Seventh IEEE Conference on Artificial Intelligence Applications In genreral, each NER researches which have been devoted have to solve four problems: Language, Input,Kind of entity, and learning method Languages: NER have been applied to several languages There are many good researches for English NER, they have solved language independence and multilingualism prob-lems German is well studied in CONLL-2003 and in earlier works Similarly, Spanish and Dutch are strongly represented, boosted by a major devoted conference: CONLL -2002 (Collins, 2002) Chinese is studied in some researches (Wang et al., 1992),(Computer et al., 1996), (Yu et al., 1998) and so are French (Petasis et al., 2001), (And, 2003), Greek (Karkaletsis et al., 1999) and Italian (Black et al., 1998), (Cucchiarelli & Velardi, ) Many other languages received some attention as well: Basque (Whitelaw & Patrick, 2003), Bulgarian (Silva et al., 2004), Catalan (Carreras et al., 2003),Hindi (Cucerzan & Yarowsky, 1999), Romania (Cucerzan & Yarowsky, 1999), Swedish (Kokkinakis, 1998) and Turkish (Cucerzan & Yarowsky, 1999) Portuguese was examined by(Palmer et al., 1997) In Vietnamese, there are some NER research is apply, for example VN- KIM (Nguyen & Cao, 2007)IE system have just Format input NER research have been applied to many format of documents: General text, email, scientific text, journalistic,ect and mamy domain: sport, business,literature, etc Each system usually direct specific format and domain (Maynard et al., 2001) Designed a system for email, scientific texts and religious texts (Minkov & Wang, 2005) created a system specifically designed for email documents Now day, studies Chapter Conclusion And Future work 6.1 Conclusion After the time study The main target of My thesis is release some solutions to solve building corpus for name entity recognition(NER) domain, and built a NER system, which use the our NER corpus to train and test The thesis include five chapters: • Chapter one: Introduction: Overview NER research and some approach to built NER system.And We expose problem • Chapter two: Related word: Overview some research in the world to built NLP corpus in general and NER corpus in particular So that we localize my directly study • Chapter three: Building corpus process: Describe a process build a general corpus Then, we apply to build Vietnamese corpus by off line tools • Chapter four: Online corpus Framework: We base on building corpus process to build a online framework for annotating It will overcome o ff-line tools disadvantage 60 6.1 Conclusion • Chapter five: Evaluation: Present about my experiments and evaluate result And describe our NER system using corpus we built The thesis distribute includes: • We release a building corpus process Which include three steps: Build anno-tation guide line, annotate document and quality control • We apply the process to build NER corpus by offline tools method O ffline tools method is a manual way use desktop programs, In the thesis we used Callisto to annotate We built tools to quality control in two level: Document level and corpus level • We build a online annotation framework to overcome offline method disadvan-tage The online frame work have some features – Annotation will be executed though Internet environments (Annotate anytime, anywhere) – Automate all steps in process: Manage files, distribute to annotator, etc – Enable lager number annotator join – Quality control corpus in many level • Our online anntation frame work is not only applied to build NER corpus, but it also is used to buil other corpus such as annotate noun phrase, verb prase • we have been a Vietnamese NER system The system use corpus as train set, and test set However Our online annotate framework exists some disadvantage which can list that: 62 Chapter Conclusion And Future work • The process still include some manual step Instead of annotated directive row documents,we can rowly annotated the documents so that performance is higher • The tools interface is not friendly, sometime it brings some trouble for anno-tator and supervisor • By limited time, My corpus include seventy document, about 1200 sentences and over 20000 words, it not such complete enough to NER system • etc So that in future, we have a lot of work to improve my study All works we list in next section future work 6.2 Future work Although our research release some good result, We need doing more work that improve the result Some work we can list include: Create corpus is bigger and more quality, Based on exist tools we integrate them and add other tools, and build NER systems more accurate 6.2.1 Create corpus bigger and more quality Our corpus include 66 documents(offline corpus) and 44 documents (online corpus) So that is is quite small, we need annotated more documents In the future, based online tools, many people join to annotate, and supervise So that increasing corpus is not difficult 6.2 Future work 6.2.2 Improve online annotation framework In the future, we take interest to develop online annotate framework There is a list of work to improve the framework: • Improve framework interfaces, make tools friendly and conveniently • Add many level to quality control corpus: Documents level, corpus level, ex-plain errors • Add some option that allow system not only annotate for name entity domain but also using many annotate problems, for example annotate verb phrase, noun phrase,etc • etc 6.2.3 Building NER system base statistical In our research, we only built NER system base set of rules Although it bring good results for us, we need improve system, base on the corpus we can build NER system based statistical to improve accurate, furthermore we will combine both rule base and machine learning for identifying name entity Conclusion, In this domain we have a lot of work to solve problem so that the work will continue, in nearly future the system will be perfect And it is a useful tools to support NER studies particular and NLP in general Appendix A Name Entity guideline A.1 Basic concepts A.1.1 Entity and Entity Name • Entity: Entity is a object or a set of object in the nature world • Mention: Instance of entity A.1.2 Instance of entity • Proper Name: Name entity • Noun or noun phrase: • Pronoun A.1.3 List of Entities • Person::person ’s proper name 64 A.2 Entity classification • Organization::are proper name called for entities which established by a cer-tain hierarchical structure • Facility:Proper name called for construct and architecture entities which is built by people such as: stadium, museum, and station • Location:Proper name called for geography entities or geographical • Religion: Proper name of regigion organization A.1.4 Entities recognize rules • No names nested.A new name is recognized only when the old name has ended • In the case of nested between names,only received the longest name(longest matching) For example: Thành phố HỒ Chí Minh đẹp Ho Chi Minh city is very beautiful Do not recognize Thành phố Hồ Chí Minh A.2 A.2.1 Entity classification Person Include full name and short name( first name or last name) For example: thủ tướng Nguyễn Tấn Dũng Mguyen Tan Dung prime minister chủ tịch Hồ Chí Minh Ho Chi Minh president 66 Chapter A Name Entity guideline Notice: these is not Person Phrase denote indirectly person chủ tịch nước Việt Nam Vietnam president bóng vàng Việt Nam 2008 Golden Ball 2008 Identification • prefix – vocation prefix: Ơng Nguyễn Minh Triết Mr Nguyen Minh Triet Cô Lý Mrs Ly – special cases: for example: Bà Trương; bà Triệu – Family Relation prefix: Dì Ninh; Chú Diệu; Anh Bắc Auntie Ninh, Uncle Dieu, Cousin Bac – political - social status prefix Thủ tướng Nguyễn Tấn Dũng, Giám đốc Giang Prime minister Nguyen Tan Dung, Director Giang A.2 Entity classification • suffix : a words after person is usually active verb: For example play, cry, smile A.2.2 Organization • Political- government Organizations Văn phịng phủ, Cơng an thành phố Hà Nội Government Office,Hanoi police • Economic Organizations Cơng ty trách nhiệm hữu hạn Tân hịa phát, Tập đồn FPT Tan Hoang Phat Co., Ltd.,FPT Corporation • Education Organizations Trường Đại HỌc công nghệ, HỌc viện Ngân hàng University of Technology, Academy of Bank • Medicine Organizations Bệnh viện Bạch Mai Bach Mai Hospital • Other Organizations 68 Chapter A Name Entity guideline Hội trữ thập đỏ Red Cross Identification:Organization usually stay after these prefix: Company,Corporation, Schools, hospital A.2.3 Location Location is Geography entity such as: territory, places, rivers, streams • name of city, county, district, road which administrative create by people Thành phố Hồ Hhí Minh, Quận Ho Chi Minh City, Tay Ho District However, In the cases: tiểu khu 8, quận (8 sub-area, district) they are location • island, ocean, see (Nature Location) Sơng hồng, Đảo Bạch Long Vĩ, Châu Á Hong river, Bach Long vi Island, Asia • National and nationality Cơ hướng dẫn viên du lịch người Hoa Cheese guided tour Việt Nam thành viên Asean Vietnam is a menber of Asean Identification: Before location prefix such as: in, stay, out,etc A.2 Entity classification A.2.4 Facility Facility are thing which people built, in general they are building and architecture things such as: stadium, museum, station For example: Tòa nhàHITC xây HITC building is being re-built A.2.5 Religion Name of Religion organization such as: Buddhism, Christianity Tôi người theo đạo phật I am a Buddhist Bibliography Adam Przepiorkowski, Rafal L Gorski, B L.-T., & Lazinski, M (2008) Towards the national corpus of polish Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association (ELRA) http://www.lrec-conf.org/proceedings/lrec2008/ And, T P (2003) The multilingual named entity recognition framework Asif Ekbal, S B (2008) Development of bengali named entity tagged corpus and its use in ner systems The 6th Workshop on Asian Languae Resources, 2008 Bermingham, A., & Smeaton, A F (2007) A study of inter-annotator agreement for opinion retrieval Black, W., Rinaldi, F., & Mowatt, D (1998) Facile: Description of the ne system used for muc-7 In Proceedings of the 7th Message Understanding Conference Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R (1998) Nyu: Description of the mene named entity system as used in muc-7 In Proceedings of the Seventh Message Understanding Conference (MUC-7 Carreras, X., Marquez, L., & Padro, L (2003) Named entity recognition for catalan using spanish resources In Proceedings of EACL’03 70 Bibliography Collins, M (2002) Coll02: Ranking algorithms for named entity extraction: Boosting and the voted perceptron Association for Computational Linguistics Collins, M., & Singer, Y (1999) Unsupervised models for named entity classification In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp 100–110) Computer, D O., hsi Chen, H., & chang Lee, J (1996) Identification and classifica-tion of proper nouns in chinese texts hsin-hsi chen and jen-chang lee Proceedings of 16th International Conference on Computational Linguistics (pp 222–229) Cucchiarelli, A., & Velardi, P Unsupervised named entity recognition using syntac-tic and semantic contextual evidence Cucerzan, S., & Yarowsky, D (1999) Language independent named entity recogni-tion combining morphological and contextual evidence (pp 90–99 ) Disambiguation, W S (2008) A case study on inter-annotator agreement for word sense disambiguation Evi Marzelou, Maria Zourari, V G., & Piperidis, S (2008) Building a greek corpus for textual entailment Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association (ELRA) http://www.lrec-conf.org/proceedings/lrec2008/ Karkaletsis, V., Paliouras, G., Petasis, G., Manousopoulou, N., & Spyropoulos, C D (1999) Named-entity recognition from greek and english texts Journal of Intelligent and Robotic Systems, 26, 123–135 Kokkinakis, D (1998) AVENTINUS, GATE and Swedish Lingware Proceedings of the 11th NODALIDA Conference (pp 22–33) Copenhagen 72 Bibliography ˇ Kravalová, J., & Zabokrtský, Z (2009) Czech named entity corpus and svmbased recognizer NEWS ’09: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (pp 194–201) Morristown, NJ, USA: Association for Computational Linguistics Maynard, D., Tablan, V., Ursu, C., Cunningham, H., & Wilks, Y (2001) Named entity recognition from diverse text types In Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark Minkov, E., & Wang, R C (2005) Extracting personal names from emails: Apply-ing named entity recognition to informal text In HLT-EMNLP Nelson, K P., & Edwards, D (2007) Population-based measures of agreement Nguyen, T.-V T., & Cao, T H (2007) Vn-kim ie: automatic extraction of vietnamese named-entities on the web New Gen Comput., 25, 277–292 Palmer, D., , Palmer, D D., & Day, D S (1997) A statistical profile of the named entity task Proc ACL Conference for Applied Natural Language Processing (pp 190–193) Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropoulos, C D (2001) Using machine learning to maintain rule-based named-entity recognition and classification systems Proc Conference of Association for Com-putational Linguistics (pp 426–433) Pham, D D., Tran, G B., & Pham, S B (2009) A hybrid approach to vietnamese word segmentation using part of speech tags Knowledge and Systems Engineering, International Conference on, 0, 154–161 Bibliography Ruifeng Xu, Yunqing Xia, K.-F W., & Li, W (2008) Opinion annotation in on-line chinese product reviews Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association (ELRA) http://www.lrec-conf.org/proceedings/lrec2008/ Silva, J F F D., Kozareva, Z., Gabriel, J., & Lopes, P (2004) Cluster analysis and classification of named entities Proc Conference on Language Resources and Evaluation Strassel, S (2006) Simple named entity guidelines v6.4 Wang, L.-J., Chang, H., Chao, & huang Chang, C (1992) Recognizing unregistered names for mandarin word identification Proc of COLING92 (pp 1239–1243) COLING Whitelaw, C., & Patrick, J (2003) Evaluating corpora for named entity recognition using character-level features In (Whitelaw & Patrick, 2003), 910–921 Yu, S., Bai, S., & Wu, P (1998) Description of the kent ridge digital labs system used for muc-7 In Proceedings of the MUC-7 ... example, MUC-6 collection composed of newswire texts, and on a proprietary corpus made of manual transla-tions of phone conversations and technical email Kind of Entity Although list entities depend... was focusing in Information Extraction( IE) tasks where structured informa-tion of computer activities and defense related activities is extracted from unstruc-tured text, such as newspaper articles... system is set of rule which have been built by people (in ordinary expert) to particular target Rules will create by some features: Part of speech, context( words and phrases are in front of words