Towardsaframeworkforbuildinganannotatednamedentitiescorpus Hồng Hữu Sơn Trường Đại học Cơng nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: PGS.TS Phạm Bảo Sơn Năm bảo vệ: 2010 Keywords Mạng thông tin; Công nghệ thông tin; Ngơn ngữ tự nhiên; Trí tuệ nhân tạo Content Table of Contents Introduction 1.1 Overview Name Entity recognition(NER) 1.2 NER Approach 1.2.1 Rule based approach 1.2.2 Machine learning Approach 1.2.3 Comparing 1.3 Thesis contribution 1.4 Thesis structure 1 3 Related Work Overview our problem Building NER corpus research Researches about buildingcorpus Process Overview annotate tools Summary 8 10 11 12 2.1 2.2 2.3 2.4 2.5 Corpusbuilding process 13 Corpusbuilding process 13 3.1.1 Objective 13 3.1.2 Built annotation guide line 14 3.1.3 Annotate documents 16 3.1.4 Quality control 17 3.2 Building Vietnamese NER corpus by off-line tools 20 3.2.1 Built annotation guide line 20 3.2.2 Annotate documents 22 3.2.3 Quality control 24 3.3 Discus about Vietnamese NER corpusbuilding process 26 3.1 3.4 Conclusion 27 Online Annotation Framework 28 4.1 Introduction 28 4.2 Training section 29 4.3 Annotation documents 30 4.3.1 Online annotation interface 31 4.3.2 Automate file distribution for annotator 32 4.3.3 Automate save and manage files 33 4.4 Quality control 34 4.4.1 Document level 34 4.4.2 Corpus level 35 4.4.3 Explain unusual entity 37 4.5 Conclusion 38 Evaluation 39 5.1 Introduction 39 5.2 Corpus evaluation 40 5.2.1 Inter annotatetor agreements 41 5.2.2 Offline corpus evaluation 42 5.2.3 Online corpus 45 5.3 Time costing 47 5.3.1 Overview 47 5.3.2 Offline process 48 5.3.3 Online framework 49 5.4 Named entity recognition system 51 5.4.1 Preprocessing 52 5.4.2 Gazetteer 54 5.4.3 Transducer 54 5.4.4 Experiment 56 5.5 Summary 58 Conclusion And Future work 60 6.1 Conclusion 60 6.2 Future work 62 6.2.1 Create corpus bigger and more quality 62 6.2.2 Improve online annotation framework 63 6.2.3 Building NER system base statistical 63 A Name Entity guideline 64 A.1 Basic concepts 64 A.1.1 Entity and Entity Name 64 A.1.2 Instance of entity 64 A.1.3 List of Entities 64 A.1.4Entities recognize rules 65 A.2 Entity classification 65 A.2.1 Person 65 A.2.2 Organization 67 A.2.3 Location 68 A.2.4 Facility 69 A.2.5 Religion 69 References Adam Przepiorkowski, Rafal L Gorski, B L.-T., & Lazinski, M (2008) Towards the national corpus of polish Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association (ELRA) http://www.lrec-conf.org/proceedings/lrec2008/ And, T P (2003) The multilingual named entity recognition framework Asif Ekbal, S B (2008) Development of bengali named entity tagged corpus and its use in ner systems The 6th Workshop on Asian Languae Resources, 2008 Bermingham, A., & Smeaton, A F (2007) A study of inter-annotator agreement for opinion retrieval Black, W., Rinaldi, F., & Mowatt, D (1998) Facile: Description of the ne system used for muc-7 In Proceedings of the 7th Message Understanding Conference Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R (1998) Nyu: Description of the mene named entity system as used in muc-7 In Proceedings of the Seventh Message Understanding Conference (MUC-7 Carreras, X., Marquez, L., & Padro, L (2003) Named entity recognition for catalan using spanish resources In Proceedings of EACL’03 Collins, M (2002) Coll02: Ranking algorithms fornamed entity extraction: Boosting and the voted perceptron Association for Computational Linguistics Collins, M., & Singer, Y (1999) Unsupervised models fornamed entity classification In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp 100-110) Computer, D O., hsi Chen, H., & chang Lee, J (1996) Identification and classification of proper nouns in chinese texts hsin-hsi chen and jen-chang lee Proceedings of 16th International Conference on Computational Linguistics (pp 222-229) Cucchiarelli, A., & Velardi, P Unsupervised named entity recognition using syntactic and semantic contextual evidence Cucerzan, S., & Yarowsky, D (1999) Language independent named entity recognition combining morphological and contextual evidence (pp 90-99 ) Disambiguation, W S (2008) A case study on inter-annotator agreement for word sense disambiguation Evi Marzelou, Maria Zourari, V G., & Piperidis, S (2008) Buildinga greek corpusfor textual entailment Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association (ELRA) http://www.lrec-conf.org/proceedings/lrec2008/ Karkaletsis, V., Paliouras, G., Petasis, G., Manousopoulou, N., & Spyropoulos, C D (1999) Named-entity recognition from greek and english texts Journal of Intelligent and Robotic Systems, 26, 123-135 Kokkinakis, D (1998) AVENTINUS, GATE and Swedish Lingware Proceedings of the 11th NODALIDA Conference (pp 22-33) Copenhagen Kravalova, J., & Zabokrtsky, Z (2009) Czech named entity corpus and svm-based recognizer NEWS ’09: Proceedings of the 2009 NamedEntities Workshop: Shared Task on Transliteration (pp 194-201) Morristown, NJ, USA: Association for Computational Linguistics Maynard, D., Tablan, V., Ursu, C., Cunningham, H., & Wilks, Y (2001) Named entity recognition from diverse text types In Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark Minkov, E., & Wang, R C (2005) Extracting personal names from emails: Applying named entity recognition to informal text In HLT-EMNLP Nelson, K P., & Edwards, D (2007) Population-based measures of agreement Nguyen, T.-V T., & Cao, T H (2007) Vn-kim ie: automatic extraction of vietnamese named-entities on the web New Gen Comput., 25, 277-292 Palmer, D., , Palmer, D D., & Day, D S (1997) A statistical profile of the named entity task Proc ACL Conference for Applied Natural Language Processing (pp 190-193) Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropou- los, C D (2001) Using machine learning to maintain rule-based named-entity recognition and classification systems Proc Conference of Association for Computational Linguistics (pp 426-433) Pham, D D., Tran, G B., & Pham, S B (2009) A hybrid approach to vietnamese word segmentation using part of speech tags Knowledge and Systems Engineering, International Conference on, 0, 154-161 Ruifeng Xu, Yunqing Xia, K.-F W., & Li, W (2008) Opinion annotation in online chinese product reviews Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association (ELRA) http://www.lrec-conf.org/proceedings/lrec2008/ Silva, J F F D., Kozareva, Z., Gabriel, J., & Lopes, P (2004) Cluster analysis and classification of namedentities Proc Conference on Language Resources and Evaluation Strassel, S (2006) Simple named entity guidelines v6.4 Wang, L.-J., Chang, H., Chao, & huang Chang, C (1992) Recognizing unregistered names for mandarin word identification Proc of COLING92 (pp 1239-1243) COLING Whitelaw, C., & Patrick, J (2003) Evaluating corpora fornamed entity recognition using character-level features In (Whitelaw & Patrick, 2003), 910-921 Yu, S., Bai, S., & Wu, P (1998) Description of the kent ridge digital labs system used for muc-7 In Proceedings of the MUC-7 ... Cluster analysis and classification of named entities Proc Conference on Language Resources and Evaluation Strassel, S (2006) Simple named entity guidelines v6.4 Wang, L.-J., Chang, H., Chao, & huang... (2008) Building a greek corpus for textual entailment Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association... Sixth International Language Resources and Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association (ELRA) http://www.lrec-conf.org/proceedings/lrec2008/ And, T P (2003)