Towards a framework for building an Annotated Named Entities Corpus Hoang Huu Son Faculty of Information Technology University of technology and engineering Vietnam National University, Hanoi Supervised by Doctor Pham Bao Son A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology June, 2010 Table of Contents Introduction 1.1 Overview Name Entity recognition(NER) 1.2 NER Approach 1.2.1 Rule based approach 1.2.2 Machine learning Approach 1.2.3 Comparing 1.3 Thesis contribution 1.4 Thesis structure Related Work 2.1 Overview our problem 2.2 Building NER corpus research 2.3 Researches about building corpus Process 2.4 Overview annotate tools 2.5 Summary Corpus building process 3.1 Corpus building process 3.1.1 Objective 3.1.2 Built annotation guide line 3.1.3 Annotate documents 3.1.4 Quality control 3.2 Building Vietnamese NER corpus by off-line tools 3.2.1 Built annotation guide line 3.2.2 Annotate documents 3.2.3 Quality control 3.3 Discus about Vietnamese NER corpus building process 3.4 Conclusion ii 1 3 8 10 11 12 13 13 13 14 16 17 20 20 22 24 26 27 TABLE OF CONTENTS iii Online Annotation Framework 4.1 Introduction 4.2 Training section 4.3 Annotation documents 4.3.1 Online annotation interface 4.3.2 Automate file distribution for annotator 4.3.3 Automate save and manage files 4.4 Quality control 4.4.1 Document level 4.4.2 Corpus level 4.4.3 Explain unusual entity 4.5 Conclusion 28 28 29 30 31 32 33 34 34 35 37 38 39 39 40 41 42 45 47 47 48 49 51 52 54 54 56 58 60 60 62 62 63 63 Evaluation 5.1 Introduction 5.2 Corpus evaluation 5.2.1 Inter annotatetor agreements 5.2.2 Offline corpus evaluation 5.2.3 Online corpus 5.3 Time costing 5.3.1 Overview 5.3.2 Offline process 5.3.3 Online framework 5.4 Named entity recognition system 5.4.1 Preprocessing 5.4.2 Gazetteer 5.4.3 Transducer 5.4.4 Experiment 5.5 Summary Conclusion And Future work 6.1 Conclusion 6.2 Future work 6.2.1 Create corpus bigger and more quality 6.2.2 Improve online annotation framework 6.2.3 Building NER system base statistical iv A Name Entity guideline A.1 Basic concepts A.1.1 Entity and Entity Name A.1.2 Instance of entity A.1.3 List of Entities A.1.4 Entities recognize rules A.2 Entity classification A.2.1 Person A.2.2 Organization A.2.3 Location A.2.4 Facility A.2.5 Religion TABLE OF CONTENTS 64 64 64 64 64 65 65 65 67 68 69 69 Toward a Framework for building Named Entity Corpus Hoang Huu Son University of Engineering and Technology Vietnam National University, Hanoi 144, Xuan Thuy, Cau Giay, Hanoi, Vietnam Abstract Named entities recognition (NER) problem is one of the most interesting in nature language processing domain However a main NER research barrier is difficult to build a NER corpus and there is any NER corpus have been published So that in the thesis, we release a corpus building process and frameworks to build NER corpus - special Vietnamese named entity corpus Introduction Please be noted some points as follows - The context of the research and its role/importance - Related studies and their methods/solutions/approaches - The remain problems and objective of this study/thesis - Your proposal What will be carried out? released corpus of Czech sentences with manually annotated named entities, in which a rich two-level classification scheme was used - How are the models designed? You can design different models/parameters, so please describe them in detail - How are the data prepared? - The results should be presented in Tables and Graphs - It is important of giving the discussion after obtaining experimental results Conclusions - With regard to the objective of this study as you showed in the introduction, which have been done? - The contribution of your work, the meaning of obtained results - Present future work if needed Publications - You can arrange one or more sections after the Introduction - You can use subsections - Show how the problem are formulated You may give some foundations if necessary - Show different aspects of the problems, for examples: the feature selections, learning algorithms, etc - Show your proposal, it is good if you can present the differences between your proposal and previous studies It is also important to show/analyze the solution in a reasonable way - Show how features are selected/built; the algorithms/methods you will use - Give here your publications during this master course - You can also give here your submission and its status (i.e submitted, revising, in press, ) Experiments You should give the information as follows: Kravalov´a, ˇ Jana and Zabokrtsk´ y, Zdenˇek have built Czech Named Entity Corpus which present in paper [?] In this recently References [1] I M Author Some related article I wrote Some Fine Journal, 99(7):1–100, January 1999 [2] A N Expert A Book He Wrote His Publisher, Erewhon, NC, 1999 ... Toward a Framework for building Named Entity Corpus Hoang Huu Son University of Engineering and Technology Vietnam National University, Hanoi 14 4, Xuan Thuy, Cau Giay, Hanoi, Vietnam Abstract Named. .. A Name Entity guideline A. 1 Basic concepts A. 1. 1 Entity and Entity Name A. 1. 2 Instance of entity A. 1. 3 List of Entities A. 1. 4 Entities recognize rules A. 2 Entity classification... 4.3 Annotation documents 4.3 .1 Online annotation interface 4.3.2 Automate file distribution for annotator 4.3.3 Automate save and manage files 4.4 Quality control