INCREMENTAL KNOWLEDGE ACQUISITION FOR NAMED ENTITIES RECOGNITION

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	54
Dung lượng	1,08 MB

Nội dung

ACKNOWLEDGEMENT First and foremost, I would like to thank my supervisors, Dr. Pham Bao Son and Dr. Nguyen Phuong Thai, for their patient guidance, continuous support and encouragement through the years. They were always available and answered my question carefully whenever I need help. I am very grateful to their advice and teaching. I would like to thank Nguyen Ba Dat for his constant help also his contributions to this work. I would like to thank the following people for their reviewing parts of this thesis: Nguyen Quoc Dai, Hoang Duc Tam. Their tireless help, suggestions and comments are invaluable. I greatly appreciate the Human Machine Interaction Laboratory, UETColtech for their support in the time I am here. I would also like to thank my friend, Le Hong Thuan, for her kindly help and encouragement. Finally, I would like to thank my family for their love and understanding that helps me to finish this thesis successfully. Thank you ABSTRACT The knowledge of human being is huge and expanding everyday. However, because almost all of it is only available in unstructured forms of natural language documents, there is a great need of computing systems for extracting information automatically. In such problem domains, Named Entities Recognition (NER) holds a central role in a successful information extractor. Approaches for NER can be divided into three groups: statistical approaches, grammarbased approaches and hybrid approaches. In statistical and hybrid approaches, a large annotated training corpus is required to achieve an acceptable result. However, it takes a lot of time and effort to obtain such annotated corpora. Grammarbased approaches take advantages of using experts’ knowledge to overcome the shortage of annotated corpora. Nonetheless, there are some problems that occur in grammarbased approaches. It is, firstly, the difficulty of maintaining the system when a large number of rules are added. Secondly, because of the fact that our language is changing day by day, grammarbased approaches become expensive when adapting into new domains or acquiring new knowledge. In this thesis, we firstly introduce an incremental knowledge acquisition method for Named Entities Recognition (NER). Although NER is different from a traditional classification problem, with this method, we have successfully applied Ripple Down Rule (RDR) which is known as the favourable solution for handling classification problems. As the result, the method takes the advantages of RDR by incrementally acquiring knowledge without breaking the consistency of the existing system. With RDR structure, our system is able to be adapted to other domains in the easier and more effective way. It is also compatible with the changing of our language. Moreover, this thesis also introduces an implementation on GATE framework by using JAPE grammars to reduce the effort of creating a new knowledge base. Experiments show that knowledge is acquired continuously without breaking the consistency of the existing knowledge base. Meanwhile, the current knowledge base is evaluated with an Fmeasure of 82% on the set of an existing Vietnamese corpus. Keywords: incremental knowledge acquisition, Named Entities Recognition, Ripple Down Rule TABLE OF CONTENTS LIST OF FIGURES Figure 2.1: An SCRDR example 17 Figure 2.2: GATE’s architecture 19 Figure 2.3: An annotation graph 20 Figure 3.1: Our system’s overview 24 Figure 3.2: An example of Tokenizer 26 Figure 3.3: An example of Gazetteer 26 Figure 3.4: A example of NE annotations 27 Figure 3.5: The NER Module 28 Figure 3.6: An example of all received NE annotations 30 Figure 3.7: The structure of RDR Module 31 Figure 4.1: The changing of Fmeasure between layers 38 Figure 4.2: The performance of our system after every 20 rules 38 LIST OF TABLES

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY Nguyen Van Bong INCREMENTAL KNOWLEDGE ACQUISITION FOR NAMED ENTITIES RECOGNITION Major: Computer Science HA NOI - 2012 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY Nguyen Van Bong INCREMENTAL KNOWLEDGE ACQUISITION FOR NAMED ENTITIES RECOGNITION Major: Computer Science Supervisor: Dr. Pham Bao Son Co-Supervisor: Dr. Nguyen Phuong Thai HA NOI - 2012 AUTHORSHIP “I hereby declare that the work contained in this thesis is of my own and has not been previously submitted for a degree or diploma at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no materials previously published or written by another person except where due reference or acknowledgement is made.” Signature:……………………………………………… 2 SUPERVISOR’S APPROVAL “I hereby approve that the thesis in its current form is ready for committee examination as a requirement for the Bachelor of Computer Science degree at the University of Engineering and Technology.” Signature:……………………………………………… [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] 3 ACKNOWLEDGEMENT First and foremost, I would like to thank my supervisors, Dr. Pham Bao Son and Dr. Nguyen Phuong Thai, for their patient guidance, continuous support and encouragement through the years. They were always available and answered my question carefully whenever I need help. I am very grateful to their advice and teaching. I would like to thank Nguyen Ba Dat for his constant help also his contributions to this work. I would like to thank the following people for their reviewing parts of this thesis: Nguyen Quoc Dai, Hoang Duc Tam. Their tireless help, suggestions and comments are invaluable. I greatly appreciate the Human Machine Interaction Laboratory, UET/Coltech for their support in the time I am here. I would also like to thank my friend, Le Hong Thuan, for her kindly help and encouragement. Finally, I would like to thank my family for their love and understanding that helps me to finish this thesis successfully. Thank you! 4 ABSTRACT The knowledge of human being is huge and expanding everyday. However, because almost all of it is only available in unstructured forms of natural language documents, there is a great need of computing systems for extracting information automatically. In such problem domains, Named Entities Recognition (NER) holds a central role in a successful information extractor. Approaches for NER can be divided into three groups: statistical approaches, grammar-based approaches and hybrid approaches. In statistical and hybrid approaches, a large annotated training corpus is required to achieve an acceptable result. However, it takes a lot of time and effort to obtain such annotated corpora. Grammar-based approaches take advantages of using experts’ knowledge to overcome the shortage of annotated corpora. Nonetheless, there are some problems that occur in grammar-based approaches. It is, firstly, the difficulty of maintaining the system when a large number of rules are added. Secondly, because of the fact that our language is changing day by day, grammar-based approaches become expensive when adapting into new domains or acquiring new knowledge. In this thesis, we firstly introduce an incremental knowledge acquisition method for Named Entities Recognition (NER). Although NER is different from a traditional classification problem, with this method, we have successfully applied Ripple Down Rule (RDR) which is known as the favourable solution for handling classification problems. As the result, the method takes the advantages of RDR by incrementally acquiring knowledge without breaking the consistency of the existing system. With RDR structure, our system is able to be adapted to other domains in the easier and more effective way. It is also compatible with the changing of our language. Moreover, this thesis also introduces an implementation on GATE framework by using JAPE grammars to reduce the effort of creating a new knowledge base. Experiments show that knowledge is acquired continuously without breaking the consistency of the existing knowledge base. Meanwhile, the current knowledge base is evaluated with an F- measure of 82% on the set of an existing Vietnamese corpus. Keywords: incremental knowledge acquisition, Named Entities Recognition, Ripple Down Rule 5 TABLE OF CONTENTS 6 LIST OF FIGURES Figure 2.1: An SCRDR example Figure 2.2: GATE’s architecture Figure 2.3: An annotation graph Figure 3.1: Our system’s overview Figure 3.2: An example of Tokenizer Figure 3.3: An example of Gazetteer Figure 3.4: A example of NE annotations Figure 3.5: The NER Module Figure 3.6: An example of all received NE annotations Figure 3.7: The structure of RDR Module Figure 4.1: The changing of F-measure between layers Figure 4.2: The performance of our system after every 20 rules 7 LIST OF TABLES 8 LIST OF ABBREVIATIONS CRF Conditional Random Fields GATE General Architecture for Text Engineering IE Information Extraction JAPE Java Annotation Patterns Engine LHS Left-hand-side LR Language Resource NE Named Entities NER Named Entities Recognition NLP Natural Language Processing POS Part of Speech PR Processing Resource RDR Ripple Down Rule RHS Right-hand-side SCRDR Single Classification Ripple Down Rule SVM Support Vector Machine VR Visual Resource 9 Chapter 1 INTRODUCTION 1.1. Named Entities Recognition The knowledge of human being is enormous and expanding every day. With the explosion of Internet, this kind of resources is now shared and becomes easier to fetch. However, almost all of it is unstructured and stored in a natural language form of documents. Therefore, it is difficult to collect and use this kind of resource effectively. As the result, the problem of how to automatically extract information from that resource and how to store it into a structured form is considered as the favourite problem in natural language processing domains by many scholars. Named Entities Recognition (NER) holds the central role of a successful information extractor. It includes two smaller problems which are locating and categorizing all mentions of named entities in textual document. There are some popular kinds of named entities [4,14]: - Names of people, locations and organizations - Dates and times - Currency amount or percentages Furthermore, depending on problems and domains, there some additional specific named entities. Similar to many other natural language processing (NLP) problems, there are three common kinds of approaches: - Grammar-based approaches [4,15] - Statistical approaches [14] - Hybrid approaches [8] 10 [...]... of our work Section 2.1 describes the named entities recognition problem and related works and section 2.2 provides an overview of ripple down rules In the rest of this chapter, section 2.3 presents the GATE framework and JAPE grammars that we have been working on 2.1 Named Entities Recognition 2.1.1 Introduction Named Entities Recognition is a phase in IE (Information Extraction) system It tries to... approach, the named entities recognition problem is turned into a classification problem For example, the IOB model [12] categorizes each word in a document by labeling it “I”,”O” or “B” - “I” if the present word is inside the considering named entities - “O” if the present word is outside the considering named entities - “B” if the present word is the start of the considering named entities After... type Organization for the matched annotations 24 Chapter 3 OUR INCREMENTAL KNOWLEDGE ACQUISITION METHOD FOR NER To create a system acquiring knowledge incrementally, we use RDR to structure rules in the knowledge base Although, RDR has worked successfully on classification problem domains, it has not been tested in NER problems It is still doubtful because besides classifying named entities, NER requires... collected to form a series of starting positions of all potential named entities 3.4.2 Phase 2 – finding named entities starting from those positions This phase is designed to co-operate with the RDR module to identify all candidate named entities which are located by the given starting positions This phase receives a series of starting positions in which each starting position represents a candidate named. .. been any research on applying RDR to the NER problem because it includes not only classifying a named entity but also locating the boundary of the named entity 1.3 The contributions of the thesis In this thesis, we firstly introduce a new method to recognize named entities by building an incremental knowledge acquisition system The method addresses the above difficulties including: - By turning the NER... while the system is being used Note that, in RDR approaches, cases played an integral part of knowledge acquisition process by motivating the capture of new knowledge Cases are also the contexts for deciding whether the new knowledge would apply or not for ensuring the consistency of the system when adding new knowledge RDR has been successful in handling single classification problems across a wide range... advantages of using this structure The last part of this chapter is the review of GATE and JAPE grammars which is the framework of our implementation In chapter 3, we propose our incremental knowledge acquisition method for named entities recognition This chapter also focuses on the rule language used in our system Chapter 4 presents our experiments on Vietnamese The experimental results and the errors are... rulebased knowledge system for classification problem domains Before the invention of RDR, scientists give a little concern about the structure and as how to maintain a rulebased system Therefore, with the development of human being in general, people wasted a lot of time and effort to add new knowledge or to adapt the existing system to new domains RDR is different, which maintains the system by incrementally... Time, Money, Percent) are clear and do not have many ambiguities, they are often easier to recognize However, the definition of NE is domain dependent For example, in the chemical named entities recognizer systems, there are some additional named entities for drugs, diseases, etc [21] In advertisement magazines, they can be product names Similar to many other natural language processing problems, there... category of named entities, etc is considered as the input of the machine learning system There are three kinds of learning in statistical approaches including supervised, unsupervised, semi-supervised However, the two later kinds (unsupervised and semisupervised) are rarely used in named entities recognition problem domains There are only few researches applying these kinds of learning For example: . adapting into new domains or acquiring new knowledge. In this thesis, we firstly introduce an incremental knowledge acquisition method for Named Entities Recognition (NER). Although NER is different. UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY Nguyen Van Bong INCREMENTAL KNOWLEDGE ACQUISITION FOR NAMED ENTITIES RECOGNITION Major: Computer Science Supervisor: Dr. Pham Bao Son Co-Supervisor:. computing systems for extracting information automatically. In such problem domains, Named Entities Recognition (NER) holds a central role in a successful information extractor. Approaches for NER can

Ngày đăng: 14/06/2014, 09:27

Nguồn tham khảo

Tài liệu tham khảo

Loại

Chi tiết

[1] Bikel, D., Miller, S., Schwartz, R. & Weischedel, R., "a High- Performance Learning Name-finder," in Fifth conference on applied natural language processing, 1998, pp. 194-201

Sách, tạp chí

Tiêu đề:	a High- PerformanceLearning Name-finder

[2] Borthwick, A., Sterling, J., Agichtein, E. & Grishman, R., "Exploiting diverse knowledge sources via maximum entropy in named entity recognition," in Proceedings of the Sixth workshop on Very Large Corpora, Montreal, Canada, 1998

Sách, tạp chí

Tiêu đề:	Exploiting diverseknowledge sources via maximum entropy in named entity recognition

[3] Budi, I. & Bressan, S., "Association Rules Mining for Name Entity Recognition,"in Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003

Sách, tạp chí

Tiêu đề:	Association Rules Mining for Name Entity Recognition

[4] Cao, T. V., Nguyen, T. & Tru, H., "Vn-kim ie: automatic extraction of vietnamese named-entities on the web," in New Gen. Comput., 2007, p. 277–292

Sách, tạp chí

Tiêu đề:	Vn-kim ie: automatic extraction ofvietnamese named-entities on the web

[5] Collins, M. & Singer,Y., "Unsupervised models for named entity classification,"in proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999

Sách, tạp chí

Tiêu đề:	Unsupervised models for named entity classification

[6] Compton, P. & Jansen, R., "A philosophical basis for knowledge acquisition," in Knowledge Aquisition, 1988-1990, p. 241–257

Sách, tạp chí

Tiêu đề:	A philosophical basis for knowledge acquisition

[7] Cunningham, H., Maynard, D., Bontcheva, K. & Tablan, V., "GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications.," in Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002, pp. 168-175

Sách, tạp chí

Tiêu đề:	GATE: AFramework and Graphical Development Environment for Robust NLP Tools andApplications

[9] Hirschman, L., & Thompson, S. H., "Overview of evaluation in speech and natural language processing," in Survey of the state of the art in human language technology. Cambridge, UK: Cambridge University Press, 1998

Sách, tạp chí

Tiêu đề:	Overview of evaluation in speech andnatural language processing

[10] Iwanska, L., Croll, M., Yoon, T. & Adams, M., "Wayne state university:Description of the UNO processing system as used for MUC-6," in In Proc. of the MUC-6, NIST, Columbia, 1995

Sách, tạp chí

Tiêu đề:	Wayne state university:Description of the UNO processing system as used for MUC-6

[11] Kim, J., Kang, I. & Choi, K., "Unsupervised Named Entity Classification Models and their Ensembles," in Proceedings of the 19th international conference on Computational linguistics, 2002

Sách, tạp chí

Tiêu đề:	Unsupervised Named Entity Classification Modelsand their Ensembles

[13] Linguistic Data Consortium, ACE (Automatic Content Extraction) English Annotation Guidelines for Entities. 2008

Sách, tạp chí

Tiêu đề:	ACE (Automatic Content Extraction) EnglishAnnotation Guidelines for Entities

[14] Mansouri, A., Affendey, L. & Mamat, A., "A new fuzzy support vector machine method for named entity recognition," in Proceedings of the 2008 International Conference on Computer Science and Information Technology,, Washington, DC, USA, 2008, pp. 24-28

Sách, tạp chí

Tiêu đề:	A new fuzzy support vector machinemethod for named entity recognition

[15] Maynard, D., Tablan, V., Ursu, C., Cunningham, H. & Wilks, Y., "Named Entity Recognition from Diverse Text Types," in Proceedings of the Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark, Bulgaria, 2001, pp. 257-274

Sách, tạp chí

Tiêu đề:	Named EntityRecognition from Diverse Text Types

[17] Nguyen, P. T., Vu, X. L., Nguyen, H., Nguyen, H. & Le, P., "Building a large syntactically-annotated corpus of vietnamese," in Proceedings of the Third Linguistic Annotation Workshop, Stroudsburg, PA, USA, 2009, pp. 182-185

Sách, tạp chí

Tiêu đề:	Building a largesyntactically-annotated corpus of vietnamese

[18] Nguyen, T., Oanh, T., Hieu, P. & Thuy, H., "Named Entity Recognition in Vietnamese Free-Text and Web Documents Using Conditional Ramdom Fields,"in The 8th Conference on Some selection problems of Information Technology and Telecommunication, Hai Phong, Viet Nam, 2005

Sách, tạp chí

Tiêu đề:	Named Entity Recognition inVietnamese Free-Text and Web Documents Using Conditional Ramdom Fields

[19] Nguyen, D., Hoang, S., Pham, S. & Nguyen, P. T., "Named entity recognition for vietnamese," in Intelligent Information and Database Systems, 2010, pp. 205- 214

Sách, tạp chí

Tiêu đề:	Named entity recognition forvietnamese

[20] Pham, D., Tran, G. & Pham, S., "A Hybrid approach to vietnamese word segmentation using part of speech tags," in Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, Washington, DC,USA, 2009, pp. 154-161

Sách, tạp chí

Tiêu đề:	A Hybrid approach to vietnamese wordsegmentation using part of speech tags

[21] Pham, T., Kawazoe, A., Dinh, D. & Collier, N., "Construction of Vietnamese corpora for named entity recognition," in Conference RIAO2007, Pittsburgh PA, U.S.A, 2007

Sách, tạp chí

Tiêu đề:	Construction of Vietnamesecorpora for named entity recognition

[22] D. Richards, "Two decades of ripple down rules research," in Knowledge Engineering Review, 2009, p. 159–184

Sách, tạp chí

Tiêu đề:	Two decades of ripple down rules research

[23] Wu, Y., Fan, T., Lee, Y. & Yen, S., Extracting Named Entities Using Support Vector Machines. Berlin Heidelberg: Spring-Verlag, 2006

Sách, tạp chí

Tiêu đề:	Extracting Named Entities Using SupportVector Machines

Xem thêm