(LUẬN VĂN THẠC SĨ) Information Extraction for Vietnamese Real-Estate Advertisements

Information Extraction for Vietnamese Real-Estate Advertisements by Pham Vi Lien Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Dr Pham Bao Son A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology June, 2012 TIEU LUAN MOI download : skknchat@gmail.com ORIGINALITY STATEMENT I hereby declare that this thesis is my own work and to the best of my knowledge, it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University or any other educational institution, except where due to acknowledgment is made to the thesis Any contribution made in the research by others, with whom I have worked at University of Engineering and Technology or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to conception or in style, presentation and linguistic expression are acknowledged Signed: Date: i TIEU LUAN MOI download : skknchat@gmail.com Abstract In recent years, real-estate market in Vietnam is growing rapidly which creates a lot of information about real-estate, especially information on advertising for buying and selling activities of real-estate development This poses an essential demand for building an information extraction system to help users deal with the increasing amount of real-estate advertisements on the Internet We propose a rule-based approach to build an information extraction system for online realestate advertisements in Vietnamese At the same time, we set up a process to build an annotated corpus which can be used in machine learning approaches at a later stage Our system achieve promising results with F-measures of above 90% Our approach is particularly suitable for under-resourced languages where an annotated corpus of a decent size is not readily available Keywords: natural language processing, information extraction, online realestate advertisements TIEU LUAN MOI download : skknchat@gmail.com Acknowledgements I would like to express great gratitude to my supervisor, Dr Pham Bao Son, in the Faculty of Information Technology at University of Engineering and Technology of Vietnam National University, Hanoi, for his encouragement, support, patience, guidance and advice Without his constant invaluable direction and tolerance, I could not have become a better researcher I would also like to respect to my lecturers who has taught me educational subjects at University of Engineering and Technology of Vietnam National University, Hanoi I would also like to depict my great pleasure to my sponsor, Quang Trung University, who have granted me the full scholarship to follow my Master degree I owe all friends and colleagues a huge thank for their encouragement and friend-ship They have provided great mental support to me when I got stressful at times Last but not least, thank to my wife for her sympathy and love during the past years I heartily thank my parents, parents-in-law and my sisters for their encouragement and the many years of support during my studies Again, I owe my success in life as I am today to my parent’ unconditional love, hard work, and sacrifices To all, I thank you iii TIEU LUAN MOI download : skknchat@gmail.com Contents ORIGINALITY STATEMENT i Abstract ii Acknowledgements iii List of Figures vi List of Tables vii Introduction 1.1 Problem and Idea 1.2 Scope of the thesis 1.3 Thesis’ structure 1 4 Related Work 2.1 Approaches 2.1.1 Rule-based approach 2.1.2 Machine-learning approach 2.1.3 Hybrid approach 2.2 GATE framework 2.2.1 Introduction 2.2.2 General Architecture of GATE 2.2.3 An example: ANNIE - A Nearly-New Information Extraction System 2.2.4 Working with GATE 2.2.5 Gazetteers 2.2.6 JAPE 6 8 11 11 12 13 Our Vietnamese Real-Estate Information Extraction system 3.1 Template Definition 3.2 Corpus Development 3.2.1 Criterion of data collection 3.2.2 Data collection 14 14 16 16 17 iv TIEU LUAN MOI download : skknchat@gmail.com Table of Contents 3.3 3.4 v 3.2.3 Data normalization 3.2.4 Corpus Annotation System Development 3.3.1 Tokenizer 3.3.2 Gazetteer 3.3.3 JAPE Transducer 3.3.3.1 Remove incorrect Lookup annotations 3.3.3.2 Recognizing entities 3.3.3.3 Recognizing entities 3.3.3.4 Recognizing entities 3.3.3.5 Recognizing , and entities 3.3.3.6 Recognizing entities 3.3.3.7 Recognizing entities 3.3.3.8 Recognizing entities Summary Experiments and Error 4.1 Evaluation metrics 4.2 Experimental result 4.3 Errors Analysis 18 21 23 24 26 27 29 30 30 31 32 32 33 33 34 Analysis 35 35 36 40 Conclusion and Future Works 42 5.1 Conclusion 42 5.2 Future Works 42 A A typical code 44 B Relevant Publications 46 Bibliography 47 TIEU LUAN MOI download : skknchat@gmail.com List of Figures 1.1 1.2 The result for query "cần mua nhà Hà Nội" on Google Search The expected result of our system 2.1 2.2 A screenshot of a GUI in GATE framework The general architecture of GATE 10 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Template of our system An example of an original news article before normalization An example of a normalized news article The process of creating an annotated corpus and system development The main code is defined to create a new Callisto task A news articles annotated by Callisto Architecture of our Vietnamese Real-Estate Information Extraction system Typical Vietnamese Real-Estate Information Extraction system components 3.8 15 17 21 21 22 23 24 28 4.1 4.2 4.3 The performance of our system in three versions 39 Using lenient criteria to evaluate the annotation in three versions 39 Using strict criteria to evaluate the annotation in three versions 40 5.1 The screenshot of Real-Estate Information Extraction system 43 A.1 A.2 A.3 A.4 A A A A code code code code recognize recognize recognize recognize TypeEstate entity Telephone entity Email entity Zone entity 44 44 45 45 vi TIEU LUAN MOI download : skknchat@gmail.com List of Tables 4.1 4.2 4.3 4.4 Performance Performance Performance Performance on on on on the the the the T raining3 data using lenient criteria T raining3 data using strict criteria T est3 data using lenient criteria T est3 data using strict criteria 37 37 38 38 vii TIEU LUAN MOI download : skknchat@gmail.com Chapter Introduction As data and information sources are growing rapidly everyday, dealing with this data become a big and challenging problem Popular techniques such as machine learning can not be easily applied for many language processing tasks in Vietnamese due to the lack of annotated corpora This is indeed the case for processing real-estate advertising information In this thesis, we propose to build an information extraction system for real-estate adverstisements in Vietnamese 1.1 Problem and Idea With the advent and development of the Internet, a great amount of data has been posted to the Internet Those data are not only text but also image, audio, video, and so on They appear in most areas of life from economic, politic, society, medicine to the emerging areas today such as securities, finance, real-estate, etc The explosion of data is constantly increasing everyday, especially, in the cloud computing age Almost all of user data is stored on the web platform This huge data source contain a lot of information If data are increasing rapidly, it means that, information is also growing much faster than data With more information, users become more confused because the useful information that they need is drifting following the stream-data In order to help TIEU LUAN MOI download : skknchat@gmail.com Chapter Introduction people deal with this situation, there are many search engines that have been created such as Google1 , Bing2 , Yahoo3 , etc They quickly become an indispensable tool to assist human in finding useful information from the huge data sources on the Internet However, they still haven’t met the expectations of the users, especially, in the case where the user’s query is a question Take the following example: We use the phrase "cần mua nhà Hà Nội" (buy a house in Hanoi) as a query for Google’s search engine (Figure 1.1) The results which we obtained is a list of links These links refer to websites containing one of the words of the above query From Figure 1.1, we can easily see that these results aren’t the expected results of the users Users have to spend a lot of time to find an answer for their query from this list of links Therefore, our desire is that the users should get a list of specific answers to the query Figure 1.1: The result for query "cần mua nhà Hà Nội" on Google Search In order to solve the above problem, the researchers have looked into areas such as information extraction, text summarization, data mining, etc to deliver more useful and specific information to users Information Extraction is one of the important tasks in natural language processing The main idea of an information extraction system is to extract snippets https://www.google.com/ https://www.bing.com/ https://www.yahoo.com/ TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiments and Error Analysis Type (1) (2) (3) (4) (5) (6) TypeEstate CategoryEstate Zone Area Price Contact All No entities annotated manually No entities recognized correctly No entities recognized by our system Precision Recall F-measure (1) (2) (3) (4) (5) 180 180 180 100% 100% 180 176 180 98% 98% 165 152 160 95% 92% 151 134 134 100% 89% 147 146 146 100% 99% 463 460 465 99% 99% 1286 1248 1265 99% 97% 37 - (6) 100% 98% 94% 94% 100% 99% 98% Table 4.1: Performance on the T raining3 data using lenient criteria Type (1) (2) (3) (4) (5) (6) TypeEstate CategoryEstate Zone Area Price Contact All - No entities annotated manually No entities recognized correctly No entities recognized by our system Precision Recall F-measure (1) (2) (3) (4) (5) 180 180 180 100% 100% 180 176 180 98% 98% 165 112 160 70% 68% 151 132 134 99% 87% 147 146 146 100% 99% 463 457 465 98% 99% 1286 1203 1265 95% 94% (6) 100% 98% 69% 93% 100% 98% 94% Table 4.2: Performance on the T raining3 data using strict criteria The overall F-measures of the system on Test data using the lenient and strict criteria are 96% and 91% respectively However we can easily see that the performance varies between entities The lowest performance is on the Zone entity which reflects the fact that Zone entities are very ambiguous and different to recognize This is partly due to the fact that Zone entities in Vietnamese are often long and presented in many formats This also explains why the performance for Zone entities is significantly improved when using lenient criteria compared to TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiments and Error Analysis 38 strict criteria Type (1) (2) (3) (4) (5) (6) TypeEstate CategoryEstate Zone Area Price Contact All - No entities annotated manually No entities recognized correctly No entities recognized by our system Precision Recall F-measure (1) (2) (3) (4) (5) 80 79 80 99% 99% 80 76 80 95% 95% 72 62 69 90% 86% 61 51 51 100% 84% 58 55 55 100% 95% 173 172 173 99% 99% 524 495 508 97% 94% (6) 99% 95% 88% 91% 97% 99% 96% Table 4.3: Performance on the T est3 data using lenient criteria Type (1) (2) (3) (4) (5) (6) TypeEstate CategoryEstate Zone Area Price Contact All - No entities annotated manually No entities recognized correctly No entities recognized by our system Precision Recall F-measure (1) (2) (3) (4) (5) 80 78 80 98% 98% 80 68 80 85% 85% 72 43 69 62% 60% 61 51 51 100% 84% 58 55 55 100% 95% 173 172 173 99% 99% 524 467 508 92% 89% (6) 98% 85% 61% 91% 97% 99% 91% Table 4.4: Performance on the T est3 data using strict criteria Figure 4.1 shows the system’s performance using lenient and strict criteria In the graph, the Y-axis is F-measure while the X-axis represents three versions in the development process of our system Observing the chart, we can see the system’s performance is significantly improved through each version from 65% and 78% of version 1.0 up to 91% and 96% of version 2.0 using strict and lenient criteria respectively One important point to note is that the system’s performances using TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiments and Error Analysis 39 Figure 4.1: The performance of our system in three versions two criteria gradually improve This indicates that the process of building our system is quite stable, whether we use the lenient or strict criteria Figure 4.2: Using lenient criteria to evaluate the annotation in three versions The performance of our system through three versions in terms of annotations are shown in Figure 4.2 and 4.3 The X-axis and Y-axis of the diagram represent the annotations and performance respectively In these two charts, we can easily see significant improvement in the recognition rate of annotations at the later version compared to the previous version Specifically, in Figure 4.3 the difference is quite TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiments and Error Analysis 40 large (average about 27%) for the Zone, Area, Price and Contact annotations between version 1.0 and 2.0 In addition, in the two charts above, we can see specific disparity of Zone annotation when using two lenient and strict criteria Figure 4.3: Using strict criteria to evaluate the annotation in three versions 4.3 Errors Analysis As we mentioned above, the Zone entity is one of the most entities difficult recognition of our system The main reason can be explained as follows: - Diverse write styles - Some entities, especially Zone entity, are very long and not use capitalization Take the following two examples: TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiments and Error Analysis 41 "Tôi cần mua hộ Mỹ đình – từ liêm – Hà Nội." "I need to buy an apartment in My Dinh - Tu Liem - Ha Noi." "Liên hệ: anh minh - 0987214931." "Contact: anh Minh - 0987214931." The location name (the phrase "Mỹ đình – từ liêm – Hà Nội") in the first example and Person name (the phrase "anh minh") in the second example are not recognized correctly as the clue words are not capitalized TIEU LUAN MOI download : skknchat@gmail.com Chapter Conclusion and Future Works 5.1 Conclusion In this thesis, we propose a rule-based approach to build an information extraction system for online advertising real-estate in Vietnamese Although, our approach is not new, it addresses an important task where there are no publicly available annotated corpus in Vietnamese The system obtains pretty good result with an overall F-measure of 91% when using the strict criteria, and 96% when using lenient criteria Currently, our system uses these results to present directly to the users (Figure 5.1) We can use these results for more practical purposes such as: using them as input data for the third party applications as search engine, data mining, analysis and prediction for trend of the real-estate market, etc In addition, our system can also be used as a tool to build an annotated corpus for real-estate advertisements 5.2 Future Works In the future, we will need to improve the system performance for the Zone entity In fact, the Zone entity is quite difficult to identify, but we may try to incorporate other factors such as gazetteer, context to improve recognition performance for this entity We will also try to use machine learning on our annotated corpus 42 TIEU LUAN MOI download : skknchat@gmail.com Chapter Conclusion and Future Works 43 Figure 5.1: The screenshot of Real-Estate Information Extraction system and investigate avenues that could combine machine learning approaches with our rule-based approach At this stage, we’ve got a good supporting tool for the development of an annotated corpus TIEU LUAN MOI download : skknchat@gmail.com Appendix A A typical code Figure A.1: A code recognize TypeEstate entity Figure A.2: A code recognize Telephone entity 44 TIEU LUAN MOI download : skknchat@gmail.com Appendix A A typical code 45 Figure A.3: A code recognize Email entity Figure A.4: A code recognize Zone entity TIEU LUAN MOI download : skknchat@gmail.com Appendix B Relevant Publications Lien Vi Pham and Son Bao Pham Information Extraction for Vietnamese RealEstate Advertisements In Proceedings of the fourth International Conference on Knowledge and Systems Engineering (KSE), 2012 (Accepted) 46 TIEU LUAN MOI download : skknchat@gmail.com Bibliography [1] Truc-Vien Thi Nguyen and Tru Hoang Cao Automatic extraction of vietnamese named-entities on the web Proceedings of the Journal of New Generation Computing, Ohmsha, Ltd and Springer, 2007 [2] Diana Maynard, Kalina Bontcheva, and Hamish Cunningham Towards a semantic extraction of named entities Proceedings Recent Advances in Natural Language Processing, Borovets, Bulgaria, 2003 [3] Yu-Chieh Wu, Teng-Kai Fan, Yue-Shi Lee, and Show-Jane Yen Extracting named entities using support vector machines Proceedings of the International Workshop on Knowledge Discovery in Life Science Literature, 2006 [4] Theodore W Hong and Keith L Clark Using grammatical inference to automate information extraction from the web In In Principles of Data Mining and Knowledge Discovery, 2001 [5] Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi Building intelligent systems for mining information extraction rules from web pages by using domain knowledge In in Proc IEEE Int Symp Industrial Electronics, Pusan, Korea, 2001 [6] Haisong Gu Zhu and Qiang Ji Information extraction from image sequences of real-world facial expressions Machine Vision and Applications, Vo 16, No 2, P105-115, 2005, 2005 [7] Dan Istrate, Eric Castelli, Michel Vacher, Laurent Besacier, and Jean-Francois Serignat Information extraction from sound for medical telemonitoring IEEE Transactions on Information Technology Biomedicine, Vol 10, No 2, April 2006, 2006 47 TIEU LUAN MOI download : skknchat@gmail.com Bibliography 48 [8] Howard D Wactlar New directions in video information extraction and summarization In Proceedings of the 10th DELOS Workshop, Sanorini, Greece, June 24-25, 1999, 1999 [9] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, , and Valentin Tablan Gate: A framework and graphical development environment for robust nlp tools and applications Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA (2002) , 2002 [10] Dat Ba Nguyen, Son Huu Hoang, Son Bao Pham, and Thai Phuong Nguyen Named entity recognition for vietnamese Springer Berlin/Heidelberg, ACIIDS, 2010 [11] Borthwick Andrew, Sterling John, Agichtein Eugene, and Grishman Ralph Exploiting diverse knowledge sources via maximum entropy in named entity recognition Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, 1998 [12] Alireza Mansouri, Lilly Suriani Affendey, and Ali Mamat Named entity recognition using a new fuzzy support vector machine Proceedings of the International Journal of Computer Science and Network Security, IJCSNS, vol 8, n 2, pg 320-325, 2008 [13] Xiaoshan Fang and Huanye Sheng A hybrid approach for chinese named entity recognition Proceedings of the Fifth International Conference on Discovery Science, 2002 [14] Rohini Srihari, Cheng Niu, and Wei Li A hybrid approach for named entity and sub-type tagging Proceedings of the Sixth Conference on Applied Natural Language Processing, 2000 [15] Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham, and Yorick Wilks How feasible is the reuse of grammars for named entity recognition? Proceedings of the Conference on Language Resources and Evaluation (LREC’02), 2002 [16] Indra Budi and Stéphane Bressan Association rules mining for name entity recognition Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003 TIEU LUAN MOI download : skknchat@gmail.com Bibliography 49 [17] Xuan-Thao Thi Pham, Tri Quoc Tran, Ai Kawazoe, Dien Dinh, and Nigel Collier Construction of vietnamese corpora for named entity recognition Conference RIAO2007, Pittsburgh PA, U.S.A May 30-June 1, 2007 - Copyright C.I.D Paris, France, 2007 [18] Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and Yorick Wilks Named entity recognition from diverse text types Proceedings Recent Advances in Natural Language Processing, 2001 [19] Sunita Sarawagi Information Extraction Foundations and Trends in Databases Vol 1, No (2007) 261–377, 2007 [20] Daniel M Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel A high- performance learning name-finder Proceedings of the Fifth Conference on Applied Natural Language Processing, PP 194-201, 1998 [21] John Lafferty, Andrew McCallum, and Fernando Pereira Conditional random fields: Probabilistic models for segmenting and labeling sequence data Proceedings of the International Conference on Machine Learning (ICML-2001), 2001 [22] Yaoyong Li, Kalia Bontcheva, and Hamish Cunnigham Adapting svm for data sparseness and imbalance: a case study in information extraction Natural Language Engineering 15 (2): 241–271., 2008 [23] Doug Downey, Stefan Schoenmackers, and Oren Etzioni Sparse information extraction: Unsupervised language models to the rescue Annual Meeting of the Association for Computational Linguistics, 2007 [24] Benjamin Rosenfeld and Ronen Feldman Using corpus statistics on entities to improve semi-supervised relation extraction from the web Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp 600–607, 2007 [25] Tri Tran Quoc, Thao Pham Thi Xuan, Hung Ngo Quoc, Dien Dinh, and Nigel Collier Named entity recognition in vietnamese documents Journal of “Progress in Informatics”, NII (National Institute for Informatics), Tokyo, Japan, Vol 2007, No.4, pp.1-9, 2007 TIEU LUAN MOI download : skknchat@gmail.com Bibliography 50 [26] Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, and The Minh Trinh Relation extraction in vietnamese text using conditional random fields The Sixth Asia Information Retrieval Societies Conference (AIRS), 2010 [27] Gabrielle Gayer, Itzhak Gilboa, and Offer Lieberman Rule-based and casebased reasoning in housing prices In The B.E Journal of Theoretical Economics, 2007 [28] R Feldman, B Rosenfeld, and M Fresko Teg-a hybrid approach to information extraction Knowledge and Information Systems , vol 9, pp 1–18, 2006, 2006 [29] Y Choi, C Cardie, E Riloff, and S Patwardhan Identifying sources of opinions with conditional random fields and extraction patterns In In Proceedings of HLT/EMNLP 2005, 2005 [30] Hamish Cunningham Gate, a general architecture for text engineering Computers and the Humanities 36, 223–254, 2002 [31] David Ferrucci and Adam Lally Uima: An architectural approach to unstructured information processing in the corporate research environment Natural Language Engineering, vol 10, nos 3–4, pp 327–348, 2004., 2004 [32] Boyan Onyshkevych Issues and methodology for template design for information extraction In Proceedings of the workshop on Human Language Technology, pages 171–176, 1994 [33] Jim Cowie and Yorick Wilks Information extraction In R Dale, H Moisl and H Somers (eds.) Handbook of Natural Language Processing, 2000 [34] Dang Duc Pham, Giang Binh Tran, and Son Bao Pham Vietnamese word segmentation using part of speech tags Proceedings of the First International Conference on Knowledge and Systems Engineering, Hanoi, Vietnam, 2009 [35] Le Hong Phuong, Nguyen Thi Minh Huyen, Azim Roussanaly, and Ho Tuong Vinh A hybrid a pproach to word segmentation of vietnamese texts Proceedings of the 2nd International Conference on Language and Automata Theory and Applications LATA 2008, 2008 TIEU LUAN MOI download : skknchat@gmail.com Bibliography 51 [36] Dinh Quang Thang, Le Hong Phuong, Nguyen Thi Minh Huyen, Nguyen Cam Tu, Mathias Rossignol, and Vu Xuan Luong Word segmentation of vietnamese texts: a comparison of approaches Proceedings of the 6th Language Resources and Evaluation Conference LREC 2008, 2008 TIEU LUAN MOI download : skknchat@gmail.com ... new problem in Vietnamese, especially in the domain for real-estate advertisements Our thesis addresses the problem of information extraction for Vietnamese online real-estate advertisements. .. but Vietnamese language is still at the early stage Our thesis tackles the information extraction task for online real-estate advertisement in Vietnamese We build a Vietnamese Real-Estate Information. .. been used for lots of information extraction projects in many languages and problem domains A typical example of an information extraction system is ANNIE - A Nearly-New Information Extraction