Information extraction for vietnamese real estate advertisements

Information Extraction for Vietnamese Real-Estate Advertisements by Pham Vi Lien Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Dr Pham Bao Son A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology June, 2012 ORIGINALITY STATEMENT I hereby declare that this thesis is my own work and to the best of my knowledge, it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University or any other educational institution, ex-cept where due to acknowledgment is made to the thesis Any contribution made in the research by others, with whom I have worked at University of Engineering and Technology or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to conception or in style, presentation and linguistic expression are acknowledged Signed: Date: i Abstract In recent years, real-estate market in Vietnam is growing rapidly which creates a lot of information about real-estate, especially information on advertising for buying and selling activities of real-estate development This poses an essential demand for building an information extraction system to help users deal with the increasing amount of real-estate advertisements on the Internet We propose a rule-based approach to build an information extraction system for online realestate advertisements in Vietnamese At the same time, we set up a process to build an annotated corpus which can be used in machine learning approaches at a later stage Our system achieve promising results with F-measures of above 90% Our approach is particularly suitable for under-resourced languages where an annotated corpus of a decent size is not readily available Keywords: natural language processing, information extraction, online real-estate advertisements Acknowledgements I would like to express great gratitude to my supervisor, Dr Pham Bao Son, in the Faculty of Information Technology at University of Engineering and Technology of Vietnam National University, Hanoi, for his encouragement, support, patience, guidance and advice Without his constant invaluable direction and tolerance, I could not have become a better researcher I would also like to respect to my lecturers who has taught me educational sub-jects at University of Engineering and Technology of Vietnam National University, Hanoi I would also like to depict my great pleasure to my sponsor, Quang Trung University, who have granted me the full scholarship to follow my Master degree I owe all friends and colleagues a huge thank for their encouragement and friend-ship They have provided great mental support to me when I got stressful at times Last but not least, thank to my wife for her sympathy and love during the past years I heartily thank my parents, parents-in-law and my sisters for their encouragement and the many years of support during my studies Again, I owe my success in life as I am today to my parent’ unconditional love, hard work, and sacrifices To all, I thank you iii Contents ORIGINALITY STATEMENT Abstract Acknowledgements List of Figures List of Tables Introduction 1.1 1.2 1.3 Problem and Idea Scope of the thesis Thesis’ structure Related Work 2.1 Approaches 2.1.1 2.1.2 2.1.3 GATE framework 2.2.1 2.2.2 2.2.3 2.2 2.2.4 2.2.5 2.2.6 Our Vietnamese Real-Estate Information Extraction system 3.1 Template Definition 3.2 Corpus Development 3.2.1 3.2.2 iv Table of Contents 3.2.3 3.2.4 3.3 System Development 3.3.1 3.3.2 3.3.3 3.4 Summary Experiments and Error Analysis 4.1 Evaluation metrics 4.2 Experimental result 4.3 Errors Analysis Conclusion and Future Works 5.1 Conclusion 5.2 Future Works A A typical code B Relevant Publications Bibliography List of Figures 1.1 1.2 The result for query "cƒn mua nh ð H Nºi" on Google The expected result of our system 2.1 2.2 A screenshot of a GUI in GATE framework The general architecture of GATE 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Template of our system An example of an original news article before normaliz An example of a normalized news article The process of creating an annotated corpus and system The main code is defined to create a new Callisto task A news articles annotated by Callisto Architecture of our Vietnamese Real-Estate Informatio system Typical Vietnamese Real-Estate Information Extraction ponents 3.8 4.1 4.2 4.3 The performance of our system in three versions Using lenient criteria to evaluate the annotation in thre Using strict criteria to evaluate the annotation in three 5.1 The screenshot of Real-Estate Information Extraction s A.1 A.2 A.3 A.4 A code recognize TypeEstate entity A code recognize Telephone entity A code recognize Email entity A code recognize Zone entity vi List of Tables 4.1 4.2 4.3 4.4 Performance on the T raining3 Performance on the T raining3 Performance on the T est3 Performance on the T est3 vii Chapter Introduction As data and information sources are growing rapidly everyday, dealing with this data become a big and challenging problem Popular techniques such as machine learning can not be easily applied for many language processing tasks in Viet-namese due to the lack of annotated corpora This is indeed the case for pro-cessing real-estate advertising information In this thesis, we propose to build an information extraction system for real-estate adverstisements in Vietnamese 1.1 Problem and Idea With the advent and development of the Internet, a great amount of data has been posted to the Internet Those data are not only text but also image, audio, video, and so on They appear in most areas of life from economic, politic, society, medicine to the emerging areas today such as securities, finance, realestate, etc The explosion of data is constantly increasing everyday, especially, in the cloud computing age Almost all of user data is stored on the web platform This huge data source contain a lot of information If data are increasing rapidly, it means that, information is also growing much faster than data With more information, users become more confused because the useful information that they need is drifting following the stream-data In order to help Chapter Introduction people deal with this situation, there are many search engines that have been cre1 ated such as Google , Bing , Yahoo , etc They quickly become an indispensable tool to assist human in finding useful information from the huge data sources on the Internet However, they still haven’t met the expectations of the users, espe-cially, in the case where the user’s query is a question Take the following example: We use the phrase "cƒn mua nh ð H Nºi" (buy a house in Hanoi) as a query for Google’s search engine (Figure 1.1) The results which we obtained is a list of links These links refer to websites containing one of the words of the above query From Figure 1.1, we can easily see that these results aren’t the expected results of the users Users have to spend a lot of time to find an answer for their query from this list of links Therefore, our desire is that the users should get a list of specific answers to the query Figure 1.1: The result for query "cƒn mua nh ð H Nºi" on Google Search In order to solve the above problem, the researchers have looked into areas such as information extraction, text summarization, data mining, etc to deliver more useful and specific information to users Information Extraction is one of the important tasks in natural language processing The main idea of an information extraction system is to extract snippets https://www.google.com/ https://www.bing.com/ https://www.yahoo.com/ Chapter Experiments and Error Analysis large (average about 27%) for the Zone, Area, Price and Contact annotations between version 1.0 and 2.0 In addition, in the two charts above, we can see specific disparity of Zone annotation when using two lenient and strict criteria Figure 4.3: Using strict criteria to evaluate the annotation in three versions 4.3 Errors Analysis As we mentioned above, the Zone entity is one of the most entities difficult recog-nition of our system The main reason can be explained as follows: - Diverse write styles - Some entities, especially Zone entity, are very long and not use capital-ization Take the following two examples: Chapter Experiments and Error Analysis "Tỉi cƒn mua c«n hº t⁄i Mÿ …nh tł li¶m H Nºi." "I need to buy an apartment in My Dinh - Tu Liem - Ha Noi." "Li¶n h»: anh minh - 0987214931." "Contact: anh Minh - 0987214931." The location name (the phrase "Mÿ …nh tł li¶m H Nºi") in the first example and Person name (the phrase "anh minh") in the second example are not recognized correctly as the clue words are not capitalized Chapter Conclusion and Future Works 5.1 Conclusion In this thesis, we propose a rule-based approach to build an information extraction system for online advertising real-estate in Vietnamese Although, our approach is not new, it addresses an important task where there are no publicly available annotated corpus in Vietnamese The system obtains pretty good result with an overall F-measure of 91% when using the strict criteria, and 96% when using lenient criteria Currently, our system uses these results to present directly to the users (Figure 5.1) We can use these results for more practical purposes such as: using them as input data for the third party applications as search engine, data mining, analysis and prediction for trend of the real-estate market, etc In addition, our system can also be used as a tool to build an annotated corpus for real-estate advertisements 5.2 Future Works In the future, we will need to improve the system performance for the Zone entity In fact, the Zone entity is quite difficult to identify, but we may try to incorporate other factors such as gazetteer, context to improve recognition performance for this entity We will also try to use machine learning on our annotated corpus 42 Chapter Conclusion and Future Works Figure 5.1: The screenshot of Real-Estate Information Extraction system and investigate avenues that could combine machine learning approaches with our rule-based approach At this stage, we’ve got a good supporting tool for the development of an annotated corpus Appendix A A typical code Figure A.1: A code recognize TypeEstate entity Figure A.2: A code recognize Telephone entity 44 Appendix A A typical code Figure A.3: A code recognize Email entity Figure A.4: A code recognize Zone entity Appendix B Relevant Publications Lien Vi Pham and Son Bao Pham Information Extraction for Vietnamese Real-Estate Advertisements In Proceedings of the fourth International Conference on Knowledge and Systems Engineering (KSE), 2012 (Accepted) 46 Bibliography [1] Truc-Vien Thi Nguyen and Tru Hoang Cao Automatic extraction of viet-namese named-entities on the web Proceedings of the Journal of New Gen-eration Computing, Ohmsha, Ltd and Springer, 2007 [2] Diana Maynard, Kalina Bontcheva, and Hamish Cunningham Towards a semantic extraction of named entities Proceedings Recent Advances in Natural Language Processing, Borovets, Bulgaria, 2003 [3] Yu-Chieh Wu, Teng-Kai Fan, Yue-Shi Lee, and Show-Jane Yen Extracting named entities using support vector machines Proceedings of the Interna-tional Workshop on Knowledge Discovery in Life Science Literature, 2006 [4] Theodore W Hong and Keith L Clark Using grammatical inference to automate information extraction from the web In In Principles of Data Mining and Knowledge Discovery, 2001 [5] Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi Building intelligent systems for mining information extraction rules from web pages by using do-main knowledge In in Proc IEEE Int Symp Industrial Electronics, Pusan, Korea, 2001 [6] Haisong Gu Zhu and Qiang Ji Information extraction from image sequences of real-world facial expressions Machine Vision and Applications, Vo 16, No 2, P105-115, 2005, 2005 [7] Dan Istrate, Eric Castelli, Michel Vacher, Laurent Besacier, and Jean-Francois Serignat Information extraction from sound for medical telemonitoring IEEE Transactions on Information Technology Biomedicine, Vol 10, No 2, April 2006, 2006 47 Bibliography [8] Howard D Wactlar New directions in video information extraction and sum-marization In Proceedings of the 10th DELOS Workshop, Sanorini, Greece, June 24-25, 1999, 1999 [9] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, , and Valentin Tablan Gate: A framework and graphical development environment for ro-bust nlp tools and applications Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA (2002) , 2002 [10] Dat Ba Nguyen, Son Huu Hoang, Son Bao Pham, and Thai Phuong Nguyen Named entity recognition for vietnamese Springer Berlin/Heidelberg, ACI-IDS, 2010 [11] Borthwick Andrew, Sterling John, Agichtein Eugene, and Grishman Ralph Exploiting diverse knowledge sources via maximum entropy in named entity recognition Proceedings of the Sixth Workshop on Very Large Corpora, Mon-treal, Canada, 1998 [12] Alireza Mansouri, Lilly Suriani Affendey, and Ali Mamat Named entity recognition using a new fuzzy support vector machine Proceedings of the International Journal of Computer Science and Network Security, IJCSNS, vol 8, n 2, pg 320-325, 2008 [13] Xiaoshan Fang and Huanye Sheng A hybrid approach for chinese named entity recognition Proceedings of the Fifth International Conference on Dis-covery Science, 2002 [14] Rohini Srihari, Cheng Niu, and Wei Li A hybrid approach for named entity and sub-type tagging Proceedings of the Sixth Conference on Applied Natural Language Processing, 2000 [15] Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham, and Yorick Wilks How feasible is the reuse of grammars for named entity recog-nition? Proceedings of the Conference on Language Resources and Evaluation (LREC’02), 2002 [16] Indra Budi and St†phane Bressan Association rules mining for name en-tity recognition Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003 Bibliography [17] Xuan-Thao Thi Pham, Tri Quoc Tran, Ai Kawazoe, Dien Dinh, and Nigel Collier Construction of vietnamese corpora for named entity recognition Conference RIAO2007, Pittsburgh PA, U.S.A May 30-June 1, 2007 - Copy-right C.I.D Paris, France, 2007 [18] Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and Yorick Wilks Named entity recognition from diverse text types Proceedings Recent Advances in Natural Language Processing, 2001 [19] Sunita Sarawagi Information Extraction Foundations and Trends in Databases Vol 1, No (2007) 261 377, 2007 [20] Daniel M Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel A high- performance learning name-finder Proceedings of the Fifth Conference on Applied Natural Language Processing, PP 194201, 1998 [21] John Lafferty, Andrew McCallum, and Fernando Pereira Conditional random fields: Probabilistic models for segmenting and labeling sequence data Pro-ceedings of the International Conference on Machine Learning (ICML-2001), 2001 [22] Yaoyong Li, Kalia Bontcheva, and Hamish Cunnigham Adapting svm for data sparseness and imbalance: a case study in information extraction Natural Language Engineering 15 (2): 241 271., 2008 [23] Doug Downey, Stefan Schoenmackers, and Oren Etzioni Sparse information extraction: Unsupervised language models to the rescue Annual Meeting of the Association for Computational Linguistics, 2007 [24] Benjamin Rosenfeld and Ronen Feldman Using corpus statistics on entities to improve semi-supervised relation extraction from the web Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp 600 607, 2007 [25] Tri Tran Quoc, Thao Pham Thi Xuan, Hung Ngo Quoc, Dien Dinh, and Nigel Collier Named entity recognition in vietnamese documents Journal of Progress in Informatics , NII (National Institute for Informatics), Tokyo, Japan, Vol 2007, No.4, pp.1-9, 2007 Bibliography [26] Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, and The Minh Trinh Relation extraction in vietnamese text using conditional random fields The Sixth Asia Information Retrieval Societies Conference (AIRS), 2010 [27] Gabrielle Gayer, Itzhak Gilboa, and Offer Lieberman Rulebased and case-based reasoning in housing prices In The B.E Journal of Theoretical Eco-nomics, 2007 [28] R Feldman, B Rosenfeld, and M Fresko Teg-a hybrid approach to infor-mation extraction Knowledge and Information Systems , vol 9, pp 18, 2006, 2006 [29] Y Choi, C Cardie, E Riloff, and S Patwardhan Identifying sources of opin-ions with conditional random fields and extraction patterns In In Proceedings of HLT/EMNLP 2005, 2005 [30] Hamish Cunningham Gate, a general architecture for text engineering Com-puters and the Humanities 36, 223 254, 2002 [31] David Ferrucci and Adam Lally Uima: An architectural approach to unstruc-tured information processing in the corporate research environment Natural Language Engineering, vol 10, nos 4, pp 327 348, 2004., 2004 [32] Boyan Onyshkevych Issues and methodology for template design for in-formation extraction In Proceedings of the workshop on Human Language Technology, pages 171 176, 1994 [33] Jim Cowie and Yorick Wilks Information extraction In R Dale, H Moisl and H Somers (eds.) Handbook of Natural Language Processing, 2000 [34] Dang Duc Pham, Giang Binh Tran, and Son Bao Pham Vietnamese word segmentation using part of speech tags Proceedings of the First International Conference on Knowledge and Systems Engineering, Hanoi, Vietnam, 2009 [35] Le Hong Phuong, Nguyen Thi Minh Huyen, Azim Roussanaly, and Ho Tuong Vinh A hybrid a pproach to word segmentation of vietnamese texts Proceed-ings of the 2nd International Conference on Language and Automata Theory and Applications LATA 2008, 2008 Bibliography Dinh Quang Thang, Le Hong Phuong, Nguyen Thi Minh Huyen, Nguyen Cam Tu, Mathias Rossignol, and Vu Xuan Luong Word [36] segmentation of viet-namese texts: a comparison of approaches Proceedings of the 6th Language Resources and Evaluation Conference LREC 2008, 2008 ... new problem in Vietnamese, especially in the domain for real- estate advertisements Our thesis addresses the problem of information extraction for Vietnamese online real- estate advertisements. .. but Vietnamese language is still at the early stage Our thesis tackles the information extraction task for online real- estate advertisement in Vietnamese We build a Vietnamese Real- Estate Information. .. years, real- estate market in Vietnam is growing rapidly which creates a lot of information about real- estate, especially information on advertising for buying and selling activities of real- estate