InformationExtractionforVietnamese RealEstate Advertisements Phạm Vi Liên Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS Phạm Bảo Sơn Năm bảo vệ: 2012 Abstract In recent years, real-estate market in Vietnam is growing rapidly which creates a lot of information about real-estate, especially information on advertising for buying and selling activities of real-estate development This poses an essential demand for building an informationextraction system to help users deal with the increasing amount of real-estateadvertisements on the Internet We propose a rulebased approach to build an informationextraction system for online real-estateadvertisements in VietnameseAt the same time, we set up a process to build an annotated corpus wich can be used in machine learning approaches at a later stage Our system achieve promising results with F-measures of above 90% Our approach is particularly suitable for under-resourced languages where an annotated corpus of a decent size is not readily available Keywords Công nghệ thông tin; Quảng cáo; Bất động sản; Khai thác thông tin Content ORIGINALITY STATEMENT i Abstract ii Acknowledgements iii List of Figures vi List of Tables vii Introduction 1.1 Problem and Idea 1 1.2 1.3 Scope of the thesis Thesis’ structure 4 Related Work 2.1 Approaches 2.1.1 Rule-based approach 2.1.2 Machine-learning approach 2.1.3 Hybrid approach 2.2 GATE framework 2.2.1 Introduction 2.2.2 General Architecture of GATE 2.2.3 An example: ANNIE - A Nearly-New InformationExtraction System 2.2.4 Working with GATE 11 2.2.5 Gazetteers 12 2.2.6 JAPE 13 11 Our VietnameseReal-EstateInformationExtraction system 14 3.1 Template Definition 14 3.2 Corpus Development 16 3.2.1 Criterion of data collection 16 3.2.2 Data collection 17 3.2.3 Data normalization 18 3.2.4 Corpus Annotation 21 3.3 System Development 23 3.3.1 Tokenizer 24 3.3.2 Gazetteer 26 3.3.3 JAPE Transducer 27 3.3.3.1 Remove incorrect Lookup annotations 29 3.3.3.2 Recognizing entities 30 3.3.3.3 Recognizing entities 30 3.3.3.4 Recognizing entities 31 3.3.3.5 Recognizing, and entities 32 3.3.3.6 Recognizing entities 32 3.3.3.7 Recognizing entities 33 3.3.3.8 Recognizing entities 33 3.4 Summary 34 Experiments and Error Analysis 35 4.1 Evaluation metrics 35 4.2 4.3 Experimental result 36 Errors Analysis 40 Conclusion and Future Works 42 5.1 Conclusion 42 5.2 Future Works 42 A A typical code 44 B Relevant Publications 46 Bibliography 47 References [1] Truc-Vien Thi Nguyen and Tru Hoang Cao Automatic extraction of vietnamese namedentities on the web Proceedings of the Journal of New Generation Computing, Ohmsha, Ltd and Springer, 2007 [2] Diana Maynard, Kalina Bontcheva, and Hamish Cunningham Towards a semantic extraction of named entities Proceedings Recent Advances in Natural Language Processing, Borovets, Bulgaria, 2003 [3] Yu-Chieh Wu, Teng-Kai Fan, Yue-Shi Lee, and Show-Jane Yen Extracting named entities using support vector machines Proceedings of the International Workshop on Knowledge Discovery in Life Science Literature, 2006 [4] Theodore W Hong and Keith L Clark Using grammatical inference to automate informationextraction from the web In In Principles of Data Mining and Knowledge Discovery, 2001 [5] Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi Building intelligent systems for mining informationextraction rules from web pages by using domain knowledge In in Proc IEEE Int Symp Industrial Electronics, Pusan, Korea, 2001 [6] Haisong Gu Zhu and Qiang Ji Informationextraction from image sequences of realworld facial expressions Machine Vision and Applications, Vo 16, No 2, P105-115, 2005, 2005 [7] Dan Istrate, Eric Castelli, Michel Vacher, Laurent Besacier, and Jean-Francois Serignat Informationextraction from sound for medical telemonitoring IEEE Transactions on Information Technology Biomedicine, Vol 10, No 2, April 2006, 2006 [8] Howard D Wactlar New directions in video informationextraction and summarization In Proceedings of the 10th DELOS Workshop, Sanorini, Greece, June 24-25, 1999, 1999 [9] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, , and Valentin Tablan Gate: A framework and graphical development environment for robust nlp tools and applications Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA (2002) , 2002 [10] Dat Ba Nguyen, Son Huu Hoang, Son Bao Pham, and Thai Phuong Nguyen Named entity recognition forvietnamese Springer Berlin/Heidelberg, ACI- IDS, 2010 [11] Borthwick Andrew, Sterling John, Agichtein Eugene, and Grishman Ralph Exploiting diverse knowledge sources via maximum entropy in named entity recognition Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, 1998 [12] Alireza Mansouri, Lilly Suriani Affendey, and Ali Mamat Named entity recognition using a new fuzzy support vector machine Proceedings of the International Journal of Computer Science and Network Security, IJCSNS, vol 8, n 2, pg 320-325, 2008 [13] Xiaoshan Fang and Huanye Sheng A hybrid approach for chinese named entity recognition Proceedings of the Fifth International Conference on Discovery Science, 2002 [14] Rohini Srihari, Cheng Niu, and Wei Li A hybrid approach for named entity and subtype tagging Proceedings of the Sixth Conference on Applied Natural Language Processing, 2000 [15] Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham, and Yorick Wilks How feasible is the reuse of grammars for named entity recognition? Proceedings of the Conference on Language Resources and Evaluation (LREC’02), 2002 [16] Indra Budi and Stéphane Bressan Association rules mining for name entity recognition Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003 [17] Xuan-Thao Thi Pham, Tri Quoc Tran, Ai Kawazoe, Dien Dinh, and Nigel Collier Construction of vietnamese corpora for named entity recognition Conference RIA02007, Pittsburgh PA, U.S.A May 30-June 1, 2007 - Copyright C.I.D Paris, France, 2007 [18] Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and Yorick Wilks Named entity recognition from diverse text types Proceedings Recent Advances in Natural Language Processing, 2001 [19] Sunita Sarawagi InformationExtraction Foundations and Trends in Databases Vol 1, No (2007) 261-377, 2007 [20] Daniel M Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel A highperformance learning name-finder Proceedings of the Fifth Conference on Applied Natural Language Processing, PP 194-201, 1998 [21] John Lafferty, Andrew McCallum, and Fernando Pereira Conditional random fields: Probabilistic models for segmenting and labeling sequence data Proceedings of the International Conference on Machine Learning (ICML-2001), 2001 [22] Yaoyong Li, Kalia Bontcheva, and Hamish Cunnigham Adapting svm for data sparseness and imbalance: a case study in informationextraction Natural Language Engineering 15 (2): 241-271., 2008 [23] Doug Downey, Stefan Schoenmackers, and Oren Etzioni Sparse information extraction: Unsupervised language models to the rescue Annual Meeting of the Association for Computational Linguistics, 2007 [24] Benjamin Rosenfeld and Ronen Feldman Using corpus statistics on entities to improve semi-supervised relation extraction from the web Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp 600-607, 2007 [25] Tri Tran Quoc, Thao Pham Thi Xuan, Hung Ngo Quoc, Dien Dinh, and Nigel Collier Named entity recognition in vietnamese documents Journal of “Progress in Informatics”, NII (National Institute for Informatics), Tokyo, Japan, Vol 2007, No.4, pp 1-9, 2007 [26] Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, and The Minh Trinh Relation extraction in Vietnamese text using conditional random fields The Sixth Asia Information Retrieval Societies Conference (AIRS), 2010 [27] Gabrielle Gayer, Itzhak Gilboa, and Offer Lieberman Rule-based and case- based reasoning in housing prices In The B.E Journal of Theoretical Economics, 2007 [28] R Feldman, B Rosenfeld, and M Fresko Teg-a hybrid approach to informationextraction Knowledge and Information Systems , vol 9, pp 1-18, 2006, 2006 [29] Y Choi, C Cardie, E Riloff, and S Patwardhan Identifying sources of opinions with conditional random fields and extraction patterns In In Proceedings of HLT/EMNLP 2005, 2005 [30] Hamish Cunningham Gate, a general architecture for text engineering Computers and the Humanities 36, 223-254, 2002 [31] David Ferrucci and Adam Lally Uima: An architectural approach to unstructured information processing in the corporate research environment Natural Language Engineering, vol 10, nos 3-4, pp 327-348, 2004., 2004 [32] Boyan Onyshkevych Issues and methodology for template design forinformationextraction In Proceedings of the workshop on Human Language Technology, pages 171-176, 1994 [33] Jim Cowie and Yorick Wilks Informationextraction In R Dale, H Moisl and H Somers (eds.) Handbook of Natural Language Processing, 2000 [34] Dang Duc Pham, Giang Binh Tran, and Son Bao Pham Vietnamese word segmentation using part of speech tags Proceedings of the First International Conference on Knowledge and Systems Engineering, Hanoi, Vietnam, 2009 [35] Le Hong Phuong, Nguyen Thi Minh Huyen, Azim Roussanaly, and Ho Tuong Vinh A hybrid a pproach to word segmentation of Vietnamese texts Proceedings of the 2nd International Conference on Language and Automata Theory and Applications LATA 2008, 2008 [36] Dinh Quang Thang, Le Hong Phuong, Nguyen Thi Minh Huyen, Nguyen Cam Tu, Mathias Rossignol, and Vu Xuan Luong Word segmentation of Vietnamese texts: a comparison of approaches Proceedings of the 6th Language Resources and Evaluation Conference LREC 2008, 2008 ... Vietnamese Real-Estate Information Extraction system 14 3.1 Template Definition 14 3.2 Corpus Development 16 3.2.1 Criterion of data collection 16 3.2.2 Data collection... Vacher, Laurent Besacier, and Jean-Francois Serignat Information extraction from sound for medical telemonitoring IEEE Transactions on Information Technology Biomedicine, Vol 10, No 2, April... for template design for information extraction In Proceedings of the workshop on Human Language Technology, pages 171-176, 1994 [33] Jim Cowie and Yorick Wilks Information extraction In R Dale,