2012 Fourth International Conference on Knowledge and Systems Engineering Information Extraction for Vietnamese Real Estate Advertisements Lien Vi Pham Son Bao Pham Faculty of Engineering and Technology Quang Trung University Email: phvilien@gmail.com University of Engineering and Technology and Information Technology Institute Vietnam National University, Hanoi Email: sonpb@vnu.edu.vn II RELATED WORK Abstract—Advertising has appeared in almost all areas of life The large number of advertisements, especially in real estate domain, has raised a need for an effective way to search and find useful information for users In this paper, we propose a rule-based approach to build an Information Extraction system for Vietnamese online real estate advertisements Experimental results shows that our approach is very promising with an overall F-measure of 91% on a collection of data collected from popular Vietnamese real estate websites Index Terms—information extraction, information extraction system, real-estate, real-estate advertising, online real-estate advertising Research in Information Extraction could be classified into three broad categories: • Rule-based approaches [2], [3] • Machine learning approaches [4], [5] • Hybrid approaches [6], [7] Using rules is one of the traditional methods in building Information Extraction systems These systems are often based on features such as syntactic information (e.g part of speech), contextual information [8], morphological information (e.g uppercase, lowercase, numbers, etc.) or using gazetteers [8] Up to now, there are many studies using this method [9], [10] or [11] that obtain high performance including tasks for Vietnamese [2], [3] There are works utilizing machine learning methods such as Hidden Markov Model [12], Maximum Entropy [4], Support Vector Machine [13], [5] to take advantage of annotated corpora For these problems, there are researches that obtained high performance [14] of around 81% in F-measure These methods have been successfully applied for Vietnamese [15] with an F-measure of about 83% Hybrid approaches try to combine the above two methods, in order to utilize the advantages of each method to bring high performance The systems of Srihari [7] and Fang [6] have produced very good results for Chinese But so far, there aren’t many work done for Vietnamese There are projects that extract information from real estate advertisements for English [16], [17] but these works take a wrapper induction approach on html documents This differs greatly from our work as we focus on free text which not have html tags as clues for recognizing entities I INTRODUCTION With the advent of the Internet, more and more data is available and we are currently "flooded" with the data on the Internet Although, the search engines such as Google, Bing, Yahoo, etc have been created to help people to find information, but they still haven’t met the expectations of the users Therefore, the researchers have looked into areas such as information extraction, text summarization, etc to overcome the information overload problem and to deliver useful information to users Information Extraction (IE) is the name given to any process which selectively extracts and structures data from input texts [1] The final output of the extraction process varies; in every case, however, it can be transformed so as to populate some types of database Information Extraction has gradually appeared in many fields such as politics, social, financial, real estate, etc with many different languages such as English, French, Chinese, etc However, for Vietnamese it is still a relatively new problem, especially for online real estate advertisements In the paper, we propose a rule-based approach for building an Information Extraction system for Vietnamese online real estate advertisements At the same time, we also build an annotated corpus for the same task Our paper is structured as follows In section 2, we describe some related work We present our data collection and annotation process in section Section describes our system in details Experimental results and conclusions are presented in section and section respectively 978-0-7695-4760-2/12 $26.00 © 2012 IEEE DOI 10.1109/KSE.2012.27 III DATA COLLECTION AND ANNOTATION A Data collection criteria The news articles selected for the our system should satisfy the following criteria: • An input data file consists of only news articles of realestate advertisment If an input data file has more than one advertisement, we must divide it into several files In other words, each input data file has only one output template 181 • The news article is in free text format As the focus of our work is on free text processing, we strip all html tags and only retain the free text of the collected advertisements Tơi cần bán CCCC Tòa CT14, Mỹ đình - từ liêm - Hà Nội DT:133m2, nguyên bản, chưa sửa chữa Gia chủ muốn mua xin liên hệ anh hanh: 098.985.8199, giá 5060tr/m2 (MIỄN TRUNG GIAN) Email: hanhtdb21@gmail.com B Template Definition The goal of information extraction tasks is to identify, categorize, or normalize specific information from natural language texts This information is filled in a form, which is result of the information extraction process The form has a clear structure called pattern/template, and it is usually predetermined by human experts or system developers Defining templates/patterns is a difficult task involving the selection of the required information elements, and the definition of their relationships [1] This task was determined as one of the challenges when building an information extraction system By inspecting the collected data, we have decided on the template for our system shown in figure This template captures most of the information that the posters describe as well as what the viewers are looking for in a realestate advertisement The information elements of template are often regarded as entities The example below has entities such TypeEstate, CategoryEstate, Area, Price, Zone, Fullname, Telephone, etc + + + + + + Figure person name, organization name, place name are in lower case without the first letter capitalized; frequent use of acronyms As the focus of our work is on free text processing, we strip all html tags and only retain the free text of the collected advertisements D Data normalization We perform an automatic data normalization partly to remove some ambiguity, partly to assist the human annotation process The data normalization or pre-processing step has to ensure that the content of the ads is remained intact Our normalization process consists of the following steps: • First, we add punctuation at the end of sentence • Second, we merge multiple paragraphs into a unique paragraph, because most of these news articles are not too long • Third, we normalize the punctuation, remove the redundant spaces, capitalize the first character after the dot • Fourth, we normalize Telephone, Price, Area, etc using a common pattern • Finally, we replace some of the abbreviated phrases by their corresponding full forms Figure and Figure show examples of an ad before and after the normalization process respectively Loại tin (TypeEstate) Loại nhà (CategoryEstate) Diện tích (Area) Giá tiền (Price) Khu vực (Zone) Liên hệ (Contact) • Tên liên hệ (Fullname) • Điện thoại (Telephone) • Thư điện tử (Email) • Địa (Address) Figure An example of an original news article before normalization Template of our system Tơi cần bán chung cư cao cấp tòa CT14, Mỹ đình - từ liêm - Hà Nội Diện tích: 133 m2, nguyên bản, chưa sửa chữa Gia chủ muốn mua xin liên hệ: anh Hanh - 0989858199, giá: từ 50 đến 60 triệu/m2 (MIỄN TRUNG GIAN) Email: hanhtdb21@gmail.com C Data collection In order to develop and test our system, we built a corpus by collecting data from reputable websites that provide free online real estate advertisements such as http://vnexpress net/rao-vat/13/the-house-dat/, http://raovat.thanhnien.com.vn/ pages/default.aspx, etc These websites attract a large number of posters as well as viewers The posters can put their advertisements in their own form and the websites not modify them Therefore, ads from the same posters tend to be of a similar format and ads belonging to different posters have different styles and language used This leads to plenty of ambiguity, not well-formed ads with grammatical errors that would cause troubles to automatic processing tools such word segmentation or part-of-speech tagger Some of the ambiguity and style diversity are: end of sentence without punctuation; Figure An example of a normalized news article E Corpus Annotation After the documents have been automatically normalized, they will be manually annotated using the template defined in the previous section We use Callisto1 to annotate our data Callisto is an annotation tool developed for linguistic http://callisto.mitre.org/ 182 and part-of-speech tagger [18] and packaged them as a Gate plugin in our system The Tokenizer component will create two annotations namely "Word" and "Split" annotation of textual data Our corpus annotation process is carried out together with the process of rule creation This will bootstrap the annotation process and also give insight to improve the rules • Each "Word" annotation consists of the following features: IV OUR VIETNAMESE REAL -ESTATE IE SYSTEM Our Vietnamese Real-Estate (VRE) Information Extraction system is built as plugins in GATE framework with the architecture shown in figure GATE is a framework which is used popularly to build and develop natural language processing applications, especially information extraction GATE has been used for many IE projects in many languages and problem domains including the Message Understanding Conference (MUC) and Automatic Content Extraction (ACE) evaluations Our system comprises of five components as follows: + POS is the part of speech of Word For example: Np, Nn, etc + string: the string of Word For example: "căn hộ", "Mỹ Đình", etc + upper: if the first character of Word is uppercase then upper is "true" otherwise upper is "false" + Besides, there are also a number of features such as: kind, nation, etc to help the process of writing JAPE grammar • "Split" annotation is created to capture delimiters such as: ".", ";", ",", etc B Gazetteer Figure Gazetteer consists of several different dictionaries that are created during the process of system development The gazetteer captures the real estate domain knowledge They provide necessary information for entities recognition rules at later stages Each dictionary represents a group of words with similar meaning For our system, we use the following types of gazetteers: Architecture of Vietnamese Real-Estate IE system • Gazetteers that contain potential named entities such as person, location (zone/address) or category • Gazetteers containing phrases used in contextual rules such as name prefixes or verbs that are likely to follow a person name • Gazetteer of potential ambiguous named entities • Text-Preprocessing • Word segmentation • POS tagging • Gazetteer • JAPE Transducer Looking at the architecture diagram of the system, we can visualize an overview of how the system works First, we collect free online ads about real-estate from the Internet We then put them through the Text-Preprocessing engine for normalization After the news articles were normalized, these will be transferred to the system for further processing The VRE system components will execute sequentially starting with Word Segmentation, followed by POS tagging, Gazetteer and JAPE Transducer plugins The results received at one plugin’s output will be input of the next plugin Finally, the returned news articles are annotated following our predefined template The text preprocessing step is the data normalization described in the previous section We will describe the remaining four components in this section As our system works on free text without any html tags clues, Gazetteer contributes significantly to the overall performance Output of the Gazetteer components are Lookup annotations covering words with specific semantics C JAPE Transducer The Jape transducer module is a cascade of Jape grammars or rules A Jape grammar allows one to specify regular expression patterns over semantic annotations Thus, results of previous modules including word segmentation, part of speech tagging and gazetteer in the form of annotations can be used to create annotations according to the expected template A Jape grammar has the following format: A Tokenizer A typical difference between Vietnamese and English is word segmentation as Vietnamese is a monosyllabic language A word in Vietnamese may contain one or more tokens The quality of the system depends on how well this tokenization step is carried out We use an existing word segmentation LHS (left-hand-side) –> RHS (right-hand-side) Left clause (LHS) is a regular expression over annotations Right clause is the action to be executed when the left clause is matched For example: 183 "I need to buy an apartment in My dinh - tu liem - Ha Noi." Rule: Fullname01 Priority: 70 ( (PRE_PERSON) (Word.string == ":")? (PERSONPHRASE):fn ) –> :fn.Fullname = {category = "CoF", kind = "Fullname"} V EXPERIMENTS AND ERROR ANALYSIS For our experiment, we use a corpus consisting of 260 documents and annotate them according the template defined above The corpus is split into T raining and T est sets consisting of 180 and 80 documents respectively Our system is built using the documents in the T raining set and will be tested using the documents from the T est set A Evaluation metrics Our JAPE Transducer module structures rules in the order as follows: • Removing incorrect Lookup annotations and identify potential named entities • Recognizing TypeEstate entities • Recognizing CategoryEstate entities and removing superfluous CategoryEstate entities If the news article has more than one CategoryEstate entity, we will use its relative position compared to the position of the TypeEstate entity to determine whether to retain or remove it • Recognizing Zone entities • Recognizing Area entities and removing superfluous Area entities In addition, if the news article does not have an explicit clue to determine the Area entity, we can use TypeEstate and CategoryEstate entities to determine whether an Area entity exists as in the following example: Tôi cần bán 2000 m2 đất ruộng Hà Đông (I need to sell 2000 m2 farmland in Ha Dong.) • Recognizing Price entities and removing superfluous Price entities • Recognizing Telephone entities and removing superfluous Telephone entities • Recognizing Fullname entities If the news article does not have an explicit clue to determine the Fullname entity, we can use Telephone entities • Recognizing Address entities using Zone entities • Recognizing Email entities • Aggregating Telephone, Address, Email and Fullname entities into Contact entities • Removing superfluous Zone entities We remove all Lookup annotations that are part of Word annotations That mean, the value of Word annotation contains the value of Lookup annotation was removed (the value of annotation is a string that is a feature) For example, the word "Liên" (Lien) is a person name which was used for recognizing Fullname, therefore, our system creates a Lookup annotation for it However, this word can also be just part of another word with totally different meaning as "Liên hệ" (Contact) If we not remove the above Lookup annotations, then this is a potential candidate for recognizing Fullname Zone entity is particularly difficult to recognize due to the fact that tokens describing the zones are not capitalized Moreover, this entity is are often quite long Take the following example where the zone "My dinh - tu liem - Ha Noi" is quite difficult to recognize correctly: In experiments, we use Precision, Recall and F-measure measures to evaluate our system These metrics are defined as follows: Precision = (c / a) x 100% Recall = (c / b) x 100% F-measure = x (Precision x Recall)/ (Precision + Recall) x 100% Where: a: Number of entities recognized by our system b: Number of entities annotated manually c: Number of entities recognized correctly Performance evaluation of our system is carried based on the test data using two criteria: • Strict criteria: an entity is recognized correctly when both the span and the type are the same as in the annotated corpus • Lenient criteria: an entity is recognized correctly when the type is correct and the span partially overlaps with the one in the annotated corpus B Experimental result Table I PERFORMANCE ON THE T raining DATA USING LENIENT CRITERIA Type TypeEstate CategoryEstate Zone Area Price Contact All (1) (2) (3) (4) (5) (6) - No entities annotated manually No entities recognized correctly No entities recognized by our system Precision Recall F-measure (1) (2) (3) (4) (5) (6) 180 180 180 100% 100% 100% 180 176 180 98% 98% 98% 165 152 160 95% 92% 94% 151 134 134 100% 89% 94% 147 146 146 100% 99% 100% 463 460 465 99% 99% 99% 1286 1248 1265 99% 97% 98% Table I and Table II show the system’s performance on the training data set using lenient and strict criteria respectively while Table III and Table IV show the system’s performance on the test data set using lenient and strict criteria respectively The overall F-measures of the system on Test data using the lenient and strict criteria are 96% and 91% respectively "Tôi cần mua hộ Mỹ đình – từ liêm – Hà Nội." 184 compared to strict criteria Table II PERFORMANCE ON THE T raining DATA USING STRICT CRITERIA C Errors Analysis Type (1) (2) (3) (4) (5) (6) TypeEstate CategoryEstate Zone Area Price Contact All - No entities annotated manually No entities recognized correctly No entities recognized by our system Precision Recall F-measure (1) (2) (3) (4) (5) (6) 180 180 180 100% 100% 100% 180 176 180 98% 98% 98% 165 112 160 70% 68% 69% 151 132 134 99% 87% 93% 147 146 146 100% 99% 100% 463 457 465 98% 99% 98% 1286 1203 1265 95% 94% 94% The main sources of errors for our system are: - Diverse writing styles - Some entities, especially Zone entity, are very long and not use capitalization Take the following two examples: "Tơi cần mua hộ Mỹ đình – từ liêm – Hà Nội." "I need to buy an apartment in My Dinh - Tu Liem - Ha Noi." "Liên hệ: anh minh - 0987214931." "Contact: anh Minh - 0987214931." The location name (the phrase "Mỹ đình – từ liêm – Hà Nội") in the first example and Person name (the phrase "anh minh") in the second example are not recognized correctly as the clue words are not capitalized Table III PERFORMANCE ON THE T est DATA USING LENIENT CRITERIA Type (1) (2) (3) (4) (5) (6) TypeEstate CategoryEstate Zone Area Price Contact All - No entities annotated manually No entities recognized correctly No entities recognized by our system Precision Recall F-measure (1) (2) (3) (4) (5) 80 79 80 99% 99% 80 76 80 95% 95% 72 62 69 90% 86% 61 51 51 100% 84% 58 55 55 100% 95% 173 172 173 99% 99% 524 495 508 97% 94% VI CONCLUSIONS AND We have built a system for extracting information from real estate advertisements for Vietnamese Our approach is suitable for under-resourced languages, particularly for tasks that not have annotated data Our system achieves an overall F-measure of 91% using the strict criteria which is quite respectable In the future, we will need to improve the system performance for the Zone entity We will also try to use machine learning on our annotated corpus and investigate avenues that could combine machine learning approaches with our rule based approach (6) 99% 95% 88% 91% 97% 99% 96% Table IV PERFORMANCE ON THE T est DATA USING STRICT CRITERIA Type TypeEstate CategoryEstate Zone Area Price Contact All (1) (2) (3) (4) (5) (6) No entities annotated manually No entities recognized correctly No entities recognized by our system Precision Recall F-measure (1) (2) (3) (4) (5) 80 78 80 98% 98% 80 68 80 85% 85% 72 43 69 62% 60% 61 51 51 100% 84% 58 55 55 100% 95% 173 172 173 99% 99% 524 467 508 92% 89% FUTURE WORK ACKNOWLEDGEMENTS This work is partially supported by the KC.01.TN04/11-15 project "Analyzing opinion’s trend based on social network and applying in tourist and technology products" - REFERENCES [1] J Cowie and Y Wilks, “Information extraction,” 2000 [2] D B Nguyen, S H Hoang, S B Pham, and T P Nguyen, “Named entity recognition for vietnamese,” in Proceedings of the Second international conference on Intelligent information and database systems: Part II, ser ACIIDS’10 Berlin, Heidelberg: Springer-Verlag, 2010, pp 205–214 [Online] Available: http://dl.acm.org/citation.cfm? id=1894808.1894834 [3] T.-V T Nguyen and T H Cao, “Vn-kim ie: automatic extraction of vietnamese named-entities on the web,” New Gen Comput., vol 25, no 3, pp 277–292, jan 2007 [Online] Available: http://dx.doi.org/10.1007/s00354-007-0018-4 [4] A Borthwick, J Sterling, E Agichtein, and R Grishman, “Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ser KDD ’04 New York, NY, USA: ACM, 2004, pp 89–98 [Online] Available: http://doi.acm.org/10.1145/ 1014052.1014065 [5] A Mansouri, L S Affendey, and A Mamat, “Named entity recognition using a new fuzzy support vector machine,” International Journal of Computer Science and Network Security, IJCSNS, vol 8, no 2, pp 320– 325, February 2008 (6) 98% 85% 61% 91% 97% 99% 91% However we can easily see the performance varies between the entities The lowest performance is on the Zone entity which reflects the fact that Zone entities are very ambiguous and difficult to recognize This is partly due to the fact that Zone entities in Vietnamese are often long and presented in many formats This also explains why the performance for Zone entities is significantly improved when using lenient criteria 185 [6] X Fang and H Sheng, “A hybrid approach for chinese named entity recognition,” in Proceedings of the 5th International Conference on Discovery Science, ser DS ’02 London, UK, UK: Springer-Verlag, 2002, pp 297–301 [Online] Available: http://dl.acm.org/citation.cfm? id=647859.736133 [7] R Srihari, C Niu, and W Li, “A hybrid approach for named entity and sub-type tagging,” in Proceedings of the sixth conference on Applied natural language processing, ser ANLC ’00 Stroudsburg, PA, USA: Association for Computational Linguistics, 2000, pp 247–254 [Online] Available: http://dx.doi.org/10.3115/974147.974181 [8] I Budi and S Bressan, “Association rules mining for name entity recognition,” in Proceedings of the Fourth International Conference on Web Information Systems Engineering, ser WISE ’03 Washington, DC, USA: IEEE Computer Society, 2003, pp 325– [Online] Available: http://dl.acm.org/citation.cfm?id=960322.960421 [9] D Maynard, V Tablan, C Ursu, H Cunningham, and Y Wilks, “Named entity recognition from diverse text types,” in In Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark, 2001 [10] K Pastra, D Maynard, O Hamza, H Cunningham, and Y Wilks, “How feasible is the reuse of grammars for named entity recognition,” in In Proceedings of the 3rd Conference on Language Resources and Evaluation (LREC), Canary Islands, 2002 [11] D Maynard, K Bontcheva, and H Cunningham, “Towards a semantic extraction of named entities,” in In Recent Advances in Natural Language Processing, 2003 [12] D M Bikel, S Miller, R Schwartz, and R Weischedel, “Nymble: a high-performance learning name-finder,” in Proceedings of the fifth conference on Applied natural language processing, ser ANLC ’97 Stroudsburg, PA, USA: Association for Computational Linguistics, 1997, pp 194–201 [Online] Available: http://dx.doi.org/10.3115/ 974557.974586 [13] Y.-C Wu, T.-K Fan, Y.-S Lee, and S.-J Yen, “Extracting named entities using support vector machines,” in Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature, ser KDLL’06 Berlin, Heidelberg: Springer-Verlag, 2006, pp 91–103 [Online] Available: http://dx.doi.org/10.1007/11683568_8 [14] T Nguyen, O Tran, H Phan, and T Ha, “Named entity recognition in vietnamese free-text and web documents using conditional random fields,” Proceedings of the Eighth Conference on Some Selection Problems of Information Technology and Telecommunication, Hai Phong, Viet Nam, 2005 [15] P T X Thao, T Q Tri, A Kawazoe, D Dinh, and N Collier, “Construction of vietnamese corpora for named entity recognition,” in Large Scale Semantic Access to Content (Text, Image, Video, and Sound), ser RIAO ’07 Paris, France, France: LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 2007, pp 719–724 [Online] Available: http: //dl.acm.org/citation.cfm?id=1931390.1931459 [16] T W Hong and K L Clark, “Using grammatical inference to automate information extraction from the web,” in Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, ser PKDD ’01 London, UK, UK: Springer-Verlag, 2001, pp 216–227 [Online] Available: http://dl.acm.org/citation.cfm?id= 645805.669995 [17] H Seo, J Yang, and J Choi, “Building intelligent systems for mining information extraction rules from web pages by using domain knowledge,” in in Proc IEEE Int Symp Industrial Electronics, Pusan, Korea, 2001, pp 322–327 [18] D D Pham, G B Tran, and S B Pham, “A hybrid approach to vietnamese word segmentation using part of speech tags,” in Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, ser KSE ’09 Washington, DC, USA: IEEE Computer Society, 2009, pp 154–161 [Online] Available: http://dx.doi.org/10.1109/KSE.2009.44 186 ... "Word" annotation consists of the following features: IV OUR VIETNAMESE REAL -ESTATE IE SYSTEM Our Vietnamese Real- Estate (VRE) Information Extraction system is built as plugins in GATE framework... system for extracting information from real estate advertisements for Vietnamese Our approach is suitable for under-resourced languages, particularly for tasks that not have annotated data Our... B Template Definition The goal of information extraction tasks is to identify, categorize, or normalize specific information from natural language texts This information is filled in a form, which