Information extraction for vietnamese real estate advertisements

Information Extraction for Vietnamese Real-Estate Advertisements by Pham Vi Lien Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Dr Pham Bao Son A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology June, 2012 Contents ORIGINALITY STATEMENT i Abstract ii Acknowledgements iii List of Figures vi List of Tables vii Introduction 1.1 Problem and Idea 1.2 Scope of the thesis 1.3 Thesis’ structure 1 4 Related Work 2.1 Approaches 2.1.1 Rule-based approach 2.1.2 Machine-learning approach 2.1.3 Hybrid approach 2.2 GATE framework 2.2.1 Introduction 2.2.2 General Architecture of GATE 2.2.3 An example: ANNIE - A Nearly-New Information Extraction System 2.2.4 Working with GATE 2.2.5 Gazetteers 2.2.6 JAPE 6 8 11 11 12 13 Our Vietnamese Real-Estate Information Extraction system 3.1 Template Definition 3.2 Corpus Development 3.2.1 Criterion of data collection 3.2.2 Data collection 14 14 16 16 17 iv Table of Contents 3.3 3.4 v 3.2.3 Data normalization 3.2.4 Corpus Annotation System Development 3.3.1 Tokenizer 3.3.2 Gazetteer 3.3.3 JAPE Transducer 3.3.3.1 Remove incorrect Lookup annotations 3.3.3.2 Recognizing entities 3.3.3.3 Recognizing entities 3.3.3.4 Recognizing entities 3.3.3.5 Recognizing , and entities 3.3.3.6 Recognizing entities 3.3.3.7 Recognizing entities 3.3.3.8 Recognizing entities Summary Experiments and Error 4.1 Evaluation metrics 4.2 Experimental result 4.3 Errors Analysis 18 21 23 24 26 27 29 30 30 31 32 32 33 33 34 Analysis 35 35 36 40 Conclusion and Future Works 42 5.1 Conclusion 42 5.2 Future Works 42 A A typical code 44 B Relevant Publications 46 Bibliography 47 Chương 1: Giới thiệu 1.1 Vấn đề Ý tưởng: Với đời phát triển Internet, ngày nhiều liệu gởi lên Internet "ngập lụt" chúng Mặc dù, cơng cụ tìm kiếm Google1, Bing2, Yahoo3, tạo để giúp người tìm kiếm thông tin, chúng chưa thật đáp ứng mong đợi người dùng Vì vậy, nhà nghiên cứu nhìn vào lĩnh vực khai thác thơng tin, tóm tắt văn bản, để khắc phục vấn đề tải thông tin cung cấp thơng tin hữu ích cho người sử dụng Rút trích thông tin nhiệm vụ quan trọng xử lý ngơn ngữ tự nhiên Ý tưởng hệ thống rút trích thơng tin rút trích mẩu thơng tin từ văn có cấu trúc bán cấu trúc để điền vào mẫu có cấu trúc định nghĩa sẵn gọi template Rút trích thơng tin dần xuất nhiều lĩnh vực trị, xã hội, tài chính, bất động sản, nhiều ngơn ngữ khác Anh, Pháp, Trung Quốc,… Tuy nhiên, Tiếng Việt vấn đề tương đối mẻ, đặc biệt lĩnh vực quảng cáo nhà đất trực tuyến Figure 1: Dữ liệu đầu vào kết đầu hệ thống http://www.google.com http://www.bing.com http://www.yahoo.com -1- Trong Luận văn này, đề xuất phương pháp tiếp cận dựa hệ luật để xây dựng hệ thống rút trích thơng tin quảng cáo nhà đất trực tuyến Tiếng Việt Đồng thời, xây dựng tập ngữ liệu gán nhãn cho nhiệm vụ 1.2 Phạm vi nghiên cứu Với phát triển Internet, quảng cáo trực tuyến thực tế ngày phổ biến.Nó giải pháp quảng cáo hiệu cho cá nhân quảng cáo, quan người xem Như vậy, nguồn liệu từ quảng cáo lớn đa dạng Luận án tập trung vào xử lý văn trực tuyến miễn phí quảng cáo Việt Nam lĩnh vực bất động sản 1.3 Cấu trúc luận văn: Luận văn tổ chức thành chương sau: - Chương 1: Chúng giới thiệu vấn đề ý tưởng để xây dựng hệ thống rút trích thông tin từ quảng cáo trực tuyến nhà đất Tiếng Việt - Chương 2: Chúng tơi trình bày tổng quan nghiên cứu liên quan rút trích thơng tin nói chung lĩnh vực nhà đất nói riêng - Chương 3: Chúng tơi mơ tả chi tiết làm để xây dựng hệ thống rút trích thơng tin từ quảng cáo trực tuyến nhà đất Tiếng Việt - Chương 4: Chúng tơi trình bày kết thực nghiệm chúng tơi phân tích số ngun nhân gây lỗi - Chương 5: Chúng tổng kết điểm đạt hệ thống thảo luận hướng phát triển hệ thống tương lai -2- Chương 2: Các nghiên cứu liên quan 2.1 Cách tiếp cận: Các nghiên cứu rút trích thơng tin phân thành hướng tiếp cận sau:  Hướng tiếp cận dựa hệ luật [2], [3]  Hướng tiếp cận học máy [4], [5]  Hướng tiếp cận lai [6], [7] Sử dụng hệ luật phương pháp truyền thống xây dựng hệ thống rút trích thơng tin Những hệ thống thường dựa đặc trưng cú pháp thông tin (ví dụ: từ loại từ), ngữ cảnh thơng tin [8], hình thái thơng tin (ví dụ: chữ hoa, chữ thường, số, ) sử dụng Gazetteer [8] Đến nay, có nhiều nghiên cứu sử dụng phương pháp [9], [10] [11] đạt hiệu suất cao bao gồm nhiệm vụ cho tiếng Việt [2], [3] Có nhiều cơng trình sử dụng phương pháp học máy Hidden Markov Model [12], Maximum Entropy [4], Support Vector Machine [13], [5] để tận dụng lợi tập ngữ liệu gán nhãn Về vấn đề rút tích thơng tin, có nghiên cứu thu hiệu cao [14] nằm khoảng 81% theo thước đo F-measure Những phương pháp thành công áp dụng cho Tiếng Việt [15] với F-measure khoảng 83% Phương pháp lai sực kết hợp hai phương pháp trên, để tận dụng lợi phương pháp mang lại hiệu suất cao Hệ thống Srihari [7] Fang [6] cho kết tốt Tiếng Trung Nhưng nay, chưa có nhiều nghiên cứu cho Tiếng Việt Có số cơng trình rút trích thông tin từ quảng cáo nhà đất cho Tiếng Anh [16], [17], cơng trình sử dụng cách tiếp cận wrapper induction tài liệu html Điều khác nhiều từ công việc tập -3- trung vào văn phi cấu trúc, tức văn khơng có thẻ html manh mối để nhận dạng thực thể 2.2 GATE framework: GATE kiến trúc, tảng môi trường phát triển giao diện cho ngơn ngữ kỹ thuật Nó tạo phát triển nhóm nhà phát triển dẫn đầu giáo sư Cunningham đại học Sheffield từ năm 1995 Hiện nay, sử dụng rộng rãi giới cộng đồng nhà nghiên cứu thuộc nhiều lĩnh vực xử lý ngôn ngữ, đặc biệt rút trích thơng tin Nó sử dụng cho nhiều dự án rút trích thơng tin nhiều ngơn ngữ miền vấn đề Một ví dụ điển hình hệ thống rút trích thơng tin ANNIE (A Nearly-New Information Extraction System) Nó đóng gói plugin GATE GATE cơng cụ Java phần mềm nguồn mở giấy phép GNU Người dùng nhận hỗ trợ miền phí từ cộng đồng người dùng nhà phát triển qua website thức GATE Chúng tơi sử dụng GATE để giải tốn -4- Chapter 3: Information Extraction for Vietnamese Real-Estate Advertisements 3.1 Định nghĩa Template Qua trình quan sát liệu thu thập được, định chọn template cho hệ thống thể hình Template bao quát hầu hết thông tin mà người đăng tin mô tả người xem cần tìm kiếm quảng cáo nhà đất + + + + + + Loại tin (TypeEstate) Loại nhà (CategoryEstate) Diện tích (Area) Giá tiền (Price) Khu vực (Zone) Liên hệ (Contact) o Tên liên hệ (Fullname) o Điện thoại (Telephone) o Thư điện tử (Email) o Địa (Address) Hình 2: Template hệ thống 3.2 Phát triển Copus: 3.2.1 Điều kiện chọn lọc liệu: Những tin chọn lọc cho hệ thống phải đảm bảo điều kiện sau:  Một tập tin liệu có tin quảng cáo nhà đất Nếu tập tin có nhiều tin quảng cáo, phải chia thành nhiều tập tin khác Nói cách khác, tập tin liệu đầu vào có template đầu  Các tin phi cấu trúc Do trọng tâm công việc xử lý văn phi cấu trúc, loại -5- bỏ tất thẻ html giữ lại văn quảng cáo thu thập 3.2.2 Chọn lọc liệu: Để phát triển kiểm thử hệ thống, xây dựng ngữ liệu cách thu thập liệu từ trang web có uy tín cung cấp quảng cáo nhà đất trực tuyến miễn phí http://vnexpress.net/rao-vat/13/the-house-dat/, http://raovat.thanhnien.com.vn/pages/default aspx, Đây trang web thu hút số lượng lớn người đăng tin người xem tin 3.2.3 Data normalization Chúng thực chuẩn hóa liệu phần tự động để loại bỏ số nhập nhằng, phần có hỗ trợ người q trình gán nhãn Q trình chuẩn hóa liệu bước tiền xử lý phải đảm bảo nội dung quảng cáo ngun vẹn Q trình chuẩn hóa chúng tơi bao gồm bước sau:  Thứ nhất, thêm dấu chấm câu vào sau câu  Thứ hai, trộn nhiều đoạn thành đoạn suy nhất, tin thường không dài  Thứ ba, chúng tơi chuẩn hóa dấu câu; loại bỏ khoảng trống thừa, viết hoa cho từ sau dấu chấm câu  Thứ tư, chúng tơi chuẩn hóa số điện thoại, giá tiền, diện tích, tên người,… thành định dạng phổ biến  Cuối cùng, thay vài từ viết tắt từ đầy đủ chúng Trong bước trên, bước thứ khó Bước đóng góp quan trọng để cải thiện tỉ lệ nhận dạng cho hệ thống 3.2.4 Gán nhãn tập ngữ liệu: Sau tài liệu tự động chuẩn hóa, chúng tự gán nhãn tay theo template định nghĩa phần trước -6- Chúng sử dụng cơng cụ Callisto để hỗ trợ cho q trình gán nhãn cho liệu Callisto công cụ phát triển để phụ vục công việc gán nhãn cho liệu văn Quá trình gán nhãn cho ngữ liệu thực song song với trình tạo quy tắc hệ thống Điều giảm tải cho trình gán nhãn cung cấp nhìn sâu sắc để cải thiện quy tắc tốt 3.3 Hệ thống Vietnamese Real-Estate: 3.3.1 Tokenizer Một khác biệt điển hình tiếng Việt tiếng Anh tách từ tiếng Việt ngôn ngữ đơn âm Một từ tiếng Việt chứa nhiều token Chất lượng hệ thống phụ thuộc vào bước Chúng tơi kế thừa từ cơng trình nghiên cứu [18] tách từ gán nhãn từ loại, chúng tơi đóng gói chúng thành plugin Gate hệ thống Thành phần Tokenizer tạo hai nhãn "Word" "Split"   Mỗi nhãn "Word" gồm có đặc trưng sau: o POS từ loại từ Ví dụ: Np, Nn, o string: chuỗi từ Ví dụ: "căn hộ", "Mỹ Đình", o upper: ký tự từ viết hoa upper có giá trị "true", ngược lại "false" o Ngồi ra, có số đặc trưng khác như: kind, nation, để giúp cho trình viết luật bước sau Nhãn "Split" tạo để bắt giữ dấu câu như: ".", ";", ",", etc 3.3.2 Gazetteer Gazetteer bao gồm từ điển khác tạo trình phát triển hệ thống Gazetteer nắm bắt miền tri thức nhà đất Chúng cung cấp thông tin cần thiết cho luật nhận dạng thực thể giai đoạn sau Mỗi từ điển đại diện cho -7- Chapter 3: Our Vietnamese Real-Estate Advertisements 3.1 Template Definition Inspect the collected data, we have decided on the template for our system shown in figure This template captures most of the information that the posters describe as well as what the viewers are looking for in a real-estate advertisement + + + + + + Loại tin (TypeEstate) Loại nhà (CategoryEstate) Diện tích (Area) Giá tiền (Price) Khu vực (Zone) Liên hệ (Contact) o Tên liên hệ (Fullname) o Điện thoại (Telephone) o Thư điện tử (Email) o Địa (Address) Figure 4: Template of our system 3.2 Corpus Development 3.2.1 Criterion of data collection The news articles were selected for our system should ensure the following criterion:  An input data file consists of only a news article of realestate advertising If there is an input data file has more than an advertising news article, we must divide into several files In other words, for each input data file will has only an output template  The news article is free text As the focus of our work is on free text processing, we strip all html tags and only retain the free text of the collected advertisements -25- 3.2.2 Data collection In order to develop and test our system, we built a corpus by collecting data from reputable websites that provide free online real estate advertisements such as http://vnexpress.net/raovat/13/the-house-dat/, http://raovat.thanhnien.com.vn/pages/default.aspx, etc These websites attract a large number of the posters as well as viewers 3.2.3 Data normalization We perform an automatic data normalization partly to remove some ambiguity, partly to assist the human annotation process The data normalization or pre-processing step has to ensure that the content of the ads is remained intact Our normalization process consists of the following steps:  First, we add punctuation at the end of sentence  Second, we merge multiple paragraphs into a unique paragraph, because most of these news articles are not too long  Third, we normalized the punctuation; remove the redundant space, capitalization for the characters after the dot Fourth, we normalized Telephone, Price, Area, etc using a common pattern  Finally, we replace some of the abbreviated phrases by their corresponding full forms In the above steps, the fourth step is the most difficult This step is an important contribution in improving the recognition rate of our system 3.2.4 Corpus Annotation After the documents have been automatically normalized, they will be manually annotated using the template defined in the previous section We use Callisto to annotate our data Callisto is an annotation tool developed for linguistic annotation of textual data -26- Our corpus annotation process is carried out together with the process of rule creation This will bootstrap the annotation process and also give insight to improve the rules 3.3 Our Vietnamese Real-Estate system 3.3.1 Tokenizer A typical difference between Vietnamese and English is word segmentation as Vietnamese is a monosyllabic language A word in Vietnamese may contain one or more tokens The quality of the system depends on how well this tokenizing step is carried out We use existing word segmentation and part-of-speech tagger [18] and packaged them as a Gate plugin in our system Tokenizer component will create two annotations namely "Word" and "Split"   Each "Word" annotation consists of the following features: o POS is the part of speech of Word For example: Np, Nn, etc o string: the string of Word For example: "căn hộ", "Mỹ Đình", etc o upper: if the first character of Word is uppercase then upper is "true" otherwise upper is "false" o Besides, there are also a number of features such as: kind, nation, etc to help the process of writing JAPE grammar "Split" annotation is created to capture delimiters such as: ".", ";", ",", etc 3.3.2 Gazetteer Gazetteer consists of several different dictionaries that are created during the processing of system development The gazetteer capture the real estate domain knowledge They provide necessary information for entities recognition rules at later stages Each -27- dictionary represents a group of words with similar meaning For our system, we use the following types of gazetteers:  Gazetteers that contain potential named entities such as person, location (zone/address) or category  Gazetteers containing phrases used in contextual rules such as name prefix or verbs that are likely to follow a person name  Gazetteer of potential ambiguous named entities As our system works on free text without any html tags clues, Gazetteer contributes significantly to the overall per-formance Output of the Gazetteer components are Lookup annotations covering words with specific semantics 3.3.3 JAPE Transducer The Jape transducer module is a cascade of Jape grammars or rules A Jape grammar allows one to specify regular expression patterns over semantic annotations Thus, results of previous modules including word segmentation, part of speech tagging and gazetteer in the form of annotations can be used to create annotations according to the expected template A Jape grammar has the following format: LHS (left-hand-side) –> RHS (right-hand-side) Left clause (LHS) is a regular expression over annotations Right clause is the action to be executed when the left clause is matched Our JAPE Transduce structure rules in the order as follows:  Remove incorrect Lookup annotations and identify potential named entities  Recognizing TypeEstate entities  Recognizing CategoryEstate entities based on TypeEstate If the news article has more than one CategoryEstate -28- entity, we will use its relative position compared to the position of the TypeEstate entity to determine whether to retain or remove it  Recognizing Zone entities  Recognizing Area entities using TypeEstate and CategoryEstate entities If the news article does not have an explicit clue to determine the Area entity, we can use TypeEstate and CategoryEstate entities to determine whether an Area entity exists as in the following example: Tôi cần bán 2000 m2 đất ruộng Hà Đông (I need to sell 2000 m2 farmland in Ha Dong.)  Recognizing Price entities and Removing superfluous Price entities  Recognizing Telephone entities and removing superfluous Telephone entities  Recognizing Fullname entities based on Telephone entities  Recognizing Address entities using Zone entities  Recognizing Email entities  Aggregate Telephone, Address, Email and Fullname entities into Contact entities  Removing superfluous Zone entities We removed all the Word annotation that are part of Lookup annotations For example the word "Liên" (Lien) is a person name which will be used for recognizing Fullname but this word can also be part of another word with totally different meaning However, when "Liên hệ" is recognized as a Lookup annotation as this is a potential candidate for Contact annotation, "Liên" should not be a separate Word annotation Zone entity is particularly difficult to recognize due to the fact that tokens describing the zones are not capitalized Moreover, this entity is are often quite long Take the following example where -29- the zone "My dinh - tu liem - Ha Noi" is quite difficult to recognize correctly: "Tôi cần mua hộ Mỹ đình – từ liêm – Hà Nội." "I need to buy an apartment in My dinh - tu liem - Ha Noi." 3.4 Summary In this chapter, we presented in quite details our Vietnamese RealEstate Information Extraction system At the start section of chapter, we introduce about the template of this system In the next section we describe the development process of the corpus In the final section, we presented three main components of the Vietnamese Real-Estate Information Extraction system that is Tokenizer, Gazetteer and JAPE Transducer The JAPE Transducer is an important component of the system It consists of rules or Jape grammars to recognize entities -30- Chapter 4: Experiments and Error Analysis For our experiment, we use a corpus consist of 260 documents and annotated them according the template defined above The corpus is split into Training and Test sets consisting of 180 and 80 documents respectively Our system is built using the documents in the Training set and will be tested using the documents from the Test set 4.1 Evaluation metrics In experiments, we use Precision, Recall and F-measure measures to evaluate our system These metrics are defined as follows: Precision (P) = (c / a) x 100% Recall (R) = (c / b) x 100% F-measure (F) = x (P x R)/ (P + R) x 100% Where: a: Number of entities recognized by our system b: Number of entities annotated manually c: Number of entities recognized correctly Performance evaluation of our system is carried based on the test data using two metrics criteria:  Strict criteria: an entity is recognized correctly when both the span and the type are the same as in the annotated corpus  Lenient criteria: an entity is recognized correctly when the type is correct and the span partially overlaps with the one in the annotated corpus 4.2 Experimental results Table and Table show the system’s performance on the training data set using lenient and strict criteria respectively while Table -31- and Table show the system’s performance on the test data set using lenient and strict criteria respectively (1) - No of entities annotated manually (2) - No of entities recognized correctly (3) - No of entities recognized by system Type (4) - Precision (5) - Recall (6) - F-measure (1) (2) (3) (4) (5) (6) TypeEstate 180 180 180 100% 100% 100% CategoryEstate 180 176 180 98% 98% 98% Zone 165 152 160 95% 92% 94% Area 151 134 134 100% 89% 94% Price 147 146 146 100% 99% 100% Contact 463 460 465 99% 99% 99% All 1286 1248 1265 99% 97% 98% Table 5: Performance on the Training data using lenient criteria (1) - No of entities annotated manually (2) - No of entities recognized correctly (3) - No of entities recognized by system Type (4) - Precision (5) - Recall (6) - F-measure (1) (2) (3) (4) (5) (6) 180 180 180 100% 100% 100% TypeEstate 180 176 180 98% 98% 98% CategoryEstate 165 112 160 70% 68% 69% Zone 151 132 134 99% 87% 93% Area 147 146 146 100% 99% 100% Price 463 457 465 98% 99% 98% Contact 1286 1203 1265 95% 94% 94% All Table 6: Performance on the Training data using strict criteria -32- (1) - No of entities annotated manually (2) - No of entities recognized correctly (3) - No of entities recognized by system Type (4) - Precision (5) - Recall (6) - F-measure (1) (2) (3) (4) (5) TypeEstate 80 79 80 99% 99% CategoryEstate 80 76 80 95% 95% Zone 72 62 69 90% 86% Area 61 51 51 100% 84% Price 58 55 55 100% 95% Contact 173 172 173 99% 99% All 524 495 508 97% 94% (6) 99% 95% 88% 91% 97% 99% 96% Table 7: Performance on the Test data using lenient criteria (1) - No of entities annotated manually (2) - No of entities recognized correctly (3) - No of entities recognized by system Type (4) - Precision (5) - Recall (6) - F-measure (1) (2) (3) (4) (5) TypeEstate 80 78 80 98% 98% CategoryEstate 80 68 80 85% 85% Zone 72 43 69 62% 60% Area 61 51 51 100% 84% Price 58 55 55 100% 95% Contact 173 172 173 99% 99% All 524 467 508 92% 89% (6) 98% 85% 61% 91% 97% 99% 91% Table 8: Performance on the Test data using strict criteria The overall F-measures of the system on Test data using the lenient and strict criteria are 96% and 91% respectively However -33- we can easily see the performance varies between the entities The lowest performance is on the Zone entity which reflects the fact that Zone entities are very ambiguous and different to recognize This is partly due to the fact that Zone entities in Vietnamese are often long and presented in many formats This also explains why the performance for Zone entities is significantly improved when using lenient criteria compared to strict criteria 4.3 Errors Analysis The main sources of errors for our system are:  Diverse write styles  Some entities, especially Zone entity, are very long and not use capitalization Take the following two examples: "Tơi cần mua hộ Mỹ đình – từ liêm – Hà Nội." "I need to buy an apartment in My Dinh - Tu Liem – Ha Noi." "Liên hệ: anh minh - 0987214931." "Contact: anh Minh - 0987214931." The location name (the phrase "Mỹ đình – từ liêm – Hà Nội") in the first example and Person name (the phrase "anh minh") in the second example are not recognized correctly as the clue words are not capitalized -34- Chapter 5: Conclusion and Future Works We have built a system for extracting information from real estate advertisements for Vietnamese Our approach is suitable for underresourced languages, particularly for tasks that not have annotated data Our system achieves an overall F-measure of 91% using the strict criteria which is quite respectable In the future, we will need to improve the system performance for the Zone entity We will also try to use machine learning on our annotated corpus and investigate avenues that could combine machine learning approaches with our rule based approach -35- Relevant publication [2] Lien Vi Pham and Son Bao Pham Information Extraction for Vietnamese Real-Estate In Proceedings of the fourth International Conference on Knowledge and Systems Engineering (KSE), 2012 (Accepted) Bibliography [19] J Cowie and Y Wilks, “Information extraction,” 2000 [20] D B Nguyen, S H Hoang, S B Pham, and T P Nguyen, “Named entity recognition for vietnamese,” in Proceedings of the Second international conference on Intelligent information and database systems: Part II, ser ACIIDS’10 Berlin, Heidelberg: Springer-Verlag, 2010, pp 205–214 [Online] Available: http://dl.acm.org/citation.cfm? id=1894808.1894834 [21] T.-V T Nguyen and T H Cao, “Vn-kim ie: automatic extraction of vietnamese named-entities on the web,” New Gen Comput., vol 25, no 3, pp 277–292, jan 2007 [Online] Available: http://dx.doi.org/10.1007/s00354-0070018-4 [22] A Borthwick, J Sterling, E Agichtein, and R Grishman, “Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ser KDD ’04 New York, NY, USA: ACM, 2004, pp 89–98 [Online] Available: http://doi.acm.org/10.1145/ 1014052.1014065 [23] A Mansouri, L S Affendey, and A Mamat, “Named entity recognition using a new fuzzy support vector machine,” International Journal of Computer Science and Network Security, IJCSNS, vol 8, no 2, pp 320– 325, February 2008 -36- [24] X Fang and H Sheng, “A hybrid approach for chinese named entity recognition,” in Proceedings of the 5th International Conference on Discovery Science, ser DS ’02 London, UK, UK: Springer-Verlag, 2002, pp 297–301 [Online] Available: http://dl.acm.org/citation.cfm? id=647859.736133 [25] R Srihari, C Niu, and W Li, “A hybrid approach for named entity and sub-type tagging,” in Proceedings of the sixth conference on Applied natural language processing, ser ANLC ’00 Stroudsburg, PA, USA: Association for Computational Linguistics, 2000, pp 247–254 [Online] Available: http://dx.doi.org/10.3115/974147.974181 [26] I Budi and S Bressan, “Association rules mining for name entity recognition,” in Proceedings of the Fourth International Conference on Web Information Systems Engineering, ser WISE ’03 Washington, DC, USA: IEEE Computer Society, 2003, pp 325– [Online] Available: http://dl.acm.org/citation.cfm?id=960322.960421 [27] D Maynard, V Tablan, C Ursu, H Cunningham, and Y Wilks, “Named entity recognition from diverse text types,” in In Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark, 2001 [28] K Pastra, D Maynard, O Hamza, H Cunningham, and Y Wilks, “How feasible is the reuse of grammars for named entity recognition,” in In Proceedings of the 3rd Conference on Language Resources and Evaluation (LREC), Canary Islands, 2002 [29] D Maynard, K Bontcheva, and H Cunningham, “Towards a semantic extraction of named entities,” in In Recent Advances in Natural Lan-guage Processing, 2003 [30] D M Bikel, S Miller, R Schwartz, and R Weischedel, “Nymble: a high-performance learning name-finder,” in Proceedings of the fifth conference on Applied natural language processing, ser ANLC ’97 Stroudsburg, PA, USA: Association for Computational Linguistics, 1997, pp 194– -37- 201 [Online] 974557.974586 Available: http://dx.doi.org/10.3115/ [31] Y.-C Wu, T.-K Fan, Y.-S Lee, and S.-J Yen, “Extracting named entities using support vector machines,” in Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature, ser KDLL’06 Berlin, Heidelberg: Springer-Verlag, 2006, pp 91– 103 [Online] Available: http://dx.doi.org/10.1007/11683568_8 [32] T Nguyen, O Tran, H Phan, and T Ha, “Named entity recognition in vietnamese free-text and web documents using conditional random fields,” Proceedings of the Eighth Conference on Some Selection Prob-lems of Information Technology and Telecommunication, Hai Phong, Viet Nam, 2005 [33] P T X Thao, T Q Tri, A Kawazoe, D Dinh, and N Collier, “Construction of vietnamese corpora for named entity recognition,” in Large Scale Semantic Access to Content (Text, Image, Video, and Sound), ser RIAO ’07 Paris, France, France: LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 2007, pp 719–724 [Online] Available: http: //dl.acm.org/citation.cfm?id=1931390.1931459 [34] T W Hong and K L Clark, “Using grammatical inference to automate information extraction from the web,” in Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, ser PKDD ’01 London, UK, UK: Springer-Verlag, 2001, pp 216–227 [Online] Available: http://dl.acm.org/citation.cfm?id= 645805.669995 [35] H Seo, J Yang, and J Choi, “Building intelligent systems for mining in-formation extraction rules from web pages by using domain knowledge,” in in Proc IEEE Int Symp Industrial Electronics, Pusan, Korea, 2001, pp 322–327 -38- [36] D D Pham, G B Tran, and S B Pham, “A hybrid approach to vietnamese word segmentation using part of speech tags,” in Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, ser KSE ’09 Washington, DC, USA: IEEE Computer Society, 2009, pp 154–161 [Online] Available: http://dx.doi.org/10.1109/KSE.2009.44 -39- ... HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY PHAM VI LIEN INFORMATION EXTRACTION FOR VIETNAMESE REAL- ESTATE ADVERTISEMENTS Sector : Information Technology Major : Computer Science Code : 60.48.01... the information overload problem and to deliver useful information to users Information Extraction is one of the important tasks in natural language processing The main idea of an information extraction. .. a rule-based approach for building an Information Extraction system for Vietnamese online real estate advertisements At the same time, we also build an annotated corpus for the same task 1.2

Định dạng
Số trang	42
Dung lượng	621,87 KB