Information extraction from dynamic web sources

INFORMATION EXTRACTION FROM DYNAMIC WEB SOURCES ROSHNI MOHAPATRA NATIONAL UNIVERSITY OF SINGAPORE 2004 INFORMATION EXTRACTION FROM DYNAMIC WEB SOURCES ROSHNI MOHAPATRA (B.E.(Computer Science and Engineering), VTU, India) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2004 Acknowledgments First of all, I would like to express my sincere thanks and appreciation to my supervisor, Dr. Kanagasabai Rajaraman for his attention, guidance, insight, and support which has led to this dissertation. Through his ideas, he is in many ways responsible for much of the direction this work took. I would also like to thank Prof. Sung Sam Yuan and Prof. Vladimir Bajic who have been a source of inspiration to me. I am grateful to Prof. Kwanghui Lim, Deparment of Business policy, NUS School of Business for being a mentor and friend, and listening to my frequent ramblings. I would like to acknowledge the support of my thesis examiners: A/P Tan Chew Lim and Dr. Su Jian. I greatly appreciate the comments and suggestions given by them. Ma, Papa continue to pull the feat of helping me with my work without caring to know the least about it. I would like to thank them and the rest of the family for their love, support and encouragement. Special thanks to Arun for his patience, support, favors and all the valuable input for this thesis and otherwise. Finally, a big thank you to all my friends, wherever they are, for all the good times we have shared that has helped me to come till here... iii Contents Acknowledgments iii Summary ix 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Information Extraction from the Web . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Wrapper Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 2 Survey of Related Work 2.1 2.2 11 Wrapper Verification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 RAPTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Chidlovskii’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16 Wrapper Reinduction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 ROADRUNNER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 DataProg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 SG-WRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 iv 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 ReInduce: Our Wrapper Reinduction Algorithm 22 25 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Generic Wrapper Reinduction Algorithm . . . . . . . . . . . . . . . . . . . . 30 3.3.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2 Algorithm ReInduceW . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Incremental Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 LR Wrapper Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 LR Wrapper Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.3 LRRE Wrapper Class . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 3.5 4 Experiments 4.1 4.2 52 Performance of InduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.1 Sample cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.2 Induction cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Performance of ReInduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5 Conclusions 59 A Websites considered for Evaluation of InduceLR 61 B Regular Expression Syntax 64 v List of Tables 3.1 Algorithm ReInduceW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Algorithm InduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Trace of Algorithm InduceLR for Page PA . . . . . . . . . . . . . . . . . . . 40 3.4 Algorithm InduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Expressiveness of LRRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 Algorithm InduceLRRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7 Algorithm ExtractLRRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 Websites considered for evaluation of InduceLR . . . . . . . . . . . . . . . . 53 4.2 Details of the webpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Precision and Recall of InduceLR . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Time complexity of InduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 White Pages Websites considered for evaluation of ReInduceLR . . . . . . . 57 4.6 Performance of ReInduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.7 Average Precision and Recall for Existing Approaches . . . . . . . . . . . . 58 vi List of Figures 1.1 Froogle: A product search agent . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Weather listing from the channel news asia website . . . . . . . . . . . . . . . . 5 1.3 Page PA and HTML Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Page PB and HTML Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Life Cycle of a Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Layout changes in an Online Address Book . . . . . . . . . . . . . . . . . . . . 13 2.3 Content changes in a Home supplies page . . . . . . . . . . . . . . . . . . . 15 2.4 Changed Address Book Addressm . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 User defined schema for the Address Book Example . . . . . . . . . . . . . 21 2.6 Content Features of the Address field . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Incremental Content changes in Channel News Asia Website . . . . . . . . . . . . 27 3.2 ReInduce: Wrapper Reinduction System . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Page PA and its corresponding LR Wrapper . . . . . . . . . . . . . . . . . . 34 3.4 Illustration of LR Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 HTML Source for Modified Page PAm . . . . . . . . . . . . . . . . . . . . . 43 3.6 Page PA and corresponding LR and LRRE wrappers . . . . . . . . . . . . . 47 vii 3.7 HTML Source for Modified Page PA . . . . . . . . . . . . . . . . . . . . . . 48 A.1 Screenshot from Amazon.com . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.2 Screenshot from Google.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.3 Screenshot from uspto.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.4 Screenshot from Yahoo People Search . . . . . . . . . . . . . . . . . . . . . . . 63 A.5 Screenshot from ZDNet.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 viii Summary To organize, analyze and integrate information from the Internet, many existing systems need to automatically extract the content from the webpages. Most systems use customized wrapper procedures to perform this task of information extraction. Traditionally wrappers are coded manually, but hand-coding is a tedious process. A technique known as wrapper induction has been proposed for automatically learning a wrapper from a given resource’s example pages. In both these methods, the key problem is that, due to the dynamic nature of the web, over a period of time, the layout of a website may change and hence the wrapper may become incorrect. The problem of reconstructing a wrapper, to ensure continuous extraction of information from dynamic web sources is called wrapper reinduction. In this thesis, we investigate the wrapper reinduction problem and develop a novel algorithm that can detect layout changes and reinduce wrappers automatically. We formulate wrapper reinduction as an incremental learning problem and identify that wrapper induction from an incomplete label is a key problem to be solved. We observe that the page content usually changes only incrementally for small time intervals, though the layout may change drastically and none of the syntactic features are retained. We thus propose a novel algorithm for incrementally inducing a class of Wrappers called LR wrappers and show that this algorithm asymptotically identifies the correct wrapper as the number of tuples is increased. This property is used to propose a LR wrapper reinduction algorithm. We demonstrate that this algorithm requires examples to be provided exactly once and thereafter the algorithm can detect the layout changes and reinduce wrappers automatically, so long as the wrapper changes are in LR. We have performed experimental studies of our reinduction algorithm using real web ix pages and observed that the algorithm is able to achieve near perfect performance. In comparison, DataProg has reported performance of 90% precision and 80% recall, and SGWRAM 89.5% precision and 90.5% recall. However, DataProg and SG-WRAM assume that the content to be extracted follows specific patterns, which is not required by our algorithm. Furthermore, our algorithm has been observed to be efficient and capable of learning from a small number of examples. x Chapter 1 Introduction 1.1 Background We can perceive the Web as a huge library of documents – telephone directories, weather reports, web-logs, news, virus updates, research papers, job listings, event schedules, stock market information and many more. Recently there has been an interest in developing systems that can access such resources and organize, categorize and personalize this information on the behalf of the user. Information Integration systems deal with extraction and integration of data from various sources [4]. An application developer starts with a set of web sources and creates a unified view of these sources. Once this process is complete, an end user can issue databaselike queries as if the information were stored in a single large database [30]. Many such approaches have been discussed in [4, 9, 13, 25]. 1 Figure 1.1: Froogle: A product search agent Intelligent agents or Software agents is a term used to describe fully autonomous systems perform the process of managing, collating, filtering and redistributing information from the many resources [36, 8]. Broadly put, agents would include Information integration also as a task, but they additionally analyze the information obtained from various sources. These systems assist users by finding information or performing some simpler tasks on their behalf. For instance, such a system might assist in product search to aid online shopping. Froogle, shown in Figure 1.1 is one such agent. Some agents help in web browsing by retrieving documents similar to already-requested documents [47] and presenting cross referenced scientific papers [40]. More commercial uses have been proposed: Comparative Shopping Agents [19], Virtual Travel Assistants [2, 22], and Mobile Agents [10]. Many such agents are deployed and listed online and provide a wide range of functionality. A comprehensive survey of such agents has been done in [47]. 2 A class of these agents help tackle the information overload on the internet by assisting us finding important resources on the Web [11, 37, 26, 54], and also track and analyze their usage patterns. This process of discovery and analysis of Information on the World Wide Web is called Web Mining. Web mining is a huge, interdisciplinary and very dynamic scientific area, converging from several research communities such as database, information retrieval, and artificial intelligence especially from machine learning and natural language processing. This includes automatic research and analysis of information resources available online, Web Content Mining, discovery of the link structure of the hyperlinks at the interdocument level, Web Structure Mining, and the analysis of user access patterns, Web Usage Mining[17]. A taxonomy of Web Mining tools has been described in [17]. A detailed survey on Web Mining has been presented in [31]. Originally envisioned by the World Wide Web Consortium (W3C) to evolve, proliferate, and to be used directly by people, the initial web architecture included a simple HTTP, URI, and HTML source structure. People used query mechanisms (e.g, HTML forms) and received output in the form of HTML pages. This was very well suited for manual interaction. As expected, the existence of an open and freely usable standard allows anyone in the world to experiment with extensions; the HTTP and HTML specifications have both grown rapidly in this environment. HTML pages now contain extensive, detailed formatting information which is specific to one type of browser, and many vendor-specific tags have been added, making it useful only as a display language without any standard structure. Now efforts are underway to standardize and incorporate more structure into the web. The advent of XML has helped tackle this lack of structure in the web, but it is not as commonly used and there is very limited native browser support. Though many websites 3 employ XML in their background, there are still many HTML-based websites that need to be reverted to XML before universal adoption. Thus, there is still a need to convert from the existing HTML data to XML, and technology does not provide a trivial solution for this [39]. The Web which is characterized by diverse authoring styles and content variations, does not have a rigid and static structure like relational databases. Most of the pages are composed of natural language text, and are neatly ‘formatted’ with a title, subtitles, paragraphs etc, more or less like traditional text documents. But we observe that, the web demonstrates a fair degree of structure in data representation [20] and is highly regular in order to be human-readable and is often described as semi-structured. Normally, a document may contain its own metadata, but the common case is for the logical structure to be implicitly defined by a combination of physical structure (e.g. HTML tags for a webpage, line and paragraph boundaries for a free-text resource) and content indicators (e.g. words in section headings, capitalization of important words, etc.) [55]. For example, a webpage listing the world weather report may list the results in the form of a tuple (city, condition, max temperature, min temperature) as shown in the Figure 1.2. Many such tuples may be present on the same page, appropriately formatted, giving it the appearance of a relational database. Similarly, a movie listing may have the information in the order (movie, rating, theater, time). While unstructured text may be difficult to analyze, semi-structured text poses a different set of challenges. It is interlaced with extraneous elements like advertisements and HTML formatting constructs [33] and hence, extraction of data from Web is a non-trivial 4 Figure 1.2: Weather listing from the channel news asia website problem. The primary problem faced by Information Integration systems and Intelligent agents is not resource discovery, since most of them would look at a few trusted sources related to specific domains. Since the semi-structured pages contain a lot of extraneous information, the problem exists in being able to extract the contents of a page. Kushmerick et. al.[36] advocate this task of Information extraction from the web as the core enabling technology for a variety of Information agents. 1.2 Information Extraction from the Web At the highest level, this thesis is concerned with Information Extraction from the Web. Information Extraction (IE) is the process of identifying the particular fragments of an information resource that constitute its core semantic content [34]. A number of IE systems have been proposed for dealing with free-text (see [56, 12] for example) and semi-structured text [6, 18, 33, 55]. 5 The Information extraction algorithms can be further classified on the basis of whether they deal with semi-structured text and semi-structured data [39]. Note that, in the former the data can be only inferred and in latter the data is implicitly formatted. The focus in this thesis is on semi-structured data extraction. A taxonomy for these Web data extraction methods has been described in the detailed survey by Laender et al. [39]. In this survey, the existing methods are classified into Natural language processing (NLP), HTML structure analysis, Machine Learning, data modeling and ontology-based methods. 1.2.1 Wrappers To extract information from semi-structured information resources, information extraction systems usually rely on extraction rules tailored to a that source, generally called Wrappers. Wrappers are software modules that help capture the semi-structured data on the web into a structured format. They have three main functions [32]: • Download : They must be able to download HTML pages from a web site. • Search: Within a resource they must be able to search for, recognize and extract specified data. • Save: They should Save this data in a suitably structured format to enable further manipulation. The data can then be imported into other applications for additional processing. According to [5], 80% of the published information on the WWW is based on databases running in the background. When compiling this data into HTML documents the structure 6 PEOPLE Jack,China John,USA Joseph,UK PEOPLE Jack , China John , USA Joseph , UK Figure 1.3: Page PA and HTML Source. of the underlying databases is completely lost. Wrappers try to reverse this process by restoring the information to a structured format. Also, it can be observed that across different web sites and web pages in HTML, the structural formatting (HTML tags or surrounding text) may differ, but the presentation remains fairly regular. Wrappers also help in coping with structural heterogeneity inherent in many different sources. By using several wrappers to extract data from the various information sources of the WWW, the retrieved data can be made available in an appropriately structured format [32]. To be able to search data from semi-structured web pages, the wrappers rely on key patterns that help recognize the important information fragments within a page. The most challenging aspect of Web data extraction by wrappers is to be able to recognize the data among uninteresting pieces of text. For example, consider an imaginary web site containing Person Name and Country Name entities, shown in Figure 1.3. To extract the two entities, we can propose a wrapper, say P CW rapper, using the delimiters { B , /B , I , /I }, where the first two define the left and right delimiters of the Person Name and the last two define the corresponding delimiters for Country Name. This wrapper can be used to extract the contents of the page 7 PA , and of any other page, where the same features are present. 1.2.2 Wrapper Generation One approach to creating a wrapper would be hand-code them [24] but it is a tedious process. Techniques have been proposed for constructing wrappers semi-automatically or automatically, using a resource’s sample pages. The automatic approaches which use supervised learning need the user to provide some labeled pages indicating the examples. Many such approaches were proposed in RAPIER [12], WHISK [56], WIEN [33], SoftMealy [28], STALKER [50] and DEByE [38]. A method for automatic generation of Wrappers with unsupervised learning was introduced in RoadRunner [18]. To extract the data wrappers either use content based features or landmark based features. Content based approaches [12, 56] use content/linguistic features like capitalization, presence of numeric characters etc. and are suitable for Web pages written in free text, possibly using a telegraphic style, like in job listings or rental advertisements. Landmark based approaches [33, 28, 50] use delimiter based extraction rules that rely on formatting features to delineate the structure of data found [39] and hence are more suitable for data formatted in HTML. For example, in Figure 1.3, the Wrapper P CW rapper can be learnt automatically from examples of (Person Name, Country Name) tuples. Since the extraction patterns generated in all these systems are based on content or delimiters that characterize the text, they are sensitive to changes of the Web page format. 8 Jack , China James , India John , USA Jonathan , UK INFORMATION Jack,China James,India John,USA Jonathan,UK Figure 1.4: Page PB and HTML Source. In this sense they are source-dependent. They either need to be reworked or need to be rerun to discover new patterns for new or changed source pages. For example, suppose the site in Figure 1.3 changes to a new layout as in Figure 1.4. Note that P CW rapper no longer extracts correctly. It will extract the tuples as (China, James), (India, John), (USA, Jonathan) rather than (Jack, China), (James, India), (John, USA), (Jonathan, UK). Kushmerick [16, 18] investigated 27 actual sites for a period of 6 months, and found that 44 % of the sites changed its layout during that period at least once [35]. If the source modifies its formatting (for example, to “revamp” its user interface) the observed content or landmark feature will no longer hold and the wrapper will fail [36]. In such cases, the extraction of data from such web pages becomes difficult and is clearly a non-trivial problem. In this thesis, we focus on this problem of Extraction of Information from Dynamic Web sites. We deal with dynamic web pages, typically, a web page which is modified in its layout, content or both. The challenge here is to generate the Wrapper automatically when the page changes occur, such that the data is extracted continuously for the purpose of the user. In this thesis, we develop systems that are capable of extract- 9 ing the content of such dynamic webpages. We propose a novel approach for dealing with dynamic websites and present efficient algorithms that can perform continuous extraction of information. 1.3 Organization The rest of the thesis is organized as follows: Chapter 2 is dedicated to reviewing all the existing literature for information extraction from dynamic websites and evaluating their strengths and weaknesses. We summarize the key learning from these methods and present the scope of our work. Chapter 3 presents a detailed description and analysis of our approach. We formally define the problem of information extraction from dynamic websites, and our approach to tackling it. We discuss the formal framework for our algorithm, and define and analyze in detail the wrapper classes. We also present a study and analysis of algorithms to learn these wrappers, and use them to propose a novel method to learn new wrappers on the fly when layout and content changes occur in the website. Chapter 4 discusses the empirical evaluation of our work through experiments on real webpages. We study the sample and time complexity of our algorithms and compare the results to the existing approaches. Chapter 5 summarizes our work and indicates the merits as well as limitations. We propose the ways to extend the algorithms to achieve better performance and also pose the open problems for further investigation. 10 Chapter 2 Survey of Related Work As discussed in the previous chapter, Wrappers are software modules that help us capture semi structured data into structured format. We noted that these wrappers are susceptible to “breaking”, when the website layout changes happen. To rectify this problem, a new wrapper needs to be induced using examples from the modified page. This is called the Wrapper Maintenance problem and it consists of two steps [36, 42]: 1. Wrapper Verification: To determine whether a wrapper is correct. 2. Wrapper Reinduction: To learn a new wrapper if the current wrapper has become incorrect. The entire process of a Wrapper Induction, Verification and Reinduction is illustrated through Figure 2.1 [42]. 11 User HTML Pages Labeled pages Wrapper Induction Labeled pages Wrapper Extracted Data Change Detected Automatic relabeling Wrapper Verification Reinduction System Figure 2.1: Life Cycle of a Wrapper The wrapper induction system takes a set of web pages labeled with examples of the data to be extracted. The output of the wrapper induction system is a wrapper, consisting of a set of rules to identify the data on the page. A wrapper verification system monitors the validity of data returned by the wrapper. If the site changes, the wrapper may extract nothing at all or some data that is not correct. The verification system will detect data inconsistency and notify the operator or automatically launch a wrapper repair process. A wrapper reinduction system repairs the extraction rules so that the new wrapper works on changed page. We take a simple example to illustrate this. Consider, the example given in Figure 2.2. The Wrapper Addresswrap for page Addresso is the same as P CW rapper in the previous chapter: {,,,} When the page Addresso changes its layout to Addressc , Wrapper Addresswrap would extract (12 Orchard Road, James), (34 Siglap Road, June), (22 Science Drive, Jenny) 12 Address Book Jack,1234 Orchard Road James,3454 Siglap Road John,22 Alexandra Road Jonathan,1156 Kent Ridge (a)Original Address Book Addresso Address Book Jack,12 Orchard Road James,34 Siglap Road June,22 Science Drive Jenny,11 Sunset Blvd (b) Changed Address Book Addressc Figure 2.2: Layout changes in an Online Address Book on page Addressc . The wrapper verification system will identify that the extracted data is incorrect. The Wrapper reinduction system will help learn the new Wrapper: {,,,}. Wrapper Maintenance has been investigated in literature. Below, we review the important works and discuss the strengths and limitations of these methods. 2.1 Wrapper Verification Algorithms Wrapper Verification is the step to determine whether a wrapper is still operating correctly. When a Web site changes its layout or in the case of missing attributes, the wrapper will either yield NULL results, or a wrong result. In such a case, the wrapper is considered to 13 be broken. This can become a big bottleneck for information integration tools and also for information agents. 2.1.1 RAPTURE Kushmerick [35] proposed a method for wrapper verification using a statistical approach. He uses heuristics like Word count and mean word length. The method relies on obtaining heuristics for the new page, and comparing it against the heuristic data for pre-verified pages to check whether it is correct. An outline of the steps is given below: • Step 1: Estimating the number of tuple distribution parameters for pre-verified pages. This is assumed to follow an normal distribution. The mean tuple number and the standard deviation is also computed. • Step 2: Estimating the feature value distribution parameters for each attribute in the pre-verified pages. For this simple statistical features like word-count and word length are used. For example, the word count for ‘Jonathan’ is ‘1’ and for ‘1156 Kent Ridge’ it is 3, and mean word length for the name field is 5.25. • Step 3: For any page, a similar computation of tuple distribution, feature value distribution is done. These values are compared against the values for the pre-verified pages. For example, for feature 1 (Name), in Addresso from our computation, we know that the average word length is 1, but in Addressc it is computed to be 3. • Step 4: Based on step 3, Computation of the overall verification probability is done. This probability is compared against a fixed threshold to determine whether the wrapper is correct or incorrect. In case of our example, it would return a CHANGED. 14 Item List Price Our Price Chopsticks $6.95 $4.95 Spoons $25.00 $10.00 (a) Original content, Homeo Item List Price Our Price Chopsticks $6.95 $3.95 Spoons $25.00 $5.00 (b) Modified content, Homec Figure 2.3: Content changes in a Home supplies page Strengths: For most part, this method uses a black-box approach to measuring overall page metrics and hence it can be applied in any wrapper generation system for verification. RAPTURE uses very simple numeric features to compute the probabilistic similarity measure between a wrapper’s expected and observed output. After conducting experiments with numerous actual Internet sources, the authors claim RAPTURE performs substantially better than standard regression testing approaches. Weaknesses: Since information for even a single field can vary considerably overall statistical distribution measures may not be sufficient. For example, in case of Listings for scientific publications, the author names and the scientific publication names all may vary too drastically leading to ambiguity while verification. Such cases though rare, make this approach ineffective, unless more features are used while verification like digit density, upper case density, letter density, HTML density etc. For example if the Contents of Page in 2.3(a) changes to 2.3(b) apart from the layout, then based on content patterns it would be very difficult to distinguish the ‘List price’ from ‘Our price’. Additionally, this method does not examine re-induction at all. 15 Address Book Jack,1234 Orchard Road James,3454 Siglap Road Jenny,22 Alexandra Road Jules,1 Kent Ridge Figure 2.4: Changed Address Book Addressm 2.1.2 Chidlovskii’s Algorithm Another verification approach was suggested by Chidlovskii [15] where he argues that the pages rarely undergo any massive or sweeping change and most often than not it is a slight local change or concept shift. The Automatic maintenance system repairs wrappers under this assumption of “small change”. Though this method tackles verification by classifiers built using content features of extracted information. For feature1 (Name), average length= 5.25, Number of Upper case characters =1, Numbers =0 etc. The approach makes an effort to extend the conventional forward wrappers with backward wrappers to create a multi-pass Wrapper verification approach. In contrast to forward Wrappers, the backward wrappers scan files from the end to the beginning. The backward wrapper is similar in structure to the forward wrapper, and can run into errors when the format changes. However, because of the backward scanning, it will fail at positions different from where the forward wrapper would fail. This can typically work in case of errors generated due to typos or missing close tags in HTML pages, and help to fine tune the answers further. If page Addresso was changed to page Addressm as in Fig. 2.4, wrapper Addresswrapf 16 would extract (Jack, 1234 Orchard Road), (James,3454 Siglap Road), (Jenny, 22 Alexandra Road Jules,1 Kent Ridge) on page Addressm . The backward wrapper scanning page Addressm from the backward direction would extract the tuples : (Jules, 1 Kent Ridge Road), (James,3454 Siglap Road), (Jack, 1234 Orchard Road) on page Addressm . Strengths: The forward-backward scanning is unique and seems to be a robust approach to handle wrapper verification, especially for missing attributes and tags. Tested on the a database of 18 websites, including the Scientific Literature database DBLP, this method reports an average error of only 4.7% when using the Forward-backward wrappers with the context classifier. Limitations: Though the forward-backward Wrapper approach has an advantage over other methods in verification when there are missing tags, the use of content features may not be very effective in many cases. Since information for even a single field can vary considerably overall statistical distribution measures may not be sufficient. 2.2 Wrapper Reinduction Algorithms Wrapper Reinduction is the process of learning a new wrapper if the current wrapper is broken. Wrapper reinduction is a tougher problem than Wrapper Verification. Not only the wrapper has to be verified, a new wrapper should be constructed as well. It requires new examples be provided for learning, which may be expensive when there are many sites being wrapped. The conventional wrapper induction models cannot be directly used for 17 reinduction since many of them required detailed manual labeling for training which can become a bottleneck for reinduction of wrappers. So wrapper reinduction task usually deals with locating training examples on the new page, automatically labeling it, and supplying it to the wrapper induction module to learn the new wrapper. 2.2.1 ROADRUNNER ROADRUNNER[18] is a method that uses unsupervised learning to learn the wrappers. Pages from the same website are supplied and a page comparison algorithm is used to generate wrappers based on similarities and mismatches. The algorithm performs a detailed analysis of the HTML tag structure of the pages to generate a wrapper to minimize mismatches. This system employs wrappers based on a class of regular expressions, called Union-Free Regular Expressions (UFRE’s) which are very expressive. The extraction process compares the tag structure between the sample pages and generates regular expressions that handle structural mismatches found between them. In this way, the algorithm discovers structures such as tuples, lists and variations [39]. An approach similar to ROADRUNNER was used by Arasu et. al.[3]. They propose an approach to automatic data extraction by automatically inducing the underlying template of some sample pages with the same structures from data-intensive web sites to deduce templates from a set of template generated pages, and to extract the value encoded in them. However, this does not handle multiple values listed on one page. 18 Strengths Since this method needs no examples to learn the wrappers, has an obvious strength: it provides an alternative way to deal with the wrapper maintenance problem, especially in cases where there are no examples. Limitations: Since ROADRUNNER searches in a larger wrapper space, the algorithm is potentially inefficient. The unsupervised learning method gives little control to the user. The user might want to make some refinements and only extract a specific subset of the available tuples. In such cases, some amount of user input is clearly necessary to extract the correct set of tuples. Another problem of this approach is the need for many examples to learn the Wrapper accurately [45]. 2.2.2 DataProg Knoblock at el.[29] developed a method called DataPro for wrapper repairing in the case of small mark-up change; it detects the most frequent patterns in the labeled strings; these patterns are searched in a page when the wrapper is broken. Lerman et.al.[42] extended the above content-centric approach for verification and re-induction for their DataProg system. The system takes a set of labeled example pages and attempts to induce contentbased rules so that examples from new pages can be located. Wrappers can be verified by comparing the patterns of data returned to the learned statistical distribution. When a significant difference is found, an operator can be notified or the wrapper repair process can be automatically launched. For example, by observing the street addresses listed in our example, we can see that they are not completely random: each has a numeric character followed by a capital letter. 19 DataProg tries to derive a simple rule to identify this field as ALP HA CAP S etc. Using this, they locate the examples on the new page, which are passed to a wrapper induction algorithm (STALKER algorithm) to re-induce the wrapper. This approach is similar to the approaches used by Content-centric Wrapper tools [12, 56]. Strengths: The class of wrappers described by DataProg are very expressive since they can handle missing and rearranged attributes. This approach applies machine learning techniques to learn specific statistical distribution of the patterns for each field as against the generic approach used by Kushmerick [35]. This approach assumes that the data representation is consistent, and by looking the test set, we can see that this can be successfully used for representations which have strong distinguishing features like URLs, time, price, phone numbers etc. Limitations: For many cases like news, scientific publications or even for author names, this approach cannot be used too well since there are no fixed content-based rules (Alphanumeric, Capitalized etc.) which can be identified to separate them from other content on the page. For example, in case of the example illustrated in Figure 2.3 (a) and (b), this method will not detect any change, because the generic features and data patterns of ‘List Price’ and ‘Our Price’ are the same. It also could produce too many candidates of data fields [45], many of which could be noises. It fails at very long descriptions, and is very sensitive to improper data coverage. Lerman et.al.[42] quote a striking example of the data coverage problem that occurred for the stock quotes source: the day the training data was collected, there were many more down movements in the stock price than up, and the opposite was true on the day the test data was collected. As a result, the price change fields for those two days were dissimilar. The process of clustering the candidates for each 20 !ELEMENT !ELEMENT !ELEMENT !ELEMENT Addresses (Address+) Address (Name, Street Name) Name (#PCDATA) Street Name(#PCDATA) Figure 2.5: User defined schema for the Address Book Example data field does not consider the relationship of all data fields (schema) [45]. 2.2.3 SG-WRAM SG-WRAM (Schema Guided Wrapper Maintenance)[45] is a recent method for that utilizes data features such as syntactic features and annotations, for reinduction. They base their approach on the assumption that some features of desired information in previous document remain same, e.g. syntactic (data types) hyperlink (whether or not a hyperlink is present) and annotation features (Any string that occurs before the data field) will be retained. They also assume that the underlying schemas will still be the same are still preserved in the changed HTML document. These features help the system to identify the locations of the content in the modified pages though tag structure analysis. For our example, the user defined schema would look like Figure 2.5. Internally the system computes mapping for each one of the fields above to the HTML tree, and generates the extraction rule. For each #PCDATA string, the features are highlighted. If the name Hyperlinked to another page, then the Hyperlink would be TRUE. Similarly, if each Street name was preceded by the string ‘Street’ the Annotation would be ‘Street’. For our case, the features are highlighted in Figure 2.6. 21 Attribute Syntactic Hyperlink Annotation Name [A-Z][a-z]{0,} False NULL Street Name [0-9]{0,}[A-Z][a-z]{0,} False NULL Figure 2.6: Content Features of the Address field For simple changes in pages, this method depends on the syntactic and annotation features, but in case the web site has undergone a structural change, this method uses the schema to locate structural groups and use them to extract data. Strengths: Since it relies on multiple features, this works better in many cases. In case of the example illustrated in Figures 2.3 (a) and (b), where syntactic differences are not strong, this work considers annotation features (List Price, Our Price). Thus when applying the extraction rule, our approach will find that the annotations have changed and find that the page has changed. Limitations: The assumption that data of the same topic will always be put together, similar to the user defined schema and will be retained even when changed is the basis of this approach. However, if the data schema or the syntactic and tag structure changes, then this method is not effective. 2.3 Summary From our study, we observe a few key things about Wrapper generation, verification and maintenance. We observe that landmark-based Wrapper generation approaches are more suitable for HTML pages, as compared to content-based approaches. Conventional wrapper induction algorithms cannot be extended into reinduction algorithms since most of them 22 need manual labeling of data. The reinduction procedure should be automatic for continuous extraction of information from the web source. Wrapper verification can be handled by heuristics. It can be tackled by using global statistics [35] or using local attribute specific statistics [42]. Since page structures are very complex, if a page changes its layout completely, it is very unlikely that any of previous features will be interchanged with others, e.g, in case of Page PA and PB , may be rare. Hence it might be a common occurrence, that the wrapper returns null values when the web site revamp happens. Wrapper Verification can be treated independently of Wrapper Induction. Hence, existing methods are usually adequate for the purpose. In contrast, Wrapper reinduction is a far more difficult problem and has much scope for investigation. From our survey of related work, we observe that the main issues with the current approaches are: (i) Potentially inefficient either because of the need for detailed HTML grammar analysis or due to searching in big wrapper spaces, which makes them inherently slow. (ii) The requirement that most of the data in the modified pages have effective features (syntactic patterns,annotations,etc.). These can be page specific, and hence make the reinduction approach difficult. (iii) The need for many training examples, for learning and reinduction. This additionally includes cases in which the user has to specify a detailed schema which is not very user-friendly. Our goal in this thesis is to address these issues effectively. We investigate wrapper 23 reinduction algorithms that are efficient, learn from a small number of examples and do not require strong assumptions on the data features. In the next chapter, we describe our approach and present our algorithms. 24 Chapter 3 ReInduce: Our Wrapper Reinduction Algorithm In this chapter, we present a novel algorithm for Wrapper Reinduction. As discussed earlier, the focus is on dynamic webpages, pages whose layouts or content may change over time. 3.1 Motivation We observe that though the layout may change drastically and none of syntactic are features retained, the page content usually changes only incrementally. For example, in the Channel News Asia (http://channelnewsasia.com) headlines snapshot 3.1(a) and the snapshot taken after two hours, 3.1(b), we observe that new headlines were added as the old headlines were deleted. In other words, the contents on the pages have lot of commonalities for small a 25 time interval. This content can be used to learn the new wrapper. If some of the old tuples can be detected in the page with the modified layout, we can apply wrapper induction to learn the new layout. This is the idea behind our reinduction algorithm. To motivate our approach, let us consider a wrapper X for this website, which grabs the headlines from the page. Wrapper X extracts all the headlines present on the page and stores it in a small repository. If one day, a website revamp happens and the layout is completely changed, then X might not retrieve the headlines on the page. The maintenance system then tries to locate the headlines stored in the repository and learn a new wrapper from these examples. Once the new wrapper is created, the wrapper can be used to locate all the news headlines on the same page. Instead of trying to search in the wrapper space for a wrapper which will work, or manually constructing training examples needed for reinduction, we try to learn the new wrapper from the few examples available to us. So that when these examples, though few, are discovered on the new page, we can induce the wrapper and deploy it into the system and it will be transparent to the user. An illustration of the process flow in our Wrapper Reinduction system, ReInduce is given in Figure 3.2. The key here is, at the induction/ reinduction step, in such a system there might not be too many training examples available. The key problem to be solved here is Learning from a small number of examples, and especially when not all examples in the page are available to us. In the following sections, we try to address this learning problem effectively. In this chapter, we propose our idea of reinduction algorithm. In the next section, we describe the formal framework for description of the Wrapper 26 (a)Content of Page at 1200 hrs (b)Content of Page at 1400 hrs Figure 3.1: Incremental Content changes in Channel News Asia Website 27 Figure 3.2: ReInduce: Wrapper Reinduction System classes. 3.2 Formalism Resources, queries and responses: Consider the model where a site when queried with a URL returns a HTML page. An information resource can be described formally as function from a Query Q to a response P . [33] Query Q Information Resource Response P Attributes and Tuples: We assume a model similar to the relational data model. Associated with every information resource is a set of K distinct attributes, each representing a column. For example, Page PA in the country name example in 1.3 has K=2. A tuple is a vector A1 , · · · Ak of K strings. The string Ak is the value of the k th 28 attribute. This is similar to rows in a relational model. There are M such tuples/ vectors present on a page. If there are more than one tuple present on the page, then the k th attribute of the mth tuple will be represented as Am,k . Content and Labels: The content of a page is the set of tuples it contains. A page’s label is a representation of its content. For example the label for Page PA in the country name example in 1.3 is LA = (Jack, China), (John, USA), (Joseph, UK). Wrappers: A wrapper takes as input, a page P and outputs a label L. For wrapper w and page P , we write w(P ) = L to indicate that w returns label L when invoked on P , e.g. PCWrapper(PA ) = LA . Hence, a wrapper can be described as a function from a query response to a label. Query response Page Wrapper Label A Wrapper class can be defined as a template for generating these wrappers. All wrappers belonging to a class will have similar execution steps. Wrapper Induction: Let W be a wrapper class and E = { P1 , L1 , · · · , PN , LN } be a set of example pages and their labels. Wrapper induction is the problem of finding w ∈ W such that w(Pn ) = Ln , for all n = 1, · · · , N . Wrapper Verification: We say w is correct for P iff P ’s label is identical to w(P ). Wrapper Reinduction: For a dynamic web site, it will also be a function of time. 29 We assume the same model in which the site is queried with a URL q and observed at time instants t0 , t1 , · · · , tN . Let: • {P (t0 ), P (t1 ), · · · , P (tN )} be the pages in response to the queries, and • {L(t0 ), L(t1 ), · · · , L(tN )} be the labels of the above pages. The wrapper reinduction problem is, Given the example at time t0 : P (t0 ), L(t0 ) Find wrappers wi ∈ W such that wi (P (ti )) = L(ti ). 3.3 Generic Wrapper Reinduction Algorithm Note that the wrapper reinduction problem is trivial if both the pages and labels remain static, i.e. P (ti ) = P (t0 ) and L(ti ) = L(t0 ) for i ≥ 1. Even if only the labels remain static, the problem is much simpler and reduces to the problem of inducing a wrapper wi at time ti using P (ti ), L(t0 ) as the example. However, when both the pages and labels vary, we cannot induce a wrapper automatically since L(ti ) is not known for i ≥ 1. Note that this problem is, in general, not solvable without making assumptions about the variations. Lerman et al.[42] assume that the labels follow an implicit structure over all time instants. In SG-WRAM[45], the data schema is assumed to be preserved. 30 3.3.1 Our Approach Our approach is designed as follows. Consider two time instants t1 and t2 such that t2 t1 . Let P (t1 ) and P (t2 ) be the pages returned at t1 and t2 , for a fixed URL q. P (t2 ) may differ from P (t1 ) in layout, content or both. We observe that though the layout may change drastically, the page content usually changes only incrementally, provided (t2 − t1 ) is small enough. In other words, L(t1 )and L(t2 ) will have lot of commonalities for small (t2 −t1 ). Therefore, if some of the tuples can be detected in the layout modified page, we can apply wrapper induction to learn the new layout. This is the idea behind our reinduction algorithm. Our assumption can be formally stated as follows: Assumption I: Let L(t) be a known label for page P (t) at time t. ∃s∗ > 0, such that n tuples can be found in L(t + s∗ ), for small n. It may be noted that the assumption only requires that the labels follow a common structure over a small interval so that a few tuples can be identified with certainty. However, for longer time intervals, say 10 ∗ s∗ , L(t) and L(t + s∗ ) may not follow common structures. The reinduction algorithm requires that a lower bound on s∗ be known. It can be chosen sufficiently small by observing the modification frequency of the target web site. 31 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Let s be a time interval. Set t = 0. Set E(t) = the example P (t), L(t) . Call InduceW to learn a wrapper w ∈ W using E(t). Output w as the wrapper at time t. Set L(t) = w(P (t)), and L(t + s) = w(P (t + s)). If Verify(w, P (t + s), L(t + s) )= ‘CHANGED’ Then Begin Set L = the tuples of L(t) found in P (t + s). Set E(t) = P (t), L . End Set t = t + s. Go to Step 2. Table 3.1: Algorithm ReInduceW 3.3.2 Algorithm ReInduceW We propose a generic wrapper reinduction algorithm for web sites that satisfy Assumption I. The algorithm, called ReInduceW , is presented in Table 3.1. ReInduceW is an iterative algorithm. It uses two procedures InduceW and V erif y to accomplish reinduction. The initialization is done in Step 1. In Step 2, the example at time t, E(t), is first set to the example page and label provided at time t=0. Then, InduceW is called to learn a wrapper in Step 3, which is output as the wrapper at time t in Step 4. Steps 5-6 correspond to wrapper verification. If a wrapper change has been detected, new example tuples are generated in Steps 8-9. If s = s∗ /2, then Assumption I ensures at least n tuples will be found. In Step 11-12, the algorithm increments to the next time step and goes on to induce a wrapper with the new examples. This process is repeated over intervals of size s to detect changes and reinduce the wrapper continually. Algorithm ReInduceW crucially depends the two procedures InduceW and V erif y, 32 for successful reinduction. For V erif y, which is used to detect wrapper changes, we employ a statistical method, e.g. [35]. However, the choice is not easy for InduceW because the wrapper induction in Step 3 has to be performed using a small subset of the tuples in L(t). This is a problem of inducing wrappers from insufficient examples. Kushmerick[34] has partially investigated this problem under wrapper corroboration. His approach assumes two or more example pages along with their labels (possibly incomplete) are available and makes use of the redundancy in the data to perform induction. However, our case involves a single page and an incomplete label, and hence his method is not applicable. We call this problem as Incremental Wrapper Induction and propose new algorithms for incrementally inducing wrappers, in the next section. 3.4 Incremental Wrapper Induction In this section we present wrapper classes and induction algorithms to learn these wrappers. We first consider the LR wrapper class and present an incremental induction algorithm. 3.4.1 LR Wrapper Class Let the page P be a string over alphabet Σ. Consider the regular expression l1 (˜ ∗)M1 (˜ ∗)M2 · · · (˜ ∗)MK−1 (˜ ∗)rK (3.1) 33 Jack,China John,USA Joseph,UK (a)Page PA B (˜ ∗) /B , I (˜ ∗) /I (b) Corresponding LR Wrapper Figure 3.3: Page PA and its corresponding LR Wrapper where l1 , rK and M i are strings over Σ, and ˜ ∗ denotes a non-greedy wildcard match .+?.1 A LR wrapper is a procedure that applies this regular expression globally on a page and returns the pattern matches as a label. For example, a LR wrapper for page PA shown in Figure 3.3(a) is indicated in Figure 3.3(b). The wildcards in the Wrapper will match, for example: B Jack /B , I China /I BR LR can be seen to be similar to LR class[34]. LR is identical to LR when the intra-tuple separators are equal across all tuples, for example, as in page PA . This makes LR rather simplistic, but this class is discussed mainly for easier exposition of ideas. Later we will generalize the ideas to LR. 1 Non-greedy matching attempts to match an asterisk wildcard up until the first character that is not the same as the character immediately following the wildcard. It matches a minimum number of characters before failing. Greedy matching attempts to match the longest string possible. Parantheses (..) match whatever regular expression is inside them, and indicate the start and end of a group; the contents of a group can be retrieved after a match has been performed 34 We first investigate the constraints for an LR wrapper to exist. It will be used to propose the wrapper induction algorithm. We follow the notation of [33]. Let the page P contain K attributes and M tuples, i.e. the label L has tuples Lm , m = 1, .., M , and the size of each Lj is K. The Am,k values are values of each attribute in each of Page P ’s tuples. Specifically, Am,k , is the value of the mth tuple on Page P , essentially the text fragments to be extracted from a page. The Sm,k are the separators between the attribute values in each of Page P ’s tuples. The four kinds of separators are: • Page P ’s head, denoted S0,K , the substring of the page prior to the first attribute of the first tuple. • Page P ’s tail, denoted SM,K , the substring of the page following the last attribute of the last tuple. • The intra-tuple separators that separate the attributes within a single tuple, denoted Sm,k , the separator between the k th and k + 1th attribute of the mth tuple. • The inter-tuple separators that separate the consecutive tuples, denoted Sm,K , the separator between the mth and m + 1th tuple of the Page P . We express these variables in terms of indices of in Label L. Let bm,k and em,k respectively be the starting and ending locations of the kth attribute in the mth tuple. 35 Am,k = P [bm,k , em,k ], (the attribute values) Sm,k = P [em,k , bm,k+1 ], (the intra-tuple separators) Sm,K (the inter-tuple separators) = P [em,K , bm+1,1 ], S0,0 = P [0, b1,1 ], SM,K (the head) = P [eM,K , |P |], (the tail) where k = 1, · · · , K, and m = 1, · · · , M . A sample page PA and its corresponding values are shown in Figure 3.4 (a), (b) and (c). We define the constraints for the LR Wrapper to exist: Constraint C1 (rK ): i) rK is a prefix of Sm,K ii) rK is not a substring of Am,K , for m = 1, · · · , M . Constraint C2 (l1 ): l1 is a proper suffix of Sm,K , for m = 1, · · · , M . Constraint C3 (Mk ): Mk = Sm,k , 1 ≤ m ≤ M . Constraints C1 , C2 and C3 respectively define the validity constraints for rK , l1 and Mk . C1 specifies that rK must be a prefix of the inter-tuple separators and the tail. C2 specifies that l1 must be a proper suffix of the inter-tuple separators and the head. C3 requires that Mk equal the corresponding intra-tuple separators. For easier exposition of idea, we illustrate this by translating these constraints for our example page PA , in Figure 3.4 (d). 36 Jack,China John,USA Joseph,UK (a) HTML source for page PA Am,1 m=1 Jack m=2 John m=3 Joseph Sm,1 /B , I /B , I /B , I Am,2 China USA UK Sm,2 /I BR ⇓ B /I BR ⇓ B (b) Values for page PA S0,0 -(head) S3,2 -(tail) HT M L BODY ⇓ B /I BR ⇓ HR /BODY /HT M L (c)Head and Tail for page PA C1 : rK should be a prefix of : /I BR ⇓ B and /I BR ⇓ HR /BODY /HT M L . C2 : l1 should be a proper suffix of : /I BR ⇓ B and HT M L BODY ⇓ B C3 : M1 = /B , I (d) LR constraints for page PA Figure 3.4: Illustration of LR Constraints 37 Lemma 3.4.1 Given page P and label L, there exists an LR wrapper if and only if constraints C1 , C2 and C3 are satisfied. The proof involves showing that: i) If all the constraints are satisfied, then the regular expression matches will be correct ii) If one of the constraints is not satisfied, then the attribute values will be extracted incorrectly. Lemma 3.4.1 means that constraints C1 , C2 and C3 are necessary and sufficient for the delimiters of a LR wrapper to remain valid. Hence, a LR wrapper induction algorithm need to only consider delimiters that satisfy C1 , C2 and C3 . We use this idea to present our induction algorithm, called InduceLR , in Table below. We trace this algorithm InduceLR using our favorite example of Page PA in the table 3.3 The correctness of the algorithm is proved through the following lemma. Lemma 3.4.2 Algorithm InduceLR explores all candidate delimiters that satisfy the constraints C1 , C2 and C3 . Proof: Note that constraint C3 directly specifies the values of Mk . Hence, Step 1 of InduceLR trivially considers all candidates of Mk satisfying C3 . We next prove that all candidates of rK as specified by C1 are considered. Since L includes all tuples, by the construction of SEP in Steps 2-6, there will be M − 1 matches in Steps 38 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Set Mk = Sm,k , 1 ≤ k ≤ K − 1 Set p = longest prefix common to the prefixes of strings P [em,K , |P |]. Set s = longest suffix common to the suffixes of strings P [0, bm,0 ]. If p = s then Set SEP = p Else Set SEP = p ∗ s End If Set M ID = (˜ ∗)M1 (˜ ∗)M2 · · · MK−1 (˜ ∗) Apply the pattern M ID.(SEP ) on page P. Set em = index of the mth match. Set eM = index of the end of a subsequent match with (M ID). Apply the pattern (M ID).SEP on page P. Set bm = index of the mth match. Set rK = longest prefix common to the prefixes of strings P [eM , |P |] satisfying C1 -(i). Set l1 = longest suffix common to the suffixes of strings P [0, bm ] satisfying C2 . Output {l1 , rK , M1 , · · · , MK−1 }. Table 3.2: Algorithm InduceLR 39 Step Step Step Step Step Step Step Step Step Step Step 1. 2. 3. 4-6. 7. 8. 9. 10. 11. 12. 13. Step 14. Step 15. M1 = /B , I p = /I BR ⇓ B s = /I BR ⇓ B Since p = s hence, SEP = /I BR ⇓ B M ID = (˜∗) /B , I (˜ ∗) Apply the pattern M ID.(SEP ) on Page PA em = Index of mth matches = offsets of after ‘China’ and ‘USA’ eM = Index of end of subsequent match with (M ID) = offset of after ‘UK’. Apply the pattern (M ID).SEP on Page PA bm =Index of mth match = offsets of ‘J’ in ’Jack’, ‘John’ and ‘Joseph’. rK = longest common prefix of strings P [em , |P |] satisfying C1 /I BR ⇓ l1 = longest common suffix of P [0, bm ]satisfying C2 . ⇓ B Wrapper w = ⇓ B (˜ ∗) /B , I (˜ ∗) /I BR ⇓ Table 3.3: Trace of Algorithm InduceLR for Page PA 8-9 and the M th match found in Step 10. This implies that Step 13 considers all candidates of rK that satisfy C1 . Similarly, it can be proved that all candidates of l1 as specified by C2 are considered in Step 14. By Lemma 3.4.2, InduceLR enumerates all valid delimiters of the wrapper satisfying constraints C1 ,C2 and C3 . When a valid candidate is found for each delimiter, the algorithm ends by outputting the learnt wrapper. If the page is LR wrappable, then the learnt wrapper will be correct by Lemma 3.4.1. Thus, we have Theorem 3.4.1 Given page P and label L, if P is LR wrappable, then Algorithm Induce will output a LR wrapper consistent with L. By Theorem 3.4.1, InduceLR will output a correct wrapper for any P, L , provided an 40 LR wrapper exists. Note that the LR learning algorithm learnLR [34] can also achieve the same result. However, a key property of InduceLR is that not all tuples in L are strictly required (as proved below) to establish consistency. This is a significant difference from learnLR . The latter will fail if, e.g. the first tuple is not provided, because the proper suffix constraint ClA will not hold. Consider L ⊂ L,i.e. L = {lm1 , lm2 , · · · , lmM } where M < M . Let p be the longest prefix common to Smi ,K and s be the longest suffix common to Smi ,0 . Note that both p and s will be non-empty if an LR wrapper exists. Define Constraint CS : Every inter-tuple separator Sm,K , m = 1, · · · , M − 1 matches the regular expression • “p ”, if p = s , or • “p ∗ s ”, if p = s Lemma 3.4.3 Given P, L as the example, Theorem 3.4.1 holds when CS is satisfied. Proof: Sufficient to prove that all candidate delimiters are considered. Since L has at least one tuple, Step 1 trivially considers all candidates of Mk . With P, L as the example, if Sm,K is as in the lemma, then Step 10 will match all intertuple separators and hence result in M − 1 matches. This means that Steps 11 and 12 will find em = em,K . Hence, Step 15 considers all candidates of rK satisfying constraint C1 . The proof is similar for l1 . Note that, when L = L, constraint CS is satisfied by following arguments similar to 41 that in Lemma 3.4.2. When |L | < |L|, CS will not be true in general, because a page and tuples can be maliciously chosen such that either p or s is found such that at least one Sm,K does not match the regular expression. Then, InduceLR will fail to consider some candidate delimiters and will over-generalize. The resulting l1 and rK will be longer than the correct values and hence only a subset of the tuples will be extracted. As the size of subset L increases, the candidates identified will asymptotically approach the correct values. If the tuples are not maliciously chosen, we expect this to happen for small |L |, because of the structure of LR. Obtaining the exact bounds on the sample complexity would require analysis of InduceLR under the PAC Learning framework as in [33]. The following subsection extends the results for LR to LR. 3.4.2 LR Wrapper Class Recall that LR differs from LR in assuming that the intra-tuple separators are equal (Constraint C3 ). Here, we relax this constraint and propose a generalized Induce for learning LR. LR wrapper is defined by the regular expression l1 (˜ ∗)r1 ∗ l2 (˜ ∗)r2 ∗ · · · lK (˜ ∗)rK (3.2) where lk , rk are strings over Σ, and ˜ ∗, ∗ respectively denote the non-greedy wildcard matches (.+?) and (.*?). The constraints for LR can be defined as follows: 42 PEOPLE JackXXXXXChina JohnYYYYYUSA JosephZZZZZUK Figure 3.5: HTML Source for Modified Page PAm . Constraint C1 (rk ): i) rk is a prefix of Sm,k , and ii) rk is not a substring of Am,k , for m = 1, · · · , M . Constraint C2 (lk ): lk is a proper suffix of Sm,k−1 , for m = 1, · · · , M . The LR induction algorithm InduceLR generalizes InduceLR . Since there are now 2 delimiters to be learned for each intra-tuple region, Step 1 in InduceLR needs to be modified. The generalized algorithm is presented in the Table below. We illustrate this example by modifying our Page PA , by inserting random characters between the tags. Note that there are K separator patterns (SEP) to find. They are constructed in Steps 1-7. In the second pass, Steps 9-12 find the endings of the attribute values and Steps 13-16 find the beginnings. Then, the candidate delimiters are explored in Steps 17-20 to generate the correct wrapper. The trace for this algorithm is similar to InduceLR . The difference lies in the determination of the intra-tuple separators which is step 1 of InduceLR . This step covers step 1-6 in InduceLR . For our example in Figure 3.5, Step 1-2 will determine p1 = /B and s1 = I . Based on this SEP1 in Step 6 = /B (˜ ∗) I . Rest of the steps remain the same as that of InduceLR . 43 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. Set pk = longest prefix common to the prefixes of strings Sm,k . Set pK = longest prefix common to the prefixes of strings P [em,K , |P |]. Set s1 = longest suffix common to the suffixes of strings P [0, bm,1 ]. Set sk = longest suffix common to the suffixes of strings Sm,k−1 . If pk = sk then Set SEPk = pk Else Set SEPk = pk ∗ sk End If Set M ID = (˜ ∗)SEP1 (˜ ∗)SEP2 · · · SEPK−1 (˜ ∗) Apply the pattern M ID.(SEPK ) on page P. Set em,k = index of the mth match of SEPk . Set eM,k = index of a subsequent match of SEPk with (M ID). Set eM,K = index of the end of this match. Apply the pattern (M ID).SEP on page P. Set bm = index of the mth match. Set bm,1 = index of the start of the mth match. Set bm,k = index of the end of mth match of SEPk−1 . Set rk = longest prefix common to the prefixes of strings P [em,k , bm, k + 1 ] satisfying C1 -(i). Set rK = longest prefix common to the prefixes of strings P [em,K , |P |] satisfying C1 -(i). Set l1 = longest suffix common to the suffixes of strings P [0, bm,1 ] satisfying C2 . Set lk = longest suffix common to the suffixes of strings P [em,k , bm, k + 1 ] satisfying C2 . Output {l1 , r1 , · · · , lK , rK }. Table 3.4: Algorithm InduceLR 44 As in Section 3.1, we can prove Theorem 3.4.2 Given page P and label L, if P is LR wrappable, then Algorithm InduceLR will output a LR wrapper consistent with L. We next analyze the capability of Induce in learning from a subset of tuples. Consider L ⊂ L,i.e. L = {lm1 , lm2 , · · · , lmM } where M < M . Let pk be the longest prefix common to Smi ,k and sk be the longest suffix common to Smi ,k−1 . Note that pk and sk will be non-empty if an LR wrapper exists. Define Constraint CS : Sm,k , m = 1, · · · , M − 1 matches the regular expression: • “pk ”, if pk = sk , or • “pk ∗ sk ”, if pk = sk for k = 1, · · · , K − 1. Lemma 3.4.4 Given P, L as the example, Theorem 3.4.2 holds when CS is satisfied. Proof: We prove that all candidate delimiters for r2 are considered. The steps can be extended for rk , k = K, and lk , k = 1. The proof for l1 and rK can be given as in Lemma 3.4.3. With P, L as the example, Steps 1 and 4 find p2 and s2 . If Sm,1 is as in the lemma, then Step 12 will match all Sm,1 , m = 1, · · · , M − 1. Hence, Step 7 considers all candidates of r2 satisfying constraint C1 . By Lemma 3.4.4, we can see that InduceLR is correct asymptotically as |L | → |L|. This property enables InduceLR to be used in Step 3 of ReInduceW . 45 We, thus, can derive a wrapper reinduction algorithm for LR, ReInduceLR . For V erif y, we use an assumption that the changes in the page will be drastic, and when the layout changes occur, the wrapper will fail completely and will not extract anything from the page, i.e, return null values. The experimental studies for ReInduceLR are provided in the next chapter. 3.4.3 LRRE Wrapper Class LR Wrapper Class can cover upto 54% of the websites [34]. To further improve the expressiveness of the LR class, we consider a new wrapper class called LRRE , which is defined below. Let the page P be a string over alphabet Σ. Consider the regular expression l1 (˜∗! ∼ l1 )r1 ∗ l2 (˜ ∗)r2 ∗ · · · lK (˜ ∗)rK where lk , rk are strings over Σ, and ˜ ∗, ∗ respectively denote the non-greedy wildcard matches .+? and .*?. The term (˜∗! ∼ l1 ) means that the first attribute will match only when its value does not contain l1 as a substring. Though not strictly a regular expression, we will follow this notation for the sake of convenience. We call this term as l1 constraint. Definition: A LRRE wrapper is a procedure that applies this regular expression globally on a page and returns the matched values as a label. Note that a LRRE wrapper is defined by 2K delimiters l1 , r1 , ..., lK , rK . LRRE can be seen to be similar to LR class above, which is also defined by 2K delimiters. The difference is in the l1 constraint. 46 Jack,China John,USA Joseph,UK (a) Page PA l1 = B , r1 = /B , I , l2 = /B , I , r2 = /I BR (b) LR Wrapper for Page PA B (˜∗! ∼ “ B ”) /B , I (˜ ∗) /I BR (c) LRRE Wrapper for Page PA Figure 3.6: Page PA and corresponding LR and LRRE wrappers Given an LR wrapper l1 , r1 , ..., lK , rK , we can construct an LRRE wrapper by setting lk = lk and rk = rk . This LRRE will be correct because l1 will satisfy the proper suffix constraint[34]. Hence, it can be observed that LRRE subsumes LR. For example, consider the page PA , shown in Figure 3.4.3(a). An LR Wrapper [34] for this page is indicated in 3.4.3(b). The LRRE wrapper derived using these delimiters is illustrated in 3.4.3(c) which is correct for PA . In fact, LRRE can handle pages not possible with LR. For example, consider the page PA in Fig. 3.4.3. A title ‘People’ formatted in Bold is added to the page. Because of the proper suffix constraint, it can be shown that no LR wrapper exists for this page (which 47 PEOPLE Jack,China John,USA Joseph,UK PEOPLE Jack, China John , USA Joseph , UK Figure 3.7: HTML Source for Modified Page PA is actually in HLRT). However, it can be handled using the same LRRE wrapper used for PA in 3.4.3(c). This is possible because l1 constraint relaxes the proper suffix constraint. In addition, LRRE can be seen to cover pages not handled by HLRT and OCLR. We illustrate this aspect by using the example pages discussed in [33]. The pages are listed in Table 3.5. All the pages consist of 2 attributes (K = 2). The tuples to be extracted are A11 , A12 , A21 , A22 , A31 , A32 . Interestingly, all 7 pages in Table 3.5 can handled by the simple LRRE wrapper, [(˜ ∗! ∼ “[”)]((˜ ∗)). Thus, we conclude that LRRE can handle a rich subclass of HOCLRT, a class observed Example Pages 1. −[h[A11](A12)[A21](A22)[A31](A32)t 2. [h[A11](A12)h − [A21](A22)h[A21](A22)t Handled by HLRT, HOCLRT HLRT, OCLR, HOCLRT 3. o[ho[A11](A12)cox[A21](A22)co[A31](A32)c HOCLRT Not handled by LR, OCLR LR 4. ho[A11](A12)cox[A21](A22)co[A31](A32)c LR,HLRT, OCLR HLRT 5. [A11](A12)t[A21](A22)[A31](A32)t 6. x[o[A11](A12)o[A21](A22)ox[A31](A32) 7. [ho[A11](A12)cox[A21](A22)co[A31](A32)c OCLR, LR, HOCLRT LR, OCLR OCLR HLRT, HOCLRT LR, HLRT HOCLRT OCLR, HOCLRT LR, HLRT Table 3.5: Expressiveness of LRRE 48 1. Set E = the example P, L . 2. Call InduceLR to learn a wrapper using E(t). 3. Output w as the wrapper at time t. Table 3.6: Algorithm InduceLRRE to handle 57% of the web sites [35]. Note that the pages in Table 3.5 have the intratuple separators equal across the tuples. This property ensures that l1 constraint correctly helps LRRE handle rich wrappers. However, when this property does not hold, we cannot guarantee that pages in, e.g. HLRT can be handled. An LRRE Wrapper, can be learnt in the same way as a LR Wrapper. The difference lies only in the way the wrapper extracts data from a page. We first illustrate the InduceLRRE in Table 3.6. Let P, L be an example. Let Lj , j = 1, · · · , n, be the set of content tuples of L. Note that Lj is of size K. Our idea is to learn w ∈ WLRRE from Lj such that w(P ) = L. To illustrate how the l1 constraint is implemented, we present in Table 3.7, how the LRRE wrapper is used to extract the content of a webpage. The algorithm scans the page continuously to find more strings which would match the LRRE regular expression. Step 4 extracts the first matched string with the Wrapper, which would be the first tuple on the page. In Steps 6-7, we extract each attribute of the tuple. Steps 8-12 of this algorithm are the check for the l1 constraint. In step 9, we check if the first attribute for this tuple Am,1 has l1 as a substring. If yes, then in step 10-12, we find the shortest suffix substring of Am,1 , which does not match l1 , and set this as the value of Am,1 . In step 14, we return the values of all attributes for this tuple. From the end of this tuple, we start scanning for 49 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Set P = input Page Set m = 0 Apply the pattern w on page P Set matched = first matched pattern. Set m = m + 1 For i = 1 to K Am,i = value of wildcards in matched Let em,K = index of end of Am,K . If Am,1 matches l1 , then bl1 = index of match of l1 to Am,1 el1 = index of end of match of l1 to Am,1 Am,1 = substring (Am,1 ,el1 ,|Am,1 |) End If Output {· · · {Am,1 , Am,2 , Am,3 , · · · Am,K } · · · } Set P = substring (P, em,K , |P |) Go to 3. Table 3.7: Algorithm ExtractLRRE more tuples, till no more tuples are found on the page. Now that we have described Wrapper classes which learn from a few examples these can be called in the ReInduce Function to implement a wrapper reinduction algorithm. 3.5 Summary In this chapter, we introduced a novel approach for Wrapper Reinduction from incremental web pages, whose layouts may change over time. Using the observation that though the layout may change drastically and none of syntactic are features retained, the page content usually changes only incrementally, provided the time interval is small enough, we presented a reinduction algorithm, which can be used to implement an automatic information extrac- 50 tion system. The examples have to be provided only once and after that the system extracts data using wrappers, and if need be repairs them. We introduced LR Wrapper class and the respective induction algorithms, which learn from a few examples, and can be deployed into the reinduction system. We also introduced another Wrapper class, LRRE , to learn wrappers more expressive than the LR Wrapper class. In the next section we present the experimental evaluation of our algorithms. 51 Chapter 4 Experiments This chapter presents the empirical evaluation of the algorithms InduceLR and ReInduceLR presented in the previous chapter. The experiments for studying the performance of these proposed algorithms were conducted in two parts. In the first part we study the InduceLR . We evaluated the algorithms based on the sample and induction cost. By measuring sample cost, we try to estimate the number of examples needed to have high accuracy with the algorithm. The induction cost of the algorithm measures the time taken by the algorithm, as the number of examples is varied. In the second part of our empirical evaluation we study the performance of ReInduceLR . All experiments were performed on a SUN UltraSPARC workstation. 52 No. 1 2 3 4 5 Category Shopping Search Engine Publication Server Whitepages News Website Amazon Google USPTO Yahoo People ZDNET link http://www.amazon.com http://www.google.com http://patft.uspto.gov/netahtml/search-adv.htm http://people.yahoo.com http://www.zdnet.com/ Table 4.1: Websites considered for evaluation of InduceLR No˙ 1 2 3 4 5 Site Query No. of tuples Amazon Books on Web Mining 10 Google Query = Web Mining 10 USPTO Title = Text Mining 23 Yahoo People Last Name = ‘John’ 10 ZDNET The news for the day 14 Attributes Title, Author, List Price URL, Title, Summary Patent Number, Title Name, Address, Phone Number URL, Headline, Summary Table 4.2: Details of the webpages 4.1 Performance of InduceLR We considered five categories of websites and picked a representative site from each category as listed in table 4.1. The screen shots from these Websites are given in Appendix A. For these sites, all attributes handled by LR were chosen. The details of the webpages considered are listed in Table 4.2. 4.1.1 Sample cost For each website listed in Table 4.2, all tuples present within the page were extracted to create a database. We randomly selected 2-5 number of tuples (M ) and passed them to InduceLR . Once the Wrapper was generated, it was used to extract the contents of the same page, and the performance was measured. To measure the performance, we used the metrics of precision and recall. Recall (R) is the percentage of the correctly extracted data 53 items of all the data items that should be extracted. Precision (P) is the percentage of the correctly extracted data items of all the data items that have been extracted. If there are 10 tuples present on the page, 5 tuples are returned by the Wrapper out of which only 2 are correct, then the precision is 2/5 and recall is 2/10. For each value of M , we performed 20 runs to compute average precision and recall. The results for precision and recall are listed in Table 4.3. 2 Site. Amazon Google USPTO Yahoo People ZDNET %P %R 84 78 90 84 100 90 100 100 100 95 Number of examples 3 4 5 %P %R %P %R %P %R 90 84 95 92 100 100 95 92 100 100 100 100 100 90 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 10 %P %R 100 100 100 100 100 100 100 100 100 100 Table 4.3: Precision and Recall of InduceLR We observe that the precision is above 84% for just two example tuples, though the recall is a bit lower (as in the case of Amazon). This is because 2 randomly selected tuples may not have been representative of the entire set. Two tuples can have a common substring, which can cause the wrapper to get biased. For example, Books on Amazon are classified as being either ‘Paperback’ or ‘Hardcover’ editions. While learning with two examples, there is a possibility of generation of a biased wrapper, which will extract only one of the above, though precision still remains high. However, both precision and recall reach over 92% for 4 tuples and 100% for 5 tuples. Ideally, if four to five examples are retained on the page, then it is sufficient to reinduce the wrapper correctly most of the time. However, it is important to note that Amazon is a difficult page to tackle, since it additionally has many extraneus 54 elements like book excerpts and advertisements within the page. DataProg claims only 70% Accuracy for the learnt Wrapper rules in Amazon. ROADRUNNER was unable to extract any results from the Amazon Music Bestsellers page, which essentially helps us to put the argument forward that content based features and searching for the wrapper space may not be sufficient to handle real pages very effectively. However, it is highly likely that few examples will be retained in the page, and the new Wrapper can be induced from them. A similar case occurs in case of Google, some of the Google URLs are either listed with ‘www’, and some are not: and this causes a bias while training with 2 examples. However, 4 examples seem to be sufficient to give near perfect precision and recall. 4.1.2 Induction cost Since it has to be deployed in a practical system, it is important that the learning algorithm is fast. We performed a detail study on the time needed to learn the Wrapper varying the number of examples at each step. Site Amazon Google USPTO Y! People ZDNET 2 0.318 0.104 0.162 0.542 0.148 Number of examples 3 4 5 0.296 0.322 0.328 0.104 0.100 0.112 0.168 0.176 0.182 0.556 0.570 0.592 0.158 0.154 0.164 10 0.377 0.124 0.232 0.690 0.182 Table 4.4: Time complexity of InduceLR From Table 4.4, we see that the algorithm takes less than a second for induction. We observe that the time complexity is affected by of two important steps: 55 1. This depends on the number of Attributes present in each tuple. Generalizing the prefix and suffix for each intra-tuple delimiter are time intensive routines. The prefix and suffix routines work in such a way that, after adding each character we have to check against the rest of the elements till it fails. So if the intra-tuple delimiters and inter-tuple delimiters are are long strings which match, then the generalization step is slower and hence the induction is slower. 2. The results could also lead to the false assumption that, locating examples is an expensive step. However this seems unlikely since we are using a simple regular expression match. It can be seen in the case of Amazon that the time taken to learn from 3 examples is less than that of learning from 2 examples. This may also be due to biasing as discussed earlier. Amazon gives low precision and recall while learning from 2 examples. This means that, it tries to generalize using the false examples which are located. 4.2 Performance of ReInduceLR For evaluating reinduction, we considered the Whitepages domain. We extracted examples from Yahoo People Search site. To observe sufficiently large number of page changes, we simulated a dynamic People Search web site as follows. We grabbed the templates from four other popular Whitepages sites listed in Table 4.5. We separated the top and the bottom of the page. The top of the page is the part before the tuple listing begins and bottom is the part after the last tuple is listed. We also extracted the template by which each tuple is formatted. At each iteration we used 56 Website People Search (Yahoo) WhoWhere (Lycos) Swithcboard (Infospace) Whitepages.com (W3 Data, Inc) Anywho Online directory (AT&T) link http://people.yahoo.com http://whowhere.com http://switchboard.Com http://whitepages.Com http://anywho.Com Table 4.5: White Pages Websites considered for evaluation of ReInduceLR Layout Precision Recall Time Changes % % (s) 100 100.00 98.90 1.0937 200 100.00 99.60 1.1029 300 98.36 99.60 1.1265 400 99.25 97.41 1.1768 500 99.12 96.34 1.1832 Table 4.6: Performance of ReInduceLR these templates to modify the layout randomly. The top and bottom are considered a pair, and were used together, however, the attribute formatting could vary. For example, The yahoo “top” and “bottom” could be used with the tuple formatting of Whowhere.com. With 5 top, bottom templates and 5 tuple formats, there are totally 625 possible pairs of transitions. We also accumulated data from the Yahoo site and used them to randomly amend the tuples in the page. ReInduceLR was trained at t=0. Then the performance of the wrappers generated was evaluated. The metrics chosen were precision, recall and time complexity. Precision and Recall retain their previous definition, and are measured when the new Wrapper is applied to the changed page. The results are given in Table 4.6. Over 100-500 layout changes, we observe that the algorithm performs at near perfect precision and recall, taking a little over a second for each reinduction step. 57 Algorithm % Precision % Recall SG-WRAM 89.5 90.5 DataProg 90 80 Table 4.7: Average Precision and Recall for Existing Approaches For comparison, we provide the precision and recall of DataProg and SG-WRAM, in Table 4.7 1 . Though these results were obtained using a different corpus, we observe that none of these systems achieve both precision and recall as high as ours. Thus, we conclude that our method can be more effective for Wrapper Reinduction under common layout changes. 1 The precision and recall values for ROADRUNNER are not available 58 Chapter 5 Conclusions In this thesis we investigated wrapper induction from web sites whose layout may change over time. We formulated the reinduction problem and identified that wrapper induction from an incomplete label is a key problem to be solved. We proposed a novel algorithm for incrementally inducing LR wrappers and showed that this algorithm asymptotically identifies the correct wrapper as the number of tuples is increased. This property was used to propose a LR wrapper reinduction algorithm. This algorithm requires examples to be provided exactly once and thereafter the algorithm can detect the layout changes and reinduce wrappers automatically, so long as the wrapper changes are in LR. In experimental studies, we observe that the reinduction algorithm is able to achieve near perfect performance. We also introduced a new class of Wrappers called LRRE to learn wrappers more expressive than the LR wrapper class. The contributions of this thesis can be summarized as follows: 59 1. We identified the problem of wrapper induction using insufficient examples as a key step to be solved to handle wrapper reinduction from incremental web pages. 2. We proposed novel algorithms for incrementally inducing wrappers. These algorithms can learn from as few as two examples, are efficient and are independent of schema, content based patterns or the tag structure of the page. We also showed that the algorithm asymptotically identifies the correct wrapper as the number of examples is increased. 3. Based on our induction algorithms, we developed a practical automatic reinduction system. This system needs a small number of examples to be provided initially, and after that the system verifies and repairs wrappers automatically when the layout changes occur. In comparison to the existing approaches our algorithm is more efficient and eliminates needs for searching large wrapper spaces. Our work is based on the motivation that the key to effective learning is to bias the learning algorithm [46]. However, this is also a limitation. LR class is capable of covering only 53% of common layouts [34]. It would be interesting to extend our approach to handle richer classes such as HOCLRT, and nested wrapper classes like N-LR. 60 Appendix A Websites considered for Evaluation of InduceLR Figure A.1: Screenshot from Amazon.com 61 Figure A.2: Screenshot from Google.com Figure A.3: Screenshot from uspto.gov 62 Figure A.4: Screenshot from Yahoo People Search Figure A.5: Screenshot from ZDNet.com 63 Appendix B Regular Expression Syntax A regular expression (or RE) specifies a set of strings that matches it. Regular expressions can contain both special and ordinary characters. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. Some important special characters are: “.” (Dot.) In the default mode, this matches any character except a newline. “*” Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab will match ’a’, ’ab’, or ’a’ followed by any number of ’b’s. a. b will match ‘ab’, ‘acb’, ‘ad3b’ etc. “+” Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’. 64 “?” Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’. The “*”, “+”, and “?” qualifiers described above are all greedy qualifiers, they match as much text as possible. Sometimes this behavior is not desired; if the RE < .∗ > is matched against H1 title /H1 , it will match the entire string, and not just H1 . Adding “?” after the qualifier makes it perform the match in a non-greedy or minimal fashion; as few characters as possible will be matched. Using .∗? in the previous expression will match only H1 . “+?”, “??” are the other non-greedy (or lazy) qualifiers. (...) matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed. (.∗?) will help extract the value of the wildcard as a group. 65 Bibliography [1] S. Abiteboul. Querying semi-structured data. In International Conference on Database Theory, pages 1–18, 1997. [2] J. L. Ambite, G. Barish, C. A. Knoblock, M. Muslea, J. Oh, and S. Minton. Getting from here to there: interactive planning and agent execution for optimizing travel. In Eighteenth national conference on Artificial intelligence, pages 862–869, Edmonton, Alberta, Canada, 2002. American Association for Artificial Intelligence. [3] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pages 337–348, San Diego, California, 2003. ACM Press. [4] Y. Arens and C. Knoblock. SIMS: Retrieving and integrating information from multiple sources. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 562–563, Washington, DC, 1993. [5] Arnaud Sahuguet and Fabien Azavant. Web Ecology: Recycling HTML pages as XML documents using W4F. In WebDB’99, 1999. [6] N. Ashish and C. A. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Rec., 26(4):8–15, 1997. 66 [7] P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and structured data in the web: Going back and forth. In Workshop on Management of Semistructured Data, 1997. [8] M. Bauer, D. Dengler, and G. Paul. Instructible information agents for web mining. In Intelligent User Interfaces, pages 21–28, 2000. [9] D. Beneventano, S. Bergamaschi, S. Castano, A. Corni, R. Guidetti, G. Malvezzi, M. Melchiori, and M. Vincini. Information integration: The MOMIS project demonstration. In Proceedings of 26th International Conference on Very Large Data Bases, pages 611–614, Cairo, Egypt, 2000. Morgan Kaufmann. [10] G. Beuster, B. Thomas, and C. Wolff. MIA - a ubiquitous multi-agent web information system. In International ICSC Symposium on Multi-Agents and Mobile Agents in Virtual Organizations and E-Commerce, 2000. [11] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz. The Harvest information discovery and access system. Computer Networks and ISDN Systems, 28(1–2):119–125, 1995. [12] M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, pages 6–11, Menlo Park, CA, 1998. AAAI Press. [13] M. J. Carey, L. M. Haas, P. M. Schwarz, M. Arya, W. F. Cody, R. Fagin, M. Flickner, A. W. Luniewski, W. Niblack, D. Petkovic, J. Thomas, J. H. Williams, and E. L. Wimmers. Towards heterogeneous multimedia information systems: the garlic approach. In Proceedings of the 5th International Workshop on Research Issues in Data Engineering- 67 Distributed Object Management (RIDE-DOM’95), page 124. IEEE Computer Society, 1995. [14] C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery. In Proceedings of the tenth international conference on World Wide Web, pages 681– 688, Hong Kong, 2001. ACM Press. [15] B. Chidlovskii. Automatic repairing of web wrappers. In Proceeding of the third international workshop on Web information and data management, pages 24–30, Atlanta, Georgia, USA, 2001. ACM Press. [16] B. Chidlovskii, U. Borghoff, and P. Chevalier. Towards sophisticated wrapping of webbased information repositories. In Proceedings of 5th International RIAO Conference, pages 123–135, 1997. [17] R. Cooley, J. Srivastava, and B. Mobasher. Web mining: Information and pattern discovery on the world wide web. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97), November 1997. [18] V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of International Conference on Very Large Data Bases (VLDB 01), pages 109–118. Morgan Kaufman, 2001. [19] R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the world-wide web. In W. L. Johnson and B. Hayes-Roth, editors, Proceedings of the First International Conference on Autonomous Agents (Agents’97), pages 39–48, Marina del Rey, CA, USA, 1997. ACM Press. 68 [20] O. Etzioni. The world-wide web: Quagmire or gold mine? Communications of the ACM, 39(11):65–68, 1996. [21] O. Etzioni and D. S. Weld. Intelligent agents on the internet: Fact, fiction, and forecast. IEEE Expert, 10(3):44–49, 1995. [22] M. Frank, M. Muslea, J. Oh, S. Minton, and C. Knoblock. An intelligent user interface for mixed-initiative multi-source travel planning. In Proceedings of the 6th international conference on Intelligent user interfaces, pages 85–86, Santa Fe, New Mexico, United States, 2001. ACM Press. [23] X. Gao, M. Zhang, and P. Andreae. Learning information extraction patterns from tabular web pages without manual labelling. In Web Intelligence. IEEE Computer Society, 2003. [24] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman, V. Vassalos, and J. Widom. The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2):117–132, 1997. [25] M. R. Genesereth, A. M. Keller, and O. M. Duschka. Infomaster: an information integration system. In Proceedings of the SIGMOD international conference on Management of data, pages 539–542, Tucson, Arizona, United States, 1997. ACM Press. [26] K. Hammond, R. Burke, C. Martin, and S. Lytinen. FAQ finder: a case-based approach to knowledge navigation. In Proceedings of the 11th Conference on Artificial Intelligence for Applications, pages 80–86. IEEE Computer Society, 1995. [27] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. 69 [28] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998. [29] C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: a machine learning approach. Intelligent exploration of the web, pages 275–287, 2003. [30] C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G. Philpot, and S. Tejada. Modeling web sources for information integration. In Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, pages 211–218, Madison, Wisconsin, United States, 1998. American Association for Artificial Intelligence. [31] R. Kosala and H. Blockeel. Web mining research: A survey. SIGKDD Explorations, 2(1):1–15, 2000. [32] S. Kuhlins and R. Tredwell. Toolkits for generating wrappers – a survey of software toolkits for automated data extraction from web sites. In M. Aksit, M. Mezini, and R. Unland, editors, Objects, Components, Architectures, Services, and Applications for a Networked World, volume 2591 of Lecture Notes in Computer Science (LNCS), pages 184–198, 2003. [33] N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. [34] N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:15–68, 2000. [35] N. Kushmerick. Wrapper verification. World Wide Web, 3(2):79–94, 2000. 70 [36] N. Kushmerick and B. Thomas. Adaptive information extraction: A core technology for information agents. Intelligent Information Agents R&D in Europe: An AgentLink perspective, 2002. [37] C. T. Kwok and D. S. Weld. Planning to gather information. In 13th AAAI National Conference on Artificial Intelligence, pages 32–39, Portland, Oregon, 1996. AAAI / MIT Press. [38] A. H. F. Laender, B. Ribeiro-Neto, and A. S. da Silva. Debye - date extraction by example. Data and Knowledge Engengineering, 40(2):121–154, 2002. [39] A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of web data extraction tools. SIGMOD, 31(2):84–93, 2002. [40] S. Lawrence and C. L. Giles. Searching the web: General and scientific information access. IEEE Communications, 37(1):116–122, 1999. [41] K. Lerman and S. Minton. Learning the common structure of data. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 609–614. AAAI Press / The MIT Press, 2000. [42] K. Lerman, S. Minton, and C. Knoblock. Wrapper maintanence: A machine learning approach. Journal of Artificial Intelligence Research, 18:149–181, 2003. [43] S. Luke and J. A. Hendler. Web agents that work. IEEE Multimedia, 4(3):76–80, 1997. [44] G. Mecca, P. Atzeni, A. Masci, P. Merialdo, and G. Sindoni. The Araneus web-base management system. In SIGMOD Conference, pages 544–546, 1998. 71 [45] X. Meng, D. Hu, and C. Li. Schema-guided wrapper maintenance for web-data extraction. In Proceedings of International on Web Information and Data Management(WIDM 03), pages 1–8, 2001. [46] T. Mitchell. The need for biases in learning generalizations. In J. Shavlik and T. Dietterich, editors, Readings in Machine Learning. Morgan Kaufman, 1990. [47] D. Mladenic. Text-learning and related intelligent agents: A survey. IEEE Intelligent Systems, 14(4):44–54, 1999. [48] R. Mohapatra and K. Rajaraman. Wrapper induction under web layout changes. In Proceedings of International Conference on Internet Computing, pages 102–108, Las Vegas, Nevada, USA, 2004. [49] R. Mohapatra, K. Rajaraman, and S. Y. Sung. Efficient wrapper reinduction from dynamic web sources. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, pages 391–397, Beijing,China, 2004. [50] I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured text. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, Menlo Park, California, 1998. AAAI Press. [51] I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In O. Etzioni, J. P. M¨ uller, and J. M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents’99), pages 190–197, Seattle, WA, USA, 1999. ACM Press. [52] I. Muslea, S. Minton, and C. Knoblock. Hierarchical wrapper induction for semistruc- 72 tured sources. Journal of Autonomous Agents and Multi-Agent Systems, 4:93–114, 2001. [53] T. Payne, R. Singh, and K. Sycara. RCal: A case study on semantic web agents. In The First International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2002. [54] M. Perkowitz and O. Etzioni. Category translation: Learning to understand information on the internet. In International Joint Conference on Artificial Intelligence, IJCAI-95, pages 930–938, Montreal, Canada, 1995. [55] D. Smith and M. Lopez. Information extraction from semi-structured documents. In Proceedings of Workshop on Management of Semi-structured Data, Tucson, 1997. [56] S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233–272, 1999. 73 [...]... Kushmerick et al.[36] advocate this task of Information extraction from the web as the core enabling technology for a variety of Information agents 1.2 Information Extraction from the Web At the highest level, this thesis is concerned with Information Extraction from the Web Information Extraction (IE) is the process of identifying the particular fragments of an information resource that constitute its... landmark feature will no longer hold and the wrapper will fail [36] In such cases, the extraction of data from such web pages becomes difficult and is clearly a non-trivial problem In this thesis, we focus on this problem of Extraction of Information from Dynamic Web sites We deal with dynamic web pages, typically, a web page which is modified in its layout, content or both The challenge here is to generate... extract- 9 ing the content of such dynamic webpages We propose a novel approach for dealing with dynamic websites and present efficient algorithms that can perform continuous extraction of information 1.3 Organization The rest of the thesis is organized as follows: Chapter 2 is dedicated to reviewing all the existing literature for information extraction from dynamic websites and evaluating their strengths... the information overload on the internet by assisting us finding important resources on the Web [11, 37, 26, 54], and also track and analyze their usage patterns This process of discovery and analysis of Information on the World Wide Web is called Web Mining Web mining is a huge, interdisciplinary and very dynamic scientific area, converging from several research communities such as database, information. .. To extract information from semi-structured information resources, information extraction systems usually rely on extraction rules tailored to a that source, generally called Wrappers Wrappers are software modules that help capture the semi-structured data on the web into a structured format They have three main functions [32]: • Download : They must be able to download HTML pages from a web site •... autonomous systems perform the process of managing, collating, filtering and redistributing information from the many resources [36, 8] Broadly put, agents would include Information integration also as a task, but they additionally analyze the information obtained from various sources These systems assist users by finding information or performing some simpler tasks on their behalf For instance, such a system... [33] and hence, extraction of data from Web is a non-trivial 4 Figure 1.2: Weather listing from the channel news asia website problem The primary problem faced by Information Integration systems and Intelligent agents is not resource discovery, since most of them would look at a few trusted sources related to specific domains Since the semi-structured pages contain a lot of extraneous information, the... the information to a structured format Also, it can be observed that across different web sites and web pages in HTML, the structural formatting (HTML tags or surrounding text) may differ, but the presentation remains fairly regular Wrappers also help in coping with structural heterogeneity inherent in many different sources By using several wrappers to extract data from the various information sources. .. artificial intelligence especially from machine learning and natural language processing This includes automatic research and analysis of information resources available online, Web Content Mining, discovery of the link structure of the hyperlinks at the interdocument level, Web Structure Mining, and the analysis of user access patterns, Web Usage Mining[17] A taxonomy of Web Mining tools has been described... information on the behalf of the user Information Integration systems deal with extraction and integration of data from various sources [4] An application developer starts with a set of web sources and creates a unified view of these sources Once this process is complete, an end user can issue databaselike queries as if the information were stored in a single large database [30] Many such approaches have ... al.[36] advocate this task of Information extraction from the web as the core enabling technology for a variety of Information agents 1.2 Information Extraction from the Web At the highest level,... cases, the extraction of data from such web pages becomes difficult and is clearly a non-trivial problem In this thesis, we focus on this problem of Extraction of Information from Dynamic Web sites... level, this thesis is concerned with Information Extraction from the Web Information Extraction (IE) is the process of identifying the particular fragments of an information resource that constitute

Định dạng
Số trang	83
Dung lượng	1 MB