Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 83 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
83
Dung lượng
1 MB
Nội dung
INFORMATION EXTRACTION FROM DYNAMIC WEB SOURCES
ROSHNI MOHAPATRA
NATIONAL UNIVERSITY OF SINGAPORE
2004
INFORMATION EXTRACTION FROM DYNAMIC WEB SOURCES
ROSHNI MOHAPATRA
(B.E.(Computer Science and Engineering), VTU, India)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
NATIONAL UNIVERSITY OF SINGAPORE
2004
Acknowledgments
First of all, I would like to express my sincere thanks and appreciation to my supervisor,
Dr. Kanagasabai Rajaraman for his attention, guidance, insight, and support which has
led to this dissertation. Through his ideas, he is in many ways responsible for much of the
direction this work took.
I would also like to thank Prof. Sung Sam Yuan and Prof. Vladimir Bajic who have
been a source of inspiration to me. I am grateful to Prof. Kwanghui Lim, Deparment of
Business policy, NUS School of Business for being a mentor and friend, and listening to my
frequent ramblings.
I would like to acknowledge the support of my thesis examiners: A/P Tan Chew Lim
and Dr. Su Jian. I greatly appreciate the comments and suggestions given by them.
Ma, Papa continue to pull the feat of helping me with my work without caring to know
the least about it. I would like to thank them and the rest of the family for their love,
support and encouragement. Special thanks to Arun for his patience, support, favors and
all the valuable input for this thesis and otherwise. Finally, a big thank you to all my
friends, wherever they are, for all the good times we have shared that has helped me to
come till here...
iii
Contents
Acknowledgments
iii
Summary
ix
1 Introduction
1
1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Information Extraction from the Web . . . . . . . . . . . . . . . . . . . . .
5
1.2.1
Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.2
Wrapper Generation . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.3
2 Survey of Related Work
2.1
2.2
11
Wrapper Verification Algorithms . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.1
RAPTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.1.2
Chidlovskii’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
16
Wrapper Reinduction Algorithms . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2.1
ROADRUNNER . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.2.2
DataProg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.3
SG-WRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
iv
2.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 ReInduce: Our Wrapper Reinduction Algorithm
22
25
3.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2
Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.3
Generic Wrapper Reinduction Algorithm . . . . . . . . . . . . . . . . . . . .
30
3.3.1
Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3.2
Algorithm ReInduceW . . . . . . . . . . . . . . . . . . . . . . . . . .
32
Incremental Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.4.1
LR Wrapper Class . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.4.2
LR Wrapper Class . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.4.3
LRRE Wrapper Class . . . . . . . . . . . . . . . . . . . . . . . . . .
46
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.4
3.5
4 Experiments
4.1
4.2
52
Performance of InduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.1.1
Sample cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.1.2
Induction cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Performance of ReInduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5 Conclusions
59
A Websites considered for Evaluation of InduceLR
61
B Regular Expression Syntax
64
v
List of Tables
3.1
Algorithm ReInduceW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.2
Algorithm InduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3
Trace of Algorithm InduceLR for Page PA . . . . . . . . . . . . . . . . . . .
40
3.4
Algorithm InduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.5
Expressiveness of LRRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.6
Algorithm InduceLRRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.7
Algorithm ExtractLRRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.1
Websites considered for evaluation of InduceLR . . . . . . . . . . . . . . . .
53
4.2
Details of the webpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.3
Precision and Recall of InduceLR . . . . . . . . . . . . . . . . . . . . . . . .
54
4.4
Time complexity of InduceLR . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.5
White Pages Websites considered for evaluation of ReInduceLR . . . . . . .
57
4.6
Performance of ReInduceLR . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.7
Average Precision and Recall for Existing Approaches . . . . . . . . . . . .
58
vi
List of Figures
1.1
Froogle: A product search agent . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Weather listing from the channel news asia website . . . . . . . . . . . . . . . .
5
1.3
Page PA and HTML Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4
Page PB and HTML Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1
Life Cycle of a Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2
Layout changes in an Online Address Book
. . . . . . . . . . . . . . . . . . . .
13
2.3
Content changes in a Home supplies page . . . . . . . . . . . . . . . . . . .
15
2.4
Changed Address Book Addressm . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.5
User defined schema for the Address Book Example . . . . . . . . . . . . .
21
2.6
Content Features of the Address field . . . . . . . . . . . . . . . . . . . . . .
22
3.1
Incremental Content changes in Channel News Asia Website . . . . . . . . . . . .
27
3.2
ReInduce: Wrapper Reinduction System . . . . . . . . . . . . . . . . . . . . . .
28
3.3
Page PA and its corresponding LR Wrapper . . . . . . . . . . . . . . . . . .
34
3.4
Illustration of LR Constraints . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.5
HTML Source for Modified Page PAm .
. . . . . . . . . . . . . . . . . . . .
43
3.6
Page PA and corresponding LR and LRRE wrappers . . . . . . . . . . . . .
47
vii
3.7
HTML Source for Modified Page PA . . . . . . . . . . . . . . . . . . . . . .
48
A.1 Screenshot from Amazon.com . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
A.2 Screenshot from Google.com . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
A.3 Screenshot from uspto.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
A.4 Screenshot from Yahoo People Search . . . . . . . . . . . . . . . . . . . . . . .
63
A.5 Screenshot from ZDNet.com . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
viii
Summary
To organize, analyze and integrate information from the Internet, many existing systems
need to automatically extract the content from the webpages. Most systems use customized
wrapper procedures to perform this task of information extraction. Traditionally wrappers
are coded manually, but hand-coding is a tedious process. A technique known as wrapper
induction has been proposed for automatically learning a wrapper from a given resource’s
example pages. In both these methods, the key problem is that, due to the dynamic nature
of the web, over a period of time, the layout of a website may change and hence the wrapper
may become incorrect. The problem of reconstructing a wrapper, to ensure continuous
extraction of information from dynamic web sources is called wrapper reinduction. In this
thesis, we investigate the wrapper reinduction problem and develop a novel algorithm that
can detect layout changes and reinduce wrappers automatically.
We formulate wrapper reinduction as an incremental learning problem and identify that
wrapper induction from an incomplete label is a key problem to be solved. We observe
that the page content usually changes only incrementally for small time intervals, though
the layout may change drastically and none of the syntactic features are retained. We
thus propose a novel algorithm for incrementally inducing a class of Wrappers called LR
wrappers and show that this algorithm asymptotically identifies the correct wrapper as the
number of tuples is increased. This property is used to propose a LR wrapper reinduction
algorithm. We demonstrate that this algorithm requires examples to be provided exactly
once and thereafter the algorithm can detect the layout changes and reinduce wrappers
automatically, so long as the wrapper changes are in LR.
We have performed experimental studies of our reinduction algorithm using real web
ix
pages and observed that the algorithm is able to achieve near perfect performance. In
comparison, DataProg has reported performance of 90% precision and 80% recall, and SGWRAM 89.5% precision and 90.5% recall. However, DataProg and SG-WRAM assume that
the content to be extracted follows specific patterns, which is not required by our algorithm.
Furthermore, our algorithm has been observed to be efficient and capable of learning from
a small number of examples.
x
Chapter 1
Introduction
1.1
Background
We can perceive the Web as a huge library of documents – telephone directories, weather
reports, web-logs, news, virus updates, research papers, job listings, event schedules, stock
market information and many more. Recently there has been an interest in developing
systems that can access such resources and organize, categorize and personalize this information on the behalf of the user.
Information Integration systems deal with extraction and integration of data from
various sources [4]. An application developer starts with a set of web sources and creates a
unified view of these sources. Once this process is complete, an end user can issue databaselike queries as if the information were stored in a single large database [30]. Many such
approaches have been discussed in [4, 9, 13, 25].
1
Figure 1.1: Froogle: A product search agent
Intelligent agents or Software agents is a term used to describe fully autonomous
systems perform the process of managing, collating, filtering and redistributing information
from the many resources [36, 8]. Broadly put, agents would include Information integration
also as a task, but they additionally analyze the information obtained from various sources.
These systems assist users by finding information or performing some simpler tasks on
their behalf. For instance, such a system might assist in product search to aid online
shopping. Froogle, shown in Figure 1.1 is one such agent. Some agents help in web browsing
by retrieving documents similar to already-requested documents [47] and presenting cross
referenced scientific papers [40]. More commercial uses have been proposed: Comparative
Shopping Agents [19], Virtual Travel Assistants [2, 22], and Mobile Agents [10]. Many
such agents are deployed and listed online and provide a wide range of functionality. A
comprehensive survey of such agents has been done in [47].
2
A class of these agents help tackle the information overload on the internet by assisting
us finding important resources on the Web [11, 37, 26, 54], and also track and analyze their
usage patterns. This process of discovery and analysis of Information on the World Wide
Web is called Web Mining. Web mining is a huge, interdisciplinary and very dynamic
scientific area, converging from several research communities such as database, information
retrieval, and artificial intelligence especially from machine learning and natural language
processing. This includes automatic research and analysis of information resources available
online, Web Content Mining, discovery of the link structure of the hyperlinks at the interdocument level, Web Structure Mining, and the analysis of user access patterns, Web Usage
Mining[17]. A taxonomy of Web Mining tools has been described in [17]. A detailed survey
on Web Mining has been presented in [31].
Originally envisioned by the World Wide Web Consortium (W3C) to evolve, proliferate,
and to be used directly by people, the initial web architecture included a simple HTTP,
URI, and HTML source structure. People used query mechanisms (e.g, HTML forms)
and received output in the form of HTML pages. This was very well suited for manual
interaction. As expected, the existence of an open and freely usable standard allows anyone
in the world to experiment with extensions; the HTTP and HTML specifications have both
grown rapidly in this environment. HTML pages now contain extensive, detailed formatting
information which is specific to one type of browser, and many vendor-specific tags have
been added, making it useful only as a display language without any standard structure.
Now efforts are underway to standardize and incorporate more structure into the web.
The advent of XML has helped tackle this lack of structure in the web, but it is not as
commonly used and there is very limited native browser support. Though many websites
3
employ XML in their background, there are still many HTML-based websites that need to
be reverted to XML before universal adoption. Thus, there is still a need to convert from
the existing HTML data to XML, and technology does not provide a trivial solution for
this [39].
The Web which is characterized by diverse authoring styles and content variations,
does not have a rigid and static structure like relational databases. Most of the pages
are composed of natural language text, and are neatly ‘formatted’ with a title, subtitles,
paragraphs etc, more or less like traditional text documents. But we observe that, the web
demonstrates a fair degree of structure in data representation [20] and is highly regular
in order to be human-readable and is often described as semi-structured. Normally, a
document may contain its own metadata, but the common case is for the logical structure
to be implicitly defined by a combination of physical structure (e.g. HTML tags for a webpage, line and paragraph boundaries for a free-text resource) and content indicators (e.g.
words in section headings, capitalization of important words, etc.) [55].
For example, a webpage listing the world weather report may list the results in the
form of a tuple (city, condition, max temperature, min temperature) as shown in the Figure
1.2. Many such tuples may be present on the same page, appropriately formatted, giving it
the appearance of a relational database. Similarly, a movie listing may have the information
in the order (movie, rating, theater, time).
While unstructured text may be difficult to analyze, semi-structured text poses a different set of challenges. It is interlaced with extraneous elements like advertisements and
HTML formatting constructs [33] and hence, extraction of data from Web is a non-trivial
4
Figure 1.2: Weather listing from the channel news asia website
problem.
The primary problem faced by Information Integration systems and Intelligent agents
is not resource discovery, since most of them would look at a few trusted sources related to
specific domains. Since the semi-structured pages contain a lot of extraneous information,
the problem exists in being able to extract the contents of a page. Kushmerick et. al.[36]
advocate this task of Information extraction from the web as the core enabling technology
for a variety of Information agents.
1.2
Information Extraction from the Web
At the highest level, this thesis is concerned with Information Extraction from the Web.
Information Extraction (IE) is the process of identifying the particular fragments of an
information resource that constitute its core semantic content [34]. A number of IE systems
have been proposed for dealing with free-text (see [56, 12] for example) and semi-structured
text [6, 18, 33, 55].
5
The Information extraction algorithms can be further classified on the basis of whether
they deal with semi-structured text and semi-structured data [39]. Note that, in the former
the data can be only inferred and in latter the data is implicitly formatted. The focus in
this thesis is on semi-structured data extraction. A taxonomy for these Web data extraction
methods has been described in the detailed survey by Laender et al. [39]. In this survey, the
existing methods are classified into Natural language processing (NLP), HTML structure
analysis, Machine Learning, data modeling and ontology-based methods.
1.2.1
Wrappers
To extract information from semi-structured information resources, information extraction
systems usually rely on extraction rules tailored to a that source, generally called Wrappers.
Wrappers are software modules that help capture the semi-structured data on the web into
a structured format. They have three main functions [32]:
• Download : They must be able to download HTML pages from a web site.
• Search: Within a resource they must be able to search for, recognize and extract
specified data.
• Save: They should Save this data in a suitably structured format to enable further
manipulation. The data can then be imported into other applications for additional
processing.
According to [5], 80% of the published information on the WWW is based on databases
running in the background. When compiling this data into HTML documents the structure
6
PEOPLE
Jack,China
John,USA
Joseph,UK
PEOPLE
Jack , China
John , USA
Joseph , UK
Figure 1.3: Page PA and HTML Source.
of the underlying databases is completely lost. Wrappers try to reverse this process by
restoring the information to a structured format.
Also, it can be observed that across different web sites and web pages in HTML, the
structural formatting (HTML tags or surrounding text) may differ, but the presentation
remains fairly regular. Wrappers also help in coping with structural heterogeneity inherent
in many different sources. By using several wrappers to extract data from the various information sources of the WWW, the retrieved data can be made available in an appropriately
structured format [32].
To be able to search data from semi-structured web pages, the wrappers rely on key
patterns that help recognize the important information fragments within a page. The most
challenging aspect of Web data extraction by wrappers is to be able to recognize the data
among uninteresting pieces of text.
For example, consider an imaginary web site containing Person Name and Country
Name entities, shown in Figure 1.3. To extract the two entities, we can propose a wrapper,
say P CW rapper, using the delimiters { B , /B , I , /I }, where the first two define the
left and right delimiters of the Person Name and the last two define the corresponding
delimiters for Country Name. This wrapper can be used to extract the contents of the page
7
PA , and of any other page, where the same features are present.
1.2.2
Wrapper Generation
One approach to creating a wrapper would be hand-code them [24] but it is a tedious
process. Techniques have been proposed for constructing wrappers semi-automatically or
automatically, using a resource’s sample pages. The automatic approaches which use supervised learning need the user to provide some labeled pages indicating the examples. Many
such approaches were proposed in RAPIER [12], WHISK [56], WIEN [33], SoftMealy [28],
STALKER [50] and DEByE [38]. A method for automatic generation of Wrappers with
unsupervised learning was introduced in RoadRunner [18].
To extract the data wrappers either use content based features or landmark based features. Content based approaches [12, 56] use content/linguistic features like capitalization,
presence of numeric characters etc. and are suitable for Web pages written in free text,
possibly using a telegraphic style, like in job listings or rental advertisements. Landmark
based approaches [33, 28, 50] use delimiter based extraction rules that rely on formatting
features to delineate the structure of data found [39] and hence are more suitable for data
formatted in HTML.
For example, in Figure 1.3, the Wrapper P CW rapper can be learnt automatically from
examples of (Person Name, Country Name) tuples.
Since the extraction patterns generated in all these systems are based on content or
delimiters that characterize the text, they are sensitive to changes of the Web page format.
8
Jack , China
James , India
John , USA
Jonathan , UK
INFORMATION
Jack,China
James,India
John,USA
Jonathan,UK
Figure 1.4: Page PB and HTML Source.
In this sense they are source-dependent. They either need to be reworked or need to be
rerun to discover new patterns for new or changed source pages. For example, suppose the
site in Figure 1.3 changes to a new layout as in Figure 1.4.
Note that P CW rapper no longer extracts correctly. It will extract the tuples as
(China, James), (India, John), (USA, Jonathan) rather than (Jack, China), (James,
India), (John, USA), (Jonathan, UK).
Kushmerick [16, 18] investigated 27 actual sites for a period of 6 months, and found
that 44 % of the sites changed its layout during that period at least once [35]. If the source
modifies its formatting (for example, to “revamp” its user interface) the observed content
or landmark feature will no longer hold and the wrapper will fail [36]. In such cases, the
extraction of data from such web pages becomes difficult and is clearly a non-trivial problem.
In this thesis, we focus on this problem of Extraction of Information from Dynamic Web sites. We deal with dynamic web pages, typically, a web page which is
modified in its layout, content or both. The challenge here is to generate the Wrapper
automatically when the page changes occur, such that the data is extracted continuously
for the purpose of the user. In this thesis, we develop systems that are capable of extract-
9
ing the content of such dynamic webpages. We propose a novel approach for dealing with
dynamic websites and present efficient algorithms that can perform continuous extraction
of information.
1.3
Organization
The rest of the thesis is organized as follows:
Chapter 2 is dedicated to reviewing all the existing literature for information extraction
from dynamic websites and evaluating their strengths and weaknesses. We summarize
the key learning from these methods and present the scope of our work.
Chapter 3 presents a detailed description and analysis of our approach. We formally define
the problem of information extraction from dynamic websites, and our approach to
tackling it. We discuss the formal framework for our algorithm, and define and analyze
in detail the wrapper classes. We also present a study and analysis of algorithms to
learn these wrappers, and use them to propose a novel method to learn new wrappers
on the fly when layout and content changes occur in the website.
Chapter 4 discusses the empirical evaluation of our work through experiments on real
webpages. We study the sample and time complexity of our algorithms and compare
the results to the existing approaches.
Chapter 5 summarizes our work and indicates the merits as well as limitations. We propose the ways to extend the algorithms to achieve better performance and also pose
the open problems for further investigation.
10
Chapter 2
Survey of Related Work
As discussed in the previous chapter, Wrappers are software modules that help us capture
semi structured data into structured format. We noted that these wrappers are susceptible
to “breaking”, when the website layout changes happen. To rectify this problem, a new
wrapper needs to be induced using examples from the modified page. This is called the
Wrapper Maintenance problem and it consists of two steps [36, 42]:
1. Wrapper Verification: To determine whether a wrapper is correct.
2. Wrapper Reinduction: To learn a new wrapper if the current wrapper has become
incorrect.
The entire process of a Wrapper Induction, Verification and Reinduction is illustrated
through Figure 2.1 [42].
11
User
HTML
Pages
Labeled pages
Wrapper
Induction
Labeled
pages
Wrapper
Extracted Data
Change
Detected
Automatic
relabeling
Wrapper
Verification
Reinduction System
Figure 2.1: Life Cycle of a Wrapper
The wrapper induction system takes a set of web pages labeled with examples of the
data to be extracted. The output of the wrapper induction system is a wrapper, consisting
of a set of rules to identify the data on the page.
A wrapper verification system monitors the validity of data returned by the wrapper.
If the site changes, the wrapper may extract nothing at all or some data that is not correct.
The verification system will detect data inconsistency and notify the operator or automatically launch a wrapper repair process. A wrapper reinduction system repairs the extraction
rules so that the new wrapper works on changed page.
We take a simple example to illustrate this. Consider, the example given in Figure 2.2.
The Wrapper Addresswrap for page Addresso is the same as P CW rapper in the previous
chapter: {,,,}
When the page Addresso changes its layout to Addressc , Wrapper Addresswrap would
extract (12 Orchard Road, James), (34 Siglap Road, June), (22 Science Drive, Jenny)
12
Address Book
Jack,1234 Orchard Road
James,3454 Siglap Road
John,22 Alexandra Road
Jonathan,1156 Kent Ridge
(a)Original Address Book Addresso
Address Book
Jack,12 Orchard Road
James,34 Siglap Road
June,22 Science Drive
Jenny,11 Sunset Blvd
(b) Changed Address Book Addressc
Figure 2.2: Layout changes in an Online Address Book
on page Addressc . The wrapper verification system will identify that the extracted data is
incorrect. The Wrapper reinduction system will help learn the new Wrapper: {,,,}.
Wrapper Maintenance has been investigated in literature. Below, we review the important works and discuss the strengths and limitations of these methods.
2.1
Wrapper Verification Algorithms
Wrapper Verification is the step to determine whether a wrapper is still operating correctly.
When a Web site changes its layout or in the case of missing attributes, the wrapper will
either yield NULL results, or a wrong result. In such a case, the wrapper is considered to
13
be broken. This can become a big bottleneck for information integration tools and also for
information agents.
2.1.1
RAPTURE
Kushmerick [35] proposed a method for wrapper verification using a statistical approach.
He uses heuristics like Word count and mean word length. The method relies on obtaining
heuristics for the new page, and comparing it against the heuristic data for pre-verified
pages to check whether it is correct. An outline of the steps is given below:
• Step 1: Estimating the number of tuple distribution parameters for pre-verified pages.
This is assumed to follow an normal distribution. The mean tuple number and the
standard deviation is also computed.
• Step 2: Estimating the feature value distribution parameters for each attribute in the
pre-verified pages. For this simple statistical features like word-count and word length
are used. For example, the word count for ‘Jonathan’ is ‘1’ and for ‘1156 Kent Ridge’
it is 3, and mean word length for the name field is 5.25.
• Step 3: For any page, a similar computation of tuple distribution, feature value distribution is done. These values are compared against the values for the pre-verified
pages. For example, for feature 1 (Name), in Addresso from our computation, we
know that the average word length is 1, but in Addressc it is computed to be 3.
• Step 4: Based on step 3, Computation of the overall verification probability is done.
This probability is compared against a fixed threshold to determine whether the wrapper is correct or incorrect. In case of our example, it would return a CHANGED.
14
Item
List Price Our Price
Chopsticks
$6.95
$4.95
Spoons
$25.00
$10.00
(a) Original content, Homeo
Item
List Price Our Price
Chopsticks
$6.95
$3.95
Spoons
$25.00
$5.00
(b) Modified content, Homec
Figure 2.3: Content changes in a Home supplies page
Strengths: For most part, this method uses a black-box approach to measuring overall
page metrics and hence it can be applied in any wrapper generation system for verification.
RAPTURE uses very simple numeric features to compute the probabilistic similarity measure between a wrapper’s expected and observed output. After conducting experiments
with numerous actual Internet sources, the authors claim RAPTURE performs substantially better than standard regression testing approaches.
Weaknesses: Since information for even a single field can vary considerably overall
statistical distribution measures may not be sufficient. For example, in case of Listings
for scientific publications, the author names and the scientific publication names all may
vary too drastically leading to ambiguity while verification. Such cases though rare, make
this approach ineffective, unless more features are used while verification like digit density,
upper case density, letter density, HTML density etc. For example if the Contents of Page
in 2.3(a) changes to 2.3(b) apart from the layout, then based on content patterns it would
be very difficult to distinguish the ‘List price’ from ‘Our price’. Additionally, this method
does not examine re-induction at all.
15
Address Book
Jack,1234 Orchard Road
James,3454 Siglap Road
Jenny,22 Alexandra Road
Jules,1 Kent Ridge
Figure 2.4: Changed Address Book Addressm
2.1.2
Chidlovskii’s Algorithm
Another verification approach was suggested by Chidlovskii [15] where he argues that the
pages rarely undergo any massive or sweeping change and most often than not it is a slight
local change or concept shift. The Automatic maintenance system repairs wrappers under
this assumption of “small change”.
Though this method tackles verification by classifiers built using content features of
extracted information. For feature1 (Name), average length= 5.25, Number of Upper case
characters =1, Numbers =0 etc. The approach makes an effort to extend the conventional
forward wrappers with backward wrappers to create a multi-pass Wrapper verification approach. In contrast to forward Wrappers, the backward wrappers scan files from the end to
the beginning. The backward wrapper is similar in structure to the forward wrapper, and
can run into errors when the format changes. However, because of the backward scanning,
it will fail at positions different from where the forward wrapper would fail. This can typically work in case of errors generated due to typos or missing close tags in HTML pages,
and help to fine tune the answers further.
If page Addresso was changed to page Addressm as in Fig. 2.4, wrapper Addresswrapf
16
would extract (Jack, 1234 Orchard Road), (James,3454 Siglap Road), (Jenny, 22
Alexandra Road Jules,1 Kent Ridge) on page Addressm .
The backward wrapper scanning page Addressm from the backward direction would extract the tuples : (Jules, 1 Kent Ridge Road), (James,3454 Siglap Road), (Jack,
1234 Orchard Road) on page Addressm .
Strengths: The forward-backward scanning is unique and seems to be a robust approach to handle wrapper verification, especially for missing attributes and tags. Tested
on the a database of 18 websites, including the Scientific Literature database DBLP, this
method reports an average error of only 4.7% when using the Forward-backward wrappers
with the context classifier.
Limitations: Though the forward-backward Wrapper approach has an advantage over
other methods in verification when there are missing tags, the use of content features may
not be very effective in many cases. Since information for even a single field can vary
considerably overall statistical distribution measures may not be sufficient.
2.2
Wrapper Reinduction Algorithms
Wrapper Reinduction is the process of learning a new wrapper if the current wrapper is
broken. Wrapper reinduction is a tougher problem than Wrapper Verification. Not only
the wrapper has to be verified, a new wrapper should be constructed as well. It requires
new examples be provided for learning, which may be expensive when there are many sites
being wrapped. The conventional wrapper induction models cannot be directly used for
17
reinduction since many of them required detailed manual labeling for training which can
become a bottleneck for reinduction of wrappers. So wrapper reinduction task usually deals
with locating training examples on the new page, automatically labeling it, and supplying
it to the wrapper induction module to learn the new wrapper.
2.2.1
ROADRUNNER
ROADRUNNER[18] is a method that uses unsupervised learning to learn the wrappers.
Pages from the same website are supplied and a page comparison algorithm is used to
generate wrappers based on similarities and mismatches.
The algorithm performs a detailed analysis of the HTML tag structure of the pages
to generate a wrapper to minimize mismatches. This system employs wrappers based on
a class of regular expressions, called Union-Free Regular Expressions (UFRE’s) which are
very expressive. The extraction process compares the tag structure between the sample
pages and generates regular expressions that handle structural mismatches found between
them. In this way, the algorithm discovers structures such as tuples, lists and variations
[39].
An approach similar to ROADRUNNER was used by Arasu et. al.[3]. They propose an
approach to automatic data extraction by automatically inducing the underlying template
of some sample pages with the same structures from data-intensive web sites to deduce
templates from a set of template generated pages, and to extract the value encoded in
them. However, this does not handle multiple values listed on one page.
18
Strengths Since this method needs no examples to learn the wrappers, has an obvious
strength: it provides an alternative way to deal with the wrapper maintenance problem,
especially in cases where there are no examples.
Limitations: Since ROADRUNNER searches in a larger wrapper space, the algorithm
is potentially inefficient. The unsupervised learning method gives little control to the user.
The user might want to make some refinements and only extract a specific subset of the
available tuples. In such cases, some amount of user input is clearly necessary to extract
the correct set of tuples. Another problem of this approach is the need for many examples
to learn the Wrapper accurately [45].
2.2.2
DataProg
Knoblock at el.[29] developed a method called DataPro for wrapper repairing in the case
of small mark-up change; it detects the most frequent patterns in the labeled strings; these
patterns are searched in a page when the wrapper is broken. Lerman et.al.[42] extended
the above content-centric approach for verification and re-induction for their DataProg
system. The system takes a set of labeled example pages and attempts to induce contentbased rules so that examples from new pages can be located. Wrappers can be verified by
comparing the patterns of data returned to the learned statistical distribution. When a
significant difference is found, an operator can be notified or the wrapper repair process can
be automatically launched.
For example, by observing the street addresses listed in our example, we can see that
they are not completely random: each has a numeric character followed by a capital letter.
19
DataProg tries to derive a simple rule to identify this field as ALP HA CAP S etc. Using
this, they locate the examples on the new page, which are passed to a wrapper induction
algorithm (STALKER algorithm) to re-induce the wrapper. This approach is similar to the
approaches used by Content-centric Wrapper tools [12, 56].
Strengths: The class of wrappers described by DataProg are very expressive since
they can handle missing and rearranged attributes. This approach applies machine learning
techniques to learn specific statistical distribution of the patterns for each field as against
the generic approach used by Kushmerick [35]. This approach assumes that the data representation is consistent, and by looking the test set, we can see that this can be successfully
used for representations which have strong distinguishing features like URLs, time, price,
phone numbers etc.
Limitations: For many cases like news, scientific publications or even for author
names, this approach cannot be used too well since there are no fixed content-based rules
(Alphanumeric, Capitalized etc.) which can be identified to separate them from other
content on the page. For example, in case of the example illustrated in Figure 2.3 (a) and
(b), this method will not detect any change, because the generic features and data patterns
of ‘List Price’ and ‘Our Price’ are the same. It also could produce too many candidates
of data fields [45], many of which could be noises. It fails at very long descriptions, and
is very sensitive to improper data coverage. Lerman et.al.[42] quote a striking example of
the data coverage problem that occurred for the stock quotes source: the day the training
data was collected, there were many more down movements in the stock price than up, and
the opposite was true on the day the test data was collected. As a result, the price change
fields for those two days were dissimilar. The process of clustering the candidates for each
20
!ELEMENT
!ELEMENT
!ELEMENT
!ELEMENT
Addresses (Address+)
Address (Name, Street Name)
Name (#PCDATA)
Street Name(#PCDATA)
Figure 2.5: User defined schema for the Address Book Example
data field does not consider the relationship of all data fields (schema) [45].
2.2.3
SG-WRAM
SG-WRAM (Schema Guided Wrapper Maintenance)[45] is a recent method for that utilizes
data features such as syntactic features and annotations, for reinduction. They base their
approach on the assumption that some features of desired information in previous document
remain same, e.g. syntactic (data types) hyperlink (whether or not a hyperlink is present)
and annotation features (Any string that occurs before the data field) will be retained.
They also assume that the underlying schemas will still be the same are still preserved in
the changed HTML document. These features help the system to identify the locations of
the content in the modified pages though tag structure analysis. For our example, the user
defined schema would look like Figure 2.5.
Internally the system computes mapping for each one of the fields above to the HTML
tree, and generates the extraction rule. For each #PCDATA string, the features are highlighted. If the name Hyperlinked to another page, then the Hyperlink would be TRUE.
Similarly, if each Street name was preceded by the string ‘Street’ the Annotation would be
‘Street’. For our case, the features are highlighted in Figure 2.6.
21
Attribute
Syntactic
Hyperlink Annotation
Name
[A-Z][a-z]{0,}
False
NULL
Street Name [0-9]{0,}[A-Z][a-z]{0,}
False
NULL
Figure 2.6: Content Features of the Address field
For simple changes in pages, this method depends on the syntactic and annotation
features, but in case the web site has undergone a structural change, this method uses the
schema to locate structural groups and use them to extract data.
Strengths: Since it relies on multiple features, this works better in many cases. In
case of the example illustrated in Figures 2.3 (a) and (b), where syntactic differences are
not strong, this work considers annotation features (List Price, Our Price). Thus when
applying the extraction rule, our approach will find that the annotations have changed and
find that the page has changed.
Limitations: The assumption that data of the same topic will always be put together,
similar to the user defined schema and will be retained even when changed is the basis of
this approach. However, if the data schema or the syntactic and tag structure changes, then
this method is not effective.
2.3
Summary
From our study, we observe a few key things about Wrapper generation, verification and
maintenance. We observe that landmark-based Wrapper generation approaches are more
suitable for HTML pages, as compared to content-based approaches. Conventional wrapper
induction algorithms cannot be extended into reinduction algorithms since most of them
22
need manual labeling of data. The reinduction procedure should be automatic for continuous
extraction of information from the web source.
Wrapper verification can be handled by heuristics. It can be tackled by using global
statistics [35] or using local attribute specific statistics [42]. Since page structures are very
complex, if a page changes its layout completely, it is very unlikely that any of previous
features will be interchanged with others, e.g, in case of Page PA and PB , may be rare.
Hence it might be a common occurrence, that the wrapper returns null values when the
web site revamp happens. Wrapper Verification can be treated independently of Wrapper
Induction. Hence, existing methods are usually adequate for the purpose.
In contrast, Wrapper reinduction is a far more difficult problem and has much scope
for investigation. From our survey of related work, we observe that the main issues with
the current approaches are:
(i) Potentially inefficient either because of the need for detailed HTML grammar analysis
or due to searching in big wrapper spaces, which makes them inherently slow.
(ii) The requirement that most of the data in the modified pages have effective features
(syntactic patterns,annotations,etc.). These can be page specific, and hence make the
reinduction approach difficult.
(iii) The need for many training examples, for learning and reinduction. This additionally
includes cases in which the user has to specify a detailed schema which is not very
user-friendly.
Our goal in this thesis is to address these issues effectively. We investigate wrapper
23
reinduction algorithms that are efficient, learn from a small number of examples and do
not require strong assumptions on the data features. In the next chapter, we describe our
approach and present our algorithms.
24
Chapter 3
ReInduce: Our Wrapper
Reinduction Algorithm
In this chapter, we present a novel algorithm for Wrapper Reinduction. As discussed earlier,
the focus is on dynamic webpages, pages whose layouts or content may change over time.
3.1
Motivation
We observe that though the layout may change drastically and none of syntactic are features
retained, the page content usually changes only incrementally. For example, in the Channel
News Asia (http://channelnewsasia.com) headlines snapshot 3.1(a) and the snapshot taken
after two hours, 3.1(b), we observe that new headlines were added as the old headlines were
deleted. In other words, the contents on the pages have lot of commonalities for small a
25
time interval. This content can be used to learn the new wrapper. If some of the old tuples
can be detected in the page with the modified layout, we can apply wrapper induction to
learn the new layout. This is the idea behind our reinduction algorithm.
To motivate our approach, let us consider a wrapper X for this website, which grabs
the headlines from the page. Wrapper X extracts all the headlines present on the page
and stores it in a small repository. If one day, a website revamp happens and the layout is
completely changed, then X might not retrieve the headlines on the page. The maintenance
system then tries to locate the headlines stored in the repository and learn a new wrapper
from these examples. Once the new wrapper is created, the wrapper can be used to locate
all the news headlines on the same page.
Instead of trying to search in the wrapper space for a wrapper which will work, or
manually constructing training examples needed for reinduction, we try to learn the new
wrapper from the few examples available to us. So that when these examples, though few,
are discovered on the new page, we can induce the wrapper and deploy it into the system
and it will be transparent to the user. An illustration of the process flow in our Wrapper
Reinduction system, ReInduce is given in Figure 3.2.
The key here is, at the induction/ reinduction step, in such a system there might not
be too many training examples available. The key problem to be solved here is Learning
from a small number of examples, and especially when not all examples in the page are
available to us. In the following sections, we try to address this learning problem effectively.
In this chapter, we propose our idea of reinduction algorithm.
In the next section, we describe the formal framework for description of the Wrapper
26
(a)Content of Page at 1200 hrs
(b)Content of Page at 1400 hrs
Figure 3.1: Incremental Content changes in Channel News Asia Website
27
Figure 3.2: ReInduce: Wrapper Reinduction System
classes.
3.2
Formalism
Resources, queries and responses: Consider the model where a site when queried with
a URL returns a HTML page. An information resource can be described formally as function from a Query Q to a response P . [33]
Query Q
Information
Resource
Response P
Attributes and Tuples: We assume a model similar to the relational data model.
Associated with every information resource is a set of K distinct attributes, each representing
a column. For example, Page PA in the country name example in 1.3 has K=2.
A tuple is a vector A1 , · · · Ak of K strings. The string Ak is the value of the k th
28
attribute. This is similar to rows in a relational model. There are M such tuples/ vectors
present on a page.
If there are more than one tuple present on the page, then the k th attribute of the mth
tuple will be represented as Am,k .
Content and Labels: The content of a page is the set of tuples it contains. A page’s
label is a representation of its content. For example the label for Page PA in the country
name example in 1.3 is LA = (Jack, China), (John, USA), (Joseph, UK).
Wrappers: A wrapper takes as input, a page P and outputs a label L. For wrapper
w and page P , we write w(P ) = L to indicate that w returns label L when invoked on P ,
e.g. PCWrapper(PA ) = LA . Hence, a wrapper can be described as a function from a query
response to a label.
Query response
Page
Wrapper
Label
A Wrapper class can be defined as a template for generating these wrappers. All
wrappers belonging to a class will have similar execution steps.
Wrapper Induction: Let W be a wrapper class and E = { P1 , L1 , · · · , PN , LN }
be a set of example pages and their labels. Wrapper induction is the problem of finding
w ∈ W such that w(Pn ) = Ln , for all n = 1, · · · , N .
Wrapper Verification: We say w is correct for P iff P ’s label is identical to w(P ).
Wrapper Reinduction: For a dynamic web site, it will also be a function of time.
29
We assume the same model in which the site is queried with a URL q and observed at time
instants t0 , t1 , · · · , tN . Let:
• {P (t0 ), P (t1 ), · · · , P (tN )} be the pages in response to the queries, and
• {L(t0 ), L(t1 ), · · · , L(tN )} be the labels of the above pages.
The wrapper reinduction problem is, Given the example at time t0 : P (t0 ), L(t0 ) Find
wrappers wi ∈ W such that wi (P (ti )) = L(ti ).
3.3
Generic Wrapper Reinduction Algorithm
Note that the wrapper reinduction problem is trivial if both the pages and labels remain
static, i.e. P (ti ) = P (t0 ) and L(ti ) = L(t0 ) for i ≥ 1. Even if only the labels remain static,
the problem is much simpler and reduces to the problem of inducing a wrapper wi at time
ti using P (ti ), L(t0 ) as the example.
However, when both the pages and labels vary, we cannot induce a wrapper automatically since L(ti ) is not known for i ≥ 1. Note that this problem is, in general, not solvable
without making assumptions about the variations. Lerman et al.[42] assume that the labels
follow an implicit structure over all time instants. In SG-WRAM[45], the data schema is
assumed to be preserved.
30
3.3.1
Our Approach
Our approach is designed as follows. Consider two time instants t1 and t2 such that t2 t1 .
Let P (t1 ) and P (t2 ) be the pages returned at t1 and t2 , for a fixed URL q.
P (t2 ) may differ from P (t1 ) in layout, content or both. We observe that though the
layout may change drastically, the page content usually changes only incrementally, provided
(t2 − t1 ) is small enough.
In other words, L(t1 )and L(t2 ) will have lot of commonalities for small (t2 −t1 ). Therefore, if some of the tuples can be detected in the layout modified page, we can apply wrapper
induction to learn the new layout. This is the idea behind our reinduction algorithm.
Our assumption can be formally stated as follows:
Assumption I: Let L(t) be a known label for page P (t) at time t. ∃s∗ > 0, such that n
tuples can be found in L(t + s∗ ), for small n.
It may be noted that the assumption only requires that the labels follow a common
structure over a small interval so that a few tuples can be identified with certainty. However,
for longer time intervals, say 10 ∗ s∗ , L(t) and L(t + s∗ ) may not follow common structures.
The reinduction algorithm requires that a lower bound on s∗ be known. It can be
chosen sufficiently small by observing the modification frequency of the target web site.
31
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Let s be a time interval. Set t = 0.
Set E(t) = the example P (t), L(t) .
Call InduceW to learn a wrapper w ∈ W using E(t).
Output w as the wrapper at time t.
Set L(t) = w(P (t)), and L(t + s) = w(P (t + s)).
If Verify(w, P (t + s), L(t + s) )= ‘CHANGED’ Then
Begin
Set L = the tuples of L(t) found in P (t + s).
Set E(t) = P (t), L .
End
Set t = t + s.
Go to Step 2.
Table 3.1: Algorithm ReInduceW
3.3.2
Algorithm ReInduceW
We propose a generic wrapper reinduction algorithm for web sites that satisfy Assumption
I. The algorithm, called ReInduceW , is presented in Table 3.1.
ReInduceW is an iterative algorithm. It uses two procedures InduceW and V erif y to
accomplish reinduction. The initialization is done in Step 1. In Step 2, the example at time
t, E(t), is first set to the example page and label provided at time t=0. Then, InduceW
is called to learn a wrapper in Step 3, which is output as the wrapper at time t in Step 4.
Steps 5-6 correspond to wrapper verification. If a wrapper change has been detected, new
example tuples are generated in Steps 8-9. If s = s∗ /2, then Assumption I ensures at least
n tuples will be found. In Step 11-12, the algorithm increments to the next time step and
goes on to induce a wrapper with the new examples. This process is repeated over intervals
of size s to detect changes and reinduce the wrapper continually.
Algorithm ReInduceW crucially depends the two procedures InduceW and V erif y,
32
for successful reinduction. For V erif y, which is used to detect wrapper changes, we employ
a statistical method, e.g. [35]. However, the choice is not easy for InduceW because the
wrapper induction in Step 3 has to be performed using a small subset of the tuples in L(t).
This is a problem of inducing wrappers from insufficient examples. Kushmerick[34] has
partially investigated this problem under wrapper corroboration. His approach assumes
two or more example pages along with their labels (possibly incomplete) are available and
makes use of the redundancy in the data to perform induction. However, our case involves
a single page and an incomplete label, and hence his method is not applicable. We call this
problem as Incremental Wrapper Induction and propose new algorithms for incrementally
inducing wrappers, in the next section.
3.4
Incremental Wrapper Induction
In this section we present wrapper classes and induction algorithms to learn these wrappers.
We first consider the LR wrapper class and present an incremental induction algorithm.
3.4.1
LR Wrapper Class
Let the page P be a string over alphabet Σ. Consider the regular expression
l1 (˜
∗)M1 (˜
∗)M2 · · · (˜
∗)MK−1 (˜
∗)rK
(3.1)
33
Jack,China
John,USA
Joseph,UK
(a)Page PA
B (˜
∗) /B , I (˜
∗) /I
(b) Corresponding LR Wrapper
Figure 3.3: Page PA and its corresponding LR Wrapper
where l1 , rK and M i are strings over Σ, and ˜
∗ denotes a non-greedy wildcard match .+?.1
A LR wrapper is a procedure that applies this regular expression globally on a page and
returns the pattern matches as a label. For example, a LR wrapper for page PA shown in
Figure 3.3(a) is indicated in Figure 3.3(b).
The wildcards in the Wrapper will match, for example:
B Jack /B , I China /I BR
LR can be seen to be similar to LR class[34]. LR is identical to LR when the intra-tuple
separators are equal across all tuples, for example, as in page PA .
This makes LR rather simplistic, but this class is discussed mainly for easier exposition
of ideas. Later we will generalize the ideas to LR.
1
Non-greedy matching attempts to match an asterisk wildcard up until the first character that is not
the same as the character immediately following the wildcard. It matches a minimum number of characters
before failing. Greedy matching attempts to match the longest string possible.
Parantheses (..) match whatever regular expression is inside them, and indicate the start and end of a group;
the contents of a group can be retrieved after a match has been performed
34
We first investigate the constraints for an LR wrapper to exist. It will be used to
propose the wrapper induction algorithm. We follow the notation of [33].
Let the page P contain K attributes and M tuples, i.e. the label L has tuples Lm , m =
1, .., M , and the size of each Lj is K.
The Am,k values are values of each attribute in each of Page P ’s tuples. Specifically,
Am,k , is the value of the mth tuple on Page P , essentially the text fragments to be extracted
from a page.
The Sm,k are the separators between the attribute values in each of Page P ’s tuples.
The four kinds of separators are:
• Page P ’s head, denoted S0,K , the substring of the page prior to the first attribute of
the first tuple.
• Page P ’s tail, denoted SM,K , the substring of the page following the last attribute of
the last tuple.
• The intra-tuple separators that separate the attributes within a single tuple, denoted
Sm,k , the separator between the k th and k + 1th attribute of the mth tuple.
• The inter-tuple separators that separate the consecutive tuples, denoted Sm,K , the
separator between the mth and m + 1th tuple of the Page P .
We express these variables in terms of indices of in Label L. Let bm,k and em,k respectively be the starting and ending locations of the kth attribute in the mth tuple.
35
Am,k = P [bm,k , em,k ],
(the attribute values)
Sm,k = P [em,k , bm,k+1 ],
(the intra-tuple separators)
Sm,K
(the inter-tuple separators)
= P [em,K , bm+1,1 ],
S0,0 = P [0, b1,1 ],
SM,K
(the head)
= P [eM,K , |P |],
(the tail)
where k = 1, · · · , K, and m = 1, · · · , M . A sample page PA and its corresponding values
are shown in Figure 3.4 (a), (b) and (c).
We define the constraints for the LR Wrapper to exist:
Constraint C1 (rK ): i) rK is a prefix of Sm,K
ii) rK is not a substring of Am,K , for m = 1, · · · , M .
Constraint C2 (l1 ): l1 is a proper suffix of Sm,K , for m = 1, · · · , M .
Constraint C3 (Mk ): Mk = Sm,k , 1 ≤ m ≤ M .
Constraints C1 , C2 and C3 respectively define the validity constraints for rK , l1 and
Mk . C1 specifies that rK must be a prefix of the inter-tuple separators and the tail. C2
specifies that l1 must be a proper suffix of the inter-tuple separators and the head. C3
requires that Mk equal the corresponding intra-tuple separators.
For easier exposition of idea, we illustrate this by translating these constraints for our
example page PA , in Figure 3.4 (d).
36
Jack,China
John,USA
Joseph,UK
(a) HTML source for page PA
Am,1
m=1 Jack
m=2 John
m=3 Joseph
Sm,1
/B , I
/B , I
/B , I
Am,2
China
USA
UK
Sm,2
/I BR ⇓ B
/I BR ⇓ B
(b) Values for page PA
S0,0 -(head)
S3,2 -(tail)
HT M L BODY ⇓ B
/I BR ⇓ HR /BODY /HT M L
(c)Head and Tail for page PA
C1 : rK should be a prefix of : /I BR ⇓ B and /I BR ⇓ HR
/BODY /HT M L .
C2 : l1 should be a proper suffix of : /I BR ⇓ B and HT M L BODY ⇓ B
C3 : M1 = /B , I
(d) LR constraints for page PA
Figure 3.4: Illustration of LR Constraints
37
Lemma 3.4.1 Given page P and label L, there exists an LR wrapper if and only if constraints C1 , C2 and C3 are satisfied.
The proof involves showing that:
i) If all the constraints are satisfied, then the regular expression matches will be correct
ii) If one of the constraints is not satisfied, then the attribute values will be extracted
incorrectly.
Lemma 3.4.1 means that constraints C1 , C2 and C3 are necessary and sufficient for the
delimiters of a LR wrapper to remain valid. Hence, a LR wrapper induction algorithm
need to only consider delimiters that satisfy C1 , C2 and C3 . We use this idea to present our
induction algorithm, called InduceLR , in Table below.
We trace this algorithm InduceLR using our favorite example of Page PA in the table
3.3 The correctness of the algorithm is proved through the following lemma.
Lemma 3.4.2 Algorithm InduceLR explores all candidate delimiters that satisfy the constraints C1 , C2 and C3 .
Proof: Note that constraint C3 directly specifies the values of Mk . Hence, Step 1 of
InduceLR trivially considers all candidates of Mk satisfying C3 .
We next prove that all candidates of rK as specified by C1 are considered. Since L includes
all tuples, by the construction of SEP in Steps 2-6, there will be M − 1 matches in Steps
38
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Set Mk = Sm,k , 1 ≤ k ≤ K − 1
Set p = longest prefix common to
the prefixes of strings P [em,K , |P |].
Set s = longest suffix common to
the suffixes of strings P [0, bm,0 ].
If p = s then Set SEP = p
Else Set SEP = p ∗ s
End If
Set M ID = (˜
∗)M1 (˜
∗)M2 · · · MK−1 (˜
∗)
Apply the pattern M ID.(SEP ) on page P.
Set em = index of the mth match.
Set eM = index of the end of a subsequent
match with (M ID).
Apply the pattern (M ID).SEP on page P.
Set bm = index of the mth match.
Set rK = longest prefix common to the prefixes
of strings P [eM , |P |] satisfying C1 -(i).
Set l1 = longest suffix common to the suffixes
of strings P [0, bm ] satisfying C2 .
Output {l1 , rK , M1 , · · · , MK−1 }.
Table 3.2: Algorithm InduceLR
39
Step
Step
Step
Step
Step
Step
Step
Step
Step
Step
Step
1.
2.
3.
4-6.
7.
8.
9.
10.
11.
12.
13.
Step 14.
Step 15.
M1 = /B , I
p = /I BR ⇓ B
s = /I BR ⇓ B
Since p = s hence, SEP = /I BR ⇓ B
M ID = (˜∗) /B , I (˜
∗)
Apply the pattern M ID.(SEP ) on Page PA
em = Index of mth matches = offsets of after ‘China’ and ‘USA’
eM = Index of end of subsequent match with (M ID) = offset of after ‘UK’.
Apply the pattern (M ID).SEP on Page PA
bm =Index of mth match = offsets of ‘J’ in ’Jack’, ‘John’ and ‘Joseph’.
rK = longest common prefix of strings P [em , |P |] satisfying C1
/I BR ⇓
l1 = longest common suffix of P [0, bm ]satisfying C2 .
⇓ B
Wrapper w = ⇓ B (˜
∗) /B , I (˜
∗) /I BR ⇓
Table 3.3: Trace of Algorithm InduceLR for Page PA
8-9 and the M th match found in Step 10. This implies that Step 13 considers all candidates
of rK that satisfy C1 . Similarly, it can be proved that all candidates of l1 as specified by
C2 are considered in Step 14.
By Lemma 3.4.2, InduceLR enumerates all valid delimiters of the wrapper satisfying
constraints C1 ,C2 and C3 . When a valid candidate is found for each delimiter, the algorithm
ends by outputting the learnt wrapper. If the page is LR wrappable, then the learnt wrapper
will be correct by Lemma 3.4.1. Thus, we have
Theorem 3.4.1 Given page P and label L, if P is LR wrappable, then Algorithm Induce
will output a LR wrapper consistent with L.
By Theorem 3.4.1, InduceLR will output a correct wrapper for any P, L , provided an
40
LR wrapper exists. Note that the LR learning algorithm learnLR [34] can also achieve the
same result. However, a key property of InduceLR is that not all tuples in L are strictly
required (as proved below) to establish consistency. This is a significant difference from
learnLR . The latter will fail if, e.g. the first tuple is not provided, because the proper suffix
constraint ClA will not hold.
Consider L ⊂ L,i.e. L = {lm1 , lm2 , · · · , lmM } where M < M . Let p be the longest
prefix common to Smi ,K and s be the longest suffix common to Smi ,0 . Note that both p
and s will be non-empty if an LR wrapper exists. Define
Constraint CS : Every inter-tuple separator Sm,K , m = 1, · · · , M − 1 matches the regular
expression
• “p ”, if p = s , or
• “p ∗ s ”, if p = s
Lemma 3.4.3 Given P, L
as the example, Theorem 3.4.1 holds when CS is satisfied.
Proof: Sufficient to prove that all candidate delimiters are considered.
Since L has at least one tuple, Step 1 trivially considers all candidates of Mk .
With P, L as the example, if Sm,K is as in the lemma, then Step 10 will match all intertuple separators and hence result in M − 1 matches. This means that Steps 11 and 12 will
find em = em,K . Hence, Step 15 considers all candidates of rK satisfying constraint C1 .
The proof is similar for l1 .
Note that, when L = L, constraint CS is satisfied by following arguments similar to
41
that in Lemma 3.4.2. When |L | < |L|, CS will not be true in general, because a page and
tuples can be maliciously chosen such that either p or s is found such that at least one
Sm,K does not match the regular expression. Then, InduceLR will fail to consider some
candidate delimiters and will over-generalize. The resulting l1 and rK will be longer than
the correct values and hence only a subset of the tuples will be extracted. As the size
of subset L increases, the candidates identified will asymptotically approach the correct
values. If the tuples are not maliciously chosen, we expect this to happen for small |L |,
because of the structure of LR. Obtaining the exact bounds on the sample complexity
would require analysis of InduceLR under the PAC Learning framework as in [33].
The following subsection extends the results for LR to LR.
3.4.2
LR Wrapper Class
Recall that LR differs from LR in assuming that the intra-tuple separators are equal (Constraint C3 ). Here, we relax this constraint and propose a generalized Induce for learning
LR.
LR wrapper is defined by the regular expression
l1 (˜
∗)r1 ∗ l2 (˜
∗)r2 ∗ · · · lK (˜
∗)rK
(3.2)
where lk , rk are strings over Σ, and ˜
∗, ∗ respectively denote the non-greedy wildcard matches
(.+?) and (.*?).
The constraints for LR can be defined as follows:
42
PEOPLE
JackXXXXXChina
JohnYYYYYUSA
JosephZZZZZUK
Figure 3.5: HTML Source for Modified Page PAm .
Constraint C1 (rk ): i) rk is a prefix of Sm,k , and ii) rk is not a substring of Am,k , for
m = 1, · · · , M .
Constraint C2 (lk ): lk is a proper suffix of Sm,k−1 , for m = 1, · · · , M .
The LR induction algorithm InduceLR generalizes InduceLR . Since there are now 2
delimiters to be learned for each intra-tuple region, Step 1 in InduceLR needs to be modified.
The generalized algorithm is presented in the Table below.
We illustrate this example by modifying our Page PA , by inserting random characters
between the tags. Note that there are K separator patterns (SEP) to find. They are
constructed in Steps 1-7. In the second pass, Steps 9-12 find the endings of the attribute
values and Steps 13-16 find the beginnings. Then, the candidate delimiters are explored in
Steps 17-20 to generate the correct wrapper.
The trace for this algorithm is similar to InduceLR . The difference lies in the determination of the intra-tuple separators which is step 1 of InduceLR . This step covers step
1-6 in InduceLR . For our example in Figure 3.5, Step 1-2 will determine p1 = /B and s1
= I . Based on this SEP1 in Step 6 = /B (˜
∗) I . Rest of the steps remain the same as
that of InduceLR .
43
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Set pk = longest prefix common to
the prefixes of strings Sm,k .
Set pK = longest prefix common to
the prefixes of strings P [em,K , |P |].
Set s1 = longest suffix common to
the suffixes of strings P [0, bm,1 ].
Set sk = longest suffix common to
the suffixes of strings Sm,k−1 .
If pk = sk then Set SEPk = pk
Else Set SEPk = pk ∗ sk
End If
Set M ID = (˜
∗)SEP1 (˜
∗)SEP2 · · · SEPK−1 (˜
∗)
Apply the pattern M ID.(SEPK ) on page P.
Set em,k = index of the mth match of SEPk .
Set eM,k = index of a subsequent match of SEPk
with (M ID).
Set eM,K = index of the end of this match.
Apply the pattern (M ID).SEP on page P.
Set bm = index of the mth match.
Set bm,1 = index of the start of the mth match.
Set bm,k = index of the end of mth match of SEPk−1 .
Set rk = longest prefix common to the prefixes
of strings P [em,k , bm, k + 1 ] satisfying C1 -(i).
Set rK = longest prefix common to the prefixes
of strings P [em,K , |P |] satisfying C1 -(i).
Set l1 = longest suffix common to the suffixes
of strings P [0, bm,1 ] satisfying C2 .
Set lk = longest suffix common to the suffixes
of strings P [em,k , bm, k + 1 ] satisfying C2 .
Output {l1 , r1 , · · · , lK , rK }.
Table 3.4: Algorithm InduceLR
44
As in Section 3.1, we can prove
Theorem 3.4.2 Given page P and label L, if P is LR wrappable, then Algorithm InduceLR
will output a LR wrapper consistent with L.
We next analyze the capability of Induce in learning from a subset of tuples.
Consider L ⊂ L,i.e. L = {lm1 , lm2 , · · · , lmM } where M < M . Let pk be the longest
prefix common to Smi ,k and sk be the longest suffix common to Smi ,k−1 . Note that pk and
sk will be non-empty if an LR wrapper exists. Define
Constraint CS : Sm,k , m = 1, · · · , M − 1 matches the regular expression:
• “pk ”, if pk = sk , or
• “pk ∗ sk ”, if pk = sk
for k = 1, · · · , K − 1.
Lemma 3.4.4 Given P, L
as the example, Theorem 3.4.2 holds when CS is satisfied.
Proof: We prove that all candidate delimiters for r2 are considered. The steps can be
extended for rk , k = K, and lk , k = 1. The proof for l1 and rK can be given as in Lemma
3.4.3.
With P, L as the example, Steps 1 and 4 find p2 and s2 . If Sm,1 is as in the lemma, then
Step 12 will match all Sm,1 , m = 1, · · · , M − 1. Hence, Step 7 considers all candidates of r2
satisfying constraint C1 . By Lemma 3.4.4, we can see that InduceLR is correct asymptotically as |L | → |L|. This property enables InduceLR to be used in Step 3 of ReInduceW .
45
We, thus, can derive a wrapper reinduction algorithm for LR, ReInduceLR . For V erif y,
we use an assumption that the changes in the page will be drastic, and when the layout
changes occur, the wrapper will fail completely and will not extract anything from the page,
i.e, return null values.
The experimental studies for ReInduceLR are provided in the next chapter.
3.4.3
LRRE Wrapper Class
LR Wrapper Class can cover upto 54% of the websites [34]. To further improve the expressiveness of the LR class, we consider a new wrapper class called LRRE , which is defined
below.
Let the page P be a string over alphabet Σ. Consider the regular expression
l1 (˜∗! ∼ l1 )r1 ∗ l2 (˜
∗)r2 ∗ · · · lK (˜
∗)rK
where lk , rk are strings over Σ, and ˜
∗, ∗ respectively denote the non-greedy wildcard matches
.+? and .*?. The term (˜∗! ∼ l1 ) means that the first attribute will match only when its
value does not contain l1 as a substring. Though not strictly a regular expression, we will
follow this notation for the sake of convenience. We call this term as l1 constraint.
Definition: A LRRE wrapper is a procedure that applies this regular expression
globally on a page and returns the matched values as a label. Note that a LRRE wrapper
is defined by 2K delimiters l1 , r1 , ..., lK , rK . LRRE can be seen to be similar to LR class
above, which is also defined by 2K delimiters. The difference is in the l1 constraint.
46
Jack,China
John,USA
Joseph,UK
(a) Page PA
l1 = B , r1 = /B , I ,
l2 = /B , I , r2 = /I BR
(b) LR Wrapper for Page PA
B (˜∗! ∼ “ B ”) /B , I (˜
∗) /I BR
(c) LRRE Wrapper for Page PA
Figure 3.6: Page PA and corresponding LR and LRRE wrappers
Given an LR wrapper l1 , r1 , ..., lK , rK , we can construct an LRRE wrapper by setting
lk = lk and rk = rk . This LRRE will be correct because l1 will satisfy the proper suffix
constraint[34]. Hence, it can be observed that LRRE subsumes LR. For example, consider
the page PA , shown in Figure 3.4.3(a). An LR Wrapper [34] for this page is indicated in
3.4.3(b). The LRRE wrapper derived using these delimiters is illustrated in 3.4.3(c) which
is correct for PA .
In fact, LRRE can handle pages not possible with LR. For example, consider the page
PA in Fig. 3.4.3. A title ‘People’ formatted in Bold is added to the page. Because of the
proper suffix constraint, it can be shown that no LR wrapper exists for this page (which
47
PEOPLE
Jack,China
John,USA
Joseph,UK
PEOPLE
Jack, China
John , USA
Joseph , UK
Figure 3.7: HTML Source for Modified Page PA
is actually in HLRT). However, it can be handled using the same LRRE wrapper used for
PA in 3.4.3(c). This is possible because l1 constraint relaxes the proper suffix constraint.
In addition, LRRE can be seen to cover pages not handled by HLRT and OCLR. We
illustrate this aspect by using the example pages discussed in [33]. The pages are listed in
Table 3.5. All the pages consist of 2 attributes (K = 2). The tuples to be extracted are
A11 , A12 , A21 , A22 , A31 , A32 .
Interestingly, all 7 pages in Table 3.5 can handled by the simple LRRE wrapper,
[(˜
∗! ∼ “[”)]((˜
∗)).
Thus, we conclude that LRRE can handle a rich subclass of HOCLRT, a class observed
Example Pages
1. −[h[A11](A12)[A21](A22)[A31](A32)t
2. [h[A11](A12)h − [A21](A22)h[A21](A22)t
Handled by
HLRT, HOCLRT
HLRT, OCLR,
HOCLRT
3. o[ho[A11](A12)cox[A21](A22)co[A31](A32)c HOCLRT
Not handled by
LR, OCLR
LR
4. ho[A11](A12)cox[A21](A22)co[A31](A32)c
LR,HLRT,
OCLR
HLRT
5. [A11](A12)t[A21](A22)[A31](A32)t
6. x[o[A11](A12)o[A21](A22)ox[A31](A32)
7. [ho[A11](A12)cox[A21](A22)co[A31](A32)c
OCLR, LR,
HOCLRT
LR, OCLR
OCLR
HLRT, HOCLRT
LR, HLRT
HOCLRT
OCLR, HOCLRT LR, HLRT
Table 3.5: Expressiveness of LRRE
48
1. Set E = the example P, L .
2. Call InduceLR to learn a wrapper using E(t).
3. Output w as the wrapper at time t.
Table 3.6: Algorithm InduceLRRE
to handle 57% of the web sites [35]. Note that the pages in Table 3.5 have the intratuple separators equal across the tuples. This property ensures that l1 constraint correctly
helps LRRE handle rich wrappers. However, when this property does not hold, we cannot
guarantee that pages in, e.g. HLRT can be handled.
An LRRE Wrapper, can be learnt in the same way as a LR Wrapper. The difference lies
only in the way the wrapper extracts data from a page. We first illustrate the InduceLRRE
in Table 3.6.
Let P, L be an example. Let Lj , j = 1, · · · , n, be the set of content tuples of L. Note
that Lj is of size K. Our idea is to learn w ∈ WLRRE from Lj such that w(P ) = L.
To illustrate how the l1 constraint is implemented, we present in Table 3.7, how the
LRRE wrapper is used to extract the content of a webpage. The algorithm scans the page
continuously to find more strings which would match the LRRE regular expression. Step 4
extracts the first matched string with the Wrapper, which would be the first tuple on the
page. In Steps 6-7, we extract each attribute of the tuple. Steps 8-12 of this algorithm
are the check for the l1 constraint. In step 9, we check if the first attribute for this tuple
Am,1 has l1 as a substring. If yes, then in step 10-12, we find the shortest suffix substring
of Am,1 , which does not match l1 , and set this as the value of Am,1 . In step 14, we return
the values of all attributes for this tuple. From the end of this tuple, we start scanning for
49
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Set P = input Page
Set m = 0
Apply the pattern w on page P
Set matched = first matched pattern.
Set m = m + 1
For i = 1 to K
Am,i = value of wildcards in matched
Let em,K = index of end of Am,K .
If Am,1 matches l1 , then
bl1 = index of match of l1 to Am,1
el1 = index of end of match of l1 to Am,1
Am,1 = substring (Am,1 ,el1 ,|Am,1 |)
End If
Output {· · · {Am,1 , Am,2 , Am,3 , · · · Am,K } · · · }
Set P = substring (P, em,K , |P |)
Go to 3.
Table 3.7: Algorithm ExtractLRRE
more tuples, till no more tuples are found on the page.
Now that we have described Wrapper classes which learn from a few examples these
can be called in the ReInduce Function to implement a wrapper reinduction algorithm.
3.5
Summary
In this chapter, we introduced a novel approach for Wrapper Reinduction from incremental
web pages, whose layouts may change over time. Using the observation that though the
layout may change drastically and none of syntactic are features retained, the page content
usually changes only incrementally, provided the time interval is small enough, we presented
a reinduction algorithm, which can be used to implement an automatic information extrac-
50
tion system. The examples have to be provided only once and after that the system extracts
data using wrappers, and if need be repairs them.
We introduced LR Wrapper class and the respective induction algorithms, which learn
from a few examples, and can be deployed into the reinduction system. We also introduced
another Wrapper class, LRRE , to learn wrappers more expressive than the LR Wrapper
class.
In the next section we present the experimental evaluation of our algorithms.
51
Chapter 4
Experiments
This chapter presents the empirical evaluation of the algorithms InduceLR and ReInduceLR
presented in the previous chapter. The experiments for studying the performance of these
proposed algorithms were conducted in two parts. In the first part we study the InduceLR .
We evaluated the algorithms based on the sample and induction cost. By measuring sample
cost, we try to estimate the number of examples needed to have high accuracy with the
algorithm. The induction cost of the algorithm measures the time taken by the algorithm,
as the number of examples is varied.
In the second part of our empirical evaluation we study the performance of ReInduceLR .
All experiments were performed on a SUN UltraSPARC workstation.
52
No.
1
2
3
4
5
Category
Shopping
Search Engine
Publication Server
Whitepages
News
Website
Amazon
Google
USPTO
Yahoo People
ZDNET
link
http://www.amazon.com
http://www.google.com
http://patft.uspto.gov/netahtml/search-adv.htm
http://people.yahoo.com
http://www.zdnet.com/
Table 4.1: Websites considered for evaluation of InduceLR
No˙
1
2
3
4
5
Site
Query
No. of
tuples
Amazon
Books on Web Mining
10
Google
Query = Web Mining
10
USPTO
Title = Text Mining
23
Yahoo People Last Name = ‘John’
10
ZDNET
The news for the day
14
Attributes
Title, Author, List Price
URL, Title, Summary
Patent Number, Title
Name, Address, Phone Number
URL, Headline, Summary
Table 4.2: Details of the webpages
4.1
Performance of InduceLR
We considered five categories of websites and picked a representative site from each category
as listed in table 4.1. The screen shots from these Websites are given in Appendix A. For
these sites, all attributes handled by LR were chosen. The details of the webpages considered
are listed in Table 4.2.
4.1.1
Sample cost
For each website listed in Table 4.2, all tuples present within the page were extracted to
create a database. We randomly selected 2-5 number of tuples (M ) and passed them to
InduceLR . Once the Wrapper was generated, it was used to extract the contents of the
same page, and the performance was measured. To measure the performance, we used the
metrics of precision and recall. Recall (R) is the percentage of the correctly extracted data
53
items of all the data items that should be extracted. Precision (P) is the percentage of the
correctly extracted data items of all the data items that have been extracted. If there are
10 tuples present on the page, 5 tuples are returned by the Wrapper out of which only 2 are
correct, then the precision is 2/5 and recall is 2/10. For each value of M , we performed 20
runs to compute average precision and recall. The results for precision and recall are listed
in Table 4.3.
2
Site.
Amazon
Google
USPTO
Yahoo People
ZDNET
%P %R
84 78
90 84
100 90
100 100
100 95
Number of examples
3
4
5
%P %R %P %R %P %R
90 84 95 92 100 100
95 92 100 100 100 100
100 90 100 100 100 100
100 100 100 100 100 100
100 100 100 100 100 100
10
%P %R
100 100
100 100
100 100
100 100
100 100
Table 4.3: Precision and Recall of InduceLR
We observe that the precision is above 84% for just two example tuples, though the
recall is a bit lower (as in the case of Amazon). This is because 2 randomly selected tuples
may not have been representative of the entire set. Two tuples can have a common substring,
which can cause the wrapper to get biased. For example, Books on Amazon are classified as
being either ‘Paperback’ or ‘Hardcover’ editions. While learning with two examples, there
is a possibility of generation of a biased wrapper, which will extract only one of the above,
though precision still remains high. However, both precision and recall reach over 92% for 4
tuples and 100% for 5 tuples. Ideally, if four to five examples are retained on the page, then
it is sufficient to reinduce the wrapper correctly most of the time. However, it is important
to note that Amazon is a difficult page to tackle, since it additionally has many extraneus
54
elements like book excerpts and advertisements within the page. DataProg claims only 70%
Accuracy for the learnt Wrapper rules in Amazon. ROADRUNNER was unable to extract
any results from the Amazon Music Bestsellers page, which essentially helps us to put the
argument forward that content based features and searching for the wrapper space may
not be sufficient to handle real pages very effectively. However, it is highly likely that few
examples will be retained in the page, and the new Wrapper can be induced from them.
A similar case occurs in case of Google, some of the Google URLs are either listed with
‘www’, and some are not: and this causes a bias while training with 2 examples. However,
4 examples seem to be sufficient to give near perfect precision and recall.
4.1.2
Induction cost
Since it has to be deployed in a practical system, it is important that the learning algorithm
is fast. We performed a detail study on the time needed to learn the Wrapper varying the
number of examples at each step.
Site
Amazon
Google
USPTO
Y! People
ZDNET
2
0.318
0.104
0.162
0.542
0.148
Number of examples
3
4
5
0.296 0.322 0.328
0.104 0.100 0.112
0.168 0.176 0.182
0.556 0.570 0.592
0.158 0.154 0.164
10
0.377
0.124
0.232
0.690
0.182
Table 4.4: Time complexity of InduceLR
From Table 4.4, we see that the algorithm takes less than a second for induction. We
observe that the time complexity is affected by of two important steps:
55
1. This depends on the number of Attributes present in each tuple. Generalizing the
prefix and suffix for each intra-tuple delimiter are time intensive routines. The prefix
and suffix routines work in such a way that, after adding each character we have to
check against the rest of the elements till it fails. So if the intra-tuple delimiters and
inter-tuple delimiters are are long strings which match, then the generalization step
is slower and hence the induction is slower.
2. The results could also lead to the false assumption that, locating examples is an
expensive step. However this seems unlikely since we are using a simple regular
expression match. It can be seen in the case of Amazon that the time taken to learn
from 3 examples is less than that of learning from 2 examples. This may also be due
to biasing as discussed earlier. Amazon gives low precision and recall while learning
from 2 examples. This means that, it tries to generalize using the false examples
which are located.
4.2
Performance of ReInduceLR
For evaluating reinduction, we considered the Whitepages domain. We extracted examples
from Yahoo People Search site. To observe sufficiently large number of page changes, we
simulated a dynamic People Search web site as follows. We grabbed the templates from
four other popular Whitepages sites listed in Table 4.5.
We separated the top and the bottom of the page. The top of the page is the part
before the tuple listing begins and bottom is the part after the last tuple is listed. We
also extracted the template by which each tuple is formatted. At each iteration we used
56
Website
People Search (Yahoo)
WhoWhere (Lycos)
Swithcboard (Infospace)
Whitepages.com (W3 Data, Inc)
Anywho Online directory (AT&T)
link
http://people.yahoo.com
http://whowhere.com
http://switchboard.Com
http://whitepages.Com
http://anywho.Com
Table 4.5: White Pages Websites considered for evaluation of ReInduceLR
Layout Precision Recall Time
Changes
%
%
(s)
100
100.00
98.90 1.0937
200
100.00
99.60 1.1029
300
98.36
99.60 1.1265
400
99.25
97.41 1.1768
500
99.12
96.34 1.1832
Table 4.6: Performance of ReInduceLR
these templates to modify the layout randomly. The top and bottom are considered a pair,
and were used together, however, the attribute formatting could vary. For example, The
yahoo “top” and “bottom” could be used with the tuple formatting of Whowhere.com.
With 5 top, bottom templates and 5 tuple formats, there are totally 625 possible pairs of
transitions. We also accumulated data from the Yahoo site and used them to randomly
amend the tuples in the page.
ReInduceLR was trained at t=0. Then the performance of the wrappers generated was
evaluated. The metrics chosen were precision, recall and time complexity. Precision and
Recall retain their previous definition, and are measured when the new Wrapper is applied
to the changed page. The results are given in Table 4.6.
Over 100-500 layout changes, we observe that the algorithm performs at near perfect
precision and recall, taking a little over a second for each reinduction step.
57
Algorithm % Precision % Recall
SG-WRAM
89.5
90.5
DataProg
90
80
Table 4.7: Average Precision and Recall for Existing Approaches
For comparison, we provide the precision and recall of DataProg and SG-WRAM, in
Table 4.7 1 . Though these results were obtained using a different corpus, we observe that
none of these systems achieve both precision and recall as high as ours. Thus, we conclude
that our method can be more effective for Wrapper Reinduction under common layout
changes.
1
The precision and recall values for ROADRUNNER are not available
58
Chapter 5
Conclusions
In this thesis we investigated wrapper induction from web sites whose layout may change
over time. We formulated the reinduction problem and identified that wrapper induction
from an incomplete label is a key problem to be solved. We proposed a novel algorithm for
incrementally inducing LR wrappers and showed that this algorithm asymptotically identifies the correct wrapper as the number of tuples is increased. This property was used to
propose a LR wrapper reinduction algorithm. This algorithm requires examples to be provided exactly once and thereafter the algorithm can detect the layout changes and reinduce
wrappers automatically, so long as the wrapper changes are in LR. In experimental studies,
we observe that the reinduction algorithm is able to achieve near perfect performance. We
also introduced a new class of Wrappers called LRRE to learn wrappers more expressive
than the LR wrapper class.
The contributions of this thesis can be summarized as follows:
59
1. We identified the problem of wrapper induction using insufficient examples as a key
step to be solved to handle wrapper reinduction from incremental web pages.
2. We proposed novel algorithms for incrementally inducing wrappers. These algorithms
can learn from as few as two examples, are efficient and are independent of schema,
content based patterns or the tag structure of the page. We also showed that the
algorithm asymptotically identifies the correct wrapper as the number of examples is
increased.
3. Based on our induction algorithms, we developed a practical automatic reinduction
system. This system needs a small number of examples to be provided initially,
and after that the system verifies and repairs wrappers automatically when the layout
changes occur. In comparison to the existing approaches our algorithm is more efficient
and eliminates needs for searching large wrapper spaces.
Our work is based on the motivation that the key to effective learning is to bias the
learning algorithm [46]. However, this is also a limitation. LR class is capable of covering
only 53% of common layouts [34]. It would be interesting to extend our approach to handle
richer classes such as HOCLRT, and nested wrapper classes like N-LR.
60
Appendix A
Websites considered for Evaluation
of InduceLR
Figure A.1: Screenshot from Amazon.com
61
Figure A.2: Screenshot from Google.com
Figure A.3: Screenshot from uspto.gov
62
Figure A.4: Screenshot from Yahoo People Search
Figure A.5: Screenshot from ZDNet.com
63
Appendix B
Regular Expression Syntax
A regular expression (or RE) specifies a set of strings that matches it. Regular expressions
can contain both special and ordinary characters. Special characters either stand for classes
of ordinary characters, or affect how the regular expressions around them are interpreted.
Some important special characters are:
“.” (Dot.) In the default mode, this matches any character except a newline.
“*” Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many
repetitions as are possible. ab will match ’a’, ’ab’, or ’a’ followed by any number of
’b’s. a. b will match ‘ab’, ‘acb’, ‘ad3b’ etc.
“+” Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will
match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
64
“?” Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will
match either ‘a’ or ‘ab’.
The “*”, “+”, and “?” qualifiers described above are all greedy qualifiers, they match as
much text as possible. Sometimes this behavior is not desired; if the RE < .∗ > is
matched against H1 title /H1 , it will match the entire string, and not just H1 .
Adding “?” after the qualifier makes it perform the match in a non-greedy or minimal fashion; as few characters as possible will be matched. Using .∗? in the previous
expression will match only H1 . “+?”, “??” are the other non-greedy (or lazy) qualifiers.
(...) matches whatever regular expression is inside the parentheses, and indicates the start
and end of a group; the contents of a group can be retrieved after a match has been
performed. (.∗?) will help extract the value of the wildcard as a group.
65
Bibliography
[1] S. Abiteboul. Querying semi-structured data. In International Conference on Database
Theory, pages 1–18, 1997.
[2] J. L. Ambite, G. Barish, C. A. Knoblock, M. Muslea, J. Oh, and S. Minton. Getting
from here to there: interactive planning and agent execution for optimizing travel.
In Eighteenth national conference on Artificial intelligence, pages 862–869, Edmonton,
Alberta, Canada, 2002. American Association for Artificial Intelligence.
[3] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In
Proceedings of the 2003 ACM SIGMOD international conference on on Management
of data, pages 337–348, San Diego, California, 2003. ACM Press.
[4] Y. Arens and C. Knoblock. SIMS: Retrieving and integrating information from multiple sources. In Proceedings of the 1993 ACM SIGMOD International Conference on
Management of Data, pages 562–563, Washington, DC, 1993.
[5] Arnaud Sahuguet and Fabien Azavant. Web Ecology: Recycling HTML pages as XML
documents using W4F. In WebDB’99, 1999.
[6] N. Ashish and C. A. Knoblock.
Wrapper generation for semi-structured internet
sources. SIGMOD Rec., 26(4):8–15, 1997.
66
[7] P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and structured data in the web:
Going back and forth. In Workshop on Management of Semistructured Data, 1997.
[8] M. Bauer, D. Dengler, and G. Paul. Instructible information agents for web mining.
In Intelligent User Interfaces, pages 21–28, 2000.
[9] D. Beneventano, S. Bergamaschi, S. Castano, A. Corni, R. Guidetti, G. Malvezzi,
M. Melchiori, and M. Vincini. Information integration: The MOMIS project demonstration. In Proceedings of 26th International Conference on Very Large Data Bases,
pages 611–614, Cairo, Egypt, 2000. Morgan Kaufmann.
[10] G. Beuster, B. Thomas, and C. Wolff. MIA - a ubiquitous multi-agent web information
system. In International ICSC Symposium on Multi-Agents and Mobile Agents in
Virtual Organizations and E-Commerce, 2000.
[11] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz. The Harvest information discovery and access system. Computer Networks and ISDN Systems,
28(1–2):119–125, 1995.
[12] M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Working Notes of AAAI Spring Symposium on Applying Machine
Learning to Discourse Processing, pages 6–11, Menlo Park, CA, 1998. AAAI Press.
[13] M. J. Carey, L. M. Haas, P. M. Schwarz, M. Arya, W. F. Cody, R. Fagin, M. Flickner,
A. W. Luniewski, W. Niblack, D. Petkovic, J. Thomas, J. H. Williams, and E. L. Wimmers. Towards heterogeneous multimedia information systems: the garlic approach. In
Proceedings of the 5th International Workshop on Research Issues in Data Engineering-
67
Distributed Object Management (RIDE-DOM’95), page 124. IEEE Computer Society,
1995.
[14] C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery.
In Proceedings of the tenth international conference on World Wide Web, pages 681–
688, Hong Kong, 2001. ACM Press.
[15] B. Chidlovskii. Automatic repairing of web wrappers. In Proceeding of the third international workshop on Web information and data management, pages 24–30, Atlanta,
Georgia, USA, 2001. ACM Press.
[16] B. Chidlovskii, U. Borghoff, and P. Chevalier. Towards sophisticated wrapping of webbased information repositories. In Proceedings of 5th International RIAO Conference,
pages 123–135, 1997.
[17] R. Cooley, J. Srivastava, and B. Mobasher. Web mining: Information and pattern discovery on the world wide web. In Proceedings of the 9th IEEE International Conference
on Tools with Artificial Intelligence (ICTAI’97), November 1997.
[18] V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of International Conference on Very Large
Data Bases (VLDB 01), pages 109–118. Morgan Kaufman, 2001.
[19] R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent
for the world-wide web. In W. L. Johnson and B. Hayes-Roth, editors, Proceedings of
the First International Conference on Autonomous Agents (Agents’97), pages 39–48,
Marina del Rey, CA, USA, 1997. ACM Press.
68
[20] O. Etzioni. The world-wide web: Quagmire or gold mine? Communications of the
ACM, 39(11):65–68, 1996.
[21] O. Etzioni and D. S. Weld. Intelligent agents on the internet: Fact, fiction, and forecast.
IEEE Expert, 10(3):44–49, 1995.
[22] M. Frank, M. Muslea, J. Oh, S. Minton, and C. Knoblock. An intelligent user interface
for mixed-initiative multi-source travel planning. In Proceedings of the 6th international
conference on Intelligent user interfaces, pages 85–86, Santa Fe, New Mexico, United
States, 2001. ACM Press.
[23] X. Gao, M. Zhang, and P. Andreae. Learning information extraction patterns from
tabular web pages without manual labelling. In Web Intelligence. IEEE Computer
Society, 2003.
[24] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D.
Ullman, V. Vassalos, and J. Widom. The TSIMMIS approach to mediation: Data
models and languages. Journal of Intelligent Information Systems, 8(2):117–132, 1997.
[25] M. R. Genesereth, A. M. Keller, and O. M. Duschka. Infomaster: an information
integration system. In Proceedings of the SIGMOD international conference on Management of data, pages 539–542, Tucson, Arizona, United States, 1997. ACM Press.
[26] K. Hammond, R. Burke, C. Martin, and S. Lytinen. FAQ finder: a case-based approach to knowledge navigation. In Proceedings of the 11th Conference on Artificial
Intelligence for Applications, pages 80–86. IEEE Computer Society, 1995.
[27] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann,
2000.
69
[28] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured
data extraction from the web. Information Systems, 23(8):521–538, 1998.
[29] C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: a machine learning approach. Intelligent exploration of
the web, pages 275–287, 2003.
[30] C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G.
Philpot, and S. Tejada. Modeling web sources for information integration. In Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, pages 211–218, Madison, Wisconsin, United States,
1998. American Association for Artificial Intelligence.
[31] R. Kosala and H. Blockeel. Web mining research: A survey. SIGKDD Explorations,
2(1):1–15, 2000.
[32] S. Kuhlins and R. Tredwell. Toolkits for generating wrappers – a survey of software
toolkits for automated data extraction from web sites. In M. Aksit, M. Mezini, and
R. Unland, editors, Objects, Components, Architectures, Services, and Applications for
a Networked World, volume 2591 of Lecture Notes in Computer Science (LNCS), pages
184–198, 2003.
[33] N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University
of Washington, 1997.
[34] N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:15–68, 2000.
[35] N. Kushmerick. Wrapper verification. World Wide Web, 3(2):79–94, 2000.
70
[36] N. Kushmerick and B. Thomas. Adaptive information extraction: A core technology
for information agents. Intelligent Information Agents R&D in Europe: An AgentLink
perspective, 2002.
[37] C. T. Kwok and D. S. Weld. Planning to gather information. In 13th AAAI National
Conference on Artificial Intelligence, pages 32–39, Portland, Oregon, 1996. AAAI /
MIT Press.
[38] A. H. F. Laender, B. Ribeiro-Neto, and A. S. da Silva. Debye - date extraction by
example. Data and Knowledge Engengineering, 40(2):121–154, 2002.
[39] A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey
of web data extraction tools. SIGMOD, 31(2):84–93, 2002.
[40] S. Lawrence and C. L. Giles. Searching the web: General and scientific information
access. IEEE Communications, 37(1):116–122, 1999.
[41] K. Lerman and S. Minton. Learning the common structure of data. In Proceedings of
the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference
on Innovative Applications of Artificial Intelligence, pages 609–614. AAAI Press / The
MIT Press, 2000.
[42] K. Lerman, S. Minton, and C. Knoblock. Wrapper maintanence: A machine learning
approach. Journal of Artificial Intelligence Research, 18:149–181, 2003.
[43] S. Luke and J. A. Hendler. Web agents that work. IEEE Multimedia, 4(3):76–80, 1997.
[44] G. Mecca, P. Atzeni, A. Masci, P. Merialdo, and G. Sindoni. The Araneus web-base
management system. In SIGMOD Conference, pages 544–546, 1998.
71
[45] X. Meng, D. Hu, and C. Li. Schema-guided wrapper maintenance for web-data extraction. In Proceedings of International on Web Information and Data Management(WIDM 03), pages 1–8, 2001.
[46] T. Mitchell. The need for biases in learning generalizations. In J. Shavlik and T. Dietterich, editors, Readings in Machine Learning. Morgan Kaufman, 1990.
[47] D. Mladenic. Text-learning and related intelligent agents: A survey. IEEE Intelligent
Systems, 14(4):44–54, 1999.
[48] R. Mohapatra and K. Rajaraman. Wrapper induction under web layout changes. In
Proceedings of International Conference on Internet Computing, pages 102–108, Las
Vegas, Nevada, USA, 2004.
[49] R. Mohapatra, K. Rajaraman, and S. Y. Sung. Efficient wrapper reinduction from
dynamic web sources. In Proceedings of IEEE/WIC/ACM International Conference
on Web Intelligence, pages 391–397, Beijing,China, 2004.
[50] I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for
semistructured text. In Proceedings of AAAI-98 Workshop on AI and Information
Integration, Technical Report WS-98-01, Menlo Park, California, 1998. AAAI Press.
[51] I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction.
In O. Etzioni, J. P. M¨
uller, and J. M. Bradshaw, editors, Proceedings of the Third
International Conference on Autonomous Agents (Agents’99), pages 190–197, Seattle,
WA, USA, 1999. ACM Press.
[52] I. Muslea, S. Minton, and C. Knoblock. Hierarchical wrapper induction for semistruc-
72
tured sources. Journal of Autonomous Agents and Multi-Agent Systems, 4:93–114,
2001.
[53] T. Payne, R. Singh, and K. Sycara. RCal: A case study on semantic web agents.
In The First International Joint Conference on Autonomous Agents and Multi-Agent
Systems, 2002.
[54] M. Perkowitz and O. Etzioni. Category translation: Learning to understand information on the internet. In International Joint Conference on Artificial Intelligence,
IJCAI-95, pages 930–938, Montreal, Canada, 1995.
[55] D. Smith and M. Lopez. Information extraction from semi-structured documents. In
Proceedings of Workshop on Management of Semi-structured Data, Tucson, 1997.
[56] S. Soderland. Learning information extraction rules for semi-structured and free text.
Machine Learning, 34(1-3):233–272, 1999.
73
[...]... Kushmerick et al.[36] advocate this task of Information extraction from the web as the core enabling technology for a variety of Information agents 1.2 Information Extraction from the Web At the highest level, this thesis is concerned with Information Extraction from the Web Information Extraction (IE) is the process of identifying the particular fragments of an information resource that constitute its... landmark feature will no longer hold and the wrapper will fail [36] In such cases, the extraction of data from such web pages becomes difficult and is clearly a non-trivial problem In this thesis, we focus on this problem of Extraction of Information from Dynamic Web sites We deal with dynamic web pages, typically, a web page which is modified in its layout, content or both The challenge here is to generate... extract- 9 ing the content of such dynamic webpages We propose a novel approach for dealing with dynamic websites and present efficient algorithms that can perform continuous extraction of information 1.3 Organization The rest of the thesis is organized as follows: Chapter 2 is dedicated to reviewing all the existing literature for information extraction from dynamic websites and evaluating their strengths... the information overload on the internet by assisting us finding important resources on the Web [11, 37, 26, 54], and also track and analyze their usage patterns This process of discovery and analysis of Information on the World Wide Web is called Web Mining Web mining is a huge, interdisciplinary and very dynamic scientific area, converging from several research communities such as database, information. .. To extract information from semi-structured information resources, information extraction systems usually rely on extraction rules tailored to a that source, generally called Wrappers Wrappers are software modules that help capture the semi-structured data on the web into a structured format They have three main functions [32]: • Download : They must be able to download HTML pages from a web site •... autonomous systems perform the process of managing, collating, filtering and redistributing information from the many resources [36, 8] Broadly put, agents would include Information integration also as a task, but they additionally analyze the information obtained from various sources These systems assist users by finding information or performing some simpler tasks on their behalf For instance, such a system... [33] and hence, extraction of data from Web is a non-trivial 4 Figure 1.2: Weather listing from the channel news asia website problem The primary problem faced by Information Integration systems and Intelligent agents is not resource discovery, since most of them would look at a few trusted sources related to specific domains Since the semi-structured pages contain a lot of extraneous information, the... the information to a structured format Also, it can be observed that across different web sites and web pages in HTML, the structural formatting (HTML tags or surrounding text) may differ, but the presentation remains fairly regular Wrappers also help in coping with structural heterogeneity inherent in many different sources By using several wrappers to extract data from the various information sources. .. artificial intelligence especially from machine learning and natural language processing This includes automatic research and analysis of information resources available online, Web Content Mining, discovery of the link structure of the hyperlinks at the interdocument level, Web Structure Mining, and the analysis of user access patterns, Web Usage Mining[17] A taxonomy of Web Mining tools has been described... information on the behalf of the user Information Integration systems deal with extraction and integration of data from various sources [4] An application developer starts with a set of web sources and creates a unified view of these sources Once this process is complete, an end user can issue databaselike queries as if the information were stored in a single large database [30] Many such approaches have ... al.[36] advocate this task of Information extraction from the web as the core enabling technology for a variety of Information agents 1.2 Information Extraction from the Web At the highest level,... cases, the extraction of data from such web pages becomes difficult and is clearly a non-trivial problem In this thesis, we focus on this problem of Extraction of Information from Dynamic Web sites... level, this thesis is concerned with Information Extraction from the Web Information Extraction (IE) is the process of identifying the particular fragments of an information resource that constitute