Quagmire or Gold Mine?
COMMUNICATIONS OF THE ACMNovember 1996/Vol. 39, No. 1165Skeptics believe the Web is toounstructured for Web mining to suc-ceed. Indeed, data mining has beenapplied traditionally to databases, yetmuch of the information on the Weblies buried in documents designed forhuman consumption such as homepages or product catalogs. Further-more, much of the information on theWeb is presented in natural-languagetext with no machine-readable seman-tics; HTML annotations structure thedisplay of Web pages, but provide littleinsight into their content.Some have advocated transformingthe Web into a massive layered data-base to facilitate data mining [12], butthe Web is too dynamic and chaotic tobe tamed in this manner. Others haveattempted to hand code site-specific“wrappers” that facilitate the extrac-tion of information from individualWeb resources (e.g., [8]). Hand cod-ing is convenient but cannot keep upwith the explosive growth of the Web.As an alternative, this article argues forthe structured Web hypothesis: Infor-mation on the Web is sufficientlystructured to facilitate effective Webmining. Examples of Web structure includelinguistic and typographic conven-tions, HTML annotations (e.g.,<title>), classes of semi-structured doc-uments (e.g., product catalogs), Webindices and directories, and muchmore. To support the structured Webhypothesis, this article will survey pre-liminary Web mining successes andsuggest directions for future work.Web mining may be organized intothe following subtasks:• Resource discovery. Locating unfamil-iar documents and services on theWeb.• Information extraction. AutomaticallyOren EtzioniTERRY WIDENERThe World-Wide Web:Quagmire orGold Mine?Is information on the Web sufficiently structuredto facilitate effective Web mining? 66November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACMextracting specific informationfrom newly discovered Webresources.• Generalization. Uncovering gen-eral patterns at individual Websites and across multiple sites.Resource Discovery Web resources fall into two class-es: documents and services. Thebulk of the work on resource dis-covery focuses on the automaticcreation of searchable indices ofWeb documents. The most popu-lar indices have been created byWeb robots such as WebCrawlerand AltaVista, which scan mil-lions of Web documents andstore an index of the words in thedocuments. A person can thenask for all the indexed docu-ments that contain certain key-words. There are over a dozendifferent indices currently inactive use, each with a uniqueinterface and a database coveringa different fraction of the Web.As a result, people are forced torepeatedly try and retry theirqueries across different indices.Furthermore, the indices returnmany responses that are irrele-vant, outdated, or unavailable,forcing the person to manuallysift through the responses search-ing for useful information.MetaCrawler (http://www.metacrawler.com) representsthe next level in the informa-tion food chain by providing asingle, unified interface forWeb document searching [4].MetaCrawler’s expressive querylanguage allows searching forphrases and restricting thesearch by geographic region orby Internet domain (e.g., .gov).Metacrawler posts keywordqueries to nine searchableindices in parallel; it then col-lates and prunes the responsesreturned, aiming to provide userswith a manageable amount of high-quality information. Thus, insteadof tackling the Web directly,MetaCrawler mines robot-createdsearchable indices.Future resource discovery sys-tems will make use of automatictext categorization technology toclassify Web documents into cate-gories. This technology could facil-itate the automatic construction ofWeb directories such as Yahoo bydiscovering documents that fitYahoo categories. Alternatively, thetechnology could be used to filterthe results of queries to searchableindices. For example, in responseto a query such as “Find me prod-uct reviews of Encarta,” a discoverysystem could take documents con-taining the word “Encarta” foundby querying searchable indices,and identify the subset that corre-sponds to product reviews.Information Extraction Once a Web resource has been dis-covered, the challenge is to auto-matically extract information fromit. The bulk of today’s information-extraction systems identify a fixedset of Web resources and rely onhand-coded “wrappers” to accessthe resource and parse itsresponse. To scale with the growthof the Web, miners need to dynam-ically extract information fromunfamiliar resources, thereby elim-inating or reducing the need forhand coding. We now survey sever-al such systems.The Harvest system relies on mod-els of semi-structured documents toimprove its ability to extract informa-tion [1]. For example, it knows how tofind author and title information inLatex documents and how to stripposition information from Postscriptfiles. In one demonstration, Harvestcreated a directory of toll-free num-Some have advocat-ed transforming theWeb into a massivelayered database tofacilitate data min-ing, but the Web istoo dynamic andchaotic to be tamedin this manner. COMMUNICATIONS OF THE ACMNovember 1996/Vol. 39, No. 1167bers by extracting them from a large set of Webdocuments (see http://harvest.cs.colorado.edu/harvest/demobrokers.html). Harvest neither discoversnew documents nor learns new models of documentstructure. However, Harvest easily handles new docu-ments of a familiar type.FAQ-Finder extracts answers to frequently askedquestions (FAQ) from FAQ files available on the Web[6, 11]. Like Harvest, FAQ-Finder relies on a model ofdocument structure. A user poses a question in nat-ural language and the text of the question is used tosearch the FAQ files for a matching question. FAQ-Finder then returns the answer associated with thematching question. Because of the semi-structurednature of the files, and because the number of files ismuch smaller than the number of documents on theWeb, FAQ-Finder has the potential to return higherquality information than general-purpose searchableindices.Both Harvest and FAQ-Finder have two key limita-tions. First, both systems focus exclusively on Webdocuments and ignore services (the same holds truefor Web indices as well). Second, both Harvest andFAQ-Finder rely on a pre-specified description of cer-tain fixed classes of Web documents. In contrast, theInternet Learning Agent (ILA) and Shopbot are twoWeb miners that rely on a combination of test queriesand domain-specific knowledge to automaticallylearn descriptions of Web services (e.g., searchableproduct catalogs, personnel directories, and more).The learned descriptions can be used to enable auto-matic information extraction by intelligent agentssuch as the Internet Softbot [5].ILA learns to extract information from unfamiliarresources by querying them with familiar objects andmatching the output returned against knowledgeabout the query objects [10]. For example, ILAqueries the University of Washington personneldirectory with the entry “Etzioni” and recognizes thethird output token (685–3035) as his phone number.Based on this observation, ILA might hypothesizethat the third token output by the directory is thephone number of the person mentioned in thequery. This learning process has a number of sub-tleties. For example, the output token “oren” couldbe either Etzioni’s userid or first name. To discrimi-nate between these two competing hypotheses, ILAwill attempt to query with someone whose userid isdifferent from her first name. In the experimentsreported in [10], ILA successfully learned to extractinformation such as phone numbers and emailaddresses from the Internet server “Whois” and fromthe personnel directories of a dozen universities.Shopbot learns to extract product informationfrom Web vendors [3]. Shopbot borrows from ILAthe idea of learning by querying with familiar objects.However, Shopbot tackles a more ambitious task.Shopbot takes as input the address of a store’s homepage as well as knowledge about a product domain(e.g., software), and learns how to shop at the store.Specifically, Shopbot searches the store’s Web to findthe store’s searchable product catalog, learns the for-mat in which product descriptions are presented, andfrom these descriptions learns to extract productattributes such as price. Shopbot learns by queryingthe store for information on popular products, andanalyzing the store’s responses. In the software shop-ping domain, Shopbot was given the home pages for12 online software vendors. Shopbot learned toextract product information from each of the stores,including the product’s operating system (Mac orWindows), and more. In a preliminary user study,Shopbot users were able to shop four times faster(and find better prices) than users relying only on aWeb browser [3]. Current work on Shopbot exploresthe problem of autonomously discovering vendorhome pages.GeneralizationOnce we have automated the discovery and extrac-tion of information from Web sites, the natural nextstep is to attempt to generalize from our experience.Yet, virtually all machine learning systems deployedon the Web (see [7] for some examples) learn abouttheir user’s interests, instead of learning about theWeb itself. A major obstacle when learning about theWeb is the labeling problem: data is abundant on theWeb, but it is unlabeled. Many data mining tech-niques require inputs labeled as positive (or negative)examples of some concept. For example, it is relative-ly straightforward to take a large set of Web pageslabeled as positive and negative examples of the con-cept “home page” and derive a classifier that predictswhether any given Web page is a home page or not;unfortunately, Web pages are unlabeled.Techniques such as uncertainty sampling [9]reduce the amount of labeled data needed, but donot eliminate the labeling problem. Clustering tech-niques do not require labeled inputs, and have beenapplied successfully to large collections of documents(e.g, [2]). Indeed, the Web offers fertile ground fordocument clustering research. However, because clustering techniques take weaker (unlabeled) inputsthan other data mining techniques, they produceweaker (unlabeled) output. We consider anapproach to solving the labeling problem that relieson the observation that the Web is much more than acollection of linked documents.The Web is an interactive medium visited bymillions of people each day. Ahoy! (http://www.cs.washington.edu/research/ahoy) represents anattempt to harness this source of power to solve thelabeling problem. Ahoy! takes as input a person’sname and affiliation and attempts to locate the per-son’s home page. Ahoy! queries MetaCrawler anduses knowledge of institutions and home pages to fil-ter MetaCrawler’s output. Since Ahoy!’s filteringalgorithm is heuristic, it asks its users to label itsanswers as correct or incorrect. Ahoy! relies on its ini-tial power to draw numerous users to it and to solicittheir feedback; it then uses this feedback to solve thelabeling problem, make generalizations about theWeb, and improve its performance. By relying onfeedback from multiple users, Ahoy! rapidly collectsthe data it needs to learn; systems focused on learn-ing an individual user’s taste do not have this luxury.Finally, note that Ahoy!’s boot-strapping architectureis not restricted to learning about home pages; userfeedback may be harnessed to provide training datain a variety of Web domains.Conclusion In theory, the potential of Web mining to help peo-ple navigate, search, and visualize the contents of theWeb is enormous. This brief and selective surveyexplored the question of whether effective Web min-ing is feasible in practice. We reviewed severalpromising prototypes and outlined directions forfuture work. In essence, we have gathered prelimi-nary evidence for the structured Web hypothesis;although the Web is less structured than we mighthope, it is less random than we might fear.Acknowledgments I would like to thank my close collaborator, DanWeld, for his numerous contributions to the softbotsproject and its vision. I would also like to thank my co-softbotists David Christianson, Bob Doorenbos, MarcFriedman, Keith Golden, Nick Kushmerick, CodyKwok, Neal Lesh, Mark Langheinrich, Sujay Parekh,Mike Perkowitz, Erik Selberg, Richard Segal, andJonathan Shakes. Thanks are due to Steve Hanks andother members of the UW AI group for helpful dis-cussions and collaboration. This research was fundedin part by Office of Naval Research grant 92-J-1946, byARPA / Rome Labs grant F30602-95-1-0024, by a giftfrom Rockwell International Palo Alto Research, andby National Science Foundation grant IRI-9357772.References1. Brown, C.M., Danzig, P.B. Hardy, D., Manber, U., and Schwartz,M.F. The harvest information discovery and access system. In Pro-ceedings of the 2d International World Wide Web Conference, 1994, pp.763–771. Available from ftp://ftp.cs.colorado. edu/pub/cs/techreports/schwartz/Harvest.Conf.ps.Z.2. Cutting, D.D., Karger, J. Pedersen, and Turkey, J. Scatter/gath-er: A cluster-based approach to browsing large document col-lections. In Proceedings of the Fifteenth Interntional Conference onResearch and Development in Information Retrieval (Copenhagen,Denmark), June 12, 1992, pp. 318–329.3. Doorenbos, R.B. Etzioni, O. and Weld, D.S. A scalable compar-ison-shopping agent for the world-wide web. Technical Report96–01–03, University of Washington, Dept. of Computer Sci-ence and Engineering, January 1996. Available via ftp frompub/ai/ at ftp.cs.washington.edu.4. Etzioni, O. Moving up the information food chain: Deployingsoftbots on the Web. In Proceedings of the Fourteenth NationalConference on AI, 1996.5. Etzioni, O. and Weld, D. A softbot-based interface to the Inter-net. Commun. ACM 37, 7 (Jul. 1994), 72–76; See http://www.cs.washington.edu/research/softbots.6. Hammond, K., Burke, R., Martin, C., and Lytinen, S. FAQfinder: A case-based approach to knowledge navigation. InWorking Notes of the AAAI Spring Symposium: Informationgathering from Heterogeneous, Distributed Enviornments,1995, AAAI Press, Stanford University, pp. 69–73, To order acopy, contact sss@aaai.org.7. Knoblock, C. and Levy, A., Eds. Working Notes of the AAAISpring Symposium on Information Gathering from Heteroge-neous, Distributed Environments, AAAI Press, Stanford Uni-versity, 1995. AAAI Press. To order a copy, contactsss@aaai.org.8. Krulwich, B. The bargainfinder agent: Comparison price shop-ping on the internet. In J. Williams, Ed., Bots and Other InternetBeasties. SAMS.NET, 1996. http://bf.cstar.ac.com.bf/.9. Lewis, D. and Gale, W. Training text classifiers by uncertaintysampling. In Proceedings of the Seventeenth Annual InternationalACMSIGIR Conference on Research and Development in InformationRetrieval, 1994. 10. Perkowitz, M. and Etzioni, O. Category translation: Learningto understand information on the internet. In Proceedings ofthe Fifteenth International Joint Conference on AI, (Montreal,Can.), Aug. 1995, pp. 930–93611. Whitehead, S. D. Auto-faq: An experiment in cyberspaceleveraging. In Proceedings of the Second International WWWConference, vol. 1, (Chicago), 1994, pp. 25–38, (See also:http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/whitehead/whitehead.html).12. Zaiane, O.R. and Jiawei, H. Resource and knowledge discov-ery in global information systems: A preliminary design andexperiment. In Proceedings of Knowledge Database Discovery’951995, pp. 331–336, OREN ETZIONI (etzioni@cs.washington.edu) is an associate pro-fessor in the Department of Computer Science and Engineering atthe University of Washington in Seattle. Permission to make digital/hard copy of part or all of this work for personalor classroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage, the copyright notice, the titleof the publication and its date appear, and notice is given that copying is bypermission of ACM, Inc. To copy otherwise, to republish, to post on servers,or to redistribute to lists requires prior specific permission and/or a fee.© ACM 0002-0782/96/1100 $3.50C68November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM . theWeb.• Information extraction. AutomaticallyOren EtzioniTERRY WIDENERThe World-Wide Web :Quagmire orGold Mine?Is information on the Web sufficiently structuredto. part or all of this work for personalor classroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage,