TEAM LinG Web Dragons TEAM LinG This page intentionally left blank TEAM LinG Web Dragons Ian H. Witten Marco Gori Teresa Numerico Inside the Myths of Search Engine Technology AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier TEAM LinG Publisher Diane D. Cerra Publishing Services Manager George Morrison Project Manager Marilyn E. Rash Assistant Editor Asma Palmeiro Cover Design Yvo Riezebos Design Text Design Mark Bernard, Design on Time Composition CEPHA Imaging Pvt. Ltd. Copyeditor Carol Leyba Proofreader Daniel Stone Indexer Steve Rath Interior Printer Sheridan Books Cover Printer Phoenix Color Corp. Morgan Kaufmann Publishers is an imprint of Elsevier. 500 Sansome Street, Suite 400, San Francisco, CA 94111 This book is printed on acid-free paper. © 2007 by Elsevier Inc. All rights reserved. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written permission of the publisher. Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: permissions@elsevier.com. You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.” Library of Congress Cataloging-in-Publication Data Witten, I. H. (Ian H.) Web dragons: inside the myths of search engine technology / Ian H. Witten, Marco Gori, Teresa Numerico. p. cm. — (Morgan Kaufmann a series in multimedia and information systems) Includes bibliographical references and index. ISBN-13: 978-0-12-370609-6 (alk. paper) ISBN-10: 0-12-370609-2 (alk. paper) 1. Search engines. 2. World Wide Web. 3. Electronic information resources literacy. I. Gori, Marco. II. Numerico, Teresa. III. Title. IV. Title: Inside the myths of search engine technology. TK5105.884.W55 2006 025.04 dc22 2006023512 For information on all Morgan Kaufmann Publishers visit our Web site at www.books.elsevier.com Printed in the United States of America 0607080910 10987654321 TEAM LinG v CONTENTS List of Figures xi List of Tables xiii Preface xv About the Authors xxi 1. SETTING THE SCENE 3 According to the Philosophers 5 Knowledge as Relations 5 Knowledge Communities 7 Knowledge as Language 8 Enter the Technologists 9 The Birth of Cybernetics 9 Information as Process 10 The Personal Library 12 The Human Use of Technology 13 The Information Revolution 14 Computers as Communication Tools 14 Time-Sharing and the Internet 15 Augmenting Human Intellect 17 The Emergence of Hypertext 18 And Now, the Web 19 The World Wide Web 20 A Universal Source of Answers? 20 What Users Know About Web Search 22 Searching and Serendipity 24 So What? 25 Notes and Sources 26 2. LITERATURE AND THE WEB 29 The Changing Face of Libraries 30 Beginnings 32 The Information Explosion 33 The Alexandrian Principle: Its Rise, Fall, and Re-Birth 35 The Beauty of Books 37 TEAM LinG Metadata 41 The Library Catalog 43 The Dublin Core Metadata Standard 46 Digitizing Our Heritage 48 Project Gutenberg 49 Million Book Project 50 Internet Archive and the Bibliotheca Alexandrina 51 Amazon: A Bookstore 52 Google: A Search Engine 53 Open Content Alliance 55 New Models of Publishing 55 So What? 57 Notes and Sources 58 3. MEET THE WEB 61 Basic Concepts 62 HTTP: Hypertext Transfer Protocol 63 URI: Uniform Resource Identifier 65 Broken Links 66 HTML: Hypertext Markup Language 67 Crawling 70 Web Pages: Documents and Beyond 72 Static, Dynamic, and Active Pages 72 Avatars and Chatbots 74 Collaborative Environments 75 Enriching with Metatags 77 XML: Extensible Markup Language 78 Metrology and Scaling 79 Estimating the Web’s Size 80 Rate of Growth 81 Coverage, Freshness, and Coherence 83 Structure of the Web 85 Small Worlds 85 Scale-free Networks 88 Evolutionary Models 90 Bow Tie Architecture 91 Communities 94 Hierarchies 95 The Deep Web 96 So What? 97 Notes and Sources 98 vi CONTENTS TEAM LinG 4. HOW TO SEARCH 101 Searching Text 104 Full-text Indexes 104 Using the Index 106 What’s a Word? 107 Doing It Fast 109 Evaluating the Results 110 Searching in a Web 111 Determining What a Page Is About 113 Measuring Prestige 113 Hubs and Authorities 118 Bibliometrics 123 Learning to Rank 124 Distributing the Index 126 Developments in Web Search 128 Searching Blogs 128 Ajax Technology 129 The Semantic Web 129 Birth of the Dragons 131 The Womb Is Prepared 132 The Dragons Hatch 133 The Big Five 135 Inside the Dragon’s Lair 137 So What? 142 Notes and Sources 142 5. THE WEB WARS 145 Preserving the Ecosystem 146 Proxies 147 Crawlers 148 Parasites 149 Restricting Overuse 151 Resilience to Damage 152 Vulnerability to Attack 153 Viruses 154 Worms 155 Increasing Visibility: Tricks of the Trade 156 Term Boosting 157 Link Boosting 158 Content Hiding 161 Discussion 162 Business, Ethics, and Spam 162 The Ethics of Spam 163 Economic Issues 165 vii CONTENTS TEAM LinG Search-Engine Advertising 165 Content-Targeted Advertising 167 The Bubble 168 Quality 168 The Anti-Spam War 169 The Weapons 170 The Dilemma of Secrecy 172 Tactics and Strategy 173 So What? 174 Notes and Sources 174 6. WHO CONTROLS INFORMATION? 177 The Violence of the Archive 179 Web Democracy 181 The Rich Get Richer 182 The Effect of Search Engines 183 Popularity Versus Authority 185 Privacy and Censorship 187 Privacy on the Web 188 Privacy and Web Dragons 190 Censorship on the Web 191 Copyright and the Public Domain 193 Copyright Law 193 The Public Domain 195 Relinquishing Copyright 197 Copyright on the Web 198 Web Searching and Archiving 199 The WIPO Treaty 201 The Business of Search 201 The Consequences of Commercialization 202 The Value of Diversity 203 Personalization and Profiling 204 So What? 206 Notes and Sources 207 7. THE DRAGONS EVOLVE 211 The Adventure of Search 214 Personalization in Practice 216 My Own Web 217 Analyzing Your Clickstream 218 Communities 219 Social Space or Objective Reality? 220 Searching within a Community Perspective 221 Defining Communities 222 viii CONTENTS TEAM LinG Private Subnetworks 223 Peer-to-Peer Networks 224 A Reputation Society 227 The User as Librarian 229 The Act of Selection 229 Community Metadata 230 Digital Libraries 232 Your Computer and the Web 233 Personal File Spaces 234 From Filespace to the Web 235 Unification 236 The Global Office 236 So What? 238 Notes and Sources 241 REFERENCES 243 INDEX 251 ix CONTENTS TEAM LinG [...]... ways of searching through immense tracts of text is one of the most striking technical advances of the last decade And today search engines do it for us They weigh and measure every web page to determine whether it matches our query And they do it all for free We call on them whenever we want to find something that we need to know To learn how they work, read on! We refer to search engines as web dragons ... registration or otherwise limit access to their contents? Having surveyed the information landscape, Chapter 4 tackles the key ideas behind full-text searching and web search engines, the Internet’s new “killer app.” Despite the fact that search engines are intricate pieces of software, the underlying ideas are simple, and we describe them in plain English Full-text search is an embodiment of the classical... meaning of a revolution, but the rest of society, the audience who are swept along by the plot In the information revolution sparked by the World Wide Web, we are all members of the audience We did not ask for it We did not direct its development TEAM LinG SETTING THE SCENE 5 We did not participate in its conception and launch, in the design of the protocols and the construction of the search engines... of the IEEE and of the ECCAI, and a former president of the Italian Association for Artificial Intelligence Teresa Numerico teaches network theory and communication studies at the University of Rome She is also a researcher in the philosophy of science at the University of Salerno (Italy) She earned her PhD in the history of science and was a visiting researcher at London South Bank University in the. .. conceivable book is here, the archangels’ autobiographies, the faithful catalogue of the Library, thousands and thousands of false catalogues, the demonstration of the fallacy of those catalogues, the demonstration of the fallacy of the true catalogue ” Although the celebrated Argentine writer wrote this enigmatic little tale in 1941, it resonates with echoes of today’s World Wide Web The impious maintain... to celebrating the joy of being able to find stuff on the web, we want to make you feel uneasy about how everyone has come to rely on search engines so utterly and completely The web is where we record our knowledge, and the dragons are how we access it This book examines their interplay from many points of view: the philosophy of knowledge; the history of technology; the role of libraries, our traditional... professor of computer science at the University of Siena, where he is the leader of the artificial intelligence research group His research interests are machine learning with applications to pattern recognition, web mining, and game playing He received a Laurea from the University of Florence and a PhD from the University of Bologna He is the chairman of the Italian Chapter of the IEEE Computational Intelligence... have been concealed by the benign philosophy of today’s dominant players and the exceptionally high utility of their product Chapter 6 discusses the question of democracy (or lack of it) in cyberspace We also review the age-old system of TEAM LinG xviii PREFACE copyright—society’s way of controlling the flow of information to protect the rights of authors The fact that today’s web concentrates enormous... with Kant’s challenge of interpreting the revolution TEAM LinG SETTING THE SCENE 9 ENTER THE TECHNOLOGISTS Norbert Wiener (1894–1964) was among the leaders of the technological revolution that took place around the time of the Second World War He was the first American-born mathematician to win the respect of top intellects in the traditional European bastions of learning He coined the term cybernetics... from its components Today we see the web as having a holistic identity that transcends the sum of all the individual websites The second argument, even more germane to our topic, concerns the nature of information itself In the late 1940s, Claude Shannon, a pioneer of information theory, likened information to thermodynamic entropy, for it obeys some of the same mathematical laws Wiener inferred that . 126 Developments in Web Search 128 Searching Blogs 128 Ajax Technology 129 The Semantic Web 129 Birth of the Dragons 131 The Womb Is Prepared 132 The Dragons Hatch 133 The Big Five 135 Inside the Dragon’s. TEAM LinG Web Dragons TEAM LinG This page intentionally left blank TEAM LinG Web Dragons Ian H. Witten Marco Gori Teresa Numerico Inside the Myths of Search Engine Technology AMSTERDAM. INFORMATION? 177 The Violence of the Archive 179 Web Democracy 181 The Rich Get Richer 182 The Effect of Search Engines 183 Popularity Versus Authority 185 Privacy and Censorship 187 Privacy on the Web 188 Privacy