1. Trang chủ
  2. » Công Nghệ Thông Tin

programming spiders bots and aggregators in java 2002

485 841 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 485
Dung lượng 2,96 MB

Nội dung

Programming Spiders, Bots, and Aggregators in Java Jeff Heaton Publisher: Sybex February 2002 ISBN: 0782140408, 512 pages Spiders, bots, and aggregators are all so-called intelligent agents, which execute tasks on the Web without the intervention of a human being. Spiders go out on the Web and identify multiple sites with information on a chosen topic and retrieve the information. Bots find information within one site by cataloging and retrieving it. Aggregrators gather data from multiple sites and consolidate it on one page, such as credit card, bank account, and investment account data. This book offer offers a complete toolkit for the Java programmer who wants to build bots, spiders, and aggregrators. It teaches the basic low-level HTTP/network programming Java programmers need to get going and then dives into how to create useful intelligent agent applications. It is aimed not just at Java programmers but JSP programmers as well. The CD-ROM includes all the source code for the author's intelligent agent platform, which readers can use to build their own spiders, bots, and aggregators. i Programming Spiders, Bots, and Aggregators in Java Jeff Heaton Associate Publisher: Richard Mills Acquisitions and Developmental Editor: Diane Lowery Editor: Rebecca C. Rider Production Editor: Dennis Fitzgerald Technical Editor: Marc Goldford Graphic Illustrator: Tony Jonick Electronic Publishing Specialists: Jill Niles, Judy Fung Proofreaders: Emily Hsuan, Laurie O’Connell, Nancy Riddiough Indexer: Ted Laux CD Coordinator: Dan Mummert CD Technician: Kevin Ly Cover Designer: Carol Gorska, Gorska Design Cover Illustrator/Photographer: Akira Kaede, PhotoDisc Copyright © 2002 SYBEX Inc., 1151 Marina Village Parkway, Alameda, CA 94501. World rights reserved. The author(s) created reusable code in this publication expressly for reuse by readers. Sybex grants readers limited permission to reuse the code found in this publication or its accompanying CD-ROM so long as (author(s)) are attributed in any application containing the reusabe code and the code itself is never distributed, posted online by electronic transmission, sold, or commercially exploited as a stand-alone product. Aside from this specific exception concerning reusable code, no part of this publication may be stored in a retrieval system, transmitted, or reproduced in any way, including but not limited to photocopy, photograph, magnetic, or other record, without the prior agreement and written permission of the publisher. Library of Congress Card Number: 2001096980 ISBN: 0-7821-4040-8 SYBEX and the SYBEX logo are either registered trademarks or trademarks of SYBEX Inc. in the United States and/or other countries. Screen reproductions produced with FullShot 99. FullShot 99 © 1991-1999 Inbit Incorporated. All rights reserved. FullShot is a trademark of Inbit Incorporated. The CD interface was created using Macromedia Director, COPYRIGHT 1994, 1997-1999 Macromedia Inc. For more information on Macromedia and Macromedia Director, visit http://www.macromedia.com/ . ii Internet screen shot(s) using Microsoft Internet Explorer reprinted by permission from Microsoft Corporation. TRADEMARKS: SYBEX has attempted throughout this book to distinguish proprietary trademarks from descriptive terms by following the capitalization style used by the manufacturer. The author and publisher have made their best efforts to prepare this book, and the content is based upon final release software whenever possible. Portions of the manuscript may be based upon pre-release versions supplied by software manufacturer(s). The author and the publisher make no representation or warranties of any kind with regard to the completeness or accuracy of the contents herein and accept no liability of any kind including but not limited to performance, merchantability, fitness for any particular purpose, or any losses or damages of any kind caused or alleged to be caused directly or indirectly from this book. 10 9 8 7 6 5 4 3 2 1 Software License Agreement: Terms and Conditions The media and/or any online materials accompanying this book that are available now or in the future contain programs and/or text files (the “Software”) to be used in connection with the book. SYBEX hereby grants to you a license to use the Software, subject to the terms that follow. Your purchase, acceptance, or use of the Software will constitute your acceptance of such terms. The Software compilation is the property of SYBEX unless otherwise indicated and is protected by copyright to SYBEX or other copyright owner(s) as indicated in the media files (the “Owner(s)”). You are hereby granted a single-user license to use the Software for your personal, noncommercial use only. You may not reproduce, sell, distribute, publish, circulate, or commercially exploit the Software, or any portion thereof, without the written consent of SYBEX and the specific copyright owner(s) of any component software included on this media. In the event that the Software or components include specific license requirements or end-user agreements, statements of condition, disclaimers, limitations or warranties (“End-User License”), those End-User Licenses supersede the terms and conditions herein as to that particular Software component. Your purchase, acceptance, or use of the Software will constitute your acceptance of such End-User Licenses. By purchase, use or acceptance of the Software you further agree to comply with all export laws and regulations of the United States as such laws and regulations may exist from time to time. Reusable Code in This Book The authors created reusable code in this publication expressly for reuse for readers. Sybex grants readers permission to reuse for any purpose the code found in this publication or its accompanying CD-ROM so long as all of the authors are attributed in any application containing the reusable code, and the code itself is never sold or commercially exploited as a stand-alone product. iii Software Support Components of the supplemental Software and any offers associated with them may be supported by the specific Owner(s) of that material, but they are not supported by SYBEX. Information regarding any available support may be obtained from the Owner(s) using the information provided in the appropriate read.me files or listed elsewhere on the media. Should the manufacturer(s) or other Owner(s) cease to offer support or decline to honor any offer, SYBEX bears no responsibility. This notice concerning support for the Software is provided for your information only. SYBEX is not the agent or principal of the Owner(s), and SYBEX is in no way responsible for providing any support for the Software, nor is it liable or responsible for any support provided, or not provided, by the Owner(s). Warranty SYBEX warrants the enclosed media to be free of physical defects for a period of ninety (90) days after purchase. The Software is not available from SYBEX in any other form or media than that enclosed herein or posted to http://www.sybex.com/ If you discover a defect in the media during this warranty period, you may obtain a replacement of identical format at no charge by sending the defective media, postage prepaid, with proof of purchase to: SYBEX Inc. Product Support Department 1151 Marina Village Parkway Alameda, CA 94501 Web: http://www.sybex.com/ After the 90-day period, you can obtain replacement media of identical format by sending us the defective disk, proof of purchase, and a check or money order for $10, payable to SYBEX. Disclaimer SYBEX makes no warranty or representation, either expressed or implied, with respect to the Software or its contents, quality, performance, merchantability, or fitness for a particular purpose. In no event will SYBEX, its distributors, or dealers be liable to you or any other party for direct, indirect, special, incidental, consequential, or other damages arising out of the use of or inability to use the Software or its contents even if advised of the possibility of such damage. In the event that the Software includes an online update feature, SYBEX further disclaims any obligation to provide this feature for any specific duration other than the initial posting. The exclusion of implied warranties is not permitted by some states. Therefore, the above exclusion may not apply to you. This warranty provides you with specific legal rights; there may be other rights that you may have that vary from state to state. The pricing of the book with the Software by SYBEX reflects the allocation of risk and limitations on liability contained in this agreement of Terms and Conditions. Shareware Distribution This Software may contain various programs that are distributed as shareware. Copyright laws apply to both shareware and ordinary commercial software, and the copyright Owner(s) retains all rights. If you try a shareware program and continue using it, you are expected to iv register it. Individual programs differ on details of trial periods, registration, and payment. Please observe the requirements stated in appropriate files. Copy Protection The Software in whole or in part may or may not be copy-protected or encrypted. However, in all cases, reselling or redistributing these files without authorization is expressly forbidden except as specifically provided for by the Owner(s) therein. This book is dedicated to my grandparents: Agnes Heaton and the memory of Roscoe Heaton, as well as Emil A. Stricker and the memory of Esther Stricker. Acknowledgments There are many people that helped to make this book a reality, both directly and indirectly. It would not be possible to thank them all, but I would like to acknowledge the primary contributors. Working with Sybex on this project was a pleasure. Everyone involved in the production of this book was both professional and pleasant. First, I would like to acknowledge Marc Goldford, my technical editor, for his many helpful suggestions, and for testing the final versions of all examples. Rebecca Rider was my editor, and she did an excellent job of making sure that everything was clear and understandable. Diane Lowery, my acquisitions editor, was very helpful during the early stages of this project. I would also like to thank the production team: Dennis Fitzgerald, production editor; Jill Niles and Judy Fung, electronic publishing specialists; and Laurie O’Connell, Nancy Riddiough, and Emily Hsuan, proofreaders. It has also been a pleasure to work with everyone in the Global Software division of the Reinsurance Group of America, Inc. (RGA). I work with a group of very talented IT professionals, and I continue to learn a great deal from them. In particular, I would like to thank my supervisor Kam Chan, executive director, for the very valuable help he provides me with as I learn to design large complex systems in addition to just programming them. Additionally, I would like to thank Rick Nolle, vice president of systems, for taking the time to find the right place for me at RGA. Finally, I would like to thank Jym Barnes, managing director, for our many discussions about the latest technologies. In addition, I would like to thank my agent, Neil J. Salkind, Ph.D., for helping me develop and present the proposal for this book. I would also like to thank my friend Lisa Oliver for reviewing many chapters and discussing many of the ideas that went into this book. Likewise, I would like to thank my friend Jeffrey Noedel for the many discussions of real-world applications of bot technology. I would also like to thank Bill Darte, of Washington University in St. Louis, for acting as my advisor for some of the research that went into this book. i Table of Contents Table of Contents i Introduction 1 Overview 1 What Is a Bot? 1 What Is a Spider? 2 What Are Agents and Intelligent Agents? 3 What Are Aggregators? 4 The Java Programming Language 4 Wrap Up 5 Chapter 1: Java Socket Programming 6 Overview 6 The World of Sockets 6 Java I/O Programming 14 Proxy Issues 22 Socket Programming in Java 24 Client Sockets 25 Server Sockets 37 Summary 44 Chapter 2: Examining the Hypertext Transfer Protocol 46 Overview 46 Address Formats 46 Using Sockets to Program HTTP 50 Bot Package Classes for HTTP 60 Under the Hood 73 Summary 82 Chapter 3: Accessing Secure Sites with HTTPS 84 Overview 84 HTTP versus HTTPS 84 Using HTTPS with Java 85 HTTP User Authentication 90 Securing Access 96 Under the Hood 105 Summary 115 Chapter 4: HTML Parsing 116 Overview 116 Working with HTML 116 Tags a Bot Cares About 118 HTML That Requires Special Handling 123 Using Bot Classes for HTML Parsing 126 Using Swing Classes for HTML Parsing 128 Bot Package HTML Parsing Examples 133 Under the Hood 153 Summary 163 Chapter 5: Posting Forms 165 Overview 165 Using Forms 165 Bot Classes for a Generic Post 171 Under the Hood 186 ii Summary 190 Chapter 6: Interpreting Data 191 Overview 191 The Structure of the CSV File 191 The Structure of a QIF File 197 The XML File Format 203 Summary 213 Chapter 7: Exploring Cookies 215 Overview 215 Examining Cookies 216 Bot Classes for Cookie Processing 230 Under the Hood 232 Summary 238 Chapter 8: Building a Spider 239 Overview 239 Structure of Websites 239 Structure of a Spider 242 Constructing a Spider 246 Summary 266 Chapter 9: Building a High-Volume Spider 267 Overview 267 What Is Multithreading? 267 Multithreading with Java 268 Synchronizing Threads 272 Using a Database 275 The High-Performance Spider 283 Under the Hood 284 Summary 315 Chapter 10: Building a Bot 317 Overview 317 Constructing a Typical Bot 317 Using the CatBot 331 An Example CatBot 336 Under the Hood 342 Summary 359 Chapter 11: Building an Aggregator 360 Overview 360 Online versus Offline Aggregation 360 Building the Underlying Bot 361 Building the Weather Aggregator 369 Summary 374 Chapter 12: Using Bots Conscientiously 375 Overview 375 Dealing with Websites 375 Webmaster Actions 381 A Conscientious Spider 383 Under the Hood 396 Summary 401 Chapter 13: The Future of Bots 403 Overview 403 iii Internet Information Transfer 403 Understanding XML 404 Transferring XML Data 408 Bots and SOAP 412 Summary 412 Appendix A: The Bot Package 414 Utility Classes 414 HTTP Classes 416 The Parsing Classes 419 Spider Classes 424 Appendix B: Various HTTP Related Charts 430 The ASCII Chart 430 HTTP Headers 434 HTTP Status Codes 436 HTML Character Constants 439 Appendix C: Troubleshooting 441 WIN32 Errors 441 UNIX Errors 441 Cross-Platform Errors 444 How to Use the NOBOT Scripts 446 Appendix D: Installing Tomcat 447 Installing and Starting Tomcat 447 A JSP Example 449 Appendix E: How to Compile Examples Under Windows 451 Using the JDK 451 Using VisualCafé 456 Appendix F: How to Compile Examples Under UNIX 458 Using the JDK 458 Appendix G: Recompiling the Bot Package 461 Glossary 463 Introduction 1 Introduction Overview A tremendous amount of information is available through the Internet: today’s news, the location of an expected package, the score of last night’s game, or the current stock price of your company. Open your favorite browser, and all of this information is only a mouse click away. Nearly any piece of current information can be found online; you have only to discover it. Most of the information content of the Internet is both produced and consumed by human users. As a result, web pages are generally structured to be inviting to human visitors. But is this the only use for the Web? Are human users the only visitors a website is likely to accommodate? Actually, a whole new class of web user is developing. These users are computer programs that have the ability to access the Web in much the same way as a human user with a browser does. There are many names for these kinds of programs, and these names reflect many of the specialized tasks assigned to them. Spiders, bots, aggregators, agents, and intelligent agents are all common terms for web-savvy computer programs. As you read through this book, we will examine how to create each of these Internet programs. We will examine the differences between them as well as see what the benefits for each are. Figure I.1 shows the hierarchy of these programs. Figure I.1: Bots, spiders, aggregators, and agents What Is a Bot? [...]... Additionally, the chapters provide an in- depth explanation of how the Bot package works 5 Chapter 1: Java Socket Programming Chapter 1: Java Socket Programming Overview Exploring the world of sockets Learning how to program your network Java Stream and filter Programming Understanding client sockets Discovering server sockets The Internet is built of many related protocols, and more complex protocols are... deals with spiders, bots, and aggregators the bots that deal directly with web pages Intelligent agents are programs that can make decisions based on a user’s training, and therefore they are more of an AI topic than a web programming topic Because 3 Introduction this book deals mainly with the types of bots directly tied to web browsing, intelligent agents will not be covered What Are Aggregators? ... streams Creating Input Streams The InputStream class provided by Java is abstract, and it is only meant to be overridden to provide InputStream classes for such things as socket- and disk-based input The InputStream provided by Java provides the following methods: public abstract int read() throws IOException public int read(byte[] b) throws IOException public int read(byte[] b, int off, int len) throws... examine TCP/IP socket programming Frequently, the terms socket and TCP/IP programming are used interchangeably both in the real world and in this chapter Technically, socket-based programming allows for more protocols than just TCP/IP With the proliferation of TCP/IP systems in recent years, however, TCP/IP is the only protocol that is commonly used with socket programming The World of Sockets Spiders, ... created by the Internet Architecture Board (IAB) of the Internet Engineering Task Force (IETF; a volunteer organization that defines protocols for use on the Internet) Because of this, the definition of DHCP is recorded in an Internet RFC, and the IAB is asserting its status as to Internet Standardization Many broadband ISPs, such as cable modems and DSL, use DHCP directly from their broadband modem When... Java network communication you will need in order to program spiders, bots, and aggregators, we will examine Java s I/O classes as they relate to network communications However, much of the information could also easily apply to file-based I/O under Java If you are already familiar with file programming in Java, much of this material will be review Conversely, if you are unfamiliar with Java file programming, ... object-orientated programming languages In addition, some programming languages have the ability to use Java classes The Bot package provided in this book could easily be used with such a language This book assumes that you are generally familiar with the Java programming language, but it doesn’t require you to have expert knowledge in the Java language This book does not assume anything beyond basic Java programming. .. already seen In many ways, this is a rewind feature for an input stream 19 Chapter 1: Java Socket Programming Closing Input Streams Just like output streams, input streams must be closed when you are done with them Input streams do not have the buffering issues that output streams do, however This is because input streams are just reading data, not saving it Since the data is already saved, the input stream... read/write primitive Java data types from an underlying input/output stream in a machine-independent way GZIPInputStream GZIPOutputStream This filter implements a stream filter for reading or writing data compressed in the GZIP format 20 Chapter 1: Java Socket Programming Table 1.2: Some Java Filters Read Filter Write Filter Purpose ZipInputStream ZipOutputStream This filter implements input/output filter... program continues here } } Warning If you are connecting to the Internet through a proxy server, you must use one of the above methods to let Java know about your proxy settings If you fail to do this, the programs in this book will not be able to connect to the Internet Socket Programming in Java Java has greatly simplified socket programming, especially when compared to the requirements and constructs . Programming Spiders, Bots, and Aggregators in Java Jeff Heaton Publisher: Sybex February 2002 ISBN: 0782140408, 512 pages Spiders, bots, and aggregators are all so-called intelligent. to build their own spiders, bots, and aggregators. i Programming Spiders, Bots, and Aggregators in Java Jeff Heaton Associate Publisher: Richard Mills Acquisitions and Developmental Editor:. Agents and Intelligent Agents? 3 What Are Aggregators? 4 The Java Programming Language 4 Wrap Up 5 Chapter 1: Java Socket Programming 6 Overview 6 The World of Sockets 6 Java I/O Programming

Ngày đăng: 19/04/2014, 17:20

TỪ KHÓA LIÊN QUAN