2n d Ed iti on Web Scraping with Python COLLECTING MORE DATA FROM THE MODERN WEB Ryan Mitchell SECOND EDITION Web Scraping with Python Collecting More Data from the Modern Web Ryan Mitchell Beijing Boston Farnham Sebastopol Tokyo Web Scraping with Python by Ryan Mitchell Copyright © 2018 Ryan Mitchell All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Allyson MacDonald Production Editor: Justin Billing Copyeditor: Sharon Wilkey Proofreader: Christina Edwards April 2018: Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest Second Edition Revision History for the Second Edition 2018-03-20: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491985571 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Web Scraping with Python, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-98557-1 [LSI] Table of Contents Preface ix Part I Building Scrapers Your First Web Scraper Connecting An Introduction to BeautifulSoup Installing BeautifulSoup Running BeautifulSoup Connecting Reliably and Handling Exceptions 6 10 Advanced HTML Parsing 15 You Don’t Always Need a Hammer Another Serving of BeautifulSoup find() and find_all() with BeautifulSoup Other BeautifulSoup Objects Navigating Trees Regular Expressions Regular Expressions and BeautifulSoup Accessing Attributes Lambda Expressions 15 16 18 20 21 25 29 30 31 Writing Web Crawlers 33 Traversing a Single Domain Crawling an Entire Site Collecting Data Across an Entire Site Crawling Across the Internet 33 37 40 42 Web Crawling Models 49 Planning and Defining Objects Dealing with Different Website Layouts 50 53 iii Structuring Crawlers Crawling Sites Through Search Crawling Sites Through Links Crawling Multiple Page Types Thinking About Web Crawler Models 58 58 61 64 65 Scrapy 67 Installing Scrapy Initializing a New Spider Writing a Simple Scraper Spidering with Rules Creating Items Outputting Items The Item Pipeline Logging with Scrapy More Resources 67 68 69 70 74 76 77 80 80 Storing Data 83 Media Files Storing Data to CSV MySQL Installing MySQL Some Basic Commands Integrating with Python Database Techniques and Good Practice “Six Degrees” in MySQL Email Part II 83 86 88 89 91 94 97 100 103 Advanced Scraping Reading Documents 107 Document Encoding Text Text Encoding and the Global Internet CSV Reading CSV Files PDF Microsoft Word and docx 107 108 109 113 113 115 117 Cleaning Your Dirty Data 121 Cleaning in Code iv | Table of Contents 121 Data Normalization Cleaning After the Fact OpenRefine 124 126 126 Reading and Writing Natural Languages 131 Summarizing Data Markov Models Six Degrees of Wikipedia: Conclusion Natural Language Toolkit Installation and Setup Statistical Analysis with NLTK Lexicographical Analysis with NLTK Additional Resources 132 135 139 142 142 143 145 149 10 Crawling Through Forms and Logins 151 Python Requests Library Submitting a Basic Form Radio Buttons, Checkboxes, and Other Inputs Submitting Files and Images Handling Logins and Cookies HTTP Basic Access Authentication Other Form Problems 151 152 154 155 156 157 158 11 Scraping JavaScript 161 A Brief Introduction to JavaScript Common JavaScript Libraries Ajax and Dynamic HTML Executing JavaScript in Python with Selenium Additional Selenium Webdrivers Handling Redirects A Final Note on JavaScript 162 163 165 166 171 171 173 12 Crawling Through APIs 175 A Brief Introduction to APIs HTTP Methods and APIs More About API Responses Parsing JSON Undocumented APIs Finding Undocumented APIs Documenting Undocumented APIs Finding and Documenting APIs Automatically Combining APIs with Other Data Sources 175 177 178 179 181 182 184 184 187 Table of Contents | v More About APIs 190 13 Image Processing and Text Recognition 193 Overview of Libraries Pillow Tesseract NumPy Processing Well-Formatted Text Adjusting Images Automatically Scraping Text from Images on Websites Reading CAPTCHAs and Training Tesseract Training Tesseract Retrieving CAPTCHAs and Submitting Solutions 194 194 195 197 197 200 203 206 207 211 14 Avoiding Scraping Traps 215 A Note on Ethics Looking Like a Human Adjust Your Headers Handling Cookies with JavaScript Timing Is Everything Common Form Security Features Hidden Input Field Values Avoiding Honeypots The Human Checklist 215 216 217 218 220 221 221 223 224 15 Testing Your Website with Scrapers 227 An Introduction to Testing What Are Unit Tests? Python unittest Testing Wikipedia Testing with Selenium Interacting with the Site unittest or Selenium? 227 228 228 230 233 233 236 16 Web Crawling in Parallel 239 Processes versus Threads Multithreaded Crawling Race Conditions and Queues The threading Module Multiprocess Crawling Multiprocess Crawling Communicating Between Processes vi | Table of Contents 239 240 242 245 247 249 251 Multiprocess Crawling—Another Approach 253 17 Scraping Remotely 255 Why Use Remote Servers? Avoiding IP Address Blocking Portability and Extensibility Tor PySocks Remote Hosting Running from a Website-Hosting Account Running from the Cloud Additional Resources 255 256 257 257 259 259 260 261 262 18 The Legalities and Ethics of Web Scraping 263 Trademarks, Copyrights, Patents, Oh My! Copyright Law Trespass to Chattels The Computer Fraud and Abuse Act robots.txt and Terms of Service Three Web Scrapers eBay versus Bidder’s Edge and Trespass to Chattels United States v Auernheimer and The Computer Fraud and Abuse Act Field v Google: Copyright and robots.txt Moving Forward 263 264 266 268 269 272 272 274 275 276 Index 279 Table of Contents | vii ... EDITION Web Scraping with Python Collecting More Data from the Modern Web Ryan Mitchell Beijing Boston Farnham Sebastopol Tokyo Web Scraping with Python by Ryan Mitchell Copyright © 2018 Ryan Mitchell. .. pages as web crawlers or refer to the web scraping programs themselves as bots In theory, web scraping is the practice of gathering data through any means other than a program interacting with an... What Is Web Scraping? The automated gathering of data from the internet is nearly as old as the internet itself Although web scraping is not a new term, in years past the practice has been more