Python 2.6 Text Processing Beginner's Guide The easiest way to learn how to manipulate text with Python Jeff McNeil BIRMINGHAM - MUMBAI Python 2.6 Text Processing Beginner's Guide Copyright © 2010 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: December 2010 Production Reference: 1081210 Published by Packt Publishing Ltd 32 Lincoln Road Olton Birmingham, B27 6PA, UK ISBN 978-1-849512-12-1 www.packtpub.com Cover Image by John Quick (john@johnmquick.com) Credits Author Jeff McNeil Reviewer Maurice HT Ling Acquisition Editor Steven Wilding Development Editor Reshma Sundaresan Technical Editor Gauri Iyer Indexer Tejal Daruwale Editorial Team Leader Mithun Sehgal Project Team Leader Priya Mukherji Project Coordinator Shubhanjan Chatterjee Proofreader Jonathan Todd Graphics Nilesh R Mohite Production Coordinator Kruthika Bangera Cover Work Kruthika Bangera About the Author Jeff McNeil has been working in the Internet Services industry for over 10 years He cut his teeth during the late 90's Internet boom and has been developing software for Unix and Unix-flavored systems ever since Jeff has been a full-time Python developer for the better half of that time and has professional experience with a collection of other languages, including C, Java, and Perl He takes an interest in systems administration and server automation problems Jeff recently joined Google and has had the pleasure of working with some very talented individuals I'd like to above all thank Julie, Savannah, Phoebe, Maya, and Trixie for allowing me to lock myself in the office every night for months The Web.com gang and those in the Python community willing to share their authoring experiences Finally, Steven Wilding, Reshma Sundaresan, Shubhanjan Chatterjee, and the rest of the Packt Publishing team for all of the hard work and guidance About the Reviewer Maurice HT Ling completed his Ph.D in Bioinformatics and B.Sc(Hons) in Molecular and Cell Biology from the University of Melbourne where he worked on microarray analysis and text mining for protein-protein interactions He is currently an honorary fellow in the University of Melbourne, Australia Maurice holds several Chief Editorships, including the Python papers, Computational, and Mathematical Biology, and Methods and Cases in Computational, Mathematical and Statistical Biology In Singapore, he co-founded the Python User Group (Singapore) and is the co-chair of PyCon Asia-Pacific 2010 In his free time, Maurice likes to train in the gym, read, and enjoy a good cup of coffee He is also a senior fellow of the International Fitness Association, USA www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com, and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library Here, you can access, read, and search across Packt's entire library of books. Why Subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books Simply use your login credentials for immediate access Table of Contents Preface Chapter 1: Getting Started Categorizing types of text data Providing information through markup Meaning through structured formats Understanding freeform content Ensuring you have Python installed Providing support for Python Implementing a simple cipher Time for action – implementing a ROT13 encoder Processing structured markup with a filter Time for action – processing as a filter Time for action – skipping over markup tags State machines Supporting third-party modules Packaging in a nutshell Time for action – installing SetupTools Running a virtual environment Configuring virtualenv Time for action – configuring a virtual environment Where to get help? Summary Chapter 2: Working with the IO System Parsing web server logs Time for action – generating transfer statistics Using objects interchangeably Time for action – introducing a new log format Accessing files directly 8 9 10 10 11 15 15 18 22 23 23 23 25 25 25 28 28 29 30 31 35 35 37 Table of Contents Time for action – accessing files directly Context managers Handling other file types Time for action – handling compressed files Implementing file-like objects File object methods Enabling universal newlines 37 39 41 41 42 43 45 Accessing multiple files Time for action – spell-checking HTML content Simplifying multiple file access Inplace filtering 45 46 50 51 Accessing remote files Time for action – spell-checking live HTML pages Error handling Time for action – handling urllib errors Handling string IO instances Understanding IO in Python Summary Chapter 3: Python String Services 52 52 55 55 57 58 59 61 Understanding the basics of string object Defining strings Time for action – employee management Building non-literal strings String formatting Time for action – customizing log processor output Percent (modulo) formatting Mapping key Conversion flags Minimum width Precision Width Conversion type 61 62 62 68 68 68 74 75 76 76 76 77 77 Using the format method approach Time for action – adding status code data Making use of conversion specifiers Creating templates Time for action – displaying warnings on malformed lines Template syntax Rendering a template Calling string object methods Time for action – simple manipulation with string methods Aligning text [ ii ] 78 79 83 86 86 88 88 89 89 92 Table of Contents Detecting character classes Casing Searching strings Dealing with lists of strings 92 93 93 94 Treating strings as sequences 95 Summary 96 Chapter 4: Text Processing Using the Standard Library Reading CSV data Time for action – processing Excel formats Time for action – CSV and formulas Reading non-Excel data Time for action – processing custom CSV formats Writing CSV data Time for action – creating a spreadsheet of UNIX users Modifying application configuration files Time for action – adding basic configuration read support Using value interpolation Time for action – relying on configuration value interpolation Handling default options Time for action – configuration defaults Writing configuration data Time for action – generating a configuration file Reconfiguring our source A note on Python Time for action – creating an egg-based package Understanding the setup.py file Working with JSON Time for action – writing JSON data Encoding data Decoding data Summary Chapter 5: Regular Expressions 97 98 98 101 103 103 106 106 110 110 114 114 116 116 118 119 122 122 122 131 132 132 134 135 136 137 Simple string matching Time for action – testing an HTTP URL Understanding the match function Learning basic syntax 138 138 140 140 Detecting repetition Specifying character sets and classes Applying anchors to restrict matches 140 141 143 Wrapping it up 144 [ iii ] .. .Python 2.6 Text Processing Beginner's Guide The easiest way to learn how to manipulate text with Python Jeff McNeil BIRMINGHAM - MUMBAI Python 2.6 Text Processing Beginner's Guide Copyright... ] Preface The Python Text Processing Beginner's Guide is intended to provide a gentle, hands-on introduction to processing, understanding, and generating textual data using the Python programming... Chapter 4: Text Processing Using the Standard Library Reading CSV data Time for action – processing Excel formats Time for action – CSV and formulas Reading non-Excel data Time for action – processing