www.it-ebooks.info www.it-ebooks.info Bioinformatics Programming Using Python www.it-ebooks.info www.it-ebooks.info Bioinformatics Programming Using Python Mitchell L Model Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo www.it-ebooks.info Bioinformatics Programming Using Python by Mitchell L Model Copyright © 2010 Mitchell L Model. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Mike Loukides Production Editor: Sarah Schneider Copyeditor: Rachel Head Proofreader: Sada Preisch Indexer: Lucie Haskins Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano Printing History: December 2009: First Edition. O’Reilly and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Bioinformatics Pro- gramming Using Python, the image of a brown rat, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. TM This book uses RepKover, a durable and flexible lay-flat binding. ISBN: 978-0-596-15450-9 [M] 1259959883 www.it-ebooks.info Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Simple Values 1 Booleans 2 Integers 2 Floats 3 Strings 4 Expressions 5 Numeric Operators 5 Logical Operations 7 String Operations 9 Calls 12 Compound Expressions 16 Tips, Traps, and Tracebacks 18 Tips 18 Traps 20 Tracebacks 20 2. Names, Functions, and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Assigning Names 23 Defining Functions 24 Function Parameters 27 Comments and Documentation 28 Assertions 30 Default Parameter Values 32 Using Modules 34 Importing 34 Python Files 38 Tips, Traps, and Tracebacks 40 Tips 40 v www.it-ebooks.info Traps 45 Tracebacks 46 3. Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Sets 48 Sequences 51 Strings, Bytes, and Bytearrays 53 Ranges 60 Tuples 61 Lists 62 Mappings 66 Dictionaries 67 Streams 72 Files 73 Generators 78 Collection-Related Expression Features 79 Comprehensions 79 Functional Parameters 89 Tips, Traps, and Tracebacks 94 Tips 94 Traps 96 Tracebacks 97 4. Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Conditionals 101 Loops 104 Simple Loop Examples 105 Initialization of Loop Values 106 Looping Forever 107 Loops with Guard Conditions 109 Iterations 111 Iteration Statements 111 Kinds of Iterations 113 Exception Handlers 134 Python Errors 136 Exception Handling Statements 138 Raising Exceptions 141 Extended Examples 143 Extracting Information from an HTML File 143 The Grand Unified Bioinformatics File Parser 146 Parsing GenBank Files 148 Translating RNA Sequences 151 Constructing a Table from a Text File 155 vi | Table of Contents www.it-ebooks.info Tips, Traps, and Tracebacks 160 Tips 160 Traps 162 Tracebacks 163 5. Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Defining Classes 166 Instance Attributes 168 Class Attributes 179 Class and Method Relationships 186 Decomposition 186 Inheritance 194 Tips, Traps, and Tracebacks 205 Tips 205 Traps 207 Tracebacks 208 6. Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 System Environment 209 Dates and Times: datetime 209 System Information 212 Command-Line Utilities 217 Communications 223 The Filesystem 226 Operating System Interface: os 226 Manipulating Paths: os.path 229 Filename Expansion: fnmatch and glob 232 Shell Utilities: shutil 234 Comparing Files and Directories 235 Working with Text 238 Formatting Blocks of Text: textwrap 238 String Utilities: string 240 Comma- and Tab-Separated Formats: csv 241 String-Based Reading and Writing: io 242 Persistent Storage 243 Persistent Text: dbm 243 Persistent Objects: pickle 247 Keyed Persistent Object Storage: shelve 248 Debugging Tools 249 Tips, Traps, and Tracebacks 253 Tips 253 Traps 254 Tracebacks 255 Table of Contents | vii www.it-ebooks.info 7. Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Fundamental Syntax 258 Fixed-Length Matching 259 Variable-Length Matching 262 Greedy Versus Nongreedy Matching 263 Grouping and Disjunction 264 The Actions of the re Module 265 Functions 265 Flags 266 Methods 268 Results of re Functions and Methods 269 Match Object Fields 269 Match Object Methods 269 Putting It All Together: Examples 270 Some Quick Examples 270 Extracting Descriptions from Sequence Files 272 Extracting Entries From Sequence Files 274 Tips, Traps, and Tracebacks 283 Tips 283 Traps 284 Tracebacks 285 8. Structured Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 HTML 287 Simple HTML Processing 289 Structured HTML Processing 297 XML 300 The Nature of XML 300 An XML File for a Complete Genome 302 The ElementTree Module 303 Event-Based Processing 310 expat 317 Tips, Traps, and Tracebacks 322 Tips 322 Traps 323 Tracebacks 323 9. Web Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Manipulating URLs: urllib.parse 325 Disassembling URLs 326 Assembling URLs 327 Opening Web Pages: webbrowser 328 Module Functions 328 viii | Table of Contents www.it-ebooks.info [...]... tell you which version of Python is installed as the program called python: % python -V The name of the executable for Python 3 may be python3 instead of just python You can type this: % python3 -V to see if that is the case If you are running Python in an integrated development environment—in particular IDLE, which is part of the Python installation—type the following at the prompt (>>>) of its interactive... where applicable, Python 2.x will work for most of the book’s examples There are a few notes about Python 2 in Chapters 1, 3, and 5; they are there not just to help you if you find yourself using Python 2 for some work, but also for when you read Python 2 code The major exception is that print was a statement in Python 2 but is now a function, allowing for more flexibility Also, Python 3 reorganized... contents, so using Python 2.x with examples that demonstrate the use of certain modules would involve more than a few minor changes Determing Which Version of Python Is Installed Some version of Python 2 is probably installed on your computer, unless you are using Windows Typing the following into a command-line window (using % as an example of a command-line prompt) will tell you which version of Python. .. an introduction to Python programming that is more rapid and in some ways more superficial than what would be found in a text devoted solely to Python or introductory programming At the same time, it includes some advanced features, techniques, and topics that are often omitted from entry-level Python books These are included because of their wide applicability in bioinformatics programming, and they... www.it-ebooks.info programming, but it has evolved into a fundamentally object-oriented language (There is no declarative programming component—of the four paradigms, declarative programming is the one least amenable to fitting together with another.) Few, if any, other languages provide a blend like this as seamlessly and elegantly as does Python Installing Python This book uses Python 3, the language’s... IDE on your computer, or install one that uses Python 3 (The Python installation process installs the GUI-based IDLE for whatever version of Python is being installed.) The current release of Python can be downloaded from http:/ /python. org/download/ Installers are available for OS X and Windows With most distributions of Linux, you should be able to install Python through the usual package mechanisms... Python Language Simply put, Python is a beautiful language It is effective for everything from teaching new programmers to advanced computer science study, from simple scripts to sophisticated advanced applications It has always had some purchase in bioinformatics, and in recent years its popularity has been increasing rapidly One goal of this book is to help significantly expand Python s use for bioinformatics. .. scientific content of bioinformatics, as well as dealing with data that is more amenable to representation and manipulation in software Also, and not incidentally, it is the part of bioinformatics with which the author is most familiar About the Reader This book assumes no prior programming experience Its introduction to and use of Python are completely self-contained Even if you do have some programming experience,... and uniform a view of Python programming as possible They also were based on the assumption that most people making use of what they learn in this book will not move on to more advanced programming or large-scale software development xiv | Preface www.it-ebooks.info Some things that will appear strange to anyone with significant programming experience are in reality true to a pure “Pythonic” approach... The terms “type” and “value” come from traditional procedural programming The equivalent object-oriented terms are class and object We’ll mostly use the terms “type” and “value” early on, then gradually shift to using “class” and “object” more frequently Although Python s history is tied more to object-oriented programming than to traditional programming, we’ll use the term instance with both terminologies: . www.it-ebooks.info www.it-ebooks.info Bioinformatics Programming Using Python www.it-ebooks.info www.it-ebooks.info Bioinformatics Programming Using Python Mitchell L Model Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo www.it-ebooks.info Bioinformatics. which version of Python is installed as the program called python: % python -V The name of the executable for Python 3 may be python3 instead of just python. You can type this: % python3 -V to see. purchase in bioinformatics, and in recent years its popularity has been increasing rapidly. One goal of this book is to help significantly expand Python s use for bioinformatics programming. Python