www.it-ebooks.info www.it-ebooks.info Parallel R Q. Ethan McCallum and Stephen Weston Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info Parallel R by Q. Ethan McCallum and Stephen Weston Copyright © 2012 Q. Ethan McCallum and Stephen Weston. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Editors: Mike Loukides and Meghan Blanchette Production Editor: Kristen Borg Proofreader: O’Reilly Production Services Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano Revision History for the First Edition: 2011-10-21 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449309923 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Parallel R, the image of a rabbit, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. ISBN: 978-1-449-30992-3 [LSI] 1319202138 www.it-ebooks.info Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Why R? 1 Why Not R? 1 The Solution: Parallel Execution 2 A Road Map for This Book 2 What We’ll Cover 3 Looking Forward… 3 What We’ll Assume You Already Know 3 In a Hurry? 4 snow 4 multicore 4 parallel 4 R+Hadoop 4 RHIPE 5 Segue 5 Summary 5 2. snow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Quick Look 7 How It Works 7 Setting Up 8 Working with It 9 Creating Clusters with makeCluster 9 Parallel K-Means 10 Initializing Workers 12 Load Balancing with clusterApplyLB 13 Task Chunking with parLapply 15 Vectorizing with clusterSplit 18 Load Balancing Redux 20 iii www.it-ebooks.info Functions and Environments 23 Random Number Generation 25 snow Configuration 26 Installing Rmpi 29 Executing snow Programs on a Cluster with Rmpi 30 Executing snow Programs with a Batch Queueing System 32 Troubleshooting snow Programs 33 When It Works… 35 …And When It Doesn’t 36 The Wrap-up 36 3. multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Quick Look 37 How It Works 38 Setting Up 38 Working with It 39 The mclapply Function 39 The mc.cores Option 39 The mc.set.seed Option 40 Load Balancing with mclapply 42 The pvec Function 42 The parallel and collect Functions 43 Using collect Options 44 Parallel Random Number Generation 46 The Low-Level API 47 When It Works… 49 …And When It Doesn’t 49 The Wrap-up 49 4. parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Quick Look 52 How It Works 52 Setting Up 52 Working with It 53 Getting Started 53 Creating Clusters with makeCluster 54 Parallel Random Number Generation 55 Summary of Differences 57 When It Works… 58 …And When It Doesn’t 58 The Wrap-up 58 iv | Table of Contents www.it-ebooks.info 5. A Primer on MapReduce and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Hadoop at Cruising Altitude 59 A MapReduce Primer 60 Thinking in MapReduce: Some Pseudocode Examples 61 Calculate Average Call Length for Each Date 62 Number of Calls by Each User, on Each Date 62 Run a Special Algorithm on Each Record 63 Binary and Whole-File Data: SequenceFiles 63 No Cluster? No Problem! Look to the Clouds… 64 The Wrap-up 66 6. R+Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Quick Look 67 How It Works 67 Setting Up 68 Working with It 68 Simple Hadoop Streaming (All Text) 69 Streaming, Redux: Indirectly Working with Binary Data 72 The Java API: Binary Input and Output 74 Processing Related Groups (the Full Map and Reduce Phases) 79 When It Works… 83 …And When It Doesn’t 83 The Wrap-up 84 7. RHIPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Quick Look 85 How It Works 85 Setting Up 86 Working with It 87 Phone Call Records, Redux 87 Tweet Brevity 91 More Complex Tweet Analysis 96 When It Works… 98 …And When It Doesn’t 99 The Wrap-up 100 8. Segue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Quick Look 101 How It Works 102 Setting Up 102 Working with It 102 Model Testing: Parameter Sweep 102 When It Works… 105 Table of Contents | v www.it-ebooks.info …And When It Doesn’t 105 The Wrap-up 106 9. New and Upcoming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 doRedis 107 RevoScale R and RevoConnectR (RHadoop) 108 cloudNumbers.com 108 vi | Table of Contents www.it-ebooks.info Preface Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter- mined by context. This icon signifies a tip, suggestion, or general note. This icon indicates a warning or caution. Using Code Examples This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does vii www.it-ebooks.info require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Parallel R by Q. Ethan McCallum and Stephen Weston (O'Reilly). Copyright 2012 Q. Ethan McCallum and Stephen Weston, 978-1-449-30992-3.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Safari® Books Online Safari Books Online is an on-demand digital library that lets you easily search over 7,500 technology and creative reference books and videos to find the answers you need quickly. With a subscription, you can read any page and watch any video from our library online. Read books on your cell phone and mobile devices. Access new titles before they are available for print, and get exclusive access to manuscripts in development and post feedback for the authors. Copy and paste code samples, organize your favorites, down- load chapters, bookmark key sections, create notes, print out pages, and benefit from tons of other time-saving features. O’Reilly Media has uploaded this book to the Safari Books Online service. To have full digital access to this book and others on similar topics from O’Reilly and other pub- lishers, sign up for free at http://my.safaribooksonline.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at: http://oreilly.com/catalog/0636920021421 To comment or ask technical questions about this book, send email to: bookquestions@oreilly.com viii | Preface www.it-ebooks.info [...]... arguments on each of the cluster workers, and returns the results as a list It’s like clusterApply() without the x argument, so it executes once for each worker, like clusterEvalQ(), rather than once for each element in x § All R sessions are randomly seeded when they first generate random numbers, unless they were restored from a previous R session that generated random numbers snow workers never restore... without ever leaving the R interpreter Solves: Single-threaded, memory-bound Pros: Closer to a native R experience than R+ Hadoop; use pure R code for your MapReduce operations Cons: Requires a Hadoop cluster; requires extra setup on the cluster; cannot process standard SequenceFiles (for binary data) Segue Overview: Seamlessly send R apply-like calculations to a remote Hadoop cluster Solves: Single-threaded,... finished product you’re reading now: Robert Bjornson Nicholas Carriero Jonathan Seidman Paul Teetor Ramesh Venkataramaiah Jed Wing Any errors you find in this book belong to us, the authors Most of all we thank you, the reader, for your interest in this book We set out to create the guidebook we wish we’d had when we first tried to give R that parallel, distributed boost R work is research work, best... the workers The kmeans() function uses the sample.int() function to choose the starting cluster centers, which depend on the random number generator In order to get different solutions, the cluster workers need to use different streams of random numbers Since the workers are randomly seeded when they first start generating random numbers,§ this example will work, but it is good practice to use a parallel. .. linear algebra subprogram (BLAS) Churning through large datasets? Use a relational database or another manual method to retrieve your data in smaller, more manageable pieces And so on, and so forth Some big winners involve parallelism Spreading work across multiple CPUs overcomes R s single-threaded nature Offloading work to multiple machines reaps the multiprocess benefit and also addresses R s memory... efficient; easy to install; no configuration needed Cons: Can only use one machine; doesn’t support Windows; no built-in support for parallel random number generation (RNG) parallel Overview: A merger of snow and multicore that comes built into R as of R 2.14.0 Solves: Single-threaded, memory-bound Pros: No installation necessary; has great support for parallel random number generation Cons: Can only use one... otherwise, you risk leaking cluster workers if the cluster type is changed, for example Creating the cluster object can fail for a number of reasons, and is therefore a source of problems See the section “Troubleshooting snow Programs” on page 33 for help in solving these problems Parallel K-Means We’re finally ready to use snow to do some parallel computing, so let’s look at a real example: parallel. .. are variations of the standard lapply() function, making snow fairly easy to learn To implement these parallel operations, snow uses a master/ worker architecture, where the master sends tasks to the workers, and the workers execute the tasks and return the results to the master One important feature of snow is that it can be used with different transport mechanisms to communicate between the master... popular parallel pro- gramming package available for R It was written by Luke Tierney, A J Rossini, Na Li, and H Sevcikova, and is actively maintained by Luke Tierney It is a mature package, first released on the “Comprehensive R Archive Network” (CRAN) in 2003 Quick Look Motivation: You want to use a Linux cluster to run an R script faster For example, you’re running a Monte Carlo simulation on your laptop,... results are kept in memory on the master until they are returned to the caller in a list Of course, snow can be used with high-performance distributed file systems in order to operate on large data files, but it’s up to the user to arrange that Setting Up snow is available on CRAN, so it is installed like any other CRAN package It is pure R code and almost never has installation problems There are binary . memory-bound. Pros: Closer to a native R experience than R+ Hadoop; use pure R code for your Map- Reduce operations. Cons: Requires a Hadoop cluster; requires. Montgomery Interior Designer: David Futato Illustrator: Robert Romano Revision History for the First Edition: 2011-10-21 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449309923