hadoop the definitive guide 3nd edition

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	647
Dung lượng	9,07 MB

Nội dung

www.it-ebooks.info THIRD EDITION Hadoop: The Definitive Guide Tom White Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info Hadoop: The Definitive Guide, Third Edition by Tom White Revision History for the : 2012-01-27 Early release revision 1 See http://oreilly.com/catalog/errata.csp?isbn=9781449311520 for release details. ISBN: 978-1-449-31152-0 1327616795 www.it-ebooks.info For Eliane, Emilia, and Lottie www.it-ebooks.info www.it-ebooks.info Table of Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Data! 1 Data Storage and Analysis 3 Comparison with Other Systems 4 RDBMS 4 Grid Computing 6 Volunteer Computing 8 A Brief History of Hadoop 9 Apache Hadoop and the Hadoop Ecosystem 12 Hadoop Releases 13 What’s Covered in this Book 14 Compatibility 15 2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A Weather Dataset 17 Data Format 17 Analyzing the Data with Unix Tools 19 Analyzing the Data with Hadoop 20 Map and Reduce 20 Java MapReduce 22 Scaling Out 30 Data Flow 31 Combiner Functions 34 Running a Distributed MapReduce Job 37 Hadoop Streaming 37 Ruby 37 Python 40 iii www.it-ebooks.info Hadoop Pipes 41 Compiling and Running 42 3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 The Design of HDFS 45 HDFS Concepts 47 Blocks 47 Namenodes and Datanodes 48 HDFS Federation 49 HDFS High-Availability 50 The Command-Line Interface 51 Basic Filesystem Operations 52 Hadoop Filesystems 54 Interfaces 55 The Java Interface 57 Reading Data from a Hadoop URL 57 Reading Data Using the FileSystem API 59 Writing Data 62 Directories 64 Querying the Filesystem 64 Deleting Data 69 Data Flow 69 Anatomy of a File Read 69 Anatomy of a File Write 72 Coherency Model 75 Parallel Copying with distcp 76 Keeping an HDFS Cluster Balanced 78 Hadoop Archives 78 Using Hadoop Archives 79 Limitations 80 4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Data Integrity 83 Data Integrity in HDFS 83 LocalFileSystem 84 ChecksumFileSystem 85 Compression 85 Codecs 87 Compression and Input Splits 91 Using Compression in MapReduce 92 Serialization 94 The Writable Interface 95 Writable Classes 98 iv | Table of Contents www.it-ebooks.info Implementing a Custom Writable 105 Serialization Frameworks 110 Avro 112 File-Based Data Structures 132 SequenceFile 132 MapFile 139 5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 The Configuration API 146 Combining Resources 147 Variable Expansion 148 Configuring the Development Environment 148 Managing Configuration 148 GenericOptionsParser, Tool, and ToolRunner 151 Writing a Unit Test 154 Mapper 154 Reducer 156 Running Locally on Test Data 157 Running a Job in a Local Job Runner 157 Testing the Driver 161 Running on a Cluster 162 Packaging 162 Launching a Job 162 The MapReduce Web UI 164 Retrieving the Results 167 Debugging a Job 169 Hadoop Logs 173 Remote Debugging 175 Tuning a Job 176 Profiling Tasks 177 MapReduce Workflows 180 Decomposing a Problem into MapReduce Jobs 180 JobControl 182 Apache Oozie 182 6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Anatomy of a MapReduce Job Run 187 Classic MapReduce (MapReduce 1) 188 YARN (MapReduce 2) 194 Failures 200 Failures in Classic MapReduce 200 Failures in YARN 202 Job Scheduling 204 Table of Contents | v www.it-ebooks.info The Fair Scheduler 205 The Capacity Scheduler 205 Shuffle and Sort 205 The Map Side 206 The Reduce Side 207 Configuration Tuning 209 Task Execution 212 The Task Execution Environment 212 Speculative Execution 213 Output Committers 215 Task JVM Reuse 216 Skipping Bad Records 217 7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 MapReduce Types 221 The Default MapReduce Job 225 Input Formats 232 Input Splits and Records 232 Text Input 243 Binary Input 247 Multiple Inputs 248 Database Input (and Output) 249 Output Formats 249 Text Output 250 Binary Output 251 Multiple Outputs 251 Lazy Output 255 Database Output 256 8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Counters 257 Built-in Counters 257 User-Defined Java Counters 262 User-Defined Streaming Counters 266 Sorting 266 Preparation 266 Partial Sort 268 Total Sort 272 Secondary Sort 276 Joins 281 Map-Side Joins 282 Reduce-Side Joins 284 Side Data Distribution 287 vi | Table of Contents www.it-ebooks.info Using the Job Configuration 287 Distributed Cache 288 MapReduce Library Classes 294 9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Cluster Specification 295 Network Topology 297 Cluster Setup and Installation 299 Installing Java 300 Creating a Hadoop User 300 Installing Hadoop 300 Testing the Installation 301 SSH Configuration 301 Hadoop Configuration 302 Configuration Management 303 Environment Settings 305 Important Hadoop Daemon Properties 309 Hadoop Daemon Addresses and Ports 314 Other Hadoop Properties 315 User Account Creation 318 YARN Configuration 318 Important YARN Daemon Properties 319 YARN Daemon Addresses and Ports 322 Security 323 Kerberos and Hadoop 324 Delegation Tokens 326 Other Security Enhancements 327 Benchmarking a Hadoop Cluster 329 Hadoop Benchmarks 329 User Jobs 331 Hadoop in the Cloud 332 Hadoop on Amazon EC2 332 10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 HDFS 337 Persistent Data Structures 337 Safe Mode 342 Audit Logging 344 Tools 344 Monitoring 349 Logging 349 Metrics 350 Java Management Extensions 353 Table of Contents | vii www.it-ebooks.info [...]... dedicated to SETI@home; they are used for other things, too) 8 | Chapter 1: Meet Hadoop www.it-ebooks.info A Brief History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project The Origin of the Name Hadoop The name Hadoop is not an acronym;... parts to the computation, the map and the reduce, and it’s the interface between the two where the “mixing” occurs Like HDFS, MapReduce has built-in reliability This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system The storage is provided by HDFS and analysis by MapReduce There are other parts to Hadoop, but these capabilities are its kernel Comparison with Other Systems... one set of key-value pairs to another These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one More important, if you double the size of the input data, a job will run twice as slow But if you also double the size of the cluster, a job will run as fast as the original one This is not generally... By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and, furthermore, they were able to use what they had learned to improve the service for their customers You can read more about how Rackspace uses Hadoop in Chapter 16 RDBMS Why can’t we use databases... databases and HDFS Hadoop Releases Which version of Hadoop should you use? The answer to this question changes over time, of course, and also depends on the features that you need Hadoop Releases” on page 13 summarizes the high-level features in recent Hadoop release series There are a few active release series The 1.x release series is a continuation of the 0.20 release series, and contains the most stable... improving the MapReduce APIs, enhancing the website, and devising an object serialization framework In all cases, Tom presented his ideas precisely In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Management Committee Tom is now a respected senior member of the Hadoop developer community Though he’s an expert in many technical corners of the. .. improved since then: there is more documentation, there are more examples, and there are thriving mailing lists to go to when you have questions And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it That is why I wrote this book The Apache Hadoop community has come a long way Over the course of three years, the Hadoop project... Chapter 1: Meet Hadoop www.it-ebooks.info The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth If the data access... used in examples throughout the book, as well as further notes for running the programs in the book, and links to updates, additional resources, and my blog What’s in This Book? The rest of this book is organized as follows Chapter 1 emphasizes the need for Hadoop and sketches the history of the project Chapter 2 provides an introduction to MapReduce Chapter 3 looks at Hadoop filesystems, and in particular... members of the Apache Hadoop community xvi | Preface www.it-ebooks.info What’s New in the Second Edition? The second edition has two new chapters on Hive and Sqoop (Chapters 12 and 15), a new section covering Avro (in Chapter 4), an introduction to the new security features in Hadoop (in Chapter 9), and a new case study on analyzing massive network graphs using Hadoop (in Chapter 16) This edition continues . www.it-ebooks.info THIRD EDITION Hadoop: The Definitive Guide Tom White Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info Hadoop: The Definitive Guide, Third Edition by Tom White Revision. 323 Kerberos and Hadoop 324 Delegation Tokens 326 Other Security Enhancements 327 Benchmarking a Hadoop Cluster 329 Hadoop Benchmarks 329 User Jobs 331 Hadoop in the Cloud 332 Hadoop on Amazon. . . . 545 Hadoop Usage at Last.fm 545 Last.fm: The Social Music Revolution 545 Hadoop at Last.fm 545 Generating Charts with Hadoop 546 The Track Statistics Program 547 Summary 554 Hadoop and

Ngày đăng: 03/05/2014, 20:55

Xem thêm