Hadoop: The Definitive Guide docx

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	647
Dung lượng	8,81 MB

Nội dung

www.it-ebooks.info THIRD EDITION Hadoop: The Definitive Guide Tom White Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info Hadoop: The Definitive Guide, Third Edition by Tom White Revision History for the : 2012-01-27 Early release revision 1 See http://oreilly.com/catalog/errata.csp?isbn=9781449311520 for release details. ISBN: 978-1-449-31152-0 1327616795 www.it-ebooks.info For Eliane, Emilia, and Lottie www.it-ebooks.info www.it-ebooks.info Table of Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Data! 1 Data Storage and Analysis 3 Comparison with Other Systems 4 RDBMS 4 Grid Computing 6 Volunteer Computing 8 A Brief History of Hadoop 9 Apache Hadoop and the Hadoop Ecosystem 12 Hadoop Releases 13 What’s Covered in this Book 14 Compatibility 15 2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A Weather Dataset 17 Data Format 17 Analyzing the Data with Unix Tools 19 Analyzing the Data with Hadoop 20 Map and Reduce 20 Java MapReduce 22 Scaling Out 30 Data Flow 31 Combiner Functions 34 Running a Distributed MapReduce Job 37 Hadoop Streaming 37 Ruby 37 Python 40 iii www.it-ebooks.info Hadoop Pipes 41 Compiling and Running 42 3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 The Design of HDFS 45 HDFS Concepts 47 Blocks 47 Namenodes and Datanodes 48 HDFS Federation 49 HDFS High-Availability 50 The Command-Line Interface 51 Basic Filesystem Operations 52 Hadoop Filesystems 54 Interfaces 55 The Java Interface 57 Reading Data from a Hadoop URL 57 Reading Data Using the FileSystem API 59 Writing Data 62 Directories 64 Querying the Filesystem 64 Deleting Data 69 Data Flow 69 Anatomy of a File Read 69 Anatomy of a File Write 72 Coherency Model 75 Parallel Copying with distcp 76 Keeping an HDFS Cluster Balanced 78 Hadoop Archives 78 Using Hadoop Archives 79 Limitations 80 4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Data Integrity 83 Data Integrity in HDFS 83 LocalFileSystem 84 ChecksumFileSystem 85 Compression 85 Codecs 87 Compression and Input Splits 91 Using Compression in MapReduce 92 Serialization 94 The Writable Interface 95 Writable Classes 98 iv | Table of Contents www.it-ebooks.info Implementing a Custom Writable 105 Serialization Frameworks 110 Avro 112 File-Based Data Structures 132 SequenceFile 132 MapFile 139 5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 The Configuration API 146 Combining Resources 147 Variable Expansion 148 Configuring the Development Environment 148 Managing Configuration 148 GenericOptionsParser, Tool, and ToolRunner 151 Writing a Unit Test 154 Mapper 154 Reducer 156 Running Locally on Test Data 157 Running a Job in a Local Job Runner 157 Testing the Driver 161 Running on a Cluster 162 Packaging 162 Launching a Job 162 The MapReduce Web UI 164 Retrieving the Results 167 Debugging a Job 169 Hadoop Logs 173 Remote Debugging 175 Tuning a Job 176 Profiling Tasks 177 MapReduce Workflows 180 Decomposing a Problem into MapReduce Jobs 180 JobControl 182 Apache Oozie 182 6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Anatomy of a MapReduce Job Run 187 Classic MapReduce (MapReduce 1) 188 YARN (MapReduce 2) 194 Failures 200 Failures in Classic MapReduce 200 Failures in YARN 202 Job Scheduling 204 Table of Contents | v www.it-ebooks.info The Fair Scheduler 205 The Capacity Scheduler 205 Shuffle and Sort 205 The Map Side 206 The Reduce Side 207 Configuration Tuning 209 Task Execution 212 The Task Execution Environment 212 Speculative Execution 213 Output Committers 215 Task JVM Reuse 216 Skipping Bad Records 217 7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 MapReduce Types 221 The Default MapReduce Job 225 Input Formats 232 Input Splits and Records 232 Text Input 243 Binary Input 247 Multiple Inputs 248 Database Input (and Output) 249 Output Formats 249 Text Output 250 Binary Output 251 Multiple Outputs 251 Lazy Output 255 Database Output 256 8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Counters 257 Built-in Counters 257 User-Defined Java Counters 262 User-Defined Streaming Counters 266 Sorting 266 Preparation 266 Partial Sort 268 Total Sort 272 Secondary Sort 276 Joins 281 Map-Side Joins 282 Reduce-Side Joins 284 Side Data Distribution 287 vi | Table of Contents www.it-ebooks.info Using the Job Configuration 287 Distributed Cache 288 MapReduce Library Classes 294 9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Cluster Specification 295 Network Topology 297 Cluster Setup and Installation 299 Installing Java 300 Creating a Hadoop User 300 Installing Hadoop 300 Testing the Installation 301 SSH Configuration 301 Hadoop Configuration 302 Configuration Management 303 Environment Settings 305 Important Hadoop Daemon Properties 309 Hadoop Daemon Addresses and Ports 314 Other Hadoop Properties 315 User Account Creation 318 YARN Configuration 318 Important YARN Daemon Properties 319 YARN Daemon Addresses and Ports 322 Security 323 Kerberos and Hadoop 324 Delegation Tokens 326 Other Security Enhancements 327 Benchmarking a Hadoop Cluster 329 Hadoop Benchmarks 329 User Jobs 331 Hadoop in the Cloud 332 Hadoop on Amazon EC2 332 10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 HDFS 337 Persistent Data Structures 337 Safe Mode 342 Audit Logging 344 Tools 344 Monitoring 349 Logging 349 Metrics 350 Java Management Extensions 353 Table of Contents | vii www.it-ebooks.info [...]... one set of key-value pairs to another These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one More important, if you double the size of the input data, a job will run twice as slow But if you also double the size of the cluster, a job will run as fast as the original one This is not generally... new mail servers in as we grow By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and, furthermore, they were able to use what they had learned to improve the service for their customers You can read more about how Rackspace uses Hadoop in Chapter 16... parts to the computation, the map and the reduce, and it’s the interface between the two where the “mixing” occurs Like HDFS, MapReduce has built-in reliability This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system The storage is provided by HDFS and analysis by MapReduce There are other parts to Hadoop, but these capabilities are its kernel Comparison with Other Systems... Nonlinear Linear Another difference between MapReduce and an RDBMS is the amount of structure in the datasets that they operate on Structured data is data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema This is the realm of the RDBMS Semi-structured data, on the other hand, is looser, and though there may be... a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data Unstructured data does not have any particular internal structure: for example, plain text or image data MapReduce works well on unstructured or semistructured data, since it is designed to interpret the data at processing time In other... 1: Meet Hadoop www.it-ebooks.info The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth If the data access pattern is dominated... great control to the programmer, but requires that he or she explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higher-level algorithm for the analysis MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit Coordinating the processes in... one other (This is a slight oversimplification, since the output from mappers is fed to the reducers, but this is under the control of the MapReduce system; in this case, it needs to take more care rerunning a failed reducer than rerunning a failed map, since it has to make sure it can retrieve the necessary map outputs, and if not, regenerate them by running the relevant maps again.) So from the programmer’s... emerged However, they realized that their architecture wouldn’t scale to the billions of pages on the Web Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google.11 GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl... was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times Some applications are covered in the case studies in Chapter 16 and on the Hadoop wiki In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch through four terabytes of scanned archives from the paper converting them to PDFs for the Web.14 The processing took less . www.it-ebooks.info THIRD EDITION Hadoop: The Definitive Guide Tom White Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info Hadoop: The Definitive Guide, Third Edition by Tom. v www.it-ebooks.info The Fair Scheduler 205 The Capacity Scheduler 205 Shuffle and Sort 205 The Map Side 206 The Reduce Side 207 Configuration Tuning 209 Task Execution 212 The Task Execution Environment. since this was the latest stable release at the time of writing. New features from later releases are occasionally mentioned in the text, however, with reference to the version that they were introduced

Ngày đăng: 29/03/2014, 09:20

Xem thêm