Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 647 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
647
Dung lượng
8,81 MB
Nội dung
www.it-ebooks.info
THIRD EDITION
Hadoop: TheDefinitive Guide
Tom White
Beijing
•
Cambridge
•
Farnham
•
Köln
•
Sebastopol
•
Tokyo
www.it-ebooks.info
Hadoop: TheDefinitive Guide, Third Edition
by Tom White
Revision History for the :
2012-01-27 Early release revision 1
See http://oreilly.com/catalog/errata.csp?isbn=9781449311520 for release details.
ISBN: 978-1-449-31152-0
1327616795
www.it-ebooks.info
For Eliane, Emilia, and Lottie
www.it-ebooks.info
www.it-ebooks.info
Table of Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Data! 1
Data Storage and Analysis 3
Comparison with Other Systems 4
RDBMS 4
Grid Computing 6
Volunteer Computing 8
A Brief History of Hadoop 9
Apache Hadoop and the Hadoop Ecosystem 12
Hadoop Releases 13
What’s Covered in this Book 14
Compatibility 15
2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A Weather Dataset 17
Data Format 17
Analyzing the Data with Unix Tools 19
Analyzing the Data with Hadoop 20
Map and Reduce 20
Java MapReduce 22
Scaling Out 30
Data Flow 31
Combiner Functions 34
Running a Distributed MapReduce Job 37
Hadoop Streaming 37
Ruby 37
Python 40
iii
www.it-ebooks.info
Hadoop Pipes 41
Compiling and Running 42
3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
The Design of HDFS 45
HDFS Concepts 47
Blocks 47
Namenodes and Datanodes 48
HDFS Federation 49
HDFS High-Availability 50
The Command-Line Interface 51
Basic Filesystem Operations 52
Hadoop Filesystems 54
Interfaces 55
The Java Interface 57
Reading Data from a Hadoop URL 57
Reading Data Using the FileSystem API 59
Writing Data 62
Directories 64
Querying the Filesystem 64
Deleting Data 69
Data Flow 69
Anatomy of a File Read 69
Anatomy of a File Write 72
Coherency Model 75
Parallel Copying with distcp 76
Keeping an HDFS Cluster Balanced 78
Hadoop Archives 78
Using Hadoop Archives 79
Limitations 80
4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Data Integrity 83
Data Integrity in HDFS 83
LocalFileSystem 84
ChecksumFileSystem 85
Compression 85
Codecs 87
Compression and Input Splits 91
Using Compression in MapReduce 92
Serialization 94
The Writable Interface 95
Writable Classes 98
iv | Table of Contents
www.it-ebooks.info
Implementing a Custom Writable 105
Serialization Frameworks 110
Avro 112
File-Based Data Structures 132
SequenceFile 132
MapFile 139
5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
The Configuration API 146
Combining Resources 147
Variable Expansion 148
Configuring the Development Environment 148
Managing Configuration 148
GenericOptionsParser, Tool, and ToolRunner 151
Writing a Unit Test 154
Mapper 154
Reducer 156
Running Locally on Test Data 157
Running a Job in a Local Job Runner 157
Testing the Driver 161
Running on a Cluster 162
Packaging 162
Launching a Job 162
The MapReduce Web UI 164
Retrieving the Results 167
Debugging a Job 169
Hadoop Logs 173
Remote Debugging 175
Tuning a Job 176
Profiling Tasks 177
MapReduce Workflows 180
Decomposing a Problem into MapReduce Jobs 180
JobControl 182
Apache Oozie 182
6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Anatomy of a MapReduce Job Run 187
Classic MapReduce (MapReduce 1) 188
YARN (MapReduce 2) 194
Failures 200
Failures in Classic MapReduce 200
Failures in YARN 202
Job Scheduling 204
Table of Contents | v
www.it-ebooks.info
The Fair Scheduler 205
The Capacity Scheduler 205
Shuffle and Sort 205
The Map Side 206
The Reduce Side 207
Configuration Tuning 209
Task Execution 212
The Task Execution Environment 212
Speculative Execution 213
Output Committers 215
Task JVM Reuse 216
Skipping Bad Records 217
7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
MapReduce Types 221
The Default MapReduce Job 225
Input Formats 232
Input Splits and Records 232
Text Input 243
Binary Input 247
Multiple Inputs 248
Database Input (and Output) 249
Output Formats 249
Text Output 250
Binary Output 251
Multiple Outputs 251
Lazy Output 255
Database Output 256
8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Counters 257
Built-in Counters 257
User-Defined Java Counters 262
User-Defined Streaming Counters 266
Sorting 266
Preparation 266
Partial Sort 268
Total Sort 272
Secondary Sort 276
Joins 281
Map-Side Joins 282
Reduce-Side Joins 284
Side Data Distribution 287
vi | Table of Contents
www.it-ebooks.info
Using the Job Configuration 287
Distributed Cache 288
MapReduce Library Classes 294
9. Setting Up a Hadoop Cluster .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Cluster Specification 295
Network Topology 297
Cluster Setup and Installation 299
Installing Java 300
Creating a Hadoop User 300
Installing Hadoop 300
Testing the Installation 301
SSH Configuration 301
Hadoop Configuration 302
Configuration Management 303
Environment Settings 305
Important Hadoop Daemon Properties 309
Hadoop Daemon Addresses and Ports 314
Other Hadoop Properties 315
User Account Creation 318
YARN Configuration 318
Important YARN Daemon Properties 319
YARN Daemon Addresses and Ports 322
Security 323
Kerberos and Hadoop 324
Delegation Tokens 326
Other Security Enhancements 327
Benchmarking a Hadoop Cluster 329
Hadoop Benchmarks 329
User Jobs 331
Hadoop in the Cloud 332
Hadoop on Amazon EC2 332
10. Administering Hadoop .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
HDFS 337
Persistent Data Structures 337
Safe Mode 342
Audit Logging 344
Tools 344
Monitoring 349
Logging 349
Metrics 350
Java Management Extensions 353
Table of Contents | vii
www.it-ebooks.info
[...]... one set of key-value pairs to another These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one More important, if you double the size of the input data, a job will run twice as slow But if you also double the size of the cluster, a job will run as fast as the original one This is not generally... new mail servers in as we grow By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and, furthermore, they were able to use what they had learned to improve the service for their customers You can read more about how Rackspace uses Hadoop in Chapter 16... parts to the computation, the map and the reduce, and it’s the interface between the two where the “mixing” occurs Like HDFS, MapReduce has built-in reliability This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system The storage is provided by HDFS and analysis by MapReduce There are other parts to Hadoop, but these capabilities are its kernel Comparison with Other Systems... Nonlinear Linear Another difference between MapReduce and an RDBMS is the amount of structure in the datasets that they operate on Structured data is data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema This is the realm of the RDBMS Semi-structured data, on the other hand, is looser, and though there may be... a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data Unstructured data does not have any particular internal structure: for example, plain text or image data MapReduce works well on unstructured or semistructured data, since it is designed to interpret the data at processing time In other... 1: Meet Hadoop www.it-ebooks.info The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth If the data access pattern is dominated... great control to the programmer, but requires that he or she explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higher-level algorithm for the analysis MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit Coordinating the processes in... one other (This is a slight oversimplification, since the output from mappers is fed to the reducers, but this is under the control of the MapReduce system; in this case, it needs to take more care rerunning a failed reducer than rerunning a failed map, since it has to make sure it can retrieve the necessary map outputs, and if not, regenerate them by running the relevant maps again.) So from the programmer’s... emerged However, they realized that their architecture wouldn’t scale to the billions of pages on the Web Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google.11 GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl... was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times Some applications are covered in the case studies in Chapter 16 and on the Hadoop wiki In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch through four terabytes of scanned archives from the paper converting them to PDFs for the Web.14 The processing took less . www.it-ebooks.info THIRD EDITION Hadoop: The Definitive Guide Tom White Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info Hadoop: The Definitive Guide, Third Edition by Tom. v www.it-ebooks.info The Fair Scheduler 205 The Capacity Scheduler 205 Shuffle and Sort 205 The Map Side 206 The Reduce Side 207 Configuration Tuning 209 Task Execution 212 The Task Execution Environment. since this was the latest stable release at the time of writing. New features from later releases are occasionally mentioned in the text, however, with reference to the version that they were introduced