1. Trang chủ
  2. » Công Nghệ Thông Tin

Hadoop the definitive guide storage and analysis at internet scale 4th edition

805 103 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống


Cấu trúc

  • Hadoop: The Definitive Guide

  • Dedication

  • Foreword

  • Preface

    • Administrative Notes

    • What’s New in the Fourth Edition?

    • What’s New in the Third Edition?

    • What’s New in the Second Edition?

    • Conventions Used in This Book

    • Using Code Examples

    • Safari® Books Online

    • How to Contact Us

    • Acknowledgments

  • I. Hadoop Fundamentals

    • 1. Meet Hadoop

      • Data!

      • Data Storage and Analysis

      • Querying All Your Data

      • Beyond Batch

      • Comparison with Other Systems

        • Relational Database Management Systems

        • Grid Computing

        • Volunteer Computing

      • A Brief History of Apache Hadoop

      • What’s in This Book?

    • 2. MapReduce

      • A Weather Dataset

        • Data Format

      • Analyzing the Data with Unix Tools

      • Analyzing the Data with Hadoop

        • Map and Reduce

        • Java MapReduce

          • A test run

      • Scaling Out

        • Data Flow

        • Combiner Functions

          • Specifying a combiner function

        • Running a Distributed MapReduce Job

      • Hadoop Streaming

        • Ruby

        • Python

    • 3. The Hadoop Distributed Filesystem

      • The Design of HDFS

      • HDFS Concepts

        • Blocks

        • Namenodes and Datanodes

        • Block Caching

        • HDFS Federation

        • HDFS High Availability

          • Failover and fencing

      • The Command-Line Interface

        • Basic Filesystem Operations

      • Hadoop Filesystems

        • Interfaces

          • HTTP

          • C

          • NFS

          • FUSE

      • The Java Interface

        • Reading Data from a Hadoop URL

        • Reading Data Using the FileSystem API

          • FSDataInputStream

        • Writing Data

          • FSDataOutputStream

        • Directories

        • Querying the Filesystem

          • File metadata: FileStatus

          • Listing files

          • File patterns

          • PathFilter

        • Deleting Data

      • Data Flow

        • Anatomy of a File Read

        • Anatomy of a File Write

        • Coherency Model

          • Consequences for application design

      • Parallel Copying with distcp

        • Keeping an HDFS Cluster Balanced

    • 4. YARN

      • Anatomy of a YARN Application Run

        • Resource Requests

        • Application Lifespan

        • Building YARN Applications

      • YARN Compared to MapReduce 1

      • Scheduling in YARN

        • Scheduler Options

        • Capacity Scheduler Configuration

          • Queue placement

        • Fair Scheduler Configuration

          • Enabling the Fair Scheduler

          • Queue configuration

          • Queue placement

          • Preemption

        • Delay Scheduling

        • Dominant Resource Fairness

      • Further Reading

    • 5. Hadoop I/O

      • Data Integrity

        • Data Integrity in HDFS

        • LocalFileSystem

        • ChecksumFileSystem

      • Compression

        • Codecs

          • Compressing and decompressing streams with CompressionCodec

          • Inferring CompressionCodecs using CompressionCodecFactory

          • Native libraries

            • CodecPool

        • Compression and Input Splits

        • Using Compression in MapReduce

          • Compressing map output

      • Serialization

        • The Writable Interface

          • WritableComparable and comparators

        • Writable Classes

          • Writable wrappers for Java primitives

          • Text

            • Indexing

            • Unicode

            • Iteration

            • Mutability

            • Resorting to String

          • BytesWritable

          • NullWritable

          • ObjectWritable and GenericWritable

          • Writable collections

        • Implementing a Custom Writable

          • Implementing a RawComparator for speed

          • Custom comparators

        • Serialization Frameworks

          • Serialization IDL

      • File-Based Data Structures

        • SequenceFile

          • Writing a SequenceFile

          • Reading a SequenceFile

          • Displaying a SequenceFile with the command-line interface

          • Sorting and merging SequenceFiles

          • The SequenceFile format

        • MapFile

          • MapFile variants

        • Other File Formats and Column-Oriented Formats

  • II. MapReduce

    • 6. Developing a MapReduce Application

      • The Configuration API

        • Combining Resources

        • Variable Expansion

      • Setting Up the Development Environment

        • Managing Configuration

        • GenericOptionsParser, Tool, and ToolRunner

      • Writing a Unit Test with MRUnit

        • Mapper

        • Reducer

      • Running Locally on Test Data

        • Running a Job in a Local Job Runner

        • Testing the Driver

      • Running on a Cluster

        • Packaging a Job

          • The client classpath

          • The task classpath

          • Packaging dependencies

          • Task classpath precedence

        • Launching a Job

        • The MapReduce Web UI

          • The resource manager page

          • The MapReduce job page

        • Retrieving the Results

        • Debugging a Job

          • The tasks and task attempts pages

          • Handling malformed data

        • Hadoop Logs

        • Remote Debugging

      • Tuning a Job

        • Profiling Tasks

          • The HPROF profiler

      • MapReduce Workflows

        • Decomposing a Problem into MapReduce Jobs

        • JobControl

        • Apache Oozie

          • Defining an Oozie workflow

          • Packaging and deploying an Oozie workflow application

          • Running an Oozie workflow job

    • 7. How MapReduce Works

      • Anatomy of a MapReduce Job Run

        • Job Submission

        • Job Initialization

        • Task Assignment

        • Task Execution

          • Streaming

        • Progress and Status Updates

        • Job Completion

      • Failures

        • Task Failure

        • Application Master Failure

        • Node Manager Failure

        • Resource Manager Failure

      • Shuffle and Sort

        • The Map Side

        • The Reduce Side

        • Configuration Tuning

      • Task Execution

        • The Task Execution Environment

          • Streaming environment variables

        • Speculative Execution

        • Output Committers

          • Task side-effect files

    • 8. MapReduce Types and Formats

      • MapReduce Types

        • The Default MapReduce Job

          • The default Streaming job

          • Keys and values in Streaming

      • Input Formats

        • Input Splits and Records

          • FileInputFormat

          • FileInputFormat input paths

          • FileInputFormat input splits

          • Small files and CombineFileInputFormat

          • Preventing splitting

          • File information in the mapper

          • Processing a whole file as a record

        • Text Input

          • TextInputFormat

            • Controlling the maximum line length

          • KeyValueTextInputFormat

          • NLineInputFormat

          • XML

        • Binary Input

          • SequenceFileInputFormat

          • SequenceFileAsTextInputFormat

          • SequenceFileAsBinaryInputFormat

          • FixedLengthInputFormat

        • Multiple Inputs

        • Database Input (and Output)

      • Output Formats

        • Text Output

        • Binary Output

          • SequenceFileOutputFormat

          • SequenceFileAsBinaryOutputFormat

          • MapFileOutputFormat

        • Multiple Outputs

          • An example: Partitioning data

          • MultipleOutputs

        • Lazy Output

        • Database Output

    • 9. MapReduce Features

      • Counters

        • Built-in Counters

          • Task counters

          • Job counters

        • User-Defined Java Counters

          • Dynamic counters

          • Retrieving counters

        • User-Defined Streaming Counters

      • Sorting

        • Preparation

        • Partial Sort

        • Total Sort

        • Secondary Sort

          • Java code

          • Streaming

      • Joins

        • Map-Side Joins

        • Reduce-Side Joins

      • Side Data Distribution

        • Using the Job Configuration

        • Distributed Cache

          • Usage

          • How it works

          • The distributed cache API

      • MapReduce Library Classes

  • III. Hadoop Operations

    • 10. Setting Up a Hadoop Cluster

      • Cluster Specification

        • Cluster Sizing

          • Master node scenarios

        • Network Topology

          • Rack awareness

      • Cluster Setup and Installation

        • Installing Java

        • Creating Unix User Accounts

        • Installing Hadoop

        • Configuring SSH

        • Configuring Hadoop

        • Formatting the HDFS Filesystem

        • Starting and Stopping the Daemons

        • Creating User Directories

      • Hadoop Configuration

        • Configuration Management

        • Environment Settings

          • Java

          • Memory heap size

          • System logfiles

          • SSH settings

        • Important Hadoop Daemon Properties

          • HDFS

          • YARN

          • Memory settings in YARN and MapReduce

          • CPU settings in YARN and MapReduce

        • Hadoop Daemon Addresses and Ports

        • Other Hadoop Properties

          • Cluster membership

          • Buffer size

          • HDFS block size

          • Reserved storage space

          • Trash

          • Job scheduler

          • Reduce slow start

          • Short-circuit local reads

      • Security

        • Kerberos and Hadoop

          • An example

        • Delegation Tokens

        • Other Security Enhancements

      • Benchmarking a Hadoop Cluster

        • Hadoop Benchmarks

          • Benchmarking MapReduce with TeraSort

          • Other benchmarks

        • User Jobs

    • 11. Administering Hadoop

      • HDFS

        • Persistent Data Structures

          • Namenode directory structure

          • The filesystem image and edit log

          • Secondary namenode directory structure

          • Datanode directory structure

        • Safe Mode

          • Entering and leaving safe mode

        • Audit Logging

        • Tools

          • dfsadmin

          • Filesystem check (fsck)

            • Finding the blocks for a file

          • Datanode block scanner

          • Balancer

      • Monitoring

        • Logging

          • Setting log levels

          • Getting stack traces

        • Metrics and JMX

      • Maintenance

        • Routine Administration Procedures

          • Metadata backups

          • Data backups

          • Filesystem check (fsck)

          • Filesystem balancer

        • Commissioning and Decommissioning Nodes

          • Commissioning new nodes

          • Decommissioning old nodes

        • Upgrades

          • HDFS data and metadata upgrades

            • Start the upgrade

            • Wait until the upgrade is complete

            • Check the upgrade

            • Roll back the upgrade (optional)

            • Finalize the upgrade (optional)

  • IV. Related Projects

    • 12. Avro

      • Avro Data Types and Schemas

      • In-Memory Serialization and Deserialization

        • The Specific API

      • Avro Datafiles

      • Interoperability

        • Python API

        • Avro Tools

      • Schema Resolution

      • Sort Order

      • Avro MapReduce

      • Sorting Using Avro MapReduce

      • Avro in Other Languages

    • 13. Parquet

      • Data Model

        • Nested Encoding

      • Parquet File Format

      • Parquet Configuration

      • Writing and Reading Parquet Files

        • Avro, Protocol Buffers, and Thrift

          • Projection and read schemas

      • Parquet MapReduce

    • 14. Flume

      • Installing Flume

      • An Example

      • Transactions and Reliability

        • Batching

      • The HDFS Sink

        • Partitioning and Interceptors

        • File Formats

      • Fan Out

        • Delivery Guarantees

        • Replicating and Multiplexing Selectors

      • Distribution: Agent Tiers

        • Delivery Guarantees

      • Sink Groups

      • Integrating Flume with Applications

      • Component Catalog

      • Further Reading

    • 15. Sqoop

      • Getting Sqoop

      • Sqoop Connectors

      • A Sample Import

        • Text and Binary File Formats

      • Generated Code

        • Additional Serialization Systems

      • Imports: A Deeper Look

        • Controlling the Import

        • Imports and Consistency

        • Incremental Imports

        • Direct-Mode Imports

      • Working with Imported Data

        • Imported Data and Hive

      • Importing Large Objects

      • Performing an Export

      • Exports: A Deeper Look

        • Exports and Transactionality

        • Exports and SequenceFiles

      • Further Reading

    • 16. Pig

      • Installing and Running Pig

        • Execution Types

          • Local mode

          • MapReduce mode

        • Running Pig Programs

        • Grunt

        • Pig Latin Editors

      • An Example

        • Generating Examples

      • Comparison with Databases

      • Pig Latin

        • Structure

        • Statements

        • Expressions

        • Types

        • Schemas

          • Using Hive tables with HCatalog

          • Validation and nulls

          • Schema merging

        • Functions

          • Other libraries

        • Macros

      • User-Defined Functions

        • A Filter UDF

          • Leveraging types

        • An Eval UDF

          • Dynamic invokers

        • A Load UDF

          • Using a schema

      • Data Processing Operators

        • Loading and Storing Data

        • Filtering Data

          • FOREACH...GENERATE

          • STREAM

        • Grouping and Joining Data

          • JOIN

          • COGROUP

          • CROSS

          • GROUP

        • Sorting Data

        • Combining and Splitting Data

      • Pig in Practice

        • Parallelism

        • Anonymous Relations

        • Parameter Substitution

          • Dynamic parameters

          • Parameter substitution processing

      • Further Reading

    • 17. Hive

      • Installing Hive

        • The Hive Shell

      • An Example

      • Running Hive

        • Configuring Hive

          • Execution engines

          • Logging

        • Hive Services

          • Hive clients

        • The Metastore

      • Comparison with Traditional Databases

        • Schema on Read Versus Schema on Write

        • Updates, Transactions, and Indexes

        • SQL-on-Hadoop Alternatives

      • HiveQL

        • Data Types

          • Primitive types

          • Complex types

        • Operators and Functions

          • Conversions

      • Tables

        • Managed Tables and External Tables

        • Partitions and Buckets

          • Partitions

          • Buckets

        • Storage Formats

          • The default storage format: Delimited text

          • Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and ORCFiles

          • Using a custom SerDe: RegexSerDe

          • Storage handlers

        • Importing Data

          • Inserts

          • Multitable insert


        • Altering Tables

        • Dropping Tables

      • Querying Data

        • Sorting and Aggregating

        • MapReduce Scripts

        • Joins

          • Inner joins

          • Outer joins

          • Semi joins

          • Map joins

        • Subqueries

        • Views

      • User-Defined Functions

        • Writing a UDF

        • Writing a UDAF

          • A more complex UDAF

      • Further Reading

    • 18. Crunch

      • An Example

      • The Core Crunch API

        • Primitive Operations

          • union()

          • parallelDo()

          • groupByKey()

          • combineValues()

        • Types

          • Records and tuples

        • Sources and Targets

          • Reading from a source

          • Writing to a target

          • Existing outputs

          • Combined sources and targets

        • Functions

          • Serialization of functions

          • Object reuse

        • Materialization

          • PObject

      • Pipeline Execution

        • Running a Pipeline

          • Asynchronous execution

          • Debugging

        • Stopping a Pipeline

        • Inspecting a Crunch Plan

        • Iterative Algorithms

        • Checkpointing a Pipeline

      • Crunch Libraries

      • Further Reading

    • 19. Spark

      • Installing Spark

      • An Example

        • Spark Applications, Jobs, Stages, and Tasks

        • A Scala Standalone Application

        • A Java Example

        • A Python Example

      • Resilient Distributed Datasets

        • Creation

        • Transformations and Actions

          • Aggregation transformations

        • Persistence

          • Persistence levels

        • Serialization

          • Data

          • Functions

      • Shared Variables

        • Broadcast Variables

        • Accumulators

      • Anatomy of a Spark Job Run

        • Job Submission

        • DAG Construction

        • Task Scheduling

        • Task Execution

      • Executors and Cluster Managers

        • Spark on YARN

          • YARN client mode

          • YARN cluster mode

      • Further Reading

    • 20. HBase

      • HBasics

        • Backdrop

      • Concepts

        • Whirlwind Tour of the Data Model

          • Regions

          • Locking

        • Implementation

          • HBase in operation

      • Installation

        • Test Drive

      • Clients

        • Java

        • MapReduce

        • REST and Thrift

      • Building an Online Query Application

        • Schema Design

        • Loading Data

          • Load distribution

          • Bulk load

        • Online Queries

          • Station queries

          • Observation queries

      • HBase Versus RDBMS

        • Successful Service

        • HBase

      • Praxis

        • HDFS

        • UI

        • Metrics

        • Counters

      • Further Reading

    • 21. ZooKeeper

      • Installing and Running ZooKeeper

      • An Example

        • Group Membership in ZooKeeper

        • Creating the Group

        • Joining a Group

        • Listing Members in a Group

          • ZooKeeper command-line tools

        • Deleting a Group

      • The ZooKeeper Service

        • Data Model

          • Ephemeral znodes

          • Sequence numbers

          • Watches

        • Operations

          • Multiupdate

          • APIs

          • Watch triggers

          • ACLs

        • Implementation

        • Consistency

        • Sessions

          • Time

        • States

      • Building Applications with ZooKeeper

        • A Configuration Service

        • The Resilient ZooKeeper Application

          • InterruptedException

          • KeeperException

            • State exceptions

            • Recoverable exceptions

            • Unrecoverable exceptions

          • A reliable configuration service

        • A Lock Service

          • The herd effect

          • Recoverable exceptions

          • Unrecoverable exceptions

          • Implementation

        • More Distributed Data Structures and Protocols

          • BookKeeper and Hedwig

      • ZooKeeper in Production

        • Resilience and Performance

        • Configuration

      • Further Reading

  • V. Case Studies

    • 22. Composable Data at Cerner

      • From CPUs to Semantic Integration

      • Enter Apache Crunch

      • Building a Complete Picture

      • Integrating Healthcare Data

      • Composability over Frameworks

      • Moving Forward

    • 23. Biological Data Science: Saving Lives with Software

      • The Structure of DNA

      • The Genetic Code: Turning DNA Letters into Proteins

      • Thinking of DNA as Source Code

      • The Human Genome Project and Reference Genomes

      • Sequencing and Aligning DNA

      • ADAM, A Scalable Genome Analysis Platform

        • Literate programming with the Avro interface description language (IDL)

        • Column-oriented access with Parquet

        • A simple example: k-mer counting using Spark and ADAM

      • From Personalized Ads to Personalized Medicine

      • Join In

    • 24. Cascading

      • Fields, Tuples, and Pipes

      • Operations

      • Taps, Schemes, and Flows

      • Cascading in Practice

      • Flexibility

      • Hadoop and Cascading at ShareThis

      • Summary

  • A. Installing Apache Hadoop

    • Prerequisites

    • Installation

    • Configuration

      • Standalone Mode

      • Pseudodistributed Mode

        • Configuring SSH

        • Formatting the HDFS filesystem

        • Starting and stopping the daemons

        • Creating a user directory

      • Fully Distributed Mode

  • B. Cloudera’s Distribution Including Apache Hadoop

  • C. Preparing the NCDC Weather Data

  • D. The Old and New Java MapReduce APIs

  • Index

  • Colophon

  • Copyright

Nội dung

www.allitebooks.com www.allitebooks.com Hadoop: The Definitive Guide Tom White www.allitebooks.com For Eliane, Emilia, and Lottie www.allitebooks.com www.allitebooks.com Foreword Doug Cutting, April 2009 Shed in the Yard, California Hadoop got its start in Nutch A few of us were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers Once Google published its GFS and MapReduce papers, the route became clear They’d devised systems to solve precisely the problems we were having with Nutch So we started, two of us, half-time, to try to re-create these systems as a part of Nutch We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Web’s massive scale, we’d need to run it on thousands of machines, and moreover, that the job was bigger than two half-time developers could handle Around that time, Yahoo! got interested, and quickly put together a team that I joined We split off the distributed computing part of Nutch, naming it Hadoop With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web In 2006, Tom White started contributing to Hadoop I already knew Tom through an excellent article he’d written about Nutch, so I knew he could present complex ideas in clear prose I soon learned that he could also develop software that was as pleasant to read as his prose From the beginning, Tom’s contributions to Hadoop showed his concern for users and for the project Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services Then he moved on to tackle a wide variety of problems, including improving the MapReduce APIs, enhancing the website, and devising an object serialization framework In all cases, Tom presented his ideas precisely In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Management Committee Tom is now a respected senior member of the Hadoop developer community Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master — not only of the technology, but also of common sense and plain talk www.allitebooks.com www.allitebooks.com Preface Martin Gardner, the mathematics and science writer, once said in an interview: Beyond calculus, I am lost That was the secret of my column’s success It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.[1] In many ways, this is how I feel about Hadoop Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense And to the uninitiated, Hadoop can appear alien But it doesn’t need to be like this Stripped to its core, the tools that Hadoop provides for working with big data are simple If there’s a common theme, it is about raising the level of abstraction — to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art Things have certainly improved since then: there is more documentation, there are more examples, and there are thriving mailing lists to go to when you have questions And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it That is why I wrote this book The Apache Hadoop community has come a long way Since the publication of the first edition of this book, the Hadoop project has blossomed “Big data” has become a household term.[2] In this time, the software has made great leaps in adoption, performance, reliability, scalability, and manageability The number of things being built and run on the Hadoop platform has grown enormously In fact, it’s difficult for one person to keep track To gain even wider adoption, I believe we need to make Hadoop even easier to use This will involve writing more tools; integrating with even more systems; and writing new, improved APIs I’m looking forward to being a part of this, and I hope this book will encourage and enable others to do so, too www.allitebooks.com Administrative Notes During discussion of a particular Java class in the text, I often omit its package name to reduce clutter If you need to know which package a class is in, you can easily look it up in the Java API documentation for Hadoop (linked to from the Apache Hadoop home page), or the relevant project Or if you’re using an integrated development environment (IDE), its auto-complete mechanism can help find what you’re looking for Similarly, although it deviates from usual style guidelines, program listings that import multiple classes from the same package may use the asterisk wildcard character to save space (for example, import org.apache.hadoop.io.*) The sample programs in this book are available for download from the book’s website You will also find instructions there for obtaining the datasets that are used in examples throughout the book, as well as further notes for running the programs in the book and links to updates, additional resources, and my blog www.allitebooks.com What’s New in the Fourth Edition? The fourth edition covers Hadoop 2 exclusively The Hadoop 2 release series is the current active release series and contains the most stable versions of Hadoop There are new chapters covering YARN (Chapter 4), Parquet (Chapter 13), Flume (Chapter 14), Crunch (Chapter 18), and Spark (Chapter 19) There’s also a new section to help readers navigate different pathways through the book (What’s in This Book?) This edition includes two new case studies (Chapters 22 and 23): one on how Hadoop is used in healthcare systems, and another on using Hadoop technologies for genomics data processing Case studies from the previous editions can now be found online Many corrections, updates, and improvements have been made to existing chapters to bring them up to date with the latest releases of Hadoop and its related projects www.allitebooks.com Cluster Setup and Installation Installing Java Creating Unix User Accounts Installing Hadoop Configuring SSH Configuring Hadoop Formatting the HDFS Filesystem Starting and Stopping the Daemons Creating User Directories Hadoop Configuration Configuration Management Environment Settings Java Memory heap size System logfiles SSH settings Important Hadoop Daemon Properties HDFS YARN Memory settings in YARN and MapReduce CPU settings in YARN and MapReduce Hadoop Daemon Addresses and Ports Other Hadoop Properties Cluster membership Buffer size HDFS block size Reserved storage space Trash Job scheduler Reduce slow start Short-circuit local reads Security Kerberos and Hadoop An example Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks Benchmarking MapReduce with TeraSort Other benchmarks User Jobs 11 Administering Hadoop HDFS Persistent Data Structures Namenode directory structure The filesystem image and edit log Secondary namenode directory structure Datanode directory structure Safe Mode Entering and leaving safe mode Audit Logging Tools dfsadmin Filesystem check (fsck) Finding the blocks for a file Datanode block scanner Balancer Monitoring Logging Setting log levels Getting stack traces Metrics and JMX Maintenance Routine Administration Procedures Metadata backups Data backups Filesystem check (fsck) Filesystem balancer Commissioning and Decommissioning Nodes Commissioning new nodes Decommissioning old nodes Upgrades HDFS data and metadata upgrades Start the upgrade Wait until the upgrade is complete Check the upgrade Roll back the upgrade (optional) Finalize the upgrade (optional) IV Related Projects 12 Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization The Specific API Avro Datafiles Interoperability Python API Avro Tools Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro in Other Languages 13 Parquet Data Model Nested Encoding Parquet File Format Parquet Configuration Writing and Reading Parquet Files Avro, Protocol Buffers, and Thrift Projection and read schemas Parquet MapReduce 14 Flume Installing Flume An Example Transactions and Reliability Batching The HDFS Sink Partitioning and Interceptors File Formats Fan Out Delivery Guarantees Replicating and Multiplexing Selectors Distribution: Agent Tiers Delivery Guarantees Sink Groups Integrating Flume with Applications Component Catalog Further Reading 15 Sqoop Getting Sqoop Sqoop Connectors A Sample Import Text and Binary File Formats Generated Code Additional Serialization Systems Imports: A Deeper Look Controlling the Import Imports and Consistency Incremental Imports Direct-Mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles Further Reading 16 Pig Installing and Running Pig Execution Types Local mode MapReduce mode Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Using Hive tables with HCatalog Validation and nulls Schema merging Functions Other libraries Macros User-Defined Functions A Filter UDF Leveraging types An Eval UDF Dynamic invokers A Load UDF Using a schema Data Processing Operators Loading and Storing Data Filtering Data FOREACH…GENERATE STREAM Grouping and Joining Data JOIN COGROUP CROSS GROUP Sorting Data Combining and Splitting Data Pig in Practice Parallelism Anonymous Relations Parameter Substitution Dynamic parameters Parameter substitution processing Further Reading 17 Hive Installing Hive The Hive Shell An Example Running Hive Configuring Hive Execution engines Logging Hive Services Hive clients The Metastore Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes SQL-on-Hadoop Alternatives HiveQL Data Types Primitive types Complex types Operators and Functions Conversions Tables Managed Tables and External Tables Partitions and Buckets Partitions Buckets Storage Formats The default storage format: Delimited text Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and ORCFiles Using a custom SerDe: RegexSerDe Storage handlers Importing Data Inserts Multitable insert CREATE TABLE…AS SELECT Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts Joins Inner joins Outer joins Semi joins Map joins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF A more complex UDAF Further Reading 18 Crunch An Example The Core Crunch API Primitive Operations union() parallelDo() groupByKey() combineValues() Types Records and tuples Sources and Targets Reading from a source Writing to a target Existing outputs Combined sources and targets Functions Serialization of functions Object reuse Materialization PObject Pipeline Execution Running a Pipeline Asynchronous execution Debugging Stopping a Pipeline Inspecting a Crunch Plan Iterative Algorithms Checkpointing a Pipeline Crunch Libraries Further Reading 19 Spark Installing Spark An Example Spark Applications, Jobs, Stages, and Tasks A Scala Standalone Application A Java Example A Python Example Resilient Distributed Datasets Creation Transformations and Actions Aggregation transformations Persistence Persistence levels Serialization Data Functions Shared Variables Broadcast Variables Accumulators Anatomy of a Spark Job Run Job Submission DAG Construction Task Scheduling Task Execution Executors and Cluster Managers Spark on YARN YARN client mode YARN cluster mode Further Reading 20 HBase HBasics Backdrop Concepts Whirlwind Tour of the Data Model Regions Locking Implementation HBase in operation Installation Test Drive Clients Java MapReduce REST and Thrift Building an Online Query Application Schema Design Loading Data Load distribution Bulk load Online Queries Station queries Observation queries HBase Versus RDBMS Successful Service HBase Praxis HDFS UI Metrics Counters Further Reading 21 ZooKeeper Installing and Running ZooKeeper An Example Group Membership in ZooKeeper Creating the Group Joining a Group Listing Members in a Group ZooKeeper command-line tools Deleting a Group The ZooKeeper Service Data Model Ephemeral znodes Sequence numbers Watches Operations Multiupdate APIs Watch triggers ACLs Implementation Consistency Sessions Time States Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application InterruptedException KeeperException State exceptions Recoverable exceptions Unrecoverable exceptions A reliable configuration service A Lock Service The herd effect Recoverable exceptions Unrecoverable exceptions Implementation More Distributed Data Structures and Protocols BookKeeper and Hedwig ZooKeeper in Production Resilience and Performance Configuration Further Reading V Case Studies 22 Composable Data at Cerner From CPUs to Semantic Integration Enter Apache Crunch Building a Complete Picture Integrating Healthcare Data Composability over Frameworks Moving Forward 23 Biological Data Science: Saving Lives with Software The Structure of DNA The Genetic Code: Turning DNA Letters into Proteins Thinking of DNA as Source Code The Human Genome Project and Reference Genomes Sequencing and Aligning DNA ADAM, A Scalable Genome Analysis Platform Literate programming with the Avro interface description language (IDL) Column-oriented access with Parquet A simple example: k-mer counting using Spark and ADAM From Personalized Ads to Personalized Medicine Join In 24 Cascading Fields, Tuples, and Pipes Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary A Installing Apache Hadoop Prerequisites Installation Configuration Standalone Mode Pseudodistributed Mode Configuring SSH Formatting the HDFS filesystem Starting and stopping the daemons Creating a user directory Fully Distributed Mode B Cloudera’s Distribution Including Apache Hadoop C Preparing the NCDC Weather Data D The Old and New Java MapReduce APIs Index Colophon Copyright ... By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and furthermore, they were able to use what they had... However, the differences between relational databases and Hadoop systems are blurring Relational databases have started incorporating some of the ideas from Hadoop, and from the other direction, Hadoop systems such as Hive are becoming more interactive (by... suited to analysis with Hadoop Note that Hadoop can perform joins; it’s just that they are not used as much as in the relational world MapReduce — and the other processing models in Hadoop — scales linearly with the size

Ngày đăng: 04/03/2019, 09:09