Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 695 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
695
Dung lượng
4,3 MB
Nội dung
www.allitebooks.com www.allitebooks.com Hadoop MapReduce v2 Cookbook Second Edition www.allitebooks.com Table of Contents Hadoop MapReduce v2 Cookbook Second Edition Credits About the Author Acknowledgments About the Author About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why Subscribe? Free Access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions Getting Started with Hadoop v2 Introduction Hadoop Distributed File System – HDFS Hadoop YARN Hadoop MapReduce Hadoop installation modes Setting up Hadoop v2 on your local machine Getting ready www.allitebooks.com How to do it… How it works… Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode Getting ready How to do it… How it works… There’s more… See also Adding a combiner step to the WordCount MapReduce program How to do it… How it works… There’s more… Setting up HDFS Getting ready How to do it… See also Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2 Getting ready How to do it… How it works… See also Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution Getting ready How to do it… There’s more… HDFS command-line file operations Getting ready How to do it… How it works… There’s more… www.allitebooks.com Running the WordCount program in a distributed cluster environment Getting ready How to do it… How it works… There’s more… Benchmarking HDFS using DFSIO Getting ready How to do it… How it works… There’s more… Benchmarking Hadoop MapReduce using TeraSort Getting ready How to do it… How it works… Cloud Deployments – Using Hadoop YARN on Cloud Environments Introduction Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce Getting ready How to do it… See also Saving money using Amazon EC2 Spot Instances to execute EMR job flows How to do it… There’s more… See also Executing a Pig script using EMR How to do it… There’s more… Starting a Pig interactive session Executing a Hive script using EMR How to do it… There’s more… www.allitebooks.com Starting a Hive interactive session See also Creating an Amazon EMR job flow using the AWS Command Line Interface Getting ready How to do it… There’s more… See also Deploying an Apache HBase cluster on Amazon EC2 using EMR Getting ready How to do it… See also Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs How to do it… There’s more… Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment How to do it… How it works… See also Hadoop Essentials – Configurations, Unit Tests, and Other APIs Introduction Optimizing Hadoop YARN and MapReduce configurations for cluster deployments Getting ready How to do it… How it works… There’s more… Shared user Hadoop clusters – using Fair and Capacity schedulers How to do it… How it works… There’s more… Setting classpath precedence to user-provided JARs How to do it… www.allitebooks.com How it works… Speculative execution of straggling tasks How to do it… There’s more… Unit testing Hadoop MapReduce applications using MRUnit Getting ready How to do it… See also Integration testing Hadoop MapReduce applications using MiniYarnCluster Getting ready How to do it… See also Adding a new DataNode Getting ready How to do it… There’s more… Rebalancing HDFS See also Decommissioning DataNodes How to do it… How it works… See also Using multiple disks/volumes and limiting HDFS disk usage How to do it… Setting the HDFS block size How to do it… There’s more… See also Setting the file replication factor How to do it… How it works… www.allitebooks.com There’s more… See also Using the HDFS Java API How to do it… How it works… There’s more… Configuring the FileSystem object Retrieving the list of data blocks of a file Developing Complex Hadoop MapReduce Applications Introduction Choosing appropriate Hadoop data types How to do it… There’s more… See also Implementing a custom Hadoop Writable data type How to do it… How it works… There’s more… See also Implementing a custom Hadoop key type How to do it… How it works… See also Emitting data of different value types from a Mapper How to do it… How it works… There’s more… See also Choosing a suitable Hadoop InputFormat for your input data format How to do it… How it works… www.allitebooks.com There’s more… See also Adding support for new input data formats – implementing a custom InputFormat How to do it… How it works… There’s more… See also Formatting the results of MapReduce computations – using Hadoop OutputFormats How to do it… How it works… There’s more… Writing multiple outputs from a MapReduce computation How to do it… How it works… Using multiple input data types and multiple Mapper implementations in a single MapReduce application See also Hadoop intermediate data partitioning How to do it… How it works… There’s more… TotalOrderPartitioner KeyFieldBasedPartitioner Secondary sorting – sorting Reduce input values How to do it… How it works… See also Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache How to do it… How it works… There’s more… www.allitebooks.com L large datasets, to Apache HBase data store loading, importtsv used / Loading large datasets to an Apache HBase data store – importtsv and bulkload, How to do it…, How it works… loading, bulkload used / Loading large datasets to an Apache HBase data store – importtsv and bulkload, How to do it…, How it works…, There’s more… LDA used, for topic discovery / Topic discovery using Latent Dirichlet Allocation (LDA), How to do it…, How it works… legacy applications Hadoop, using with / Using Hadoop with legacy applications – Hadoop streaming, How it works…, There’s more… LIMIT operator about / How it works… list of data blocks retrieving / Retrieving the list of data blocks of a file M machine learning algorithm / Getting started with Apache Mahout Mahout about / Introduction Mahout Naive Bayes Classifier used, for document classification / Document classification using Mahout Naive Bayes Classifier, How to do it…, How it works… MapFile about / Outputting a random accessible indexed InvertedIndex MapFileOutputFormat format / Outputting a random accessible indexed InvertedIndex Map function / Simple analytics using MapReduce about / Parsing a complex dataset with Hadoop Mapper data of different value types, emitting from / Emitting data of different value types from a Mapper, How to do it…, How it works… MapReduce / Introduction used, for simple analytics / Simple analytics using MapReduce, Getting ready, How it works… used, for performing GROUP BY / Performing GROUP BY using MapReduce, How to do it…, How it works… used, for calculating frequency distributions / Calculating frequency distributions and sorting using MapReduce, How to do it…, There’s more… used, for calculating sorting / Calculating frequency distributions and sorting using MapReduce, How to do it…, There’s more… used, for calculating histograms / Calculating histograms using MapReduce, How to do it…, How it works… used, for calculating Scatter plots / Calculating Scatter plots using MapReduce, How to do it…, How it works… used, for joining two datasets / Joining two datasets using MapReduce, How to do it…, How it works… MapReduce computation multiple outputs, writing from / Writing multiple outputs from a MapReduce computation, How to do it…, How it works… MapReduce computations, results formatting, Hadoop OutputFormats used / Formatting the results of MapReduce computations – using Hadoop OutputFormats, How it works… MapReduce configuration optimizing, for cluster deployments / Optimizing Hadoop YARN and MapReduce configurations for cluster deployments, How to do it…, There’s more… URL / There’s more… MapReduce jobs dependencies, adding between / Adding dependencies between MapReduce jobs, How it works…, There’s more… running, on HBase / Running MapReduce jobs on HBase, How to do it… MapReduce programming model Map function / Hadoop MapReduce Reduce function / Hadoop MapReduce MRUnit used, for unit testing Hadoop MapReduce applications / Unit testing Hadoop MapReduce applications using MRUnit, How to do it… about / Unit testing Hadoop MapReduce applications using MRUnit URL / See also multiple disks/volumes using / Using multiple disks/volumes and limiting HDFS disk usage multiple input data types used, in single MapReduce application / Using multiple input data types and multiple Mapper implementations in a single MapReduce application multiple Mapper implementations used, in single MapReduce application / Using multiple input data types and multiple Mapper implementations in a single MapReduce application multiple outputs writing, from MapReduce computation / Writing multiple outputs from a MapReduce computation, How to do it…, How it works… N 20 Newsgroups dataset URL / Introduction N-dimensional space / Running K-means with Mahout NameNode about / Hadoop Distributed File System – HDFS NASA weblog dataset URL / Introduction naïve Bayer classifier URL / Classification using the naïve Bayes classifier naïve Bayes classifier used, for classification / How to do it…, How it works… O Oracle JDK URL / Getting ready ORC files used, for storing table data / Utilizing different storage formats in Hive - storing table data using ORC files, How to do it… ORDER BY operator about / How it works… P partitioned Hive tables creating / Creating partitioned Hive tables, How to do it… Partitioner about / Introduction password-less SSH configuring / How to do it… Pig about / Introduction URL / Getting started with Apache Pig used, for joining two datasets / Joining two datasets using Pig, How it works… Pig interactive session starting / Starting a Pig interactive session Pig Latin about / Getting started with Apache Pig Pig script executing, EMR used / Executing a Pig script using EMR, How to do it…, Starting a Pig interactive session PostgreSQL JDBC driver URL / How to do it… predefined bootstrap actions configure-daemons / There’s more… configure-hadoop / There’s more… memory-intensive / There’s more… run-if / There’s more… Puppet-based cluster installation URL / There’s more… Python used, for data preprocessing / Data preprocessing using Hadoop streaming and Python, How to do it…, How it works… Q query file used, for Hive batch mode / Hive batch mode - using a query file, How to do it…, How it works…, There’s more… R random accessible indexed InvertedIndex outputting / Outputting a random accessible indexed InvertedIndex recommendations about / Performing content-based recommendations making, ways / Performing content-based recommendations Reduce function / Simple analytics using MapReduce Reduce input values sorting / Secondary sorting – sorting Reduce input values, How to do it…, How it works… repository files URL / Hadoop installation modes S S3 bucket about / How to do it… URL / How to do it… sample code, GitHub URL / Introduction Scatter plots calculating, MapReduce used / Calculating Scatter plots using MapReduce, How to do it…, How it works… about / Calculating Scatter plots using MapReduce SequenceFileInputFormat subclasses / There’s more… shared user Hadoop clusters Capacity scheduler, used for / Shared user Hadoop clusters – using Fair and Capacity schedulers, How it works… Fair scheduler, used for / Shared user Hadoop clusters – using Fair and Capacity schedulers, How it works… shuffling about / Introduction Simple Storage Service (S3) / Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce single MapReduce application multiple input data types, used in / Using multiple input data types and multiple Mapper implementations in a single MapReduce application multiple Mapper implementations, used in / Using multiple input data types and multiple Mapper implementations in a single MapReduce application SolrCloud URL / See also sorting calculating, MapReduce used / Calculating frequency distributions and sorting using MapReduce, How to do it…, There’s more… SQL-style data querying, Apache Hive used / Simple SQL-style data querying using Apache Hive, How to do it…, There’s more… Sqoop about / Introduction stragglers about / Speculative execution of straggling tasks straggling tasks executing / Speculative execution of straggling tasks T table data storing, ORC files used / Utilizing different storage formats in Hive - storing table data using ORC files, How to do it… TaskTrackers about / Hadoop MapReduce TeraSort used, for benchmarking Hadoop MapReduce / Benchmarking Hadoop MapReduce using TeraSort, How to do it… term frequencies (TF) / Creating TF and TF-IDF vectors for the text data Term frequency-inverse document frequency (TF-IDF) / Creating TF and TF-IDF vectors for the text data text data TF-IDF vector, creating for / Creating TF and TF-IDF vectors for the text data, How to do it…, How it works… TF vector, creating for / Creating TF and TF-IDF vectors for the text data, How to do it…, How it works… clustering, Apache Mahout used / Clustering text data using Apache Mahout, How it works… TF-IDF vector creating, for text data / Creating TF and TF-IDF vectors for the text data, How to do it…, How it works… TF vector creating, for text data / Creating TF and TF-IDF vectors for the text data, How to do it…, How it works… TotalOrderPartitioner / TotalOrderPartitioner Twahpic URL / Topic discovery using Latent Dirichlet Allocation (LDA) two datasets joining, MapReduce used / Joining two datasets using MapReduce, How to do it…, How it works… joining, Pig used / Joining two datasets using Pig, How it works… U User-defined Function (UDF) about / Writing Hive User-defined Functions (UDF) user-provided JARs classpath precedence, setting to / Setting classpath precedence to user-provided JARs V VM, for Amazon EMR jobs configuring, EMR bootstrap actions used / Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs, How to do it…, There’s more… W web crawling about / Intradomain web crawling using Apache Nutch web crawling, with Apache Nutch performing, Hadoop cluster used / Whole web crawling with Apache Nutch using a Hadoop/HBase cluster, How to do it…, How it works… performing, HBase cluster used / Whole web crawling with Apache Nutch using a Hadoop/HBase cluster, How to do it…, How it works… web documents indexing, Apache Solr used / Indexing and searching web documents using Apache Solr, How to do it…, How it works… searching, Apache Solr used / Indexing and searching web documents using Apache Solr, How to do it…, How it works… web searching about / Introduction Whirr configuration URL / How it works… WordCount MapReduce application writing / Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode, How to do it…, How it works… bundling / Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode, How to do it…, How it works…, There’s more… running, Hadoop local mode used / Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode, How to do it…, How it works…, There’s more… WordCount MapReduce program combiner step, adding to / Adding a combiner step to the WordCount MapReduce program, How to do it…, There’s more… WordCount program running, in distributed cluster environment / Running the WordCount program in a distributed cluster environment, How to do it… Y YARN (Yet Another Resource Negotiator) about / Hadoop YARN YARN configuration URL / There’s more… YARN mini cluster used, for integration testing Hadoop MapReduce applications / Integration testing Hadoop MapReduce applications using MiniYarnCluster, How to do it… Z zipf (power law) distribution about / How to do it… ... Hadoop MapReduce v2 Cookbook Second Edition www.allitebooks.com Table of Contents Hadoop MapReduce v2 Cookbook Second Edition Credits About the Author Acknowledgments About the Author About the Reviewers... will provide you with the skills and knowledge needed to process large and complex datasets using the next generation Hadoop ecosystem This book presents many exciting topics such as MapReduce patterns using Hadoop to solve analytics, classifications, and data indexing and searching... Adding resources to the DistributedCache from the command line Adding resources to the classpath using the DistributedCache Using Hadoop with legacy applications – Hadoop streaming How to do it… How it works… There’s more…