Hadoop data processing and modelling

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	1.836
Dung lượng	20,87 MB

Nội dung

Hadoop: Data Processing and Modelling Table of Contents Hadoop: Data Processing and Modelling Hadoop: Data Processing and Modelling Credits Preface What this learning path covers Hadoop beginners Guide Hadoop Real World Solutions Cookbook, 2nd edition Mastering Hadoop What you need for this learning path Who this learning path is for Reader feedback Customer support Downloading the example code Errata Piracy Questions Module 1 What It's All About Big data processing The value of data Historically for the few and not the many Classic data processing systems Scale-up Early approaches to scale-out Limiting factors A different approach All roads lead to scale-out Share nothing Expect failure Smart software, dumb hardware Move processing, not data Build applications, not infrastructure Hadoop Thanks, Google Thanks, Doug Thanks, Yahoo Parts of Hadoop Common building blocks HDFS MapReduce Better together Common architecture What it is and isn't good for Cloud computing with Amazon Web Services Too many clouds A third way Different types of costs AWS – infrastructure on demand from Amazon Elastic Compute Cloud (EC2) Simple Storage Service (S3) Elastic MapReduce (EMR) What this book covers A dual approach Summary Getting Hadoop Up and Running Hadoop on a local Ubuntu host Other operating systems Time for action – checking the prerequisites What just happened? Setting up Hadoop A note on versions Time for action – downloading Hadoop What just happened? Time for action – setting up SSH What just happened? Configuring and running Hadoop Time for action – using Hadoop to calculate Pi What just happened? Three modes Time for action – configuring the pseudo-distributed mode What just happened? Configuring the base directory and formatting the filesystem Time for action – changing the base HDFS directory What just happened? Time for action – formatting the NameNode What just happened? Starting and using Hadoop Time for action – starting Hadoop What just happened? Time for action – using HDFS What just happened? Time for action – WordCount, the Hello World of MapReduce What just happened? Have a go hero – WordCount on a larger body of text Monitoring Hadoop from the browser The HDFS web UI The MapReduce web UI Using Elastic MapReduce Setting up an account in Amazon Web Services Creating an AWS account Signing up for the necessary services Time for action – WordCount on EMR using the management console What just happened? Have a go hero – other EMR sample applications Other ways of using EMR AWS credentials The EMR command-line tools The AWS ecosystem Comparison of local versus EMR Hadoop Summary Understanding MapReduce Key/value pairs What it mean Why key/value data? Some real-world examples MapReduce as a series of key/value transformations Pop quiz – key/value pairs The Hadoop Java API for MapReduce The 0.20 MapReduce Java API The Mapper class The Reducer class The Driver class Writing MapReduce programs Time for action – setting up the classpath What just happened? Time for action – implementing WordCount What just happened? Time for action – building a JAR file What just happened? Time for action – running WordCount on a local Hadoop cluster What just happened? Time for action – running WordCount on EMR What just happened? The pre-0.20 Java MapReduce API Hadoop-provided mapper and reducer implementations Time for action – WordCount the easy way What just happened? Walking through a run of WordCount Startup Splitting the input Task assignment Task startup Ongoing JobTracker monitoring Mapper input Mapper execution Mapper output and reduce input Partitioning The optional partition function Reducer input Reducer execution Reducer output Shutdown That's all there is to it! Apart from the combiner…maybe Why have a combiner? Time for action – WordCount with a combiner What just happened? When you can use the reducer as the combiner Time for action – fixing WordCount to work with a combiner What just happened? Reuse is your friend Pop quiz – MapReduce mechanics Hadoop-specific data types The Writable and WritableComparable interfaces Introducing the wrapper classes Primitive wrapper classes Array wrapper classes Map wrapper classes Time for action – using the Writable wrapper classes What just happened? Other wrapper classes Have a go hero – playing with Writables Making your own Input/output Files, splits, and records InputFormat and RecordReader Hadoop-provided InputFormat Hadoop-provided RecordReader OutputFormat and RecordWriter Hadoop-provided OutputFormat Don't forget Sequence files Summary Developing MapReduce Programs Using languages other than Java with Hadoop How Hadoop Streaming works Why to use Hadoop Streaming Time for action – implementing WordCount using Streaming What just happened? Differences in jobs when using Streaming Analyzing a large dataset Getting the UFO sighting dataset Getting a feel for the dataset Time for action – summarizing the UFO data What just happened? Examining UFO shapes Time for action – summarizing the shape data What just happened? Time for action – correlating of sighting duration to UFO shape What just happened? Using Streaming scripts outside Hadoop Time for action – performing the shape/time analysis from the command line What just happened? Java shape and location analysis Time for action – using ChainMapper for field validation/analysis What just happened? Have a go hero Too many abbreviations Using the Distributed Cache Time for action – using the Distributed Cache to improve location output What just happened? Counters, status, and other output Time for action – creating counters, task states, and writing log output What just happened? Too much information! Summary Advanced MapReduce Techniques Simple, advanced, and in-between Joins When this is a bad idea Map-side versus reduce-side joins Matching account and sales information Time for action – reduce-side join using MultipleInputs What just happened? DataJoinMapper and TaggedMapperOutput Implementing map-side joins Using the Distributed Cache Have a go hero - Implementing map-side joins Pruning data to fit in the cache Using a data representation instead of raw data Using multiple mappers To join or not to join Graph algorithms Graph 101 Graphs and MapReduce – a match made somewhere Representing a graph Time for action – representing the graph What just happened? Overview of the algorithm The mapper The reducer Iterative application Time for action – creating the source code What just happened? Time for action – the first run What just happened? Time for action – the second run What just happened? Time for action – the third run What just happened? Time for action – the fourth and last run What just happened? Running multiple jobs Final thoughts on graphs Using language-independent data structures Candidate technologies Introducing Avro Time for action – getting and installing Avro What just happened? Avro and schemas Time for action – defining the schema What just happened? Time for action – creating the source Avro data with Ruby What just happened? Time for action – consuming the Avro data with Java What just happened? Using Avro within MapReduce Time for action – generating shape summaries in MapReduce What just happened? Time for action – examining the output data with Ruby What just happened? Time for action – examining the output data with Java What just happened? Have a go hero – graphs in Avro Going forward with Avro Summary When Things Break Failure Embrace failure Or at least don't fear it Don't try this at home Types of failure Hadoop node failure The dfsadmin command Cluster setup, test files, and block sizes Fault tolerance and Elastic MapReduce Time for action – killing a DataNode process What just happened? NameNode and DataNode communication Have a go hero – NameNode log delving Time for action – the replication factor in action What just happened? Time for action – intentionally causing missing blocks about / Machine learning supervisor command about / Installation procedure supporting components, Hive / The supporting components of Hive syslogd about / Sources T table about / The data model table joins performing, in Hive / Performing table joins in Hive, How to it left outer join / Left outer join right outer join / Right outer join full outer join / Full outer join left semi join / Left semi join TaggedMapperOutput class about / DataJoinMapper and TaggedMapperOutput task failures, due to data about / Task failure due to data dirty data, handling through code / Handling dirty data through code skip mode, using / Using Hadoop's skip mode dirty data, handling by skip mode / Time for action – handling dirty data by using skip mode, What just happened? task failures, due to software about / Task failure due to software slow running tasks / Failure of slow running tasks, Time for action – causing task failure HDFS programmatic access / Have a go hero – HDFS programmatic access slow-running tasks, handling / Hadoop's handling of slow-running tasks speculative execution / Speculative execution failing tasks, handling / Hadoop's handling of failing tasks term frequency / Term frequency terminate() method / UDF, UDAF, and UDTF terminatePartial() method / UDF, UDAF, and UDTF TestDFSIO benchmarking / TestDFSIO text data clustering, K-Means Mahout used / Clustering text data using K-Means, How to it TextInputFormat about / Hadoop-provided InputFormat TextOutputFormat about / Hadoop-provided OutputFormat Tf-idf about / Document analysis using Hadoop and Mahout, Term frequency – inverse document frequency calculating, in Pig / Tf-Idf in Pig three-layer network topology versus four-layer network topology / Three-layer versus four-layer network topology Thrift / Comparison – Avro versus Protocol Buffers / Thrift about / Candidate technologies, Introducing Apache Flume URL / Candidate technologies versus Avro / Comparison – Avro versus Protocol Buffers / Thrift throughput about / Performance Ticket Granting Server (TGS) about / The Kerberos architecture and workflow Ticket Granting Ticket (TGT) about / The Kerberos architecture and workflow timeline, Hadoop about / Hadoop's timeline timestamp() function / What just happened? TimestampInterceptor class / What just happened? timestamps used, for writing data into directory / Time for action – adding timestamps, What just happened? adding / Time for action – adding timestamps, What just happened? topologies about / Architecture of an Apache Storm cluster topology about / Computation and data modeling in Apache Storm top X finding, Map Reduce program used / Map Reduce program to find the top X, How to it traditional relational databases about / Pruning data to fit in the cache training data / Machine learning transparent encryption enabling, for HDFS / Enabling transparent encryption for HDFS, How to it , How it works reference / How it works TreeMap reference link / How to it truststore configuring / Configuring the keystore and truststore about / Configuring the keystore and truststore Tuple data type, Pig / Complex data types in Pig Twitter apps URL / How to it Twitter authorization tokens generating / How to it Twitter data importing Twitter data, Flume used / Importing Twitter data into HDFS using Flume, How to it Twitter sentiment analysis performing, Hive used / Twitter sentiment analysis using Hive, How to it , How it works Twitter sentiment analytics performing, R used / Performing Twitter Sentiment Analytics using R, How to it , How it works Twitter trending topics creating, Spark Streaming used / Creating Twitter trending topics using Spark Streaming, How to it , How it works defining, Spark streaming used / Twitter trending topics using Spark streaming, How to it type mapping used, for improving data import / Time for action – using a type mapping, What just happened? U Ubuntu about / What just happened? UDAF about / UDF, UDAF, and UDTF UDAFs about / UDF, UDAF, and UDTF UDF about / UDF, UDAF, and UDTF Regular UDFs / UDF, UDAF, and UDTF UDAFs / UDF, UDAF, and UDTF UDTF / UDF, UDAF, and UDTF UDFMethodResolver interface / What just happened? UDP syslogd source / It's all about events UDTF about / UDF, UDAF, and UDTF UFO analysis running, on EMR / Time for action – running UFO analysis on EMR ufodata / What just happened? UFO dataset UFO data, summarizing / Time for action – summarizing the UFO data, What just happened? UFO shapes, examining / Examining UFO shapes shape data, summarizing / Time for action – summarizing the shape data, What just happened? sighting duration, correlating to UFO shape / Time for action – correlating of sighting duration to UFO shape, What just happened? Streaming scripts, using outside Hadoop / Using Streaming scripts outside Hadoop shape/time analysis, performing from command line / Time for action – performing the shape/time analysis from the command line, What just happened? UFO data table, Hive creating / Time for action – creating a table for the UFO data, What just happened? data, loading / Time for action – inserting the UFO data, What just happened? data, validating / Validating the data, What just happened? redefining, with correct column separator / Time for action – redefining the table with the correct column separator, What just happened? UFO sighting dataset getting / Getting the UFO sighting dataset UFO sighting records sighting date / Getting the UFO sighting dataset recorded date / Getting the UFO sighting dataset location date / Getting the UFO sighting dataset shape / Getting the UFO sighting dataset duration / Getting the UFO sighting dataset description / Getting the UFO sighting dataset ui command about / Installation procedure uniform data distribution balancer command, executing for / Executing the balancer command for uniform data distribution, How to it UNION operator / The UNION operator UNIONS, complex types / Data types Unix chmod / What just happened? unsupervised learning about / Machine learning update statement versus insert statement / Inserts versus updates use cases, Apache Mahout classification / Apache Mahout clustering / Apache Mahout collaborative filtering / Apache Mahout frequent itemset mining / Apache Mahout use cases, Apache Storm about / Use cases for Apache Storm algorithmic trading in stock markets / Use cases for Apache Storm analytics from social network feeds / Use cases for Apache Storm smart advertising / Use cases for Apache Storm location-based applications / Use cases for Apache Storm sensor network-based applications / Use cases for Apache Storm useful tools, HDFS about / Useful HDFS tools rebalancer / Useful HDFS tools fsck / Useful HDFS tools import checkpoint / Useful HDFS tools User-defined Aggregate Functions (UDAFs) / The supporting components of Hive user-defined counter implementing, in Map Reduce program / Implementing a userdefined counter in a Map Reduce program, How to it , How it works user-defined function writing, in Pig / Writing a user-defined function in Pig, How to it user-defined functions (UDF) about / User-Defined Function adding / Time for action – adding a new User Defined Function (UDF), What just happened? User-defined Functions (UDFs) / The supporting components of Hive user-defined functions (UDFs) about / User-defined functions evaluation functions / The evaluation functions load functions / The load functions store functions / The store functions User-Defined Functions (UDFs) about / Introduction user based recommendation engine setting up, Mahout used / Creating a user-based recommendation engine using Mahout, How to it , How it works user commands about / YARN commands, User commands jar command / User commands application command / User commands node command / User commands logs command / User commands User Defined functions writing, in Hive / Writing a user-defined function in Hive, How to it User Element, cluster maxRunningApps / FairScheduler user identity, Hadoop security model about / User identity super user / The super user USE statement / What just happened? V VersionedWritable wrapper class about / Other wrapper classes versioning about / A note on versioning vertical scaling about / Scalability Virtual Private Cloud (VPC) about / Provisioning a Hadoop cluster on EMR W Web log analytics defining / Web log analytics, Solution references / Getting ready problem statement / Problem statement solution / Solution web log data analyzing, Pig used / Analyzing web log data using Pig, How to it web logs data into HDFS importing, Flume used / Importing web logs data into HDFS using Flume, How to it , How it works web server data getting, into Hadoop / Time for action – getting web server data into Hadoop, What just happened? WHERE clause / What just happened? Whir about / Whir URL / Whir WordCount example executing / Time for action – WordCount, the Hello World of MapReduce, What just happened?, Have a go hero – WordCount on a larger body of text mapper and reducer implementations, using / Time for action – WordCount the easy way start-up / Startup input, splitting / Splitting the input task assignment / Task assignment task start-up / Task startup JobTracker monitoring / Ongoing JobTracker monitoring mapper input / Mapper input mapper execution / Mapper execution mapper output / Mapper output and reduce input reduce input / Mapper output and reduce input partitioning / Partitioning optional partition function / The optional partition function reducer input / Reducer input reducer execution / Reducer execution reducer output / Reducer output shutdown / Shutdown combiner class, using / Apart from the combiner…maybe, Time for action – WordCount with a combiner reducer, using as combiner / When you can use the reducer as the combiner fixing, to work with combiner / Time for action – fixing WordCount to work with a combiner implementing, Streaming used / Time for action – implementing WordCount using Streaming, What just happened? WordCount example, on EMR AWS management console used / Time for action – WordCount on EMR using the management console Worker node, Apache Storm about / Architecture of an Apache Storm cluster World Wide Web (WWW) about / The inception of Hadoop wrapper classes about / Introducing the wrapper classes primitive wrapper classes / Primitive wrapper classes array wrapper classes / Array wrapper classes map wrapper classes / Map wrapper classes writable wrapper classes / Time for action – using the Writable wrapper classes CompressedWritable / Other wrapper classes ObjectWritable / Other wrapper classes NullWritable / Other wrapper classes VersionedWritable / Other wrapper classes WritableComparable interface using / Writable and WritableComparable Writable interface using / Writable and WritableComparable writable wrapper classes about / Time for action – using the Writable wrapper classes exercises / Have a go hero – playing with Writables X XML data processing, Hive XML SerDe used / Processing XML data in Hive using XML SerDe, How to it , How it works XML SerDe references / Getting ready Y YARN about / Upcoming Hadoop changes, Yet Another Resource Negotiator (YARN) Spark, running on / Running Spark on YARN, How to it architecture / Architecture overview monitoring / Monitoring YARN job scheduling / Job scheduling in YARN yarn.scheduler.capacity..acl_administer_queue property / CapacityScheduler yarn.scheduler.capacity..acl_submit_applications property / CapacityScheduler yarn.scheduler.capacity..capacity property / CapacityScheduler yarn.scheduler.capacity..maximum-am-resource-percent property / CapacityScheduler yarn.scheduler.capacity..maximum-applications property / CapacityScheduler yarn.scheduler.capacity..maximum-capacity property / CapacityScheduler yarn.scheduler.capacity..minimum-user- limit-percent property / CapacityScheduler yarn.scheduler.capacity..state property / CapacityScheduler yarn.scheduler.capacity..user-limit-factor property / CapacityScheduler yarn.scheduler.capacity.maximum-am-resource-percent property / CapacityScheduler yarn.scheduler.capacity.maximum-applications property / CapacityScheduler yarn.scheduler.capacity.root.queues property / CapacityScheduler yarn.scheduler.fair.allocation.file property / FairScheduler yarn.scheduler.fair.allow-undeclared-pools property / FairScheduler yarn.scheduler.fair.locality.threshold.node property / FairScheduler yarn.scheduler.fair.locality.threshold.rack property / FairScheduler yarn.scheduler.fair.sizebasedweight property / FairScheduler yarn.scheduler.fair.use-as-default-queue property / FairScheduler YARN applications developing / Developing YARN applications YARN clients, writing / Writing YARN clients ApplicationMaster entity, writing / Writing the Application Master entity YARN architecture about / The YARN architecture Resource Manager (RM) / The YARN architecture, Resource Manager (RM) Node Manager (NM) / The YARN architecture, Node Manager (NM) Application Master (AM) / The YARN architecture, Application Master (AM) container / The YARN architecture client / The YARN architecture YARN clients / YARN clients YARN clients about / YARN clients writing / Writing YARN clients YARN commands about / YARN commands user commands / YARN commands, User commands administration commands / YARN commands yarn rmadmin command / CapacityScheduler Yet Another Resource Negotiator (YARN) about / How to it , Yet Another Resource Negotiator (YARN) .. .Hadoop: Data Processing and Modelling Table of Contents Hadoop: Data Processing and Modelling Hadoop: Data Processing and Modelling Credits Preface What this learning path covers Hadoop. .. Databases Common data paths Hadoop as an archive store Hadoop as a preprocessing step Hadoop as a data input tool The serpent eats its own tail Setting up MySQL Time for action – installing and. .. works Data Analysis Using Hive, Pig, and Hbase Introduction Storing and processing Hive data in a sequential file format Getting ready How to it How it works Storing and processing Hive data

Ngày đăng: 02/03/2019, 10:57