1. Trang chủ
  2. » Công Nghệ Thông Tin

Big data analytics with microsoft HDInsight in 24 hours, sams teach yourself

992 215 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Cấu trúc

  • About This E-Book

  • Title Page

  • Copyright Page

  • Contents at a Glance

  • Table of Contents

  • About the Authors

  • Dedications

  • Acknowledgments

  • We Want to Hear from You!

  • Reader Services

  • Introduction

    • Who Should Read This Book

    • How This Book Is Organized

    • Conventions Used in This Book

      • Try It Yourself

    • System Requirements

  • Part I: Understanding Big Data, Hadoop 1.0, and 2.0

    • Hour 1. Introduction of Big Data, NoSQL, and Business Value Proposition

      • Types of Analysis

      • Types of Data

        • Structured Data

        • Unstructured Data

        • Semi-Structured Data

      • Big Data

        • Volume Characteristics of Big Data

        • Variety Characteristics of Big Data

        • Velocity Characteristics of Big Data

        • What Big Data Is Not

      • Managing Big Data

        • More Data, More Accurate Models

        • More—and Cheaper—Computing Power and Storage

        • Increased Awareness of the Competition and a Means to Proactively Win Over Competitors

        • Availability of New Tools and Technologies to Process and Manage Big Data

      • NoSQL Systems

        • NoSQL Versus RDBMS

        • Major Types of NoSQL Technologies

        • Benefits of Using NoSQL Systems

        • Limitations of NoSQL Systems

      • Big Data, NoSQL Systems, and the Business Value Proposition

      • Application of Big Data and Big Data Solutions

      • Summary

      • Q&A

    • Hour 2. Introduction to Hadoop, Its Architecture, Ecosystem, and Microsoft Offerings

      • What Is Apache Hadoop?

      • Architecture of Hadoop and Hadoop Ecosystems

        • Hadoop Distributed File System

        • MapReduce

        • Hadoop Ecosystems

      • What’s New in Hadoop 2.0

        • Single Point of Failure

        • Limited to Running MapReduce Jobs on HDFS

        • Low Computing Resource Utilization

        • Horizontal Scaling Performance Issue

        • Overly Crowded JobTracker

      • Architecture of Hadoop 2.0

        • HDFS High Availability

        • HDFS Federation

        • HDFS Snapshot

      • Tools and Technologies Needed with Big Data Analytics

        • Data Acquisition

        • Data Storage

        • Data Analysis

        • Data Visualization

        • Data Management

        • Development and Monitoring Tools

      • Major Players and Vendors for Hadoop

        • Cloudera

        • Hortonworks

        • MapR

        • Amazon

        • Microsoft

      • Deployment Options for Microsoft Big Data Solutions

        • On-Premises

        • Cloud

      • Summary

      • Q&A

    • Hour 3. Hadoop Distributed File System Versions 1.0 and 2.0

      • Introduction to HDFS

      • HDFS Architecture

        • File Split in HDFS

        • Block Placement and Replication in HDFS

        • Writing to HDFS

        • Reading from HDFS

        • Handling Failures

        • Delete Files from HDFS to Decrease the Replication Factor

      • Rack Awareness

        • Making Clusters Rack Aware

      • WebHDFS

      • Accessing and Managing HDFS Data

        • HDFS Command-Line Interface

        • Using MapReduce, Hive, Pig, or Sqoop

      • What’s New in HDFS 2.0

        • HDFS High Availability

        • HDFS Federation

        • HDFS Snapshot

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 4. The MapReduce Job Framework and Job Execution Pipeline

      • Introduction to MapReduce

      • MapReduce Architecture

        • MapReduce Job Request and Response Flow

        • TaskTracker and Data Node Co-location

      • MapReduce Job Execution Flow

        • Multiple Input and Output Format

        • Mapper

        • Partitioner

        • Reducer

        • Combiner

        • Driver

        • Tool Interface

        • Context Object

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 5. MapReduce—Advanced Concepts and YARN

      • DistributedCache

      • Hadoop Streaming

      • MapReduce Joins

        • Map-Side Join

        • Reduce-Side Join

      • Bloom Filter

      • Performance Improvement

        • Use of Compression

        • Reusing Java Virtual Machine

        • MapReduce Job Scheduling

        • Fair Scheduler

        • Capacity Scheduler

      • Handling Failures

        • JobTracker Failure

        • TaskTracker Failure

        • Task Failure

        • Speculative Execution

        • Handling Bad Records

      • Counter

      • YARN

        • Different Components of YARN

        • Node Manager

        • Container

        • Job Execution Flow in YARN

      • Uber-Tasking Optimization

      • Failures in YARN

        • Task Failure

        • Application Master Failure

        • Node Manager Failure

        • Resource Manager Failure

      • Resource Manager High Availability and Automatic Failover in YARN

        • How to Reach an Active Resource Manager

      • Summary

      • Q&A

        • Quiz

        • Answers

  • Part II: Getting Started with HDInsight and Understanding Its Different Components

    • Hour 6. Getting Started with HDInsight, Provisioning Your HDInsight Service Cluster, and Automating HDInsight Cluster Provisioning

      • Introduction to Microsoft Azure

        • Azure Storage Service

      • Understanding HDInsight Service

        • HDInsight Cluster Deployment

      • Provisioning HDInsight on the Azure Management Portal

        • Enabling a Remote Desktop Connection via the Remote Desktop Protocol

        • Verifying HDInsight Setup

      • Automating HDInsight Provisioning with PowerShell

        • Prerequisites

        • Provisioning HDInsight Cluster

        • Verifying HDInsight Setup with PowerShell

      • Managing and Monitoring HDInsight Cluster and Job Execution

      • Summary

      • Q&A

      • Exercise

    • Hour 7. Exploring Typical Components of HDFS Cluster

      • HDFS Cluster Components

        • Understanding Name Node Functionality

        • Why the Secondary Name Node Is Not a Standby Node

        • Standby Name Node

      • HDInsight Cluster Architecture

      • High Availability in HDInsight

        • HA Based on Quorum-Based Storage

        • Failover Detection Using ZooKeeper

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 8. Storing Data in Microsoft Azure Storage Blob

      • Understanding Storage in Microsoft Azure

      • Benefits of Azure Storage Blob over HDFS

      • Azure Storage Explorer Tools

        • Azure Storage Explorer

        • AZCopy

        • Azure PowerShell

        • Hadoop Command Line

        • HDInsight Storage Architecture Details

        • Configuring the Default File System

        • Understanding the Impact of Blob Storage on Performance and Data Locality

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 9. Working with Microsoft Azure HDInsight Emulator

      • Getting Started with HDInsight Emulator

        • Setting Up Microsoft HDInsight Emulator

      • Setting Up Microsoft Azure Emulator for Storage

        • Setting Up Microsoft Storage Emulator

      • Summary

      • Q&A

        • Quiz

        • Answers

  • Part III: Programming MapReduce and HDInsight Script Action

    • Hour 10. Programming MapReduce Jobs

      • MapReduce Hello World!

        • Running a Java MapReduce Program on HDInsight Emulator

      • Analyzing Flight Delays with MapReduce

      • Serialization Frameworks for Hadoop

        • Avro

      • Hadoop Streaming

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 11. Customizing the HDInsight Cluster with Script Action

      • Identifying the Need for Cluster Customization

      • Developing Script Action

        • Using the HDInsightUtilities Module

      • Consuming Script Action

        • Using Script Action with the Azure Management Portal

        • Using Script Action with PowerShell

        • Using Script Action with HDInsight .NET SDK

      • Running a Giraph Job on a Customized HDInsight Cluster

      • Testing Script Action with HDInsight Emulator

      • Summary

      • Q&A

        • Quiz

        • Answers

  • Part IV: Querying and Processing Big Data in HDInsight

    • Hour 12. Getting Started with Apache Hive and Apache Tez in HDInsight

      • Introduction to Apache Hive

      • Getting Started with Apache Hive in HDInsight

        • Using the Hive Command-Line Interface

        • Using PowerShell Scripting

        • Using the Cluster Dashboard

      • Azure HDInsight Tools for Visual Studio

        • Connecting to HDInsight Cluster from Visual Studio

        • Viewing Existing Table Properties and Data

        • Viewing Hive Jobs on HDInsight Cluster

        • Creating New Tables in Hive

        • Writing Hive Queries

        • Creating a Hive Application

      • Programmatically Using the HDInsight .NET SDK

      • Introduction to Apache Tez

        • Using the Apace Tez Engine with Hive on HDInsight

      • Summary

      • Q&A

      • Exercise

    • Hour 13. Programming with Apache Hive, Apache Tez in HDInsight, and Apache HCatalog

      • Programming with Hive in HDInsight

        • Running Examples on HDInsight Emulator

        • Comparison with RDBMS Databases

        • Database or Schema

      • Using Tables in Hive

        • Internal Table

        • External Table

        • Internal and External Tables

        • Supported Data Types for Columns in Hive Tables

        • Other Clauses Used When Creating a Table in Hive

      • Serialization and Deserialization

        • CREATE TABLE AS SELECT Command

        • CREATE TABLE LIKE Command

        • Temporary Table

        • Creating Table Views

      • Data Load Processes for Hive Tables

        • Data Manipulation Language

        • Built-in Functions in Hive

      • Querying Data from Hive Tables

        • Writing Data Analysis Queries

        • Partition Switching or Swapping

        • Dynamic Partition Insert

        • Creating Datasets for Analysis

        • Data Analysis of Timely Departure Percentage, Based on Airline

        • Data Analysis of Cancelled Flights, Based on Cancellation Reason

      • Indexing in Hive

      • Apache Tez in Action

      • Apache HCatalog

      • Summary

      • Q&A

      • Exercise

    • Hour 14. Consuming HDInsight Data from Microsoft BI Tools over Hive ODBC Driver: Part 1

      • Introduction to Hive ODBC Driver

        • 32-Bit Versus 64-Bit Hive ODBC Driver

        • Setting Up the Hive ODBC Driver

        • Configuring the 32-Bit Driver

      • Introduction to Microsoft Power BI

      • Accessing Hive Data from Microsoft Excel

      • Summary

      • Q&A

    • Hour 15. Consuming HDInsight Data from Microsoft BI Tools over Hive ODBC Driver: Part 2

      • Accessing Hive Data from PowerPivot

        • Reporting and Data Visualization with PowerPivot

        • Reporting and Data Visualization with Excel

        • Reporting and Data Visualization with Power View

        • Reporting and Data Visualization with Power Map

      • Accessing Hive Data from SQL Server

        • Accessing Data from SQL Server Analysis Services

        • Accessing Data from SQL Server Reporting Services

      • Accessing HDInsight Data from Power Query

      • Summary

      • Q&A

      • Exercise

    • Hour 16. Integrating HDInsight with SQL Server Integration Services

      • The Need for Data Movement

      • Introduction to SSIS

      • Analyzing On-time Flight Departure with SSIS

        • Scenario Prerequisites

        • Package Variables

        • Setting Up Azure PowerShell for Automation

      • Provisioning HDInsight Cluster

        • Executing Hive Query

        • Loading Query Results to a SQL Azure Table

        • Executing the Package

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 17. Using Pig for Data Processing

      • Introduction to Pig Latin

      • Using Pig to Count Cancelled Flights

        • Uploading Data to an HDInsight Cluster for Processing

        • Defining Pig Relations

        • Filtering Pig Relations

        • Grouping Records by Cancellation Code

        • Summarizing Cancelled Flights by Reason

        • Retrieving the Cancellation Description by Joining Relations

        • Saving Results to the File System

      • Using HCatalog in a Pig Latin Script

        • Specifying Parallelism in Pig Latin

      • Submitting Pig Jobs with PowerShell

        • Adding Azure Subscription

        • Creating a Pig Job Definition

        • Submitting a Pig Job for Execution

        • Getting the Job Output

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 18. Using Sqoop for Data Movement Between RDBMS and HDInsight

      • What Is Sqoop?

        • Importing Data to HDInsight Clusters

        • Importing to Hive

        • Exporting Data from HDFS

        • Understanding the Export Process

      • Using Sqoop Import and Export Commands

      • Using Sqoop with PowerShell

      • Summary

      • Q&A

        • Quiz

        • Answers

  • Part V: Managing Workflow and Performing Statistical Computing

    • Hour 19. Using Oozie Workflows and Job Orchestration with HDInsight

      • Introduction to Oozie

        • Oozie Workflow

      • Determining On-time Flight Departure Percentage with Oozie

        • Scenario Prerequisites

        • Creating an Oozie Workflow

        • Executing the Workflow

        • Monitoring Job Status

        • Querying the Results

      • Submitting an Oozie Workflow with HDInsight .NET SDK

      • Coordinating Workflows with Oozie

      • Oozie Compared to SSIS

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 20. Performing Statistical Computing with R

      • Introduction to R

        • Installing R on Windows

        • Loading External Data

        • Performing Rudimentary Data Analysis

      • Integrating R with Hadoop

      • Enabling R on HDInsight

        • Installing R on HDInsight

        • Using R with HDInsight

      • Summary

      • Q&A

        • Quiz

        • Answers

  • Part VI: Performing Interactive Analytics and Machine Learning

    • Hour 21. Performing Big Data Analytics with Spark

      • Introduction to Spark

        • Installing Spark on HDInsight

      • Spark Programming Model

        • Log Mining with the Spark Shell

      • Blending SQL Querying with Functional Programs

        • Hive Compared to Spark SQL

        • Using SQL Blended with Functional Code to Analyze Crime Data

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 22. Microsoft Azure Machine Learning

      • History of Traditional Machine Learning

      • Introduction to Azure ML

        • Benefits of Azure ML

      • Azure ML Workspace

        • Azure ML Studio

      • Processes to Build Azure ML Solutions

      • Getting Started with Azure ML

        • Retrieving Data into Azure ML Modules

        • Using the Descriptive Statistics Module

      • Creating Predictive Models with Azure ML

      • Publishing Azure ML Models as Web Services

      • Summary

      • Q&A

      • Exercise

  • Part VII: Performing Real-time Analytics

    • Hour 23. Performing Stream Analytics with Storm

      • Introduction to Storm

        • Understanding the Storm Architecture

      • Using SCP.NET to Develop Storm Solutions

      • Analyzing Speed Limit Violation Incidents with Storm

        • Creating the Storm Topology

        • Creating the SQL Azure Table to Store Violation Counts

        • Submitting the Topology to the HDInsight Storm Cluster

      • Summary

      • Q&A

        • Quiz

        • Answers

    • Hour 24. Introduction to Apache HBase on HDInsight

      • Introduction to Apache HBase

        • When to Use HBase

      • HBase Architecture

        • Creating HBase Tables

        • Writing Data to HBase Tables

        • Reading Data from HBase Tables

        • Data Distribution and Storage

        • Compaction of Data

      • Creating HDInsight Cluster with HBase

        • Using the Azure Management Portal

        • Using PowerShell Scripting

        • Verifying the Created HDInsight with HBase Cluster

      • Summary

      • Q&A

  • Index

  • Code Snippets

Nội dung

About This E-Book EPUB is an open, industry-standard format for e-books However, support for EPUB and its many features varies across reading devices and applications Use your device or app settings to customize the presentation to your liking Settings that you can customize often include font, font size, single or double column, landscape or portrait mode, and figures that you can click or tap to enlarge For additional information about the settings and features on your reading device or app, visit the device manufacturer’s Web site Many titles include programming code or configuration examples To optimize the presentation of these elements, view the e-book in singlecolumn, landscape mode and adjust the font size to the smallest setting In addition to presenting code and configurations in the reflowable text format, we have included images of the code that mimic the presentation found in the print book; therefore, where the reflowable format may compromise the presentation of the code listing, you will see a “Click here to view code image” link Click the link to view the print-fidelity code image To return to the previous page viewed, click the Back button on your device or app Sams Teach Yourself: Big Data Analytics with Microsoft HDInsight® in 24 Hours Arshad Ali Manpreet Singh 800 East 96th Street, Indianapolis, Indiana, 46240 USA Sams Teach Yourself Big Data Analytics with Microsoft HDInsight® in 24 Hours Copyright © 2016 by Pearson Education, Inc All rights reserved No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from the publisher No patent liability is assumed with respect to the use of the information contained herein Although every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions Nor is any liability assumed for damages resulting from the use of the information contained herein ISBN-13: 978-0-672-33727-7 ISBN-10: 0-672-33727-4 Library of Congress Control Number: 2015914167 Printed in the United States of America First Printing November 2015 Editor-in-Chief Greg Wiegand Acquisitions Editor Joan Murray Development Editor Sondra Scott Managing Editor Sandra Schroeder Senior Project Editor Tonya Simpson Copy Editor Krista Hansing Editorial Services, Inc Senior Indexer Cheryl Lenser Proofreader Anne Goebel Technical Editors Shayne Burgess Ron Abellera Publishing Coordinator Cindy Teeter Cover Designer Mark Shirar Compositor codeMantra Trademarks All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized Sams Publishing cannot attest to the accuracy of this information Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark HDInsight is a registered trademark of Microsoft Corporation Warning and Disclaimer Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied The information provided is on an “as is” basis The authors and the publisher shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book Special Sales For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419 For government sales inquiries, please contact governmentsales@pearsoned.com For questions about sales outside the U.S., please contact international@pearsoned.com Contents at a Glance Introduction Part I: Understanding Big Data, Hadoop 1.0, and 2.0 HOUR Introduction of Big Data, NoSQL, and Business Value Proposition Introduction to Hadoop, Its Architecture, Ecosystem, and Microsoft Offerings Hadoop Distributed File System Versions 1.0 and 2.0 The MapReduce Job Framework and Job Execution Pipeline MapReduce—Advanced Concepts and YARN Part II: Getting Started with HDInsight and Understanding Its Different Components HOUR Getting Started with HDInsight, Provisioning Your HDInsight Service Cluster, and Automating HDInsight Cluster Provisioning Exploring Typical Components of HDFS Cluster Storing Data in Microsoft Azure Storage Blob Working with Microsoft Azure HDInsight Emulator Part III: Programming MapReduce and HDInsight Script Action HOUR 10 Programming MapReduce Jobs 11 Customizing the HDInsight Cluster with Script Action Part IV: Querying and Processing Big Data in HDInsight HOUR 12 Getting Started with Apache Hive and Apache Tez in HDInsight 13 Programming with Apache Hive, Apache Tez in HDInsight, and Apache HCatalog 14 Consuming HDInsight Data from Microsoft BI Tools over Hive ODBC Driver: Part 15 Consuming HDInsight Data from Microsoft BI Tools over Hive ODBC Driver: Part 16 Integrating HDInsight with SQL Server Integration Services 17 Using Pig for Data Processing 18 Using Sqoop for Data Movement Between RDBMS and HDInsight Part V: Managing Workflow and Performing Statistical Computing HOUR 19 Using Oozie Workflows and Job Orchestration with HDInsight 20 Performing Statistical Computing with R Part VI: Performing Interactive Analytics and Machine Learning HOUR 21 Performing Big Data Analytics with Spark 22 Microsoft Azure Machine Learning Part VII: Performing Real-time Analytics HOUR 23 Performing Stream Analytics with Storm 24 Introduction to Apache HBase on HDInsight Index Table of Contents Introduction Part I: Understanding Big Data, Hadoop 1.0, and 2.0 HOUR 1: Introduction of Big Data, NoSQL, and Business Value Proposition Types of Analysis Types of Data Big Data Managing Big Data NoSQL Systems Big Data, NoSQL Systems, and the Business Value Proposition Application of Big Data and Big Data Solutions Summary Q&A HOUR 2: Introduction to Hadoop, Its Architecture, Ecosystem, and Microsoft Offerings What Is Apache Hadoop? Architecture of Hadoop and Hadoop Ecosystems What’s New in Hadoop 2.0 Architecture of Hadoop 2.0 Tools and Technologies Needed with Big Data Analytics Major Players and Vendors for Hadoop Deployment Options for Microsoft Big Data Solutions Summary Q&A HOUR 3: Hadoop Distributed File System Versions 1.0 and 2.0 Introduction to HDFS HDFS Architecture Rack Awareness WebHDFS Accessing and Managing HDFS Data What’s New in HDFS 2.0 Summary Q&A HOUR 4: The MapReduce Job Framework and Job Execution Pipeline Introduction to MapReduce MapReduce Architecture MapReduce Job Execution Flow Summary Q&A HOUR 5: MapReduce—Advanced Concepts and YARN DistributedCache Hadoop Streaming MapReduce Joins Bloom Filter Performance Improvement Handling Failures Counter YARN Uber-Tasking Optimization Failures in YARN Resource Manager High Availability and Automatic Failover in YARN Summary Q&A Part II: Getting Started with HDInsight and Understanding Its Different ... Customizing the HDInsight Cluster with Script Action Part IV: Querying and Processing Big Data in HDInsight HOUR 12 Getting Started with Apache Hive and Apache Tez in HDInsight 13 Programming with. .. IV: Querying and Processing Big Data in HDInsight HOUR 12: Getting Started with Apache Hive and Apache Tez in HDInsight Introduction to Apache Hive Getting Started with Apache Hive in HDInsight. .. Manpreet Singh 800 East 96th Street, Indianapolis, Indiana, 4 6240 USA Sams Teach Yourself Big Data Analytics with Microsoft HDInsight in 24 Hours Copyright © 2016 by Pearson Education, Inc All

Ngày đăng: 02/03/2019, 10:02

TỪ KHÓA LIÊN QUAN