1. Trang chủ
  2. » Công Nghệ Thông Tin

Real time big data analytics

470 85 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Cấu trúc

  • Real-Time Big Data Analytics

  • Credits

  • About the Authors

  • About the Reviewer

  • www.PacktPub.com

  • eBooks, discount offers, and more

  • Why subscribe?

  • Preface

  • What this book covers

  • What you need for this book

  • Who this book is for

  • Conventions

  • Reader feedback

  • Customer support

  • Downloading the example code

  • Errata

  • Piracy

  • Questions

  • 1. Introducing the Big Data Technology Landscape and Analytics Platform

  • Big Data – a phenomenon

  • The Big Data dimensional paradigm

  • The Big Data ecosystem

  • The Big Data infrastructure

  • Components of the Big Data ecosystem

  • The Big Data analytics architecture

  • Building business solutions

  • Dataset processing

  • Solution implementation

  • Presentation

  • Distributed batch processing

  • Batch processing in distributed mode

  • Push code to data

  • Distributed databases (NoSQL)

  • Advantages of NoSQL databases

  • Choosing a NoSQL database

  • Real-time processing

  • The telecoms or cellular arena

  • Transportation and logistics

  • The connected vehicle

  • The financial sector

  • Summary

  • 2. Getting Acquainted with Storm

  • An overview of Storm

  • The journey of Storm

  • Storm abstractions

  • Streams

  • Topology

  • Spouts

  • Bolts

  • Tasks

  • Workers

  • Storm architecture and its components

  • A Zookeeper cluster

  • A Storm cluster

  • How and when to use Storm

  • Storm internals

  • Storm parallelism

  • Storm internal message processing

  • Summary

  • 3. Processing Data with Storm

  • Storm input sources

  • Meet Kafka

  • Getting to know more about Kafka

  • Other sources for input to Storm

  • A file as an input source

  • A socket as an input source

  • Kafka as an input source

  • Reliability of data processing

  • The concept of anchoring and reliability

  • The Storm acking framework

  • Storm simple patterns

  • Joins

  • Batching

  • Storm persistence

  • Storm's JDBC persistence framework

  • Summary

  • 4. Introduction to Trident and Optimizing Storm Performance

  • Working with Trident

  • Transactions

  • Trident topology

  • Trident tuples

  • Trident spout

  • Trident operations

  • Merging and joining

  • Filter

  • Function

  • Aggregation

  • Grouping

  • State maintenance

  • Understanding LMAX

  • Memory and cache

  • Ring buffer – the heart of the disruptor

  • Producers

  • Consumers

  • Storm internode communication

  • ZeroMQ

  • Storm ZeroMQ configurations

  • Netty

  • Understanding the Storm UI

  • Storm UI landing page

  • Topology home page

  • Optimizing Storm performance

  • Summary

  • 5. Getting Acquainted with Kinesis

  • Architectural overview of Kinesis

  • Benefits and use cases of Amazon Kinesis

  • High-level architecture

  • Components of Kinesis

  • Creating a Kinesis streaming service

  • Access to AWS Kinesis

  • Configuring the development environment

  • Creating Kinesis streams

  • Creating Kinesis stream producers

  • Creating Kinesis stream consumers

  • Generating and consuming crime alerts

  • Summary

  • 6. Getting Acquainted with Spark

  • An overview of Spark

  • Batch data processing

  • Real-time data processing

  • Apache Spark – a one-stop solution

  • When to use Spark – practical use cases

  • The architecture of Spark

  • High-level architecture

  • Spark extensions/libraries

  • Spark packaging structure and core APIs

  • The Spark execution model – master-worker view

  • Resilient distributed datasets (RDD)

  • RDD – by definition

  • Fault tolerance

  • Storage

  • Persistence

  • Shuffling

  • Writing and executing our first Spark program

  • Hardware requirements

  • Installation of the basic software

  • Spark

  • Java

  • Scala

  • Eclipse

  • Configuring the Spark cluster

  • Coding a Spark job in Scala

  • Coding a Spark job in Java

  • Troubleshooting – tips and tricks

  • Port numbers used by Spark

  • Classpath issues – class not found exception

  • Other common exceptions

  • Summary

  • 7. Programming with RDDs

  • Understanding Spark transformations and actions

  • RDD APIs

  • RDD transformation operations

  • RDD action operations

  • Programming Spark transformations and actions

  • Handling persistence in Spark

  • Summary

  • 8. SQL Query Engine for Spark – Spark SQL

  • The architecture of Spark SQL

  • The emergence of Spark SQL

  • The components of Spark SQL

  • The DataFrame API

  • DataFrames and RDD

  • User-defined functions

  • DataFrames and SQL

  • The Catalyst optimizer

  • SQL and Hive contexts

  • Coding our first Spark SQL job

  • Coding a Spark SQL job in Scala

  • Coding a Spark SQL job in Java

  • Converting RDDs to DataFrames

  • Automated process

  • The manual process

  • Working with Parquet

  • Persisting Parquet data in HDFS

  • Partitioning and schema evolution or merging

  • Partitioning

  • Schema evolution/merging

  • Working with Hive tables

  • Performance tuning and best practices

  • Partitioning and parallelism

  • Serialization

  • Caching

  • Memory tuning

  • Summary

  • 9. Analysis of Streaming Data Using Spark Streaming

  • High-level architecture

  • The components of Spark Streaming

  • The packaging structure of Spark Streaming

  • Spark Streaming APIs

  • Spark Streaming operations

  • Coding our first Spark Streaming job

  • Creating a stream producer

  • Writing our Spark Streaming job in Scala

  • Writing our Spark Streaming job in Java

  • Executing our Spark Streaming job

  • Querying streaming data in real time

  • The high-level architecture of our job

  • Coding the crime producer

  • Coding the stream consumer and transformer

  • Executing the SQL Streaming Crime Analyzer

  • Deployment and monitoring

  • Cluster managers for Spark Streaming

  • Executing Spark Streaming applications on Yarn

  • Executing Spark Streaming applications on Apache Mesos

  • Monitoring Spark Streaming applications

  • Summary

  • 10. Introducing Lambda Architecture

  • What is Lambda Architecture

  • The need for Lambda Architecture

  • Layers/components of Lambda Architecture

  • The technology matrix for Lambda Architecture

  • Realization of Lambda Architecture

  • high-level architecture

  • Configuring Apache Cassandra and Spark

  • Coding the custom producer

  • Coding the real-time layer

  • Coding the batch layer

  • Coding the serving layer

  • Executing all the layers

  • Summary

  • Index

Nội dung

Real-Time Big Data Analytics Table of Contents Real-Time Big Data Analytics Credits About the Authors About the Reviewer www.PacktPub.com eBooks, discount offers, and more Why subscribe? Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions Introducing the Big Data Technology Landscape and Analytics Platform Big Data – a phenomenon The Big Data dimensional paradigm The Big Data ecosystem The Big Data infrastructure Components of the Big Data ecosystem The Big Data analytics architecture Building business solutions Dataset processing Solution implementation Presentation Distributed batch processing Batch processing in distributed mode Push code to data Distributed databases (NoSQL) Advantages of NoSQL databases Choosing a NoSQL database Real-time processing The telecoms or cellular arena Transportation and logistics The connected vehicle The financial sector Summary Getting Acquainted with Storm An overview of Storm The journey of Storm Storm abstractions Streams Topology Spouts Bolts Tasks Workers Storm architecture and its components A Zookeeper cluster A Storm cluster How and when to use Storm Storm internals Storm parallelism Storm internal message processing Summary Processing Data with Storm Storm input sources Meet Kafka Getting to know more about Kafka Other sources for input to Storm A file as an input source A socket as an input source Kafka as an input source Reliability of data processing The concept of anchoring and reliability The Storm acking framework Storm simple patterns Joins Batching Storm persistence Storm's JDBC persistence framework Summary Introduction to Trident and Optimizing Storm Performance Working with Trident Transactions Trident topology Trident tuples Trident spout Trident operations Merging and joining Filter Function Aggregation Grouping State maintenance Understanding LMAX Memory and cache Ring buffer – the heart of the disruptor Producers Consumers Storm internode communication ZeroMQ Storm ZeroMQ configurations Netty Understanding the Storm UI Storm UI landing page Topology home page Optimizing Storm performance Summary Getting Acquainted with Kinesis Architectural overview of Kinesis Benefits and use cases of Amazon Kinesis High-level architecture Components of Kinesis Creating a Kinesis streaming service Access to AWS Kinesis Configuring the development environment Creating Kinesis streams Creating Kinesis stream producers Creating Kinesis stream consumers Generating and consuming crime alerts Summary Getting Acquainted with Spark An overview of Spark Batch data processing Real-time data processing Apache Spark – a one-stop solution When to use Spark – practical use cases The architecture of Spark High-level architecture Spark extensions/libraries Spark packaging structure and core APIs The Spark execution model – master-worker view Resilient distributed datasets (RDD) RDD – by definition Fault tolerance Storage Persistence Shuffling Writing and executing our first Spark program Hardware requirements Installation of the basic software Spark Java Scala Eclipse Configuring the Spark cluster Coding a Spark job in Scala Coding a Spark job in Java Troubleshooting – tips and tricks Port numbers used by Spark Classpath issues – class not found exception Other common exceptions Summary Programming with RDDs Understanding Spark transformations and actions RDD APIs RDD transformation operations RDD action operations Programming Spark transformations and actions Handling persistence in Spark Summary SQL Query Engine for Spark – Spark SQL The architecture of Spark SQL The emergence of Spark SQL The components of Spark SQL The DataFrame API DataFrames and RDD User-defined functions DataFrames and SQL The Catalyst optimizer SQL and Hive contexts Coding our first Spark SQL job Coding a Spark SQL job in Scala Coding a Spark SQL job in Java Converting RDDs to DataFrames Automated process The manual process Working with Parquet Persisting Parquet data in HDFS Partitioning and schema evolution or merging Partitioning Schema evolution/merging Working with Hive tables Performance tuning and best practices Partitioning and parallelism Serialization Caching Memory tuning Summary Analysis of Streaming Data Using Spark Streaming High-level architecture The components of Spark Streaming The packaging structure of Spark Streaming Spark Streaming APIs Spark Streaming operations Coding our first Spark Streaming job Creating a stream producer Writing our Spark Streaming job in Scala Writing our Spark Streaming job in Java Executing our Spark Streaming job Querying streaming data in real time The high-level architecture of our job Coding the crime producer Coding the stream consumer and transformer Executing the SQL Streaming Crime Analyzer Deployment and monitoring Cluster managers for Spark Streaming Executing Spark Streaming applications on Yarn Executing Spark Streaming applications on Apache Mesos Monitoring Spark Streaming applications Summary 10 Introducing Lambda Architecture What is Lambda Architecture The need for Lambda Architecture Layers/components of Lambda Architecture The technology matrix for Lambda Architecture Realization of Lambda Architecture high-level architecture Configuring Apache Cassandra and Spark Coding the custom producer Coding the real-time layer Coding the batch layer Coding the serving layer Executing all the layers Summary Index Real-Time Big Data Analytics resource managers, Spark Apache Mesos / The Spark execution model – master-worker view Hadoop YARN / The Spark execution model – master-worker view standalone mode / The Spark execution model – master-worker view local mode / The Spark execution model – master-worker view ring buffer about / Ring buffer – the heart of the disruptor producers / Producers consumers / Consumers rule based optimizations / The Catalyst optimizer S S3 reference / Components of Kinesis Scala reference link / Spark packaging structure and core APIs installing / Scala Spark job, coding in / Coding a Spark job in Scala Spark Streaming job, writing in / Writing our Spark Streaming job in Scala Scala 2.10.5 compressed tarball download link / Scala Scala APIs, by Spark Core org.apache.spark / Spark packaging structure and core APIs org.apache.spark.SparkContext / Spark packaging structure and core APIs org.apache.spark.rdd.RDD.scala / Spark packaging structure and core APIs org.apache.spark.annotation / Spark packaging structure and core APIs org.apache.spark.broadcast / Spark packaging structure and core APIs HttpBroadcast / Spark packaging structure and core APIs TorrentBroadcast / Spark packaging structure and core APIs org.apache.spark.io / Spark packaging structure and core APIs org.apache.spark.scheduler / Spark packaging structure and core APIs org.apache.spark.storage / Spark packaging structure and core APIs org.apache.spark.util / Spark packaging structure and core APIs scalability reference link / Batch data processing, The need for Lambda Architecture schema evolution about / Schema evolution/merging schema merging about / Schema evolution/merging SequenceFileRDDFunctions about / RDD APIs reference link / RDD APIs serialization process URL / Handling persistence in Spark shards about / Components of Kinesis for reads / Components of Kinesis for writes / Components of Kinesis single point of failure (SPOF) / The need for Lambda Architecture SLAs about / Batch data processing smart traversing about / Ring buffer – the heart of the disruptor software development kit (SDK) / Components of Kinesis Spark overview / An overview of Spark about / Apache Spark – a one-stop solution features / Apache Spark – a one-stop solution practical use cases / When to use Spark – practical use cases packaging structure / Spark packaging structure and core APIs core APIs / Spark packaging structure and core APIs hardware requisites / Hardware requirements installing / Spark persistence handling / Handling persistence in Spark storage levels / Handling persistence in Spark Spark-Cassandra connector reference link / Configuring Apache Cassandra and Spark Spark-Cassandra Java library reference link / Configuring Apache Cassandra and Spark Spark 1.4.0 download link / Configuring Apache Cassandra and Spark Spark actions about / Understanding Spark transformations and actions programming / Programming Spark transformations and actions Spark architecture about / The architecture of Spark high-level architecture / High-level architecture Spark cluster configuring / Configuring the Spark cluster Spark compressed tarball download link / Spark Spark Core about / Spark packaging structure and core APIs Spark core engine about / The components of Spark Streaming Spark driver about / The Spark execution model – master-worker view Spark execution model about / Spark packaging structure and core APIs Spark extensions about / Spark packaging structure and core APIs Spark framework error / Working with Parquet overwrite / Working with Parquet append / Working with Parquet ignore / Working with Parquet Spark job coding, in Scala / Coding a Spark job in Scala coding, in Java / Coding a Spark job in Java Spark master about / The Spark execution model – master-worker view Spark packages reference link / Spark extensions/libraries SparkR about / Spark extensions/libraries reference link / Spark extensions/libraries Spark SQL reference link / Spark extensions/libraries phases / The Catalyst optimizer SPARK SQL architecture / The architecture of Spark SQL emergence / The emergence of Spark SQL about / The emergence of Spark SQL features / The emergence of Spark SQL components / The components of Spark SQL DataFrame API / The components of Spark SQL catalyst optimizer / The components of Spark SQL Spark SQL job coding / Coding our first Spark SQL job reference / Coding our first Spark SQL job coding, in Scala / Coding a Spark SQL job in Scala coding, in Java / Coding a Spark SQL job in Java Spark Steaming job coding / Coding our first Spark Streaming job Spark Streaming reference link / When to use Spark – practical use cases, Spark extensions/libraries about / Spark extensions/libraries high level architecture / High-level architecture components / The components of Spark Streaming packaging structure / The packaging structure of Spark Streaming Spark Streaming APIs about / Spark Streaming APIs reference link / Spark Streaming APIs Spark Streaming applications executing, on YARN / Executing Spark Streaming applications on Yarn executing, on Apache Mesos / Executing Spark Streaming applications on Apache Mesos monitoring / Monitoring Spark Streaming applications reference link / Monitoring Spark Streaming applications Spark Streaming job writing, in Scala / Writing our Spark Streaming job in Scala writing, in Java / Writing our Spark Streaming job in Java executing / Executing our Spark Streaming job Spark streaming job about / The components of Spark Streaming data receiver / The components of Spark Streaming batches / The components of Spark Streaming DStreams / The components of Spark Streaming streaming contexts / The components of Spark Streaming Spark Streaming operations about / Spark Streaming operations Spark transformation about / Understanding Spark transformations and actions programming / Programming Spark transformations and actions Spark UI workers / Configuring the Spark cluster running applications / Configuring the Spark cluster completed application / Configuring the Spark cluster Spark worker/executors about / The Spark execution model – master-worker view speed layers about / Layers/components of Lambda Architecture splits about / Understanding Spark transformations and actions spout collector / The concept of anchoring and reliability SQL Streaming Crime Analyzer high-level architecture / The high-level architecture of our job crime producer, coding / Coding the crime producer stream consumer, coding / Coding the stream consumer and transformer stream transformer, coding / Coding the stream consumer and transformer executing / Executing the SQL Streaming Crime Analyzer standalone resource manager about / Configuring the Spark cluster StorageLevel class reference link / Persistence storage levels, Spark StorageLevel.MEMORY_ONLY / Handling persistence in Spark StorageLevel.MEMORY_ONLY_SER / Handling persistence in Spark StorageLevel.MEMORY_AND_DISK / Handling persistence in Spark StorageLevel.MEMORY_AND_DISK_SER / Handling persistence in Spark StorageLevel.DISK_ONLY / Handling persistence in Spark StorageLevel.MEMORY_ONLY_2, MEMORY_AND_DISK_2 / Handling persistence in Spark StorageLevel.OFF_HEAP / Handling persistence in Spark Storm about / Real-time processing overview / An overview of Storm journey / The journey of Storm performance / The journey of Storm scalability / The journey of Storm fail safe / The journey of Storm reliability / The journey of Storm easy / The journey of Storm open source / The journey of Storm abstractions / Storm abstractions architecture / Storm architecture and its components components / Storm architecture and its components local mode / Storm architecture and its components distributed mode / Storm architecture and its components reference / Storm architecture and its components using / How and when to use Storm input sources / Storm input sources performance, optimizing / Optimizing Storm performance reference link / Apache Spark – a one-stop solution Storm abstractions stream / Streams topology / Topology spout / Spouts bolts / Bolts Storm acking framework about / The Storm acking framework Storm cluster about / A Storm cluster Nimbus / A Storm cluster Supervisors / A Storm cluster UI / A Storm cluster Storm internal message processing about / Storm internal message processing inter-worker communication / Storm internal message processing intra-worker communication / Storm internal message processing Storm internals about / Storm internals Storm parallelism / Storm parallelism Storm internal message processing / Storm internal message processing Storm internode communication about / Storm internode communication ZeroMQ / ZeroMQ Netty / Netty Storm parallelism about / Storm parallelism worker process / Storm parallelism executors / Storm parallelism tasks / Storm parallelism Storm persistence about / Storm persistence JDBC persistence framework / Storm's JDBC persistence framework Storm simple patterns about / Storm simple patterns Joins / Joins batching / Batching Storm UI about / Understanding the Storm UI landing page / Storm UI landing page topology home page / Topology home page StreamingContext URL / The components of Spark Streaming streaming data querying / Querying streaming data in real time stream producer creating / Creating a stream producer Supervisors about / A Storm cluster workers / A Storm cluster executors / A Storm cluster tasks / A Storm cluster / Optimizing Storm performance T Tachyon URL / Apache Spark – a one-stop solution, Handling persistence in Spark Taychon URL / Handling persistence in Spark TextInputFormat reference link / Understanding Spark transformations and actions Thrift reference / Schema evolution/merging transformation / Dataset processing transformation operations, on input streams reference link / Spark Streaming operations transformation operations, on streaming data windowing operations / Spark Streaming operations transform operations / Spark Streaming operations updateStateByKey Operation / Spark Streaming operations output operations / Spark Streaming operations Trident working with / Working with Trident transactions / Transactions topology / Trident topology operations / Trident operations Trident operations about / Trident operations merging / Merging and joining joining / Merging and joining filter / Filter, Function aggregation / Aggregation grouping / Grouping state maintenance / State maintenance Trident topology about / Trident topology Trident tuples / Trident tuples Trident spout / Trident spout troubleshooting tips about / Troubleshooting – tips and tricks port numbers, used by Spark / Port numbers used by Spark classpath issues / Classpath issues – class not found exception other common exceptions / Other common exceptions U use cases, for batch data processing log analysis/analytics / Batch data processing predictive maintenance / Batch data processing faster claim processing / Batch data processing pricing analytics / Batch data processing use cases, real-time data processing Internet of Things (IoT) / Real-time data processing online trading systems / Real-time data processing online publishing / Real-time data processing assembly lines / Real-time data processing online gaming systems / Real-time data processing W WordCountTopology about / How and when to use Storm Write Ahead Logs (WAL) / The technology matrix for Lambda Architecture Y YARN URL / High-level architecture modes / The Spark execution model – master-worker view Spark Streaming applications, executing on / Executing Spark Streaming applications on Yarn reference link / Executing Spark Streaming applications on Yarn YARN client mode / The Spark execution model – master-worker view YARN cluster mode / The Spark execution model – master-worker view Yet Another Resource Negotiator (YARN) / Batch processing in distributed mode Z ZeroMQ about / ZeroMQ Storm ZeroMQ configurations / Storm ZeroMQ configurations ZooKeeper / Optimizing Storm performance Zookeeper about / A Storm cluster Zookeeper cluster about / A Zookeeper cluster ... the Big Data Technology Landscape and Analytics Platform Big Data – a phenomenon The Big Data dimensional paradigm The Big Data ecosystem The Big Data infrastructure Components of the Big Data. .. producer Coding the real-time layer Coding the batch layer Coding the serving layer Executing all the layers Summary Index Real-Time Big Data Analytics Real-Time Big Data Analytics Copyright.. .Real-Time Big Data Analytics Table of Contents Real-Time Big Data Analytics Credits About the Authors About the Reviewer www.PacktPub.com

Ngày đăng: 04/03/2019, 11:46