Pro Spark Streaming The Zen of Real-time Analytics using Apache Spark — Zubair Nabi Pro Spark Streaming The Zen of Real-Time Analytics Using Apache Spark Zubair Nabi Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark Zubair Nabi Lahore, Pakistan ISBN-13 (pbk): 978-1-4842-1480-0 DOI 10.1007/978-1-4842-1479-4 ISBN-13 (electronic): 978-1-4842-1479-4 Library of Congress Control Number: 2016941350 Copyright © 2016 by Zubair Nabi This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law Trademarked names, logos, and images may appear in this book Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect to the material contained herein Managing Director: Welmoed Spahr Acquisitions Editor: Celestin Suresh John Developmental Editor: Matthew Moodie Technical Reviewer: Lan Jiang Editorial Board: Steve Anglin, Pramila Balen, Louise Corrigan, James DeWolf, Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Nikhil Karkal, James Markham, Susan McDermott, Matthew Moodie, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing Coordinating Editor: Rita Fernando Copy Editor: Tiffany Taylor Compositor: SPi Global Indexer: SPi Global Cover image designed by Freepik.com Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013 Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springer.com Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc) SSBM Finance Inc is a Delaware corporation For information on translations, please e-mail rights@apress.com, or visit www.apress.com Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use eBook versions and licenses are also available for most titles For more information, reference our Special Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales Any source code or other supplementary materials referenced by the author in this text is available to readers at www.apress.com For detailed information about how to locate your book’s source code, go to www.apress.com/source-code/ Printed on acid-free paper To my father, who introduced me to the sanctity of the written word, who taught me that erudition transcends mortality, and who shaped me into the person I am today Thank you, Baba Contents at a Glance About the Author xiii About the Technical Reviewer xv Acknowledgments xvii Introduction xix ■Chapter 1: The Hitchhiker’s Guide to Big Data ■Chapter 2: Introduction to Spark ■Chapter 3: DStreams: Real-Time RDDs 29 ■Chapter 4: High-Velocity Streams: Parallelism and Other Stories 51 ■Chapter 5: Real-Time Route 66: Linking External Data Sources 69 ■Chapter 6: The Art of Side Effects 99 ■Chapter 7: Getting Ready for Prime Time 125 ■Chapter 8: Real-Time ETL and Analytics Magic 151 ■Chapter 9: Machine Learning at Scale 177 ■Chapter 10: Of Clouds, Lambdas, and Pythons 199 Index 227 v Contents About the Author xiii About the Technical Reviewer xv Acknowledgments xvii Introduction xix ■Chapter 1: The Hitchhiker’s Guide to Big Data Before Spark The Era of Web 2.0 Sensors, Sensors Everywhere Spark Streaming: At the Intersection of MapReduce and CEP ■Chapter 2: Introduction to Spark Installation 10 Execution 11 Standalone Cluster 11 YARN 12 First Application 12 Build 14 Execution 15 SparkContext 17 Creation of RDDs 17 Handling Dependencies 18 Creating Shared Variables 19 Job execution 20 vii ■ CONTENTS RDD 20 Persistence 21 Transformations 22 Actions 26 Summary 27 ■Chapter 3: DStreams: Real-Time RDDs 29 From Continuous to Discretized Streams 29 First Streaming Application 30 Build and Execution 32 StreamingContext 32 DStreams 34 The Anatomy of a Spark Streaming Application 36 Transformations 40 Summary 50 ■Chapter 4: High-Velocity Streams: Parallelism and Other Stories 51 One Giant Leap for Streaming Data 51 Parallelism 53 Worker 53 Executor 54 Task 56 Batch Intervals 59 Scheduling 60 Inter-application Scheduling 60 Batch Scheduling 61 Inter-job Scheduling 61 One Action, One Job 61 Memory 63 Serialization 63 Compression 65 Garbage Collection 65 viii ■ CONTENTS Every Day I’m Shuffling 66 Early Projection and Filtering 66 Always Use a Combiner 66 Generous Parallelism 66 File Consolidation 66 More Memory 66 Summary 67 ■Chapter 5: Real-Time Route 66: Linking External Data Sources 69 Smarter Cities, Smarter Planet, Smarter Everything 69 ReceiverInputDStream 71 Sockets 72 MQTT 80 Flume 84 Push-Based Flume Ingestion 85 Pull-Based Flume Ingestion 86 Kafka 86 Receiver-Based Kafka Consumer 89 Direct Kafka Consumer 91 Twitter 92 Block Interval 93 Custom Receiver 93 HttpInputDStream 94 Summary 97 ■Chapter 6: The Art of Side Effects 99 Taking Stock of the Stock Market 99 foreachRDD 101 Per-Record Connection 103 Per-Partition Connection 103 ix CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS Figure 10-16 Blending real-time processing with batch processing to implement a data-querying system The architecture in Figure 10-16 has three layers: the real-time or speed layer, which computes results for a specific number of time units; the batch layer, which computes results for the entire dataset (starting at T=0) periodically with newly appended data; and the serving layer, which serves queries by merging results from the online and offline views Every time a new record is generated by the data source (the grey box in the figure), it makes its way to both layers For the real-time layer, it is instantly consumed, and the result is written to a data store optimized for real-time computation On the batch layer, the data is first written to an append-only distributed filesystem A batch-processing job is kicked off periodically to recompute the view over the entire dataset and write it to a batch data store The real-time layer is refreshed every time the batch job completes Every time a user query comes in (at extreme right in the figure) the results for a key—the Yelp business ID in this case—is computed on the fly by merging values from the real-time and batch data stores In essence, the real-time view masks away the latency of the batch layer This design is known as the Lambda Architecture.7 It was conceived by Nathan Marz, the creator of the popular Apache Storm stream-processing system Let’s implement the Lambda Architecture using a combination of Spark Streaming and Google Cloud Platform Spark is a great system to implement the Lambda Architecture because it provides a unified API and execution engine for both batch and real-time processing In addition, Spark SQL simplifies the implementation of typical queries, which revolve around aggregations, rollups, and cubes Finally, the integration of Spark with other Big Data systems, such as message queues, key value stores, and distributed file systems, enables end-to-end applications Lambda Architecture using Spark Streaming on Google Cloud Platform Listing 10-4 provides the code for this implementation For the real-time layer, you use Spark Streaming in concert with Cloud BigTable The batch layer, on the other hand, is implemented using BigQuery The application uses the Yelp reviews dataset to determine the positive and negative ratings of a business ID at different aggregation levels (basically, a SQL rollup operation) The application is ready to be deployed to Dataproc for execution Let’s walk through the code to understand the specifics Nathan Marz, “How to Beat the CAP Theorom,” Thoughts from the Red Planet, October 13, 2011, http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html 215 CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS Listing 10-4 Lambda Architecture Using Spark Streaming, Cloud BigTable, BigQuery, and Dataproc 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 216 package org.apress.prospark import import import import import import import import import import import import import import import import import import import import org.apache.hadoop.conf.Configuration org.apache.hadoop.hbase.HBaseConfiguration org.apache.hadoop.hbase.client.Put org.apache.hadoop.hbase.mapreduce.TableOutputFormat org.apache.hadoop.hbase.util.Bytes org.apache.spark.SparkConf org.apache.spark.SparkContext org.apache.spark.rdd.RDD.rddToPairRDDFunctions org.apache.spark.streaming.Seconds org.apache.spark.streaming.StreamingContext org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions org.json4s.DefaultFormats org.json4s.jvalue2extractable org.json4s.jvalue2monadic org.json4s.native.JsonMethods.parse org.json4s.string2JsonInput com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration com.google.gson.JsonObject com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat org.apache.hadoop.io.Text object LambdaDataprocApp { def main(args: Array[String]) { if (args.length != 14) { System.err.println( "Usage: LambdaDataprocApp " + " " + " ") System.exit(1) } val Seq(appName, batchInterval, hostname, port, projectId, zone, clusterId, tableName, columnFamilyName, columnName, checkpointDir, sessionLength, bqDatasetId, bqTableId) = args.toSeq val conf = new SparkConf() setAppName(appName) setJars(SparkContext.jarOfClass(this.getClass).toSeq) val ssc = new StreamingContext(conf, Seconds(batchInterval.toInt)) ssc.checkpoint(checkpointDir) val statefulCount = (values: Seq[(Int, Long)], state: Option[(Int, Long)]) => { val prevState = state.getOrElse(0, System.currentTimeMillis()) if ((System.currentTimeMillis() - prevState._2) > sessionLength.toLong) { None CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 } else { Some(values.map(v => v._1).sum + prevState._1, values.map(v => v._2).max) } } val ratings = ssc.socketTextStream(hostname, port.toInt) map(r => { implicit val formats = DefaultFormats parse(r) }) map(jvalue => { implicit val formats = DefaultFormats ((jvalue \ "business_id").extract[String], (jvalue \ "date").extract[String], (jvalue \ "stars").extract[Int]) }) map(rec => (rec._1, rec._2, if (rec._3 > 3) "good" else "bad")) ratings.map(rec => (rec.productIterator.mkString(":"), (1, System currentTimeMillis()))) updateStateByKey(statefulCount) foreachRDD(rdd => { val hbaseConf = HBaseConfiguration.create() hbaseConf.set("hbase.client.connection.impl", "com.google.cloud.bigtable hbase1_1.BigtableConnection") hbaseConf.set("google.bigtable.project.id", projectId) hbaseConf.set("google.bigtable.zone.name", zone) hbaseConf.set("google.bigtable.cluster.name", clusterId) hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName) val jobConf = new Configuration(hbaseConf) jobConf.set("mapreduce.job.outputformat.class", classOf[TableOutputFormat[Te xt]].getName) rdd.mapPartitions(it => { it.map(rec => { val put = new Put(rec._1.getBytes) put.addColumn(columnFamilyName.getBytes, columnName.getBytes, Bytes toBytes(rec._2._1)) (rec._1, put) }) }).saveAsNewAPIHadoopDataset(jobConf) }) ratings.foreachRDD(rdd => { val bqConf = new Configuration() val bqTableSchema = "[{'name': 'timestamp', 'type': 'STRING'}, {'name': 'business_id', 'type': 'STRING'}, {'name': 'rating', 'type': 'STRING'}]" BigQueryConfiguration.configureBigQueryOutput( bqConf, projectId, bqDatasetId, bqTableId, bqTableSchema) bqConf.set("mapreduce.job.outputformat.class", classOf[BigQueryOutputFormat[_, _]].getName) rdd.mapPartitions(it => { 217 CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS 95 it.map(rec => (null, { 96 val j = new JsonObject() 97 j.addProperty("timestamp", rec._1) 98 j.addProperty("business_id", rec._2) 99 j.addProperty("rating", rec._3) 100 j 101 })) 102 }).saveAsNewAPIHadoopDataset(bqConf) 103 }) 104 105 ssc.start() 106 ssc.awaitTermination() 107 108 } 109 110 } Similar to Listing 10-2, the application relies on the SocketDriver for data production Therefore, before running the application, feed the SocketDriver the review dataset (yelp_academic_dataset_review.json) The structure of the JSON object in the review dataset is outlined in Listing 10-5 Listing 10-5 JSON Blueprint of the Review Dataset 10 11 12 13 14 { "votes": { "funny": , "useful": , "cool": }, "user_id": "", "review_id": "", "stars": , "date": "", "text": "", "type": "review", "business_id": "" } The application reads this data from a socket (line 54) and projects the business_id, date, and stars fields from the JSON object (lines 55–62) You then categorize records as good if the number of stars is greater than three or bad otherwise (line 63) This resulting stream is divided into two flows: one for realtime processing and the other for batch processing The real-time pipeline puts its data in Cloud BigTable, which is a fully managed, cloud-based version of BigTable from Google with an HBase-compatible API You model the layout such that the row key is a concatenation of the business ID, timestamp, and rating category (good or bad) (line 65) With each key, you also need to maintain the count across batches, for which you use an updateStateByKey operation The real-time pipeline needs to be flushed every time the batch view has been updated in BigQuery Specifically, you need to remove keys from the updateStateByKey So, you associate a session-length value with each record, which is simply the time before you expire a key To implement this, you insert the current system time into each record (line 65) and remove the key if it has exceeded the session length (lines 47–49) in the statefulCount function Ideally, this session-length value should be equal to the latency of the batch layer For instance, if the batch job is kicked off every hour, the value of the session length should be 3600 218 CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS CREATING A CLOUD BIGTABLE CLUSTER Log in to the Google Cloud Platform dashboard, and, from the Storage section of the Products & Services menu, select Bigtable Click Create Cluster in the next screen Use Figure 10-17 as reference to set up your cluster Figure 10-17 Creating a Cloud BigTable cluster 219 CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS Go to the API Manager in the Products & Services menu, and enable the API for Cloud Bigtable You can also configure the HBase shell to talk to the Bigtable deployment.8 The rest of the real-time layer code writes the row keys and values to Bigtable using the HBase API Notice that the code is almost identical to Listing 6-12 in Chapter The only difference is in terms of additional configuration parameters for Bigtable, such as the project ID, cluster ID, and cluster zone (lines 69–72) The batch part of the application relies on BigQuery, which is another fully managed storage service from Google that allows SQL queries against append-only tables It achieves performance and scalability by using aggregation trees and columnar storage The unit of creation in BigQuery is a dataset, which can contain many tables The batch layer in Listing 10-4 (lines 88–103) simply writes each record verbatim to BigQuery The BigQuery execution layer in SQL can then be used to periodically re-create precomputed views The BigQuery connector for Spark treats BigQuery as just another Hadoop-compatible storage system This means you can use saveAsNewAPIHadoopDataset to write to BigQuery The only thing you need to in the foreachRDD clause is to provide a schema for the BigQuery table (line 89) and configuration information (lines 90–93) You follow a simple schema with a column each for business ID, timestamp, and rating type For BigQuery, each record needs to be converted to JSON, which is what you in lines 96–99 Voilà—your Lambda Architecture application is ready for execution To execute the application, first add the dependencies from Listing 10-6 to your sbt build definition file Make sure you have enabled the APIs for both BigQuery and Bigtable in the GCP console In addition, you need to create a table with a single column family in Bigtable You can so by executing the following in the HBase shell: create 'ratingstable', 'ratingscf' Listing 10-6 Dependencies Required for the Lambda Architecture Application libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.10" libraryDependencies += "com.google.cloud.bigtable" % "bigtable-hbase-1.1" % "0.2.3" exclude("com.google.guava", "guava") libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.1.2" libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.1.2" libraryDependencies += "com.google.guava" % "guava" % "16.0" libraryDependencies += "org.mortbay.jetty.alpn" % "alpn-boot" % "8.1.6.v20151105" libraryDependencies += "com.google.cloud.bigdataoss" % "bigquery-connector" % "0.7.4-hadoop2" As before, create a JAR for the application, copy it to the HDFS deployment on your Dataproc cluster, and run it from the Dataproc UI Refer to Figure 10-18 for command-line parameters that need to be passed to the application https://cloud.google.com/bigtable/docs/installing-hbase-shell 220 CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS Figure 10-18 Arguments required by the Lambda Architecture application on top of Dataproc Once the application has started executing, you can run the query from Listing 10-7 periodically to create rollup values 221 CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS Listing 10-7 BigQuery SQL Query to Calculate Rollups for the Lambda Architecture Application SELECT timestamp, business_id, rating, COUNT(1) AS COUNT FROM [.] GROUP BY ROLLUP(timestamp, business_id, rating) And just like that, you have implemented a Lambda Architecture application to realize a highly available and eventually consistent query-serving system Note that this application uses a common pipeline to generate real-time views as well as to route data to the batch layer In a real-world deployment, these would need to be separated into two self-contained applications for fault-tolerance and performance For instance, instead of publishing data to a socket, it can be published to a Kafka topic, and then these two separate applications can consume the topic via individual subscriptions Streaming Graph Analytics Any dataset with relationships between entities can be modeled as a graph, and its analysis can be mapped onto graph problems The Web is a graph of machines connected via the Internet, Facebook is a graph of users connected via friends, and locations on Google Maps lend themselves to a graph with connections provided by modes of transportation The sheer size of some of these graphs negates the use of singlemachine libraries At the other end of the spectrum, standard Big Data systems such as Hadoop or Spark are too low level to capture the expressiveness required for graph processing GraphX was designed to fill this gap by enabling graph-parallel computation on top of Spark Like the rest of the book, this last topic is illustrated using an example Listing 10-8 JSON Structure of the Yelp User Dataset 10 11 12 13 14 15 16 17 18 19 222 { "yelping_since":"", "votes":{ "funny":, "useful":, "cool": }, "review_count":, "name":"", "user_id":"", "friends":[ "" ], "fans":, "average_stars":, "type":"user", "compliments":{ "photos":, "hot":, CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS 20 21 22 23 24 25 26 "cool":, "plain": }, "elite":[ ] } Finding influential users in social networks is an interesting topic due to its implications for online advertising For instance, posts shared by influential users are widely disseminated in comparison to those of ordinary users The Yelp dataset also contains friendship information for users (JSON attributes in Listing 10-8) This can be used to build a graph of relationships on Yelp A number of algorithms such as PageRank and HITS can then be used to pinpoint influential users The PageRank-based approach using GraphX is shown in Listing 10-9 Listing 10-9 First Streaming GraphX Application 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 package org.apress.prospark import import import import import import import import import import import import org.apache.spark.SparkConf org.apache.spark.SparkContext org.apache.spark.graphx.Edge org.apache.spark.graphx.Graph org.apache.spark.graphx.Graph.graphToGraphOps org.apache.spark.streaming.Seconds org.apache.spark.streaming.StreamingContext org.json4s.DefaultFormats org.json4s.jvalue2extractable org.json4s.jvalue2monadic org.json4s.native.JsonMethods.parse org.json4s.string2JsonInput object UserRankApp { def main(args: Array[String]) { if (args.length != 4) { System.err.println( "Usage: UserRankApp ") System.exit(1) } val Seq(appName, batchInterval, hostname, port) = args.toSeq val conf = new SparkConf() setAppName(appName) setJars(SparkContext.jarOfClass(this.getClass).toSeq) val ssc = new StreamingContext(conf, Seconds(batchInterval.toInt)) ssc.socketTextStream(hostname, port.toInt) map(r => { implicit val formats = DefaultFormats parse(r) 223 CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS 36 37 38 39 40 }) foreachRDD(rdd => { val edges = rdd.map(jvalue => { implicit val formats = DefaultFormats ((jvalue \ "user_id").extract[String], (jvalue \ "friends") extract[Array[String]]) }) flatMap(r => r._2.map(f => Edge(r._1.hashCode.toLong, f.hashCode.toLong, 1.0))) 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 val vertices = rdd.map(jvalue => { implicit val formats = DefaultFormats ((jvalue \ "user_id").extract[String]) }) map(r => (r.hashCode.toLong, r)) val tolerance = 0.0001 val graph = Graph(vertices, edges, "defaultUser") subgraph(vpred = (id, idStr) => idStr != "defaultUser") val pr = graph.pageRank(tolerance).cache graph.outerJoinVertices(pr.vertices) { (userId, attrs, rank) => (rank.getOrElse(0.0).asInstanceOf[Number] doubleValue, attrs) }.vertices.top(10) { Ordering.by(_._2._1) }.foreach(rec => println("User id: %s, Rank: %f".format(rec._2._2, rec._2._1))) 57 58 59 60 61 62 63 64 65 66 67 }) ssc.start() ssc.awaitTermination() } } After reading the data from the socket and converting each record to JSON (lines 32–36), you implement the core logic of the application in a foreachRDD transform GraphX does not have native support for online graph analysis, so using this approach means the state of analysis is confined to individual batches This per-batch implementation is useful for a large class of applications, which only rely on the current state of the network As with other graph libraries, you need to create separate vertex and edge objects, which must be combined to generate the graph Specifically, you need to create edge and vertex RDDs to spawn a graph object For the edge records, you create user_id:user_id pairs from each JSON object, where each pair represents a friendship connection between users (lines 38–42) GraphX requires vertex IDs to be in the form of long values Therefore, the application takes the hash code of user IDs and uses it for this purpose In addition, you assign a weight of to each friendship edge (line 42) Similarly, for the vertex RDDs, you emit the user ID hash as a long and its value in string form (lines 44–48) Both RDDs (edge and vertex) can then be used to create a graph object (line 51) with defaultUser as the default user ID for missing values Because these default values are meaningless from your perspective (the influence of a defaultUser is useless), the subgraph method is used to filter by edges and vertices (line 52) 224 CHAPTER 10 ■ OF CLOUDS, LAMBDAS, AND PYTHONS GraphX out of the box contains an implementation of PageRank, which you use on the graph with a convergence tolerance of 0.0001 (line 53) This returns vertex ID:rank pairs These need to be joined with the original graph to get back the original user ID (lines 55–57) You then get the top 10 most influential users ordered by rank by iterating over the vertices of the graph (line 57-59) Finally, you print these values to standard output This example should have whetted your appetite for graph processing the Spark way This was just the tip of the iceberg, though; GraphX is far richer in terms of algorithms, features, and transforms than were presented here Refer to the official documentation for a deeper dive Summary The days of on-premises data centers are numbered Fully managed systems deployments are all the rage due to their simple cost model, elasticity, scalability, and out-of-the-box integration with the wider Big Data ecosystem A number of such solutions also exist for Spark, and this chapter explored Dataproc Python aficionados also got their hands dirty with the Spark Python API To support low latency and highly available data querying, another topic of discussion in this chapter was the Lambda Architecture Using a combination of Spark Streaming, Cloud Bigtable, and BigQuery, the nitty-gritty of the Lambda Architecture was laid bare To wrap up the chapter as well as the book, graph processing using GraphX in tandem with Spark Streaming was introduced That’s all, folks With that, you come to the end of this “sparkling” journey! 225 Index A D addAccumulator method, 119–120 addInPlace method, 119–120 Alternating least square (ALS), 190–192 awaitTermination()method, 32, 36 Data frame, 151 avoid shuffling, 170 cache aggressively, 170 MLlib, 162 persistence, 170 query transformation action, 168 aggregation expression, 164 cube operation, 165 DataFrameNaFunctions, 167 DataFrameStatFunctions, 167 dropDuplicates, 165 external database over JDBC, 169 GroupedData, 163 inner join, 166 intersection, 166 output format set, 168 output mode, 169 random sample, 165 rollup operation, 165 saves, 168–169 RDD operations, 170 types, 162 use code generation, 170 DataFrameNaFunctions, 167 DataFrameStatFunctions, 167 Data mining, 177, 188 Dataproc Compute Engine API, 204 GCP project, 204 Discretized streams (DStreams) RDDs, 35 traditional stream processing, 29 transformations aggregation, 42 foreachRDD, 50 mapping, 40 B Batch processing, 6, 214–215 Big Data systems, Spark acyclic graph, canonical word-count, MapReduce programming model, Samza messages, sensor network, SQL to NoSQL, stream-processing system, Web 2.0 applications, local Execution, 15 sbt file, 14–15 standalone cluster mode, 15 YARN, 17 C cache() function, 170 Call data record (CDR), 151–152 Case-class method, 195 Cassandra Query Language (CQL), 115 ChiSqSelector, 186 Chi-square selection, 186 Clickstream Dataset, 126 Collaborative filtering, 188, 190–192 compute() method, 20, 35–36, 72 createCombiner function, 44 Custom receiver HttpInputDStream, 94 receiver interface method, 94 © Zubair Nabi 2016 Z Nabi, Pro Spark Streaming, DOI 10.1007/978-1-4842-1479-4 227 ■ INDEX Discretized streams (DStreams) (cont.) variation, 42 windowing, 46 Downward closure property, 193 E Executors DStream parallelism, 57 dynamic executor allocation, 55 narrow dependency, 56–57 F Flume pull based flume ingestion, 86 push based flume ingestion, 85 sample Flume configuration file, 84 foreachRDD lazy static connection, 105 per-partition connection, 104 per record connection, 103 scheduling, 102 static connection, 104 static connection pool, 106 FP-growth technique, 193 G getReceiver() method, 72 Global state accumulators, 119 Redis, 121 static variable, 116 updateStateByKey(), 118 Graphite metrics, 145 GraphX, 222–225 H Hadoop, 1, 5–6, 9, 126, 157, 222 HiveContext, 160–161 I, J Internet of Things (IoT), 177 K Kafka consumer configuration parameters, 91 custom configuration, 90 direct Kafka consumer, 91 Kafka driver program, 87 228 producer configuration parameters, 89 receiver based kafka consumer, 89 k-means, 189–190 L Lambda Architecture availability, 214 batch layer, 215 batch processing, 214 consistency, 214 real-time/speed layer, 215 serving layer, 215 Spark Streaming BigQuery, 220 Cloud BigTable creation, 218 create rollup values, 221 SocketDriver, 218 stream processing, 214 timestamps, 214 Logging, 143–144 Logistic regression, 188 M Machine-learning estimators, 196–197 feature selection and extraction chi-square selection, 186 datasets, 186 PCA , 187 learning algorithms classification, 188 clustering, 189 collaborative filtering, 188, 190 content-based filtering, 188 frequent pattern mining, 188, 193 supervised learning, 188 MLlib data types, 182 preprocessing, 185 statistical analysis, 184 ParamMap, parameter setting, 197 pipelines, cross-validation of, 197 streaming MLlib application, 179 transformers, 196–197 mapPartitionsWithInputSplit method, 191 MapReduce programming model, Mean square error (MSE), 181 mergeCombiner function, 44 mergeValue function, 44 Message Queue Telemetry Transport (MQTT) custom driver program, 80 QoS Options, 82 ■ INDEX Microbatch processing, 186 MLlib ALS, 191 k-means, 189 Spark SQL, 162 statistical analysis, 184 streaming/online-learning algorithms, 188 Model-based approach, 190 MQTT See Message Queue Telemetry Transport (MQTT) N, O Nagios service monitoring, 148 New York City (NYC), bike-sharing data, 70 P, Q partitioner function, 44 Petabytes, 177 Principal component analysis (PCA), 187 printSchema() function, 156 PySpark, 212 R randomSplit method, 181 RDD See Resilient distributed dataset (RDD) ReceiverInputDStream compute() method, 72 ConstantInputDStream, 71 DirectKafkaInputDStream, 71 FileInputDStream, 71 getReceiver() method, 72 hierarchical view, 72 QueueInputDStream, 71 start()method, 72 stop() method, 72 Resilient distributed dataset (RDD) actions, 26 conversion functions, 21 key-value pair transformations, 24 mapping transformations, 23 miscellaneous transformations, 25 PairRDDFunctions class, 20 persistence level, 21–22 variation transformations, 23 S saveToCassandra method, 115 Scalable streaming storage Cassandra, 113 HBase, 108 Spark Cassandra Connector, 115 SparkOnHBase, 112 stock market dashboard, 108, 110 Scala sequence, 157 sendMsg() function, 75 Sensor data storm, 177–179 Sockets AbstractDriver program, 73 convertor function, 73 driver program, 75 SocketInputDStreams, 80 Spark Streaming application, 78 Spark Apache projects, 9–10 execution control processes, 12 driver program, 13 RDD object, 14 saveAsTextFile (), 14 SparkConf object, 13 SparkContext object, 14 standalone-cluster mode, 11 YARN, 12 SparkContext, 155, 161 create RDDs, 17 job execution, 20 shared variables, 19 SparkR, 171 add column names, 172 Databricks, 172 definition, 171 execution, 172 Milan CDR dataset, 172 rJava, 171 Spark master, 172 sparkR.stop(), 172 streaming, 173 Spark SQL, 158 CDR, 152 dataframes (see Data frame) SparkR (see SparkR) SQLContext (see SQLContext) streaming application, 153 Spark Streaming, batch interval, 59 memory compression, 65 garbage collection, 65 serialization, 63 parallelism JVM process, 54 worker process, 53 scheduling action, 61 batch scheduling, 61 inter-application scheduling, 60 229 ■ INDEX Spark Streaming (cont.) inter-job Scheduling, 61 shuffle operation built-in combiners, 66 file consolidation, 66 filter, 66 generous parallelism, 66 Voyager Dataset, 52 Spark Streaming application awaitTermination()method, 32 specialized DStreams, 38 FileInputDStream, 36 Invoking map()function, 36 Reddit dataset, 39 single batch execution, 37 start()method, 32 stop()method, 32 StreamingContext, 32, 36 DStreams creation, 33 DStream consolidation, 34 job execution, 34 streaming version, 30 TextInputFormat, 39 spark.streaming.blockInterval, 93 Spark UI additional metrics, 136 code-level information, 132 DAG visualization, 133 detailed job information, 131 detailed stage information, 132 environment information page, 139 event timeline, 135 high-level executor metrics, 138 historical analysis, 142 individual task-level metrics, 136 partition breakdown, RDD, 138 RDD flow visualization, 134 referrer channel division application, 129 RESTful metrics, 142 running applications, 130 storage properties, cached RDDs, 137 streaming statistics, healthy application, 141 streaming statistics, tabular form, 142 streaming statistics, unhealthy application, 140 summary metrics, 136 230 summary page, 130 task execution state timeline, 135 thread level dump, 139 SQLContext catalyst, 160 data frames creation dynamic schemas, 155 existing RDDs, 155 external database, 157 Hive table, 158 parquet file, 157 RDDs with JSON, 157 Scala sequence, 157 HiveContext, 160 SQL execution, 158 UDF, 159 StandardScaler, 185 start()method, 32, 72 statefulCount function, 218 Stochastic gradient descent (SGD), 181 Stock market, 100 stop() method, 32, 72 StreamingContext, 155 Streaming graph analytics, 222 StreamingKMeans model, 190 StreamingLinearRegressionWithSGD model, 181 Stream processing, 214 Supervised learning, 188 System mertics, 146 T Tachyon, 126–128 toDF method, 154 Twitter, 92 U, V, W, X union() method, 34 User-defined functions (UDF), 159 Y, Z Yelp, 199–200, 205, 215, 222–223