Apache Spark cluster manager typesBuilding standalone applications with Apache SparkSubmitting applications Deployment strategies Running Spark examples Building your own programs Parall
Trang 2Learning Apache Spark 2
Trang 3Table of Contents
Learning Apache Spark 2
Credits
About the Author
About the Reviewers
What you need for this book
Who this book is for
1 Architecture and Installation
Apache Spark architecture overview
Installing Apache Spark
Writing your first Spark program
Scala shell examples
Python shell examples
Trang 4Apache Spark cluster manager types
Building standalone applications with Apache SparkSubmitting applications
Deployment strategies
Running Spark examples
Building your own programs
Parallelizing existing collections
Referencing external data source
Static singleton functions
Passing functions to Spark (Java)
Passing functions to Spark (Python)
Trang 5Set operations in Spark
Transformations on two PairRDDs
Actions available on PairRDDs
How is Spark being used?
Commonly Supported File Formats
Text Files
Trang 6Working with Amazon S3
Structured Data sources and Databases
Working with NoSQL Databases
Working with Cassandra
Obtaining a Cassandra table as an RDDSaving data to Cassandra
Working with HBase
Bulk Delete exampleMap Partition ExampleWorking with MongoDB
Connection to MongoDBWriting to MongoDBLoading data from MongoDBWorking with Apache Solr
Importing the JAR File via Spark-shellConnecting to Solr via DataFrame APIConnecting to Solr via RDD
References
Summary
4 Spark SQL
What is Spark SQL?
What is DataFrame API?
What is DataSet API?
What's new in Spark 2.0?
Under the hood - catalyst optimizer
Solution 1
Solution 2
The Sparksession
Trang 7Steps involved in a streaming app
Architecture of Spark Streaming
Setting up checkpointing with Scala
Setting up checkpointing with Java
Setting up checkpointing with Python
Trang 8Automatic driver restartDStream best practices
Fault tolerance
Worker failure impact on receivers
Worker failure impact on RDDs/DStreamsWorker failure impact on output operationsWhat is Structured Streaming?
Under the hood
Structured Spark Streaming API :Entry pointOutput modes
Append modeComplete modeUpdate modeOutput sinks
Failure recovery and checkpointing
References
Summary
6 Machine Learning with Spark
What is machine learning?
Why machine learning?
Types of machine learning
Introduction to Spark MLLib
Why do we need the Pipeline API?
How does it work?
Scala syntax - building a pipeline
Building a pipeline
Predictions on test documents
Python program - predictions on test documentsFeature engineering
Feature extraction algorithms
Feature transformation algorithms
Feature selection algorithms
Classification and regression
Classification
Regression
Clustering
Trang 9Basic graph operators (RDD API)
List of graph operators (RDD API)
Caching and uncaching of graphs
Graph algorithms in GraphX
Loading and saving of GraphFrames
Comparison between GraphFrames and GraphX
GraphX <=> GraphFrames
Converting from GraphFrame to GraphX
Converting from GraphX to GraphFrames
References
Summary
8 Operating in Clustered Mode
Clusters, nodes and daemons
Key bits about Spark Architecture
Running Spark in standalone mode
Trang 10Installing Spark standalone on a cluster
Starting a Spark cluster manually
Cluster overview
Workers overview
Running applications and drivers overview
Completed applications and drivers overview
Using the Cluster Launch Scripts to Start a Standalone ClusterEnvironment Properties
Connecting Spark-Shell, PySpark, and R-Shell to the clusterResource scheduling
Running Spark in YARN
Spark with a Hadoop Distribution (Cloudera)
Interactive Shell
Batch Application
Important YARN Configuration Parameters
Running Spark in Mesos
Before you start
Steps to use the cluster mode
Mesos run modes
Key Spark on Mesos configuration properties
References:
Summary
9 Building a Recommendation System
What is a recommendation system?
Types of recommendations
Manual recommendations
Simple aggregated recommendations based on PopularityUser-specific recommendations
User specific recommendations
Key issues with recommendation systems
Trang 11Gathering known input data
Predicting unknown from known ratings
Content-based recommendations
Predicting unknown ratingsPros and cons of content based recommendationsCollaborative filtering
Jaccard similarityCosine similarityCentered cosine (Pearson Correlation)Latent factor methods
Evaluating prediction methodRecommendation system in Spark
Sample dataset
How does Spark offer recommendation?
Importing relevant libraries
Defining the schema for ratings
Defining the schema for movies
Loading ratings and movies data
Data partitioning
Training an ALS model
Predicting the test dataset
Evaluating model performance
Using implicit preferencesSanity checking
Model Deployment
References
Summary
10 Customer Churn Prediction
Overview of customer churn
Why is predicting customer churn important?
How do we predict customer churn with Spark?
Data set description
Code example
Defining schema
Loading data
Data exploration
Trang 12PySpark import code
Exploring international minutes
Exploring night minutes
Exploring day minutes
Exploring eve minutes
Comparing minutes data for churners and non-churnersComparing charge data for churners and non-churners
Exploring customer service calls
Scala code - constructing a scatter plot
Exploring the churn variable
Execution and storage
Tasks running in parallel
Operators within the same task
Memory management configuration options
Memory tuning key tips
I/O tuning
Data locality
Sizing up your executors
Calculating memory overhead
Setting aside memory/CPU for YARN application masterI/O throughput
Sample calculations
The skew problem
Security configuration in Spark
Kerberos authentication
Shared secrets
Shared secret on YARN
Shared secret on other cluster managers
Trang 13Setting up Jupyter Notebook with Spark
What is a Jupyter Notebook?
Setting up a Jupyter Notebook
Securing the notebook server
Preparing a hashed password
Using Jupyter (only with version 5.0 and later)Manually creating hashed password
Setting up PySpark on Jupyter
Trang 14Learning Apache Spark 2
Trang 15Copyright © 2017 Packt Publishing
All rights reserved No part of this book may be reproduced, stored in a
retrieval system, or transmitted in any form or by any means, without theprior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented However, the information contained inthis book is sold without warranty, either express or implied Neither theauthor, nor Packt Publishing, and its dealers and distributors will be heldliable for any damages caused or alleged to be caused directly or indirectly bythis book
Packt Publishing has endeavored to provide trademark information about all
of the companies and products mentioned in this book by the appropriate use
of capitals However, Packt Publishing cannot guarantee the accuracy of thisinformation
First published: March 2017
Trang 16www.packtpub.com
Trang 17Tejal Daruwale Soni
Content Development Editor
Trang 18About the Author
Muhammad Asif Abbasi has worked in the industry for over 15 years in a
variety of roles from engineering solutions to selling solutions and everything
in between Asif is currently working with SAS a market leader in AnalyticSolutions as a Principal Business Solutions Manager for the Global
Technologies Practice Based in London, Asif has vast experience in
consulting for major organizations and industries across the globe, and
running proof-of-concepts across various industries including but not limited
to telecommunications, manufacturing, retail, finance, services, utilities andgovernment Asif is an Oracle Certified Java EE 5 Enterprise architect,
Teradata Certified Master, PMP, Hortonworks Hadoop Certified developer,and administrator Asif also holds a Master's degree in Computer Science andBusiness Administration
Trang 19About the Reviewers
Prashant Verma started his IT carrier in 2011 as a Java developer in
Ericsson working in telecom domain After couple of years of JAVA EEexperience, he moved into Big Data domain, and has worked on almost allthe popular big data technologies, such as Hadoop, Spark, Flume, Mongo,Cassandra,etc He has also played with Scala Currently, He works with QAInfotech as Lead Data Enginner, working on solving e-Learning problemsusing analytics and machine learning
Prashant has also worked on Apache Spark for Java Developers, Packt as a
Technical Reviewer
I want to thank Packt Publishing for giving me the chance to review the book as well as my employer and my family for their patience while I was busy working on this book.
Trang 20At www.PacktPub.com, you can also read a collection of free technical
articles, sign up for a range of free newsletters and receive exclusive
discounts and offers on Packt books and eBooks
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt Mapt gives you full
access to all Packt books and video courses, as well as industry-leading tools
to help you plan your personal development and advance your career
Trang 22Customer Feedback
Thanks for purchasing this Packt book At Packt, quality is at the heart of oureditorial process To help us improve, please leave us an honest review at thewebsite where you acquired this product
If you'd like to join our team of regular reviewers, you can email us at
customerreviews@packtpub.com We award our regular reviewers with freeeBooks and videos in exchange for their valuable feedback Help us be
relentless in improving our products!
Trang 23This book will cover the technical aspects of Apache Spark 2.0, one of thefastest growing open-source projects In order to understand what ApacheSpark is, we will quickly recap a the history of Big Data, and what has madeApache Spark popular Irrespective of your expertise level, we suggest goingthrough this introduction as it will help set the context of the book
Trang 24The Past
Before going into the present-day Spark, it might be worthwhile
understanding what problems Spark intend to solve, and especially the datamovement Without knowing the background we will not be able to predictthe future
"You have to learn the past to predict the future."
Late 1990s: The world was a much simpler place to live, with proprietary
databases being the sole choice of consumers Data was growing at quite anamazing pace, and some of the biggest databases boasted of maintainingdatasets in excess of a Terabyte
Early 2000s: The dotcom bubble happened, meant companies started going
online, and likes of Amazon and eBay leading the revolution Some of thedotcom start-ups failed, while others succeeded The commonality among thebusiness models started was a razor-sharp focus on page views, and
everything started getting focused on the number of users A lot of marketingbudget was spent on getting people online This meant more customer
behavior data in the form of weblogs Since the defacto storage was an MPPdatabase, and the value of such weblogs was unknown, more often than notthese weblogs were stuffed into archive storage or deleted
2002: In search for a better search engine, Doug Cutting and Mike Cafarella
started work on an open source project called Nutch, the objective of whichwas to be a web scale crawler Web-Scale was defined as billions of webpages and Doug and Mike were able to index hundreds of millions of web-pages, running on a handful of nodes and had a knack of falling down
2004-2006: Google published a paper on the Google File System (GFS)
(2003) and MapReduce (2004) demonstrating the backbone of their searchengine being resilient to failures, and almost linearly scalable Doug Cuttingtook particular interest in this development as he could see that GFS andMapReduce papers directly addressed Nutch’s shortcomings Doug Cuttingadded Map Reduce implementation to Nutch which ran on 20 nodes, and was
Trang 25much easier to program Of course we are talking in comparative terms here.
2006-2008: Cutting went to work with Yahoo in 2006 who had lost the
search crown to Google and were equally impressed by the GFS and
MapReduce papers The storage and processing parts of Nutch were spun out
to form a separate project named Hadoop under AFS where as Nutch webcrawler remained a separate project Hadoop became a top-level Apacheproject in 2008 On February 19, 2008 Yahoo announced that its search index
is run on a 10000 node Hadoop cluster (truly an amazing feat)
We haven't forget about the proprietary database vendors the majority ofthem didn’t expect Hadoop to change anything for them, as database vendorstypically focused on relational data, which was smaller in volumes but higher
in value I was talking to a CTO of a major database vendor (will remainunnamed), and discussing this new and upcoming popular elephant (Hadoop
of course! Thanks to Doug Cutting’s son for choosing a sane name I mean hecould have chosen anything else, and you know how kids name things thesedays ) The CTO was quite adamant that the real value is in the relationaldata, which was the bread and butter of his company, and despite that factthat the relational data had huge volumes, it had less of a business value Thiswas more of a 80-20 rule for data, where from a size perspective unstructureddata was 4 times the size of structured data (80-20), whereas the same
structured data had 4 times the value of unstructured data I would say thatthe relational database vendors massively underestimated the value of
unstructured data back then
Anyways, back to Hadoop: So, after the announcement by Yahoo, a lot of
companies wanted to get a piece of the action They realised something bigwas about to happen in the dataspace Lots of interesting use cases started toappear in the Hadoop space, and the defacto compute engine on Hadoop,MapReduce wasn’t able to meet all those expectations
The MapReduce Conundrum: The original Hadoop comprised primarily
HDFS and Map-Reduce as a compute engine The original use case of webscale search meant that the architecture was primarily aimed at long-runningbatch jobs (typically single-pass jobs without iterations), like the original use
Trang 26case of indexing web pages The core requirement of such a framework wasscalability and fault-tolerance, as you don’t want to restart a job that had beenrunning for 3 days, having completed 95% of its work Furthermore, the
objective of MapReduce was to target acyclic data flows
A typical MapReduce program is composed of a Map() operation and
optionally a Reduce() operation, and any workload had to be converted to theMapReduce paradigm before you could get the benefit of Hadoop Not onlythat majority of other open source projects on Hadoop also used MapReduce
as a way to perform computation For example: Hive and Pig Latin both
generated MapReduce to operate on Big Data sets The problem with thearchitecture of MapReduce was that the job output data from each step had to
be store in a distributed system before the next step could begin This meantthat each iteration had to reload the data from the disk thus incurring a
significant performance penalty Furthermore, while typically design, forbatch jobs, Hadoop has often been used to do exploratory analysis throughSQL-like interfaces such as Pig and Hive Each query incurs significant
latency due to initial MapReduce job setup, and initial data read which oftenmeans increased wait times for the users
Beginning of Spark: In June of 2011, Matei Zaharia, Mosharaf Chowdhury,
Michael J Franklin, Scott Shenker and Ion Stoica published a paper in which
they proposed a framework that could outperform Hadoop 10 times in
iterative machine learning jobs The framework is now known as Spark Thepaper aimed to solve two of the major inadequacies of the Hadoop/MR
framework:
Iterative jobs
Interactive analysis
The idea that you can plug the gaps of map-reduce from an iterative and
interactive analysis point of view, while maintaining its scalability and
resilience meant that the platform could be used across a wide variety of usecases
This created huge interest in Spark, particularly from communities of userswho had become frustrated with the relatively slow response from
Trang 27MapReduce, particularly for interactive queries requests Spark in 2015
became the most active open source project in Big Data, and had tons of newfeatures of improvements during the course of the project The communitygrew almost 300%, with attendances at Spark-Summit increasing from just1,100 in 2014 to almost 4,000 in 2015 The number of meetup groups grew
by a factor of 4, and the contributors to the project increased from just over a
100 in 2013 to 600 in 2015
Spark is today the hottest technology for big data analytics Numerous
benchmarks have confirmed that it is the fastest engine out there If you go toany Big data conference be it Strata + Hadoop World or Hadoop Summit,Spark is considered to be the technology for future
Stack Overflow released the results of a 2016 developer survey
(http://bit.ly/1MpdIlU) with responses from 56,033 engineers across 173countries Some of the facts related to Spark were pretty interesting Spark
was the leader in Trending Tech and the Top-Paying Tech.
Trang 28Why are people so excited about Spark?
In addition to plugging MapReduce deficiencies, Spark provides three majorthings that make it really powerful:
General engine with libraries for many data analysis tasks - includesbuilt-in libraries for Streaming, SQL, machine learning and graph
processing
Access to diverse data sources, means it can connect to Hadoop,
Cassandra, traditional SQL databases, and Cloud Storage includingAmazon and OpenStack
Last but not the least, Spark provides a simple unified API that meansusers have to learn just one API to get the benefit of the entire
framework stack
We hope that this book gives you the foundation of understanding Spark as aframework, and helps you take the next step towards using it for your
implementations
Trang 29What this book covers
Chapter 1, Architecture and Installation, will help you get started on the
journey of learning Spark This will walk you through key architectural
components before helping you write your first Spark application
Chapter 2, Transformations and Actions with Spark RDDs, will help you
understand the basic constructs as Spark RDDs and help you understand thedifference between transformations, actions, and lazy evaluation, and howyou can share data
Chapter 3, ELT with Spark, will help you with data loading, transformation,
and saving it back to external storage systems
Chapter 4, Spark SQL, will help you understand the intricacies of the
DataFrame and Dataset API before a discussion of the under-the-hood power
of the Catalyst optimizer and how it ensures that your client applicationsremain performant irrespective of your client AP
Chapter 5, Spark Streaming, will help you understand the architecture of
Spark Streaming, sliding window operations, caching, persistence, pointing, fault-tolerance before discussing structured streaming and how itrevolutionizes Stream processing
check-Chapter 6, Machine Learning with Spark, is where the rubber hits the road,
and where you understand the basics of machine learning before looking atthe various types of machine learning, and feature engineering utility
functions, and finally looking at the algorithms provided by Spark MLlibAPI
Chapter 7, GraphX, will help you understand the importance of Graph in
today’s world, before understanding terminology such vertex, edge, Motifetc We will then look at some of the graph algorithms in GraphX and alsotalk about GraphFrames
Chapter 8, Operating in Clustered mode, helps the user understand how
Trang 30Spark can be deployed as standalone, or with YARN or Mesos.
Chapter 9, Building a Recommendation system, will help the user understand
the intricacies of a recommendation system before building one with an ALSmodel
Chapter 10, Customer Churn Predicting, will help the user understand the
importance of Churn prediction before using a random forest classifier topredict churn on a telecommunication dataset
Appendix, There's More with Spark, is where we cover the topics around
performance tuning, sizing your executors, and security before walking theuser through setting up PySpark with Jupyter notebook
Trang 31What you need for this book
You will need Spark 2.0, which you can download from Apache Sparkwebsite We have used few different configurations, but you can essentiallyrun most of these examples inside a virtual machine with 4-8GB of RAM,and 10 GB of available disk space
Trang 32Who this book is for
This book is for people who have heard of Spark, and want to understandmore This is a beginner-level book for people who want to have some hands-
on exercise with the fastest growing open source project This book providesample reading and links to exciting YouTube videos for additional
exploration of the topics
Trang 33In this book, you will find a number of text styles that distinguish betweendifferent kinds of information Here are some examples of these styles and anexplanation of their meaning
Code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles areshown as follows: "We can include other contexts through the use of theinclude directive."
A block of code is set as follows:
New terms and important words are shown in bold Words that you see on
the screen, for example, in menus or dialog boxes, appear in the text like this:
"Clicking the Next button moves you to the next screen."
Note
Warnings or important notes appear in a box like this
Trang 34Tips and tricks appear like this
Trang 35Reader feedback
Feedback from our readers is always welcome Let us know what you thinkabout this book-what you liked or disliked Reader feedback is important for
us as it helps us develop titles that you will really get the most out of To send
us general feedback, simply e-mail feedback@packtpub.com, and mentionthe book's title in the subject of your message If there is a topic that you haveexpertise in and you are interested in either writing or contributing to a book,see our author guide at www.packtpub.com/authors
Trang 36Customer support
Now that you are the proud owner of a Packt book, we have a number ofthings to help you to get the most from your purchase
Trang 37Downloading the example code
You can download the example code files for this book from your account at
http://www.packtpub.com If you purchased this book elsewhere, you canvisit http://www.packtpub.com/support and register to have the files e-maileddirectly to you
You can download the code files by following these steps:
1 Log in or register to our website using your e-mail address and
password
2 Hover the mouse pointer on the SUPPORT tab at the top.
3 Click on Code Downloads & Errata.
4 Enter the name of the book in the Search box.
5 Select the book for which you're looking to download the code files
6 Choose from the drop-down menu where you purchased this book from
7 Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract thefolder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at
https://github.com/PacktPublishing/Learning-Apache-Spark-2 We also haveother code bundles from our rich catalog of books and videos available at
https://github.com/PacktPublishing/ Check them out!
Trang 38Although we have taken every care to ensure the accuracy of our content,mistakes do happen If you find a mistake in one of our books-maybe a
mistake in the text or the code-we would be grateful if you could report this
to us By doing so, you can save other readers from frustration and help usimprove subsequent versions of this book If you find any errata, please
report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the
details of your errata Once your errata are verified, your submission will beaccepted and the errata will be uploaded to our website or added to any list ofexisting errata under the Errata section of that title
To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of thebook in the search field The required information will appear under the
Errata section.
Trang 39Piracy of copyrighted material on the Internet is an ongoing problem acrossall media At Packt, we take the protection of our copyright and licenses veryseriously If you come across any illegal copies of our works in any form onthe Internet, please provide us with the location address or website nameimmediately so that we can pursue a remedy
Please contact us at copyright@packtpub.com with a link to the suspectedpirated material
We appreciate your help in protecting our authors and our ability to bring youvaluable content
Trang 40If you have a problem with any aspect of this book, you can contact us
at questions@packtpub.com, and we will do our best to address the problem