Learning apache spark 2

Apache Spark cluster manager typesBuilding standalone applications with Apache SparkSubmitting applications Deployment strategies Running Spark examples Building your own programs Parall

Trang 2

Learning Apache Spark 2

Trang 3

Table of Contents

Learning Apache Spark 2

Credits

About the Author

About the Reviewers

What you need for this book

Who this book is for

1 Architecture and Installation

Apache Spark architecture overview

Installing Apache Spark

Writing your first Spark program

Scala shell examples

Python shell examples

Trang 4

Apache Spark cluster manager types

Building standalone applications with Apache SparkSubmitting applications

Deployment strategies

Running Spark examples

Building your own programs

Parallelizing existing collections

Referencing external data source

Static singleton functions

Passing functions to Spark (Java)

Passing functions to Spark (Python)

Trang 5

Set operations in Spark

Transformations on two PairRDDs

Actions available on PairRDDs

How is Spark being used?

Commonly Supported File Formats

Text Files

Trang 6

Working with Amazon S3

Structured Data sources and Databases

Working with NoSQL Databases

Working with Cassandra

Obtaining a Cassandra table as an RDDSaving data to Cassandra

Working with HBase

Bulk Delete exampleMap Partition ExampleWorking with MongoDB

Connection to MongoDBWriting to MongoDBLoading data from MongoDBWorking with Apache Solr

Importing the JAR File via Spark-shellConnecting to Solr via DataFrame APIConnecting to Solr via RDD

References

Summary

4 Spark SQL

What is Spark SQL?

What is DataFrame API?

What is DataSet API?

What's new in Spark 2.0?

Under the hood - catalyst optimizer

Solution 1

Solution 2

The Sparksession

Trang 7

Steps involved in a streaming app

Architecture of Spark Streaming

Setting up checkpointing with Scala

Setting up checkpointing with Java

Setting up checkpointing with Python

Trang 8

Automatic driver restartDStream best practices

Fault tolerance

Worker failure impact on receivers

Worker failure impact on RDDs/DStreamsWorker failure impact on output operationsWhat is Structured Streaming?

Under the hood

Structured Spark Streaming API :Entry pointOutput modes

Append modeComplete modeUpdate modeOutput sinks

Failure recovery and checkpointing

References

Summary

6 Machine Learning with Spark

What is machine learning?

Why machine learning?

Types of machine learning

Introduction to Spark MLLib

Why do we need the Pipeline API?

How does it work?

Scala syntax - building a pipeline

Building a pipeline

Predictions on test documents

Python program - predictions on test documentsFeature engineering

Feature extraction algorithms

Feature transformation algorithms

Feature selection algorithms

Classification and regression

Classification

Regression

Clustering

Trang 9

Basic graph operators (RDD API)

List of graph operators (RDD API)

Caching and uncaching of graphs

Graph algorithms in GraphX

Loading and saving of GraphFrames

Comparison between GraphFrames and GraphX

GraphX <=> GraphFrames

Converting from GraphFrame to GraphX

Converting from GraphX to GraphFrames

References

Summary

8 Operating in Clustered Mode

Clusters, nodes and daemons

Key bits about Spark Architecture

Running Spark in standalone mode

Trang 10

Installing Spark standalone on a cluster

Starting a Spark cluster manually

Cluster overview

Workers overview

Running applications and drivers overview

Completed applications and drivers overview

Using the Cluster Launch Scripts to Start a Standalone ClusterEnvironment Properties

Connecting Spark-Shell, PySpark, and R-Shell to the clusterResource scheduling

Running Spark in YARN

Spark with a Hadoop Distribution (Cloudera)

Interactive Shell

Batch Application

Important YARN Configuration Parameters

Running Spark in Mesos

Before you start

Steps to use the cluster mode

Mesos run modes

Key Spark on Mesos configuration properties

References:

Summary

9 Building a Recommendation System

What is a recommendation system?

Types of recommendations

Manual recommendations

Simple aggregated recommendations based on PopularityUser-specific recommendations

User specific recommendations

Key issues with recommendation systems

Trang 11

Gathering known input data

Predicting unknown from known ratings

Content-based recommendations

Predicting unknown ratingsPros and cons of content based recommendationsCollaborative filtering

Jaccard similarityCosine similarityCentered cosine (Pearson Correlation)Latent factor methods

Evaluating prediction methodRecommendation system in Spark

Sample dataset

How does Spark offer recommendation?

Importing relevant libraries

Defining the schema for ratings

Defining the schema for movies

Loading ratings and movies data

Data partitioning

Training an ALS model

Predicting the test dataset

Evaluating model performance

Using implicit preferencesSanity checking

Model Deployment

References

Summary

10 Customer Churn Prediction

Overview of customer churn

Why is predicting customer churn important?

How do we predict customer churn with Spark?

Data set description

Code example

Defining schema

Loading data

Data exploration

Trang 12

PySpark import code

Exploring international minutes

Exploring night minutes

Exploring day minutes

Exploring eve minutes

Comparing minutes data for churners and non-churnersComparing charge data for churners and non-churners

Exploring customer service calls

Scala code - constructing a scatter plot

Exploring the churn variable

Execution and storage

Tasks running in parallel

Operators within the same task

Memory management configuration options

Memory tuning key tips

I/O tuning

Data locality

Sizing up your executors

Calculating memory overhead

Setting aside memory/CPU for YARN application masterI/O throughput

Sample calculations

The skew problem

Security configuration in Spark

Kerberos authentication

Shared secrets

Shared secret on YARN

Shared secret on other cluster managers

Trang 13

Setting up Jupyter Notebook with Spark

What is a Jupyter Notebook?

Setting up a Jupyter Notebook

Securing the notebook server

Preparing a hashed password

Using Jupyter (only with version 5.0 and later)Manually creating hashed password

Setting up PySpark on Jupyter

Trang 14

Learning Apache Spark 2

Trang 15

retrieval system, or transmitted in any form or by any means, without theprior written permission of the publisher, except in the case of brief

quotations embedded in critical articles or reviews

Every effort has been made in the preparation of this book to ensure the

accuracy of the information presented However, the information contained inthis book is sold without warranty, either express or implied Neither theauthor, nor Packt Publishing, and its dealers and distributors will be heldliable for any damages caused or alleged to be caused directly or indirectly bythis book

Packt Publishing has endeavored to provide trademark information about all

of the companies and products mentioned in this book by the appropriate use

of capitals However, Packt Publishing cannot guarantee the accuracy of thisinformation

First published: March 2017

Trang 16

www.packtpub.com

Trang 17

Tejal Daruwale Soni

Content Development Editor

Trang 18

About the Author

Muhammad Asif Abbasi has worked in the industry for over 15 years in a

variety of roles from engineering solutions to selling solutions and everything

in between Asif is currently working with SAS a market leader in AnalyticSolutions as a Principal Business Solutions Manager for the Global

Technologies Practice Based in London, Asif has vast experience in

consulting for major organizations and industries across the globe, and

running proof-of-concepts across various industries including but not limited

to telecommunications, manufacturing, retail, finance, services, utilities andgovernment Asif is an Oracle Certified Java EE 5 Enterprise architect,

Teradata Certified Master, PMP, Hortonworks Hadoop Certified developer,and administrator Asif also holds a Master's degree in Computer Science andBusiness Administration

Trang 19

About the Reviewers

Prashant Verma started his IT carrier in 2011 as a Java developer in

Ericsson working in telecom domain After couple of years of JAVA EEexperience, he moved into Big Data domain, and has worked on almost allthe popular big data technologies, such as Hadoop, Spark, Flume, Mongo,Cassandra,etc He has also played with Scala Currently, He works with QAInfotech as Lead Data Enginner, working on solving e-Learning problemsusing analytics and machine learning

Prashant has also worked on Apache Spark for Java Developers, Packt as a

Technical Reviewer

I want to thank Packt Publishing for giving me the chance to review the book as well as my employer and my family for their patience while I was busy working on this book.

Trang 20

At www.PacktPub.com, you can also read a collection of free technical

articles, sign up for a range of free newsletters and receive exclusive

discounts and offers on Packt books and eBooks

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt Mapt gives you full

access to all Packt books and video courses, as well as industry-leading tools

to help you plan your personal development and advance your career

Trang 22

Customer Feedback

Thanks for purchasing this Packt book At Packt, quality is at the heart of oureditorial process To help us improve, please leave us an honest review at thewebsite where you acquired this product

If you'd like to join our team of regular reviewers, you can email us at

customerreviews@packtpub.com We award our regular reviewers with freeeBooks and videos in exchange for their valuable feedback Help us be

relentless in improving our products!

Trang 23

This book will cover the technical aspects of Apache Spark 2.0, one of thefastest growing open-source projects In order to understand what ApacheSpark is, we will quickly recap a the history of Big Data, and what has madeApache Spark popular Irrespective of your expertise level, we suggest goingthrough this introduction as it will help set the context of the book

Trang 24

The Past

Before going into the present-day Spark, it might be worthwhile

understanding what problems Spark intend to solve, and especially the datamovement Without knowing the background we will not be able to predictthe future

"You have to learn the past to predict the future."

Late 1990s: The world was a much simpler place to live, with proprietary

databases being the sole choice of consumers Data was growing at quite anamazing pace, and some of the biggest databases boasted of maintainingdatasets in excess of a Terabyte

Early 2000s: The dotcom bubble happened, meant companies started going

online, and likes of Amazon and eBay leading the revolution Some of thedotcom start-ups failed, while others succeeded The commonality among thebusiness models started was a razor-sharp focus on page views, and

everything started getting focused on the number of users A lot of marketingbudget was spent on getting people online This meant more customer

behavior data in the form of weblogs Since the defacto storage was an MPPdatabase, and the value of such weblogs was unknown, more often than notthese weblogs were stuffed into archive storage or deleted

2002: In search for a better search engine, Doug Cutting and Mike Cafarella

started work on an open source project called Nutch, the objective of whichwas to be a web scale crawler Web-Scale was defined as billions of webpages and Doug and Mike were able to index hundreds of millions of web-pages, running on a handful of nodes and had a knack of falling down

2004-2006: Google published a paper on the Google File System (GFS)

(2003) and MapReduce (2004) demonstrating the backbone of their searchengine being resilient to failures, and almost linearly scalable Doug Cuttingtook particular interest in this development as he could see that GFS andMapReduce papers directly addressed Nutch’s shortcomings Doug Cuttingadded Map Reduce implementation to Nutch which ran on 20 nodes, and was

Trang 25

much easier to program Of course we are talking in comparative terms here.

2006-2008: Cutting went to work with Yahoo in 2006 who had lost the

search crown to Google and were equally impressed by the GFS and

MapReduce papers The storage and processing parts of Nutch were spun out

to form a separate project named Hadoop under AFS where as Nutch webcrawler remained a separate project Hadoop became a top-level Apacheproject in 2008 On February 19, 2008 Yahoo announced that its search index

is run on a 10000 node Hadoop cluster (truly an amazing feat)

We haven't forget about the proprietary database vendors the majority ofthem didn’t expect Hadoop to change anything for them, as database vendorstypically focused on relational data, which was smaller in volumes but higher

in value I was talking to a CTO of a major database vendor (will remainunnamed), and discussing this new and upcoming popular elephant (Hadoop

of course! Thanks to Doug Cutting’s son for choosing a sane name I mean hecould have chosen anything else, and you know how kids name things thesedays ) The CTO was quite adamant that the real value is in the relationaldata, which was the bread and butter of his company, and despite that factthat the relational data had huge volumes, it had less of a business value Thiswas more of a 80-20 rule for data, where from a size perspective unstructureddata was 4 times the size of structured data (80-20), whereas the same

structured data had 4 times the value of unstructured data I would say thatthe relational database vendors massively underestimated the value of

unstructured data back then

Anyways, back to Hadoop: So, after the announcement by Yahoo, a lot of

companies wanted to get a piece of the action They realised something bigwas about to happen in the dataspace Lots of interesting use cases started toappear in the Hadoop space, and the defacto compute engine on Hadoop,MapReduce wasn’t able to meet all those expectations

The MapReduce Conundrum: The original Hadoop comprised primarily

HDFS and Map-Reduce as a compute engine The original use case of webscale search meant that the architecture was primarily aimed at long-runningbatch jobs (typically single-pass jobs without iterations), like the original use

Trang 26

case of indexing web pages The core requirement of such a framework wasscalability and fault-tolerance, as you don’t want to restart a job that had beenrunning for 3 days, having completed 95% of its work Furthermore, the

objective of MapReduce was to target acyclic data flows

A typical MapReduce program is composed of a Map() operation and

optionally a Reduce() operation, and any workload had to be converted to theMapReduce paradigm before you could get the benefit of Hadoop Not onlythat majority of other open source projects on Hadoop also used MapReduce

as a way to perform computation For example: Hive and Pig Latin both

generated MapReduce to operate on Big Data sets The problem with thearchitecture of MapReduce was that the job output data from each step had to

be store in a distributed system before the next step could begin This meantthat each iteration had to reload the data from the disk thus incurring a

significant performance penalty Furthermore, while typically design, forbatch jobs, Hadoop has often been used to do exploratory analysis throughSQL-like interfaces such as Pig and Hive Each query incurs significant

latency due to initial MapReduce job setup, and initial data read which oftenmeans increased wait times for the users

Beginning of Spark: In June of 2011, Matei Zaharia, Mosharaf Chowdhury,

Michael J Franklin, Scott Shenker and Ion Stoica published a paper in which

they proposed a framework that could outperform Hadoop 10 times in

iterative machine learning jobs The framework is now known as Spark Thepaper aimed to solve two of the major inadequacies of the Hadoop/MR

framework:

Iterative jobs

Interactive analysis

The idea that you can plug the gaps of map-reduce from an iterative and

interactive analysis point of view, while maintaining its scalability and

resilience meant that the platform could be used across a wide variety of usecases

This created huge interest in Spark, particularly from communities of userswho had become frustrated with the relatively slow response from

Trang 27

MapReduce, particularly for interactive queries requests Spark in 2015

became the most active open source project in Big Data, and had tons of newfeatures of improvements during the course of the project The communitygrew almost 300%, with attendances at Spark-Summit increasing from just1,100 in 2014 to almost 4,000 in 2015 The number of meetup groups grew

by a factor of 4, and the contributors to the project increased from just over a

100 in 2013 to 600 in 2015

Spark is today the hottest technology for big data analytics Numerous

benchmarks have confirmed that it is the fastest engine out there If you go toany Big data conference be it Strata + Hadoop World or Hadoop Summit,Spark is considered to be the technology for future

Stack Overflow released the results of a 2016 developer survey

(http://bit.ly/1MpdIlU) with responses from 56,033 engineers across 173countries Some of the facts related to Spark were pretty interesting Spark

was the leader in Trending Tech and the Top-Paying Tech.

Trang 28

Why are people so excited about Spark?

In addition to plugging MapReduce deficiencies, Spark provides three majorthings that make it really powerful:

General engine with libraries for many data analysis tasks - includesbuilt-in libraries for Streaming, SQL, machine learning and graph

processing

Access to diverse data sources, means it can connect to Hadoop,

Cassandra, traditional SQL databases, and Cloud Storage includingAmazon and OpenStack

Last but not the least, Spark provides a simple unified API that meansusers have to learn just one API to get the benefit of the entire

framework stack

We hope that this book gives you the foundation of understanding Spark as aframework, and helps you take the next step towards using it for your

implementations

Trang 29

What this book covers

Chapter 1, Architecture and Installation, will help you get started on the

journey of learning Spark This will walk you through key architectural

components before helping you write your first Spark application

Chapter 2, Transformations and Actions with Spark RDDs, will help you

understand the basic constructs as Spark RDDs and help you understand thedifference between transformations, actions, and lazy evaluation, and howyou can share data

Chapter 3, ELT with Spark, will help you with data loading, transformation,

and saving it back to external storage systems

Chapter 4, Spark SQL, will help you understand the intricacies of the

DataFrame and Dataset API before a discussion of the under-the-hood power

of the Catalyst optimizer and how it ensures that your client applicationsremain performant irrespective of your client AP

Chapter 5, Spark Streaming, will help you understand the architecture of

Spark Streaming, sliding window operations, caching, persistence, pointing, fault-tolerance before discussing structured streaming and how itrevolutionizes Stream processing

check-Chapter 6, Machine Learning with Spark, is where the rubber hits the road,

and where you understand the basics of machine learning before looking atthe various types of machine learning, and feature engineering utility

functions, and finally looking at the algorithms provided by Spark MLlibAPI

Chapter 7, GraphX, will help you understand the importance of Graph in

today’s world, before understanding terminology such vertex, edge, Motifetc We will then look at some of the graph algorithms in GraphX and alsotalk about GraphFrames

Chapter 8, Operating in Clustered mode, helps the user understand how

Trang 30

Spark can be deployed as standalone, or with YARN or Mesos.

Chapter 9, Building a Recommendation system, will help the user understand

the intricacies of a recommendation system before building one with an ALSmodel

Chapter 10, Customer Churn Predicting, will help the user understand the

importance of Churn prediction before using a random forest classifier topredict churn on a telecommunication dataset

Appendix, There's More with Spark, is where we cover the topics around

performance tuning, sizing your executors, and security before walking theuser through setting up PySpark with Jupyter notebook

Trang 31

What you need for this book

You will need Spark 2.0, which you can download from Apache Sparkwebsite We have used few different configurations, but you can essentiallyrun most of these examples inside a virtual machine with 4-8GB of RAM,and 10 GB of available disk space

Trang 32

Who this book is for

This book is for people who have heard of Spark, and want to understandmore This is a beginner-level book for people who want to have some hands-

on exercise with the fastest growing open source project This book providesample reading and links to exciting YouTube videos for additional

exploration of the topics

Trang 33

In this book, you will find a number of text styles that distinguish betweendifferent kinds of information Here are some examples of these styles and anexplanation of their meaning

Code words in text, database table names, folder names, filenames, file

extensions, pathnames, dummy URLs, user input, and Twitter handles areshown as follows: "We can include other contexts through the use of theinclude directive."

A block of code is set as follows:

New terms and important words are shown in bold Words that you see on

the screen, for example, in menus or dialog boxes, appear in the text like this:

"Clicking the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this

Trang 34

Tips and tricks appear like this

Trang 35

Reader feedback

Feedback from our readers is always welcome Let us know what you thinkabout this book-what you liked or disliked Reader feedback is important for

us as it helps us develop titles that you will really get the most out of To send

us general feedback, simply e-mail feedback@packtpub.com, and mentionthe book's title in the subject of your message If there is a topic that you haveexpertise in and you are interested in either writing or contributing to a book,see our author guide at www.packtpub.com/authors

Trang 36

Customer support

Now that you are the proud owner of a Packt book, we have a number ofthings to help you to get the most from your purchase

Trang 37

Downloading the example code

You can download the example code files for this book from your account at

http://www.packtpub.com If you purchased this book elsewhere, you canvisit http://www.packtpub.com/support and register to have the files e-maileddirectly to you

You can download the code files by following these steps:

1 Log in or register to our website using your e-mail address and

password

2 Hover the mouse pointer on the SUPPORT tab at the top.

3 Click on Code Downloads & Errata.

4 Enter the name of the book in the Search box.

5 Select the book for which you're looking to download the code files

6 Choose from the drop-down menu where you purchased this book from

7 Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract thefolder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at

https://github.com/PacktPublishing/Learning-Apache-Spark-2 We also haveother code bundles from our rich catalog of books and videos available at

https://github.com/PacktPublishing/ Check them out!

Trang 38

Although we have taken every care to ensure the accuracy of our content,mistakes do happen If you find a mistake in one of our books-maybe a

mistake in the text or the code-we would be grateful if you could report this

to us By doing so, you can save other readers from frustration and help usimprove subsequent versions of this book If you find any errata, please

report them by visiting http://www.packtpub.com/submit-errata, selecting

your book, clicking on the Errata Submission Form link, and entering the

details of your errata Once your errata are verified, your submission will beaccepted and the errata will be uploaded to our website or added to any list ofexisting errata under the Errata section of that title

To view the previously submitted errata, go to

https://www.packtpub.com/books/content/support and enter the name of thebook in the search field The required information will appear under the

Errata section.

Trang 39

Piracy of copyrighted material on the Internet is an ongoing problem acrossall media At Packt, we take the protection of our copyright and licenses veryseriously If you come across any illegal copies of our works in any form onthe Internet, please provide us with the location address or website nameimmediately so that we can pursue a remedy

Please contact us at copyright@packtpub.com with a link to the suspectedpirated material

We appreciate your help in protecting our authors and our ability to bring youvaluable content

Trang 40

If you have a problem with any aspect of this book, you can contact us

at questions@packtpub.com, and we will do our best to address the problem

Định dạng
Số trang	502
Dung lượng	17,46 MB