Scala and Spark for Big Data Analytics Tame big data with Scala and Apache Spark! Md Rezaul Karim Sridhar Alla BIRMINGHAM - MUMBAI Scala and Spark for Big Data Analytics Copyright © 2017 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: July 2017 Production reference: 1210717 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78528-084-9 www.packtpub.com Credits Authors Md Rezaul Karim Copy Editor Safis Editing Sridhar Alla Reviewers Project Coordinator Andrea Bessi Ulhas Kambali Sumit Pal Commissioning Editor Proofreader Aaron Lazar Safis Editing Acquisition Editor Indexer Nitin Dasan Rekha Nair Content Development Editor Cover Work Vikas Tiwari Melwyn Dsa Technical Editor Production Coordinator Subhalaxmi Nadar Melwyn Dsa About the Authors Md Rezaul Karim is a research scientist at Fraunhofer FIT, Germany He is also a PhD candidate at RWTH Aachen University, Aachen, Germany He holds a BSc and an MSc in computer science Before joining Fraunhofer FIT, he had been working as a researcher at the Insight Centre for data analytics, Ireland Previously, he worked as a lead engineer with Samsung Electronics' distributed R&D centers in Korea, India, Vietnam, Turkey, and Bangladesh Earlier, he worked as a research assistant in the Database Lab at Kyung Hee University, Korea, and as an R&D engineer with BMTech21 Worldwide, Korea Even before that, he worked as a software engineer with i2SoftTechnology, Dhaka, Bangladesh He has more than years of experience in the area of research and development, with a solid knowledge of algorithms and data structures in C/C++, Java, Scala, R, and Python-focused big data technologies: Spark, Kafka, DC/OS, Docker, Mesos, Zeppelin, Hadoop, and MapReduce, and deep learning technologies: TensorFlow, DeepLearning4j, and H2O-Sparking Water His research interests include machine learning, deep learning, semantic web, linked data, big data, and bioinformatics He is the author of the following book titles with Packt: Large-Scale Machine Learning with Spark Deep Learning with TensorFlow I am very grateful to my parents, who have always encouraged me to pursue knowledge I also want to thank my wife Saroar, son Shadman, elder brother Mamtaz, elder sister Josna, and friends, who have endured my long monologues about the subjects in this book, and have always been encouraging and listening to me Writing this book was made easier by the amazing efforts of the open source community and the great documentation of many projects out there related to Apache Spark and Scala Further more, I would like to thank the acquisition, content development, and technical editors of Packt (and others who were involved in this book title) for their sincere cooperation and coordination Additionally, without the work of numerous researchers and data analytics practitioners who shared their expertise in publications, lectures, and source code, this book might not exist at all! Sridhar Alla is a big data expert helping small and big companies solve complex problems, such as data warehousing, governance, security, real-time processing, high-frequency trading, and establishing large-scale data science practices He is an agile practitioner as well as a certified agile DevOps practitioner and implementer He started his career as a storage software engineer at Network Appliance, Sunnyvale, and then worked as the chief technology officer at a cyber security firm, eIQNetworks, Boston His job profile includes the role of the director of data science and engineering at Comcast, Philadelphia He is an avid presenter at numerous Strata, Hadoop World, Spark Summit, and other conferences He also provides onsite/online training on several technologies He has several patents filed in the US PTO on large-scale computing and distributed systems He holds a bachelors degree in computer science from JNTU, Hyderabad, India, and lives with his wife in New Jersey Sridhar has over 18 years of experience writing code in Scala, Java, C, C++, Python, R and Go He also has extensive hands-on knowledge of Spark, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing and high performance computing I would like to thank my wonderful wife, Rosie Sarkaria, for all the love and patience during the many months I spent writing this book as well as reviewing countless edits I made I would also like to thank my parents Ravi and Lakshmi Alla all the support and encouragement they continue to bestow upon me I am very grateful to the many friends especially Abrar Hashmi, Christian Ludwig who helped me bounce ideas and get clarity on the various topics Writing this book was not possible without the fantastic larger Apache community and Databricks folks who are making Spark so powerful and elegant Further, I would like to thank the acquisition, content development and technical editors of Packt Publishing (and others who were involved in this book title) for their sincere cooperation and coordination About the Reviewers Andre Baianov is an economist-turned-software developer, with a keen interest in data science After a bachelor's thesis on data mining and a master's thesis on business intelligence, he started working with Scala and Apache Spark in 2015 He is currently working as a consultant for national and international clients, helping them build reactive architectures, machine learning frameworks, and functional programming backends To my wife: beneath our superficial differences, we share the same soul Sumit Pal is a published author with Apress for SQL on Big Data - Technology, Architecture and Innovations and SQL on Big Data Technology, Architecture and Innovations He has more than 22 years of experience in the software industry in various roles, spanning companies from start-ups to enterprises Sumit is an independent consultant working with big data, data visualization, and data science, and a software architect building end-to-end, data-driven analytic systems He has worked for Microsoft (SQL Server development team), Oracle (OLAP development team), and Verizon (big data analytics team) in a career spanning 22 years Currently, he works for multiple clients, advising them on their data architectures and big data solutions, and does hands-on coding with Spark, Scala, Java, and Python Sumit has spoken at the following big data conferences: Data Summit NY, May 2017; Big Data Symposium, Boston, May 2017; Apache Linux Foundation, May 2016, in Vancouver, Canada; and Data Center World, March 2016, in Las Vegas www.PacktPub.com For support files and downloads related to your book, please visit www.PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book At Packt, quality is at the heart of our editorial process To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785280848 If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback Help us be relentless in improving our products! Figure 5: A sample of the bank dataset Now, let's first load the data on the Zeppelin notebook: valbankText = sc.textFile("/home/asif/bank/bank-full.csv") Upon the execution of this line of code, create a new paragraph and name it as the data ingestion paragraph: Figure 6: Data ingesting paragraph If you see the preceding image carefully, the code worked and we did not need to define the Spark context The reason is that it is already defined there as sc You don't even need to define Scala implicitly We will see an example of this later Data processing and visualization Now, let's create a case class that will tell us how to pick the selected fields from the dataset: case class Bank(age:Int, job:String, marital : String, education : String, balance : Integer) Now, split each line, filter out the header (starts with age), and map it into the Bank case class, as follows: val bank = bankText.map(s=>s.split(";")).filter(s => (s.size)>5).filter(s=>s(0)!="\"age\"").map( s=>Bank(s(0).toInt, s(1).replaceAll("\"", ""), s(2).replaceAll("\"", ""), s(3).replaceAll("\"", ""), s(5).replaceAll("\"", "").toInt ) ) Finally, convert to DataFrame and create a temporal table: bank.toDF().createOrReplaceTempView("bank") The following screenshot shows that all the code snippets were executed successfully without showing any errors: Figure 7: Data process paragraph To make it more transparent, let's see the status marked in green color (in the top-right corner of the image), as follows, after the code has been executed for each case: Figure 8: A successful execution of Spark code in each paragraph Now let's load some data to play with the following SQL command: %sql select age, count(1) from bank where age >= 45 group by age order by age Note that the preceding line of code is a pure SQL statement that selects the names of all the customers whose age is greater than or equal to 45 (that is, age distribution) Finally, it counts the number for the same customer group Now let's see how the preceding SQL statement works on the temp view (that is, bank): Figure 9: SQL query that selects the names of all the customers with age distribution [Tabular] Now you can select graph options, such as histogram, pie-chart, bar chart, and so on, from the tab near the table icon (in the result section) For example, using histogram, you can see the corresponding count for age group >=45 Figure 10: SQL query that selects the names of all the customers with age distribution [Histogram] This is how it looks using a pie-chart: Figure 11: SQL query that selects the names all the customers with age distribution [pie-chart] Fantastic! We are now almost ready to more complex data analytics problems using Zeppelin Complex data analytics with Zeppelin In this section, we will see how to perform more complex analytics using Zeppelin At first, we will formalize the problem, and then, will explore the dataset that will be used Finally, we will apply some visual analytics and machine learning techniques The problem definition In this section, we will build a spam classifier for classifying the raw text as spam or ham We will also show how to evaluate such a model We will try to focus using and working with the DataFrame API In the end, the spam classifier model will help you distinguish between spam and ham messages The following image shows a conceptual view of two messages (spam and ham respectively): Figure 12: Spam and Ham example We power some basic machine learning techniques to build and evaluate such a classifier for this kind of problem In particular, the logistic regression algorithm will be used for this problem Dataset descripting and exploration The spam data set that we downloaded from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection consists of 5,564 SMS, which have been classified by hand as either ham or spam Only 13.4% of these SMSes are spam This means that the dataset is skewed and provides only a few examples of spam This is something to keep in mind, as it can introduce bias while training models: Figure 13: A snap of the SMS dataset So, what does this data look like? As you might have seen, social media text can really get dirty, containing slang words, misspelled words, missing whitespaces, abbreviated words, such as u, urs, yrs, and so on, and, often, a violation of grammar rules It sometimes even contains trivial words in the messages Thus, we need to take care of these issues as well In the following steps, we will encounter these issues for a better interpretation of the analytics Step Load the required packages and APIs on Zeppelin - Let's load the required packages and APIs and create the first paragraph, before we ingest the dataset on Zeppelin: Figure 14: Package/APIs load paragraph Step Load and parse the dataset - We'll use the CSV parsing library by Databricks (that is, com.databricks.spark.csv) to read the data into the DataFrame: Figure 15: Data ingesting/load paragraph Step Using StringIndexer to create numeric labels - Since the labels in the original DataFrame are categorical, we will have to convert them back so that we can feed them to or use them in the machine learning models: Figure 16: The StringIndexer paragraph, and the output shows the raw labels, original texts, and corresponding labels Step Using RegexTokenizer to create a bag of words - We'll use RegexTokenizer to remove unwanted words and create a bag of words: Figure 17: The RegexTokenizer paragraph, and the output shows the raw labels, original texts, corresponding labels, and tokens Step Removing stop words and creating a filtered DataFrame - We'll remove stop words and create a filtered DataFrame for visual analytics Finally, we show the DataFrame: Figure 18: StopWordsRemover paragraph and the output shows the raw labels, original texts, corresponding labels, tokens, and filtered tokens without the stop words Step Finding spam messages/words and their frequency - Let's try to create a DataFrame containing only the spam words, along with their respective frequency, to understand the context of the messages in the dataset We can create a paragraph on Zeppelin: Figure 19: Spam tokens with a frequency paragraph Now, let's see them in the graph using SQL queries The following query selects all the tokens with frequencies of more than 100 Then, we sort the tokens in a descending order of their frequency Finally, we use the dynamic forms to limit the number of records The first one is just a raw tabular format: Figure 20: Spam tokens with a frequency visualization paragraph [Tabular] Then, we'll use a bar diagram, which provides more visual insights We can now see that the most frequent words in the spam messages are call and free, with a frequency of 355 and 224 respectively: Figure 21: Spam tokens with a frequency visualization paragraph [Histogram] Finally, using the pie chart provides much better and wider visibility, especially if you specify the column range: Figure 22: Spam tokens with a frequency visualization paragraph [Pie chart] Step Using HashingTF for term frequency - Use HashingTF to generate the term frequency of each filtered token, as follows: Figure 23: HashingTF paragraph, and the output shows the raw labels, original texts, corresponding labels, tokens, filtered tokens, and corresponding term-frequency for each row Step Using IDF for Term frequency-inverse document frequency (TF-IDF) - TF-IDF is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus: Figure 24: IDF paragraph, and the output shows the raw labels, original texts, corresponding labels, tokens, filtered tokens, term-frequency, and the corresponding IDFs for each row Bag of words: The bag of words assigns a value of for every occurrence of a word in a sentence This is probably not ideal, as each category of the sentence, most likely, has the same frequency of the, and, and other words; whereas words such as viagra and sale probably should have an increased importance in figuring out whether or not the text is spam TF-IDF: This is the acronym for Text Frequency – Inverse Document Frequency This term is essentially the product of text frequency and inverse document frequency for each word This is commonly used in the bag of words methodology in NLP or text analytics Using TF-IDF: Let's take a look at word frequency Here, we consider the frequency of a word in an individual entry, that is, term The purpose of calculating text frequency (TF) is to find terms that appear to be important in each entry However, words such as the and and may appear very frequently in every entry We want to downweigh the importance of these words, so we can imagine that multiplying the preceding TF by the inverse of the whole document frequency might help find important words However, since a collection of texts (a corpus) may be quite large, it is common to take the logarithm of the inverse document frequency In short, we can imagine that high values of TF-IDF might indicate words that are very important to determining what a document is about Creating the TF-IDF vectors requires us to load all the text into memory and count the occurrences of each word before we can start training our model Step Using VectorAssembler to generate raw features for the Spark ML pipeline - As you saw in the previous step, we have only the filtered tokens, labels, TF, and IDF However, there are no associated features that can be fed into any ML models Thus, we need to use the Spark VectorAssembler API to create features based on the properties in the previous DataFrame, as follows: Figure 25: The VectorAssembler paragraph that shows using VectorAssembler for feature creations Step 10 Preparing the training and test set - Now we need to prepare the training and test set The training set will be used to train the Logistic Regression model in Step 11, and the test set will be used to evaluate the model in Step 12 Here, I make it 75% for the training and 25% for the test You can adjust it accordingly: Figure 26: Preparing training/test set paragraph Step 11 Training binary logistic regression model - Since, the problem itself is a binary classification problem, we can use a binary logistic regression classifier, as follows: Figure 27: LogisticRegression paragraph that shows how to train the logistic regression classifier with the necessary labels, features, regression parameters, elastic net param, and maximum iterations Note that, here, for better results, we have iterated the training for 200 times We have set the regression parameter and elastic net params a very low -i.e 0.0001 for making the training more intensive Step 12 Model evaluation - Let's compute the raw prediction for the test set Then, we instantiate the raw prediction using the binary classifier evaluator, as follows: Figure 28: Model evaluator paragraph Now let's compute the accuracy of the model for the test set, as follows: Figure 29: Accuracy calculation paragraph This is pretty impressive However, if you were to go with the model tuning using cross-validation, for example, you could gain even higher accuracy Finally, we will compute the confusion matrix to get more insight: Figure 30: Confusion paragraph shows the number of correct and incorrect predictions summarized with count values and broken down by each class Data and results collaborating Furthermore, Apache Zeppelin provides a feature for publishing your notebook paragraph results Using this feature, you can show the Zeppelin notebook paragraph results on your own website It's very straightforward; just use the tag on your page If you want to share the link of your Zeppelin notebook, the first step to publish your paragraph result is Copy a paragraph link After running a paragraph in your Zeppelin notebook, click the gear button located on the right-hand Then, click Link this paragraph in the menu, as shown in the following image: Figure 31: Linking the paragraph Then, just copy the provided link, as shown here: Figure 32: Getting the link for paragraph sharing with collaborators Now, even if you want to publish the copied paragraph, you may use the tag on your website Here is an example: Now, you can show off your beautiful visualization results on your website This is more or less the end of our data analytics journey with Apache Zeppelin For more inforamtion and related updates, you should visit the official website of Apache Zeppelin at https://zeppelin.apache.org/; you can even subscribe to Zeppelin users at users-subscribe@zeppelin.apache.org Summary Apache Zeppelin is a web-based notebook that enables you to data analytics in an interactive way Using Zeppelin, you can make beautiful data-driven, interactive, and collaborative documents with SQL, Scala, and more It is gaining more popularity by the day, since more features are being added to recent releases However, due to page limitations, and to make you more focused on using Spark only, we have shown examples that are only suitable for using Spark with Scala However, you can write your Spark code in Python and test your notebook with similar ease In this chapter, we discussed how to use Apache Zeppelin for large-scale data analytics using Spark in the backend as the interpreter We saw how to install and get started with Zeppelin We then saw how to ingest your data and parse and analyse it for better visibility Then, we saw how to visualize it for better insights Finally, we saw how to share the Zeppelin notebook with collaborators This book was downloaded from AvaxHome! Visit my blog for more new books: www.avxhm.se/blogs/AlenMiler .. .Scala and Spark for Big Data Analytics Tame big data with Scala and Apache Spark! Md Rezaul Karim Sridhar Alla BIRMINGHAM - MUMBAI Scala and Spark for Big Data Analytics Copyright... Here comes Apache Spark Spark core Spark SQL Spark streaming Spark GraphX Spark ML PySpark SparkR Summary Start Working with Spark – REPL and RDDs Dig deeper into Apache Spark Apache Spark installation... programming Functional Scala for the data scientists Why FP and Scala for learning Spark? Why Spark? Scala and the Spark programming model Scala and the Spark ecosystem Pure functions and higher-order