Practical Data Science with Hadoop and Spark ® Mendelevitch_Book.indb i 11/16/16 6:39 PM The Addison-Wesley Data and Analytics Series Visit informit.com/awdataseries for a complete list of available publications T he Addison-Wesley Data and Analytics Series provides readers with practical knowledge for solving problems and answering questions with data Titles in this series primarily focus on three areas: Infrastructure: how to store, move, and manage data Algorithms: how to mine intelligence or make predictions based on data Visualizations: how to represent data and insights in a meaningful and compelling way The series aims to tie all three of these areas together to help the reader build end-to-end systems for fighting spam; making recommendations; building personalization; detecting trends, patterns, or problems; and gaining insight from the data exhaust of systems and user interactions Make sure to connect with us! informit.com/socialconnect Mendelevitch_Book.indb ii 11/16/16 6:39 PM Practical Data Science with Hadoop and Spark ® Designing and Building Effective Analytics at Scale Ofer Mendelevitch Casey Stella Douglas Eadline Boston • Columbus • Indianapolis • New York • San Francisco • Amsterdam • Cape Town Dubai • London • Madrid • Milan • Munich • Paris • Montreal • Toronto • Delhi • Mexico City São Paulo • Sydney • Hong Kong • Seoul • Singapore • Taipei • Tokyo Mendelevitch_Book.indb iii 11/16/16 6:39 PM Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419 For government sales inquiries, please contact governmentsales@pearsoned.com For questions about sales outside the U.S., please contact intlcs@pearson.com Visit us on the Web: informit.com/aw Library of Congress Control Number: 2016955465 Copyright © 2017 Pearson Education, Inc All rights reserved Printed in the United States of America This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise For information regarding permissions, request forms and the appropriate contacts within the Pearson Education Global Rights & Permissions Department, please visit www.pearsoned.com/permissions/ ISBN-13: 978-0-13-402414-1 ISBN-10: 0-13-402414-1 Mendelevitch_Book.indb iv 16 11/16/16 6:39 PM Contents Foreword xiii Preface xv Acknowledgments xxi About the Authors xxiii I Data Science with Hadoop—An Overview Introduction to Data Science What Is Data Science? 3 Example: Search Advertising A Bit of Data Science History Statistics and Machine Learning Innovation from Internet Giants Data Science in the Modern Enterprise Becoming a Data Scientist The Data Engineer The Applied Scientist Transitioning to a Data Scientist Role 11 12 Soft Skills of a Data Scientist Building a Data Science Team 13 The Data Science Project Life Cycle 14 Ask the Right Question Data Acquisition 15 Data Cleaning: Taking Care of Data Quality Explore the Data and Design Model Features 17 Building and Tuning the Model Deploy to Production 17 18 Managing a Data Science Project Summary 18 Use Cases for Data Science Big Data—A Driver of Change 19 19 Volume: More Data Is Now Available Variety: More Data Types Velocity: Fast Data Ingest Mendelevitch_Book.indb v 15 16 20 20 21 11/16/16 6:39 PM vi Contents 21 Business Use Cases 21 22 22 23 Product Recommendation Customer Churn Analysis Customer Segmentation Sales Leads Prioritization 24 Sentiment Analysis Fraud Detection 25 26 26 Predictive Maintenance Market Basket Analysis 27 Predictive Medical Diagnosis 28 Predicting Patient Re-admission 28 Detecting Anomalous Record Access 29 Insurance Risk Analysis 29 Predicting Oil and Gas Well Production Levels Summary 29 Hadoop and Data Science What Is Hadoop? 31 31 32 Distributed File System Resource Manager and Scheduler 34 35 Distributed Data Processing Frameworks Hadoop’s Evolution 37 Hadoop Tools for Data Science 38 39 Apache Flume 39 Apache Hive 40 Apache Pig 41 Apache Spark 42 R 44 Python 45 Apache Sqoop Java Machine Learning Packages 46 Why Hadoop Is Useful to Data Scientists Cost Effective Storage Schema on Read 46 46 47 Unstructured and Semi-Structured Data Multi-Language Tooling 48 48 49 Robust Scheduling and Resource Management Levels of Distributed Systems Abstractions Mendelevitch_Book.indb vi 49 11/16/16 6:39 PM Contents 50 51 Scalable Creation of Models Scalable Application of Models Summary vii 51 II Preparing and Visualizing Data with Hadoop Getting Data into Hadoop Hadoop as a Data Lake 53 55 56 The Hadoop Distributed File System (HDFS) 58 58 Direct File Transfer to Hadoop HDFS Importing Data from Files into Hive Tables 59 59 Import CSV Files into Hive Tables 62 Import CSV Files into HIVE Using Spark 63 Import a JSON File into HIVE Using Spark 64 Using Apache Sqoop to Acquire Relational Data 65 Data Import and Export with Sqoop 66 Apache Sqoop Version Changes 67 Using Sqoop V2: A Basic Example 68 Using Apache Flume to Acquire Data Streams 74 Using Flume: A Web Log Example Overview 76 Importing Data into Hive Tables Using Spark Manage Hadoop Work and Data Flows with Apache Oozie 79 81 Apache Falcon 82 What’s Next in Data Ingestion? Summary 82 Data Munging with Hadoop Why Hadoop for Data Munging? Data Quality 85 86 86 What Is Data Quality? 86 Dealing with Data Quality Issues Using Hadoop for Data Quality 93 The Feature Matrix Choosing the “Right” Features Sampling: Choosing Instances Generating Features Text Features Mendelevitch_Book.indb vii 87 92 94 94 96 97 11/16/16 6:39 PM viii Contents 100 Time-Series Features 101 Features from Complex Data Types Feature Manipulation Dimensionality Reduction 102 103 106 Summary Exploring and Visualizing Data Why Visualize Data? 107 107 Motivating Example: Visualizing Network Throughput 108 Visualizing the Breakthrough That Never Happened 110 Creating Visualizations Comparison Charts Composition Charts Distribution Charts Relationship Charts 112 113 114 117 118 121 Using Visualization for Data Science 121 Popular Visualization Tools R 121 Python: Matplotlib, Seaborn, and Others SAS Matlab Julia 122 123 123 123 Other Visualization Tools 123 Visualizing Big Data with Hadoop Summary 124 III Applying Data Modeling with Hadoop Machine Learning with Hadoop Overview of Machine Learning Terminology 122 125 127 127 128 Task Types in Machine Learning Big Data and Machine Learning Tools for Machine Learning 129 130 131 The Future of Machine Learning and Artificial Intelligence 132 Summary Mendelevitch_Book.indb viii 132 11/16/16 6:39 PM Contents Predictive Modeling ix 133 133 Classification Versus Regression 134 Evaluating Predictive Models 136 Evaluating Classifiers 136 Evaluating Regression Models 139 Cross Validation 139 Supervised Learning Algorithms 140 Overview of Predictive Modeling Building Big Data Predictive Model Solutions 141 141 Batch Prediction 143 Real-Time Prediction 144 Model Training Example: Sentiment Analysis 145 145 Data Preparation 145 Feature Generation 146 Building a Classifier 149 Summary 150 Tweets Dataset Clustering 151 Overview of Clustering Uses of Clustering 151 152 Designing a Similarity Measure Distance Functions Similarity Functions Clustering Algorithms Example: Clustering Algorithms k-means Clustering 153 153 154 154 155 155 Latent Dirichlet Allocation 157 Evaluating the Clusters and Choosing the Number of Clusters 157 Building Big Data Clustering Solutions 158 Example: Topic Modeling with Latent Dirichlet Allocation 160 Feature Generation 160 Running Latent Dirichlet Allocation Summary Mendelevitch_Book.indb ix 162 163 11/16/16 6:39 PM ... Next in Data Ingestion? Summary 82 Data Munging with Hadoop Why Hadoop for Data Munging? Data Quality 85 86 86 What Is Data Quality? 86 Dealing with Data Quality Issues Using Hadoop for Data Quality... (Administrators) 2 07 208 C Additional Background on Data Science and Apache Hadoop and Spark 209 General Hadoop/ Spark Information 209 Hadoop/ Spark Installation Recipes 210 HDFS 210 MapReduce 211 Spark 211... What data science is and the history of its evolution The journey to becoming a data scientist Building a data science team The data science project life cycle Managing data science projects Data