Apache spark machine learning blueprints 7294

Apache Spark Machine Learning Blueprints Develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide Alex Liu BIRMINGHAM - MUMBAI Apache Spark Machine Learning Blueprints Copyright © 2016 Packt Publishing First published: May 2016 Production reference: 1250516 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78588-039-1 www.packtpub.com Contents Preface ix Chapter 1: Spark for Machine Learning Spark overview and Spark advantages Spark overview Spark advantages Spark computing for machine learning Machine learning algorithms MLlib 6 Other ML libraries Spark RDD and dataframes Spark RDD Spark dataframes Dataframes API for R 10 ML frameworks, RM4Es and Spark computing 12 ML frameworks 12 RM4Es 13 The Spark computing framework 14 ML workflows and Spark pipelines 15 ML as a step-by-step workflow 15 ML workflow examples 16 Spark notebooks 19 Notebook approach for ML 19 Step 1: Getting the software ready Step 2: Installing the Knitr package Step 3: Creating a simple report 21 21 21 Spark notebooks 22 Summary 23 Chapter 2: Data Preparation for Spark ML Accessing and loading datasets Accessing publicly available datasets Loading datasets into Spark Exploring and visualizing datasets Data cleaning Dealing with data incompleteness Data cleaning in Spark Data cleaning made easy Identity matching Identity issues Identity matching on Spark Entity resolution Short string comparison Long string comparison Record deduplication Identity matching made better Crowdsourced deduplication Configuring the crowd Using the crowd 25 26 26 27 28 31 31 32 33 34 34 35 35 35 35 36 36 36 36 37 Dataset reorganizing 37 Dataset reorganizing tasks 37 Dataset reorganizing with Spark SQL 38 Dataset reorganizing with R on Spark 39 Dataset joining 40 Dataset joining and its tool – the Spark SQL 40 Dataset joining in Spark 41 Dataset joining with the R data table package 42 Feature extraction 43 Feature development challenges 44 Feature development with Spark MLlib 45 Feature development with R 46 Repeatability and automation 47 Dataset preprocessing workflows 47 Spark pipelines for dataset preprocessing 48 Dataset preprocessing automation 48 Summary 50 Chapter 3: A Holistic View on Spark Spark for a holistic view The use case Fast and easy computing 53 53 54 56 Methods for a holistic view 58 Regression modeling 58 The SEM approach 59 Decision trees 60 Feature preparation 61 PCA 61 Grouping by category to use subject knowledge 62 Feature selection 63 Model estimation 64 MLlib implementation 64 The R notebooks' implementation 65 Model evaluation 65 Quick evaluations 66 RMSE 66 ROC curves 67 Results explanation 68 Impact assessments 69 Deployment 70 Dashboard 70 Rules 71 Summary 71 Chapter 4: Fraud Detection on Spark Spark for fraud detection The use case Distributed computing Methods for fraud detection Random forest Decision trees Feature preparation Feature extraction from LogFile Data merging Model estimation MLlib implementation R notebooks implementation Model evaluation A quick evaluation Confusion matrix and false positive ratios Results explanation Big influencers and their impacts Deploying fraud detection 73 73 74 75 76 77 78 78 79 80 80 81 81 82 82 82 83 83 84 Rules 85 Scoring 85 Summary 86 Chapter 5: Risk Scoring on Spark Spark for risk scoring The use case Apache Spark notebooks Methods of risk scoring Logistic regression Preparing coding in R Random forest and decision trees Preparing coding 87 87 88 89 91 91 92 92 92 Data and feature preparation 93 OpenRefine 93 Model estimation 95 The DataScientistWorkbench for R notebooks 95 R notebooks implementation 96 Model evaluation 97 Confusion matrix 97 ROC 97 Kolmogorov-Smirnov 98 Results explanation 99 Big influencers and their impacts 99 Deployment 100 Scoring 100 Summary 101 Chapter 6: Churn Prediction on Spark Spark for churn prediction The use case Spark computing Methods for churn prediction Regression models Decision trees and Random forest Feature preparation Feature extraction Feature selection Model estimation Spark implementation with MLlib Model evaluation Results explanation 103 103 104 105 106 106 107 108 108 109 109 110 111 113 Calculating the impact of interventions 114 Deployment 115 Scoring 115 Intervention recommendations 116 Summary 116 Chapter 7: Recommendations on Spark 117 Chapter 8: Learning Analytics on Spark 133 Apache Spark for a recommendation engine 117 The use case 118 SPSS on Spark 119 Methods for recommendation 122 Collaborative filtering 123 Preparing coding 124 Data treatment with SPSS 124 Missing data nodes on SPSS modeler 125 Model estimation 126 SPSS on Spark – the SPSS Analytics server 126 Model evaluation 127 Recommendation deployment 128 Summary 131 Spark for attrition prediction The use case Spark computing Methods of attrition prediction Regression models About regression Preparing for coding Decision trees Preparing for coding 134 134 135 137 137 137 137 138 138 Feature preparation Feature development Feature selection 139 140 141 Model estimation Spark implementation with the Zeppelin notebook Model evaluation A quick evaluation The confusion matrix and error ratios 143 143 145 146 146 Principal components analysis ML feature selection 141 143 Results explanation 147 Calculating the impact of interventions 148 Calculating the impact of main causes 149 Deployment 149 Rules 150 Scoring 150 Summary 152 Chapter 9: City Analytics on Spark Spark for service forecasting The use case Spark computing Methods of service forecasting Regression models About regression Preparing for coding Time series modeling About time series Preparing for coding Data and feature preparation Data merging Feature selection Model estimation Spark implementation with the Zeppelin notebook Spark implementation with the R notebook Model evaluation RMSE calculation with MLlib RMSE calculation with R Explanations of the results Biggest influencers Visualizing trends The rules of sending out alerts Scores to rank city zones 153 154 154 156 158 158 158 159 159 159 160 160 161 162 162 163 165 165 166 166 167 167 168 172 173 Summary 175 Chapter 10: Learning Telco Data on Spark Spark for using Telco Data The use case Spark computing Methods for learning from Telco Data Descriptive statistics and visualization Linear and logistic regression models Decision tree and random forest 177 178 178 179 180 181 181 182 Data and feature development 183 Data reorganizing 184 Feature development and selection 184 Model estimation 185 SPSS on Spark – SPSS Analytics Server 187 Model evaluation 188 RMSE calculations with MLlib 189 RMSE calculations with R 189 Confusion matrix and error ratios with MLlib and R 190 Results explanation 190 Descriptive statistics and visualizations 191 Biggest influencers 194 Special insights 195 Visualizing trends 195 Model deployment 197 Rules to send out alerts 197 Scores subscribers for churn and for Call Center calls 198 Scores subscribers for purchase propensity 199 Summary 199 Chapter 11: Modeling Open Data on Spark Spark for learning from open data The use case Spark computing Methods for scoring and ranking Cluster analysis Principal component analysis Regression models Score resembling Data and feature preparation Data cleaning Data merging Feature development Feature selection Model estimation SPSS on Spark – SPSS Analytics Server Model evaluation RMSE calculations with MLlib RMSE calculations with R 201 202 202 203 207 207 208 208 209 209 210 211 212 212 214 215 216 217 217 Results explanation 218 Comparing ranks 219 Biggest influencers 219 Deployment 220 Rules for sending out alerts 221 Scores for ranking school districts 221 Summary 222 Index 225 Modeling Open Data on Spark As an example, with R, we have obtained a PCA plot as follows: Model evaluation In the previous section, we completed our model estimation as well as some exploratory work Now it is time for us to evaluate these estimated models to see if they fit our criteria so that we can either move to our next stage for results explanation or go back to some previous stages to refine our predictive models To perform our model evaluation, in this section, we have conducted evaluations for cluster analysis and also for PCA However, our focus is still on assessing predictive models, the regression models with rankings as our target variables For this task, we will mainly use Root Mean Square Error (RMSE) to assess our models, as it is good for assessing regression models Just like we did for model estimation, to calculate RMSEs, we need to use MLlib for regression modeling on Spark At the same time, we will also use R notebooks to be implemented in the Databricks environment for Spark Of course, we also used an analytical server for SPSS, as we have adopted a dynamic approach here [ 216 ] Chapter 11 RMSE calculations with MLlib As used with good results in the past, for MLlib, we can use the following code to calculate RMSE: val valuesAndPreds = test.map { point => val prediction = new_model.predict(point.features) val r = (point.label, prediction) r } val residuals = valuesAndPreds.map {case (v, p) => math.pow((v p), 2)} val MSE = residuals.mean(); val RMSE = math.pow(MSE, 0.5) Besides the preceding code, MLlib also has some functions in the RegressionMetrics and RankingMetrics classes for us to use for the RMSE calculation RMSE calculations with R In R, the forecast package has an accuracy function that can be used to calculate forecasting accuracy as well as RMSEs: accuracy(f, x, test=NULL, d=NULL, D=NULL) The measures calculated also include the following: • ME (Mean Error) • RMSE (Root Mean Squared Error) • MAE (Mean Absolute Error) • MPE (Mean Percentage Error) • MAPE (Mean Absolute Percentage Error) • MASE (Mean Absolute Scaled Error) • ACF1 (Autocorrelation of errors at lag 1) To perform a complete evaluation, we calculated RMSEs for all the models we estimated Then, we compared and picked up the ones with smaller RMSEs [ 217 ] Modeling Open Data on Spark Results explanation Per our 4Es framework used for this book, after we passed our model evaluation stage and selected the estimated and evaluated models as our final models, our next task for this project is to interpret the results to our clients In terms of explaining the machine learning results, the users of our project are particularly interested in understanding what influences the known rankings that are widely used Also, they are interested in how new rankings are different from others and how the new rankings can be used So, we will work on their requests, but will not cover all of them as the purpose here is mainly to exhibit technologies Also, for the confidentiality issue and also space limitations, we will not go into the details too much, but will focus more on utilizing our technologies for better explanations Overall, the interpretation is straightforward here, which include the following three tasks: • Present a list of top-ranked schools and school districts • Compare various lists • Explain the impact of factors such as parent involvement and economy on the rankings One of the main achievements of this project is for us to obtain a better and more accurate ranking with our ensemble methods as well as good analytics, but it is very challenging to explain it the users, and it is also beyond the scope of this book here Another big improvement achieved here is the capability for us to quickly produce rankings per various requirements, such as to rank per academic performance or per future employment or per graduation rate, which is interesting to users, but seems still take time for adoption However, users understand the benefits of fastproducing rankings, as made possible using Apache Spark So, as a result, we have delivered a few lists, and reported on ranking comparison and on factors influencing rankings [ 218 ] Chapter 11 Comparing ranks R has some packages that help us analyze and compare rankings, such as pmr and Rmallow However, for this project, the users preferred simple comparison, such as a direct comparison of the top 10 schools and the top 10 school districts, which made our explanation a little easier Another task of the explanatory works is to compare our list to others, such as the one at http://www.school-ratings.com/schoolRatings.php?zipOrCity=91030, or the one provided by the LA Times at http://schools.latimes.com/, or the one by SchoolIE They claimed to be using Big Data to evaluate schools from many perspectives, rather than by one angle, at http://www.schoolie.com/ As a result, we found ours to be closer to the one created by SchoolIE R has some algorithms to compute similarity or distance between rankings, which we explored, but have not used to serve the clients This is because we adopted an approach with simple comparison that our clients preferred, and it is still very effective Biggest influencers As people are interested in how some schools are on top and other schools are not, our results about the biggest predictors are of great interest For this part, we use results from our estimated predictive models of regression, for which we have used our own rankings as the target variable, and also some wellknown rankings such as those provided by the US News and World Report and those by some state organizations For this task, we have just used the coefficients in our linear regression models to tell us which one has a bigger impact We also used the RandomForest function to rank features per their impact on moving schools into the top 100 In other words, we split the list into "top 100" and "the rest." We then ran the decision tree modeling and random forest modeling on it, and then used the Random Forest's feature importance function to obtain a list of features as ordered by their impact on the target variable of whether the school is in top 100 In R, we need to use the function of importance in R's randomForest package Per our results, the economic status of the community, parents' involvements, and college connections are among the factors having the biggest impact for some coast schools However, technology use has not had as much impact as expected [ 219 ] Modeling Open Data on Spark Deployment In the past, rankings were mostly reviewed as a reference by users With this project, we found we are also in a position to assist our users in integrating our results with their decision-making tools, to help them utilize rankings better and also make their lives easier For this, producing rules from rankings and also making scores behind rankings easily accessible became very important Because of the preceding reason, our deployment is still on to develop a set of rules and also to make all the scores available for decision makers, which include schools and some parents Specifically, the main task of sending out a rule to alert users when some ranking changes dramatically, especially when a ranking drops down dramatically Users of this project also have a need to obtain all the scores and rankings for their management performance Another purpose for this project is to produce good predictive models for the users to forecast possible changes of school rankings as per population changes using our developed regression models All the three needs for rankings, scores, and forecasting, mentioned in the preceding paragraph, are of value to various kinds of users who use various kinds of software systems for decision making So, we need a bridge such as Predictive Model Markup Language (PMML), which is adopted as the standard by many systems As discussed before, MLlib supports model export to PMML Therefore, we export some developed models to PMML for this project In practice, the users for this project are more interested in rule-based decision making to use some of our insights and also in score-based decision making to evaluate their regional units' performance Specifically, for this project, the client is interested in applying our results to (1) decide when an alert may be sent out if rankings have been changed or ranking changes will likely occur in the future, for which rules should be established, and to (2) develop scores and use them scores to measure performance as well as to plan for the future Besides this, clients are also interested in forecasting the attendance and other requests per ranking changes, for which R actually has a package called forecast that is ready to be used for this purpose: forecast(fit) plot(forecast(fit)) To sum up, for our special tasks, we need to turn some of our results into some rules and also need to produce some performance scores for the client [ 220 ] Chapter 11 Rules for sending out alerts As discussed earlier, for R results, there are several tools to help extract rules from developed predictive models For the decision tree model developed to model whether or not a service request level exceeds a certain level, we should use the rpart.utils R package, which can extract rules and export them in various formats such as RODBC The rpart.rules.table(model1) * package returns an unpivoted table of variable values (factor levels) associated with each branch However, for this project, partially due to the data incompleteness issue, we will need to utilize some insights to derive rules directly That is, we need to use the insights discussed in the last section For example, we can the following: • If big mobility occurred and also parents' involvement dropped, our prediction shows rankings will go down dramatically and so an alert will be sent out From an analytical perspective, we face the same issue here, to minimize false alerts, while ensuring adequate warning Therefore, by taking advantage of Spark's fast computing, we carefully produced rules, and for each rule, we supplied false positive ratios that helped the client utilize the rules Scores for ranking school districts With our regression modeling in place, we have two ways to forecast the ranking change at a specific time One is to use the estimated regression equations to forecasting directly Alternatively, we can use the following code: forecast(fit, newdata=data.frame(City=30)) [ 221 ] Modeling Open Data on Spark As long as we have obtained the scores, we can classify all the districts or schools into several categories and also illustrate them on a map to identify special zones for attention, such as the graphs as produced by R: Summary The work presented in this chapter is a further extension of Chapter 10, Learning Telco Data on Spark, as well as Chapter 9, City Analytics on Spark It is a very special extension of Chapter 9, City Analytics on Spark, as both chapters are using open datasets It is also an extension of Chapter 10, Learning Telco Data on Spark, as both chapters take a dynamic approach so that readers can take advantage of all the learned techniques to achieve better machine learning results and also to develop the best analytical solutions Therefore, this chapter may be used as a review chapter, as well as a special chapter for you to synthesize all the knowledge learned In this chapter, with a real-life project of learning from open data, we have repeated the same step-by-step RM4Es process as used in the previous chapters, from which we processed open data on Apache Spark and then selected models (equations) For each model, we estimated their coefficients and then evaluated these models against model performance metrics Finally, with the models estimated, we explained our results in detail With this real-life project of learning from open data, we further demonstrated the benefit of utilizing our RM4Es framework to organize our machine learning processes [ 222 ] Chapter 11 A little different from the previous chapters, specifically, we first selected a dynamic machine learning approach with cluster analysis and PCA plus regression Then, we prepared Spark computing and loaded in preprocessed data Second, we worked on data and feature preparation using cleaned open datasets, by reorganizing a few datasets together, and by selecting a core set of features Especially, in dealing with open datasets, a lot more work is needed to clean the data and reorganize it, as demonstrated here, which should be a special learning for anyone using open data Third, we developed measurements and estimated predictive model coefficients using MLlib, R, and SPSS on Spark Fourth, we evaluated these estimated models, mainly using RMSEs Then, we interpreted our machine learning results with lists and ranking comparisons, as well as the biggest predictors for top rankings Finally, we deployed our machine learning results with a focus on scoring and rules The preceding process is similar to the process described in the previous chapters However, in this chapter, we focused our effort on a dynamic approach, which give you opportunities to combine what you have learned so far for the best analytical results Especially, for this project, we explored the datasets and built several measurements and rankings of districts, with which we then developed rules for alerts and scores for performance, to help schools and parents for their decision making and performance management After reading this chapter, you would be completely ready to utilize Apache Spark for dynamic machine learning so that you can quickly develop actionable insights from a large amount of open data By now, users will have mastered our process, our framework, and various approaches Rather than being limited by any of them, users will be able to fully take advantage of all of them or any combination of them for optimal machine learning results [ 223 ] Index A accumulators 15 Alternating Least Squares (ALS) algorithm 123 Apache Spark URL 105 Apache Spark notebooks 89 attrition prediction about 134 confusion matrix 146, 147 error ratios 146, 147 Spark computing 135, 136 use case 134, 135 attrition prediction, methods about 137 decision trees 138 regression models 137 automation about 47 datasets preprocessing, workflows 47, 48 autoregressive integrated moving average (ARIMA) 160 model estimation 109, 110 model evaluation 111, 112 results, explaining 113 scoring 115 Spark computing 105, 106 Spark implementation, with MLlib 110 use case 104 with Spark 103 churn prediction, feature preparation feature extraction 108 feature selection 109 churn prediction, methods about 106 decision trees 107 Random forest 107 regression models 106, 107 cluster analysis reference link 208 confusion matrix about 82, 83 and error ratios 146, 147 Cross Industry Standard Process for Data Mining (CRISP-DM) 16 B D Berkeley Data Analytics Stack (BDAS) 32 broadcast variables 15 Databricks notebook about 22 URL 22 data cleaning about 31 data incompleteness, dealing with 31 in Spark 32 with SampleClean 33 DataFrame 40 C churn prediction deployment 115 feature preparation 108 impact of interventions, calculating 114 intervention recommendations 116 [ 225 ] dataframe API for R 10 URL 12 Data Scientist WorkBench about 90, 91 URL 210 dataset reorganization about 37 tasks 37, 38 with R 39 with Spark SQL 38 datasets accessing 26, 27 exploring 28-30 joining, in Spark 41, 42 joining, with Spark SQL 40 loading 26-28 references 27 repeatability 47 visualizing 28-30 datasets preprocessing automation 48-50 with Spark pipeline 48 workflows 47, 48 decision trees about 78, 138 code, preparing for 138 for churn prediction 107 URL 108 deployment, attrition prediction about 149, 150 rules 150 scoring 150, 151 deployment, holistic view about 70 dashboard 70, 71 rules 71 deployment, risk scoring about 100 scoring 100, 101 Directed Acyclic Graph (DAG) 3, 19 distributed computing 75, 76 E entity resolution about 35 long string comparison 35 record deduplication 36 short string comparison 35 F False Negative (Type I Error) 145 false positive ratios 82, 83 False Positive (Type II Error) 146 feature development, Telco Data about 183 data, reorganizing 184 feature selection 184, 185 feature extraction about 43 challenges 44 with R 46 with Spark MLlib 45, 46 feature preparation, attrition prediction about 139, 140 feature development 140, 141 feature selection 141 feature preparation, fraud detection about 78 data, merging 80 from LogFile 79 feature preparation, holistic view about 61 feature selection 63 grouping, by category 62 PCA 61 feature preparation, open data about 209, 210 data, cleaning 210, 211 data, merging 211, 212 feature development 212 feature selection 212, 213 forecast R package reference link 167 [ 226 ] fraud detection about 73 deploying 84, 85 distributed computing 75, 76 rules 85 scoring 85, 86 use case 74 G GraphX H holistic view about 53 fast and easy computing 56-58 methods 58 use case 54-56 I IBM Predictive Extensions installing 119 IBM SystemML URL identity matching about 34 crowd, configuring 36 crowdsourced deduplication 36 crowd, using 37 entity resolution 35 identity issues 34 on Spark 35 with SampleClean 36 J Jupyter notebook reference 89 using 89 K Knitr package installing 21 Kolmogorov-Smirnov (KS) 98 L Last Observation Carried Forward (LOCF) 42 LIGO project reference links 153 linear regression 106, 137 LogFile feature extraction 79 logistic regression 106, 137 M machine learning algorithms 5, notebook approach 19, 20 Spark, computing machine learning methods, Telco Data about 180, 181 decision tree 182, 183 descriptive statistics 181 linear regression model 181, 182 logistic regression model 181, 182 random forest 182, 183 visualization 181 methods, for fraud detection about 76, 77 decision trees 78 Random forest 77, 78 methods, for holistic view about 58 decision trees 60 regression modeling 58, 59 SEM approach 59, 60 methods, for recommendation engine about 122 coding, preparing 124 collaborative filtering 123 methods, for risk scoring code, preparing in R 92 decision trees 92 logistic regression 91 Random forest 92 [ 227 ] ML frameworks 12, 13 MLlib about 6, 7, 140 SystemML 7, URL used, for RMSE calculation 166 ML workflows about 15, 16 examples 17, 18 model deployment, Telco Data about 197 alerts, sending 197, 198 purchase propensity, predicting 199 scores, producing 198 model estimation, attrition prediction about 143 Spark implementation, Zeppelin notebook used 143-145 model estimation, fraud detection about 80, 81 MLlib, implementing 81 R notebooks, implementing 81 model estimation, holistic view about 64 MLlib implementation 64, 65 R notebooks implementation 65 model estimation, open data about 214 model evaluation 216 RMSE, calculating with MLlib 217 RMSE, calculating with R 217 SPSS Analytics Server 215, 216 model estimation, recommendation engine about 126 SPSS, on Spark 126, 127 model estimation, risk scoring about 95 Data Scientist Workbench, for R Notebooks 95, 96 R Notebooks implementation 96 model estimation, service forecasting about 162 Spark implementation, with R notebook 165 Spark implementation, with Zeppelin notebook 163, 164 model estimation, Telco Data about 185, 186 SPSS Analytics Server 187 model evaluation about 145, 146 performing 82 model evaluation, fraud detection about 82 confusion matrix 82, 83 false positive ratios 82, 83 model evaluation, holistic view about 65 quick evaluations 66 RMSE 66, 67 ROC curves 67, 68 model evaluation, risk scoring about 97 confusion matrix 97 Kolmogorov-Smirnov (KS) 98 ROC 97 model evaluation, service forecasting about 165 RMSE calculation, with MLlib 166 RMSE calculation, with R 166 model evaluation, Telco Data about 188 confusion matrix, calculating with R 190 error ratios, calculating with MLlib 190 RMSE, calculating with MLlib 189 RMSE, calculating with R 189 N notebook approach for machine learning (ML) 19, 20 O open data cluster analysis 207, 208 deployment 220 principal component analysis (PCA) 208 ranking 207 reference link 202 regression models 208, 209 [ 228 ] score, resembling 209 scoring 207 Spark, computing 203-207 use case 202, 203 OpenRefine about 93 URL 210 P PCA about 61 reference 61 PipelineStages 18 Predictive Model Markup Language (PMML) 149, 172 Principal Component Analysis (PCA) about 141, 142, 162, 184, 208 Subject knowledge aid 142 URL 208 R R dataframe API 10 dataset reorganization 39 feature extraction 46 used, for RMSE calculation 166 Random forest about 77 for churn prediction 107 URL 108 Receiver Operating Characteristic (ROC) curve 66, 97 recommendation engine about 117 deployment 128-130 model evaluation 127 ReporteRs R package URL 46 Research Methods Four Elements (RM4Es) about 13 Equation 13 Estimation 13 Evaluation 13 Explanation 14 Resilient Distributed Dataset (RDD) 3, results explanation, attrition prediction about 147 interventions impact, calculating 148 main causes impact, calculating 149 results explanation, fraud detection about 83 influencing variables 83, 84 results explanation, holistic view about 68 impacts assessments 69 results explanation, open data about 218 alerts, sending 221 impacts, predicting 219 ranks, comparing 219 school districts, ranking 221, 222 results explanation, risk scoring about 99 big influencers 99, 100 results explanation, service forecasting about 167 biggest influencers 167 city zones, ranking scores 173, 174 sending out alerts, rules 172 trends, visualizing 168, 169 results explanation, Telco Data about 190 descriptive statistics 191, 192, 193 impacts, analyzing 194 insights 195 trends, visualizing 195, 196 visualizations 191, 192, 193 risk scoring data preparation 93 feature preparation 93 methods 91 OpenRefine, using 93, 94 R Markdown about 20 Knitr package, installing 21 report, creating 21 R studio, downloading 21 R notebook about 96 decision tree 96 logistic regression 96 [ 229 ] Random Forest 96 references 89 used, for Spark implementation 165 ROCR URL 82 Root Mean Square Error (RMSE) about 66, 188, 216 calculation, with MLlib 166 calculation, with R 166 example 67 R studio URL 21 S SampleClean URL 33, 36 used, for data cleaning 33 used, for identity matching 36 service forecasting about 154 data, merging 161 data, preparing 160 feature, preparing 160 feature, selecting 162 methods 158 regression models 158 Spark computing 156, 157 use case 154, 155 shared variables accumulators 15 broadcast variables 15 Spark advantages 2-4 computing, for machine learning computing framework 14 holistic view 53 overview recommendations 117 URL used, for service forecasting 154 Spark dataframe about 9, 10 URL 10 Spark DataSource API URL 28 Spark, for recommendation engine SPSS on Spark 119-122 use case 118 Spark, for risk scoring about 87 Apache Spark notebooks 89 use case 88 Spark MLlib feature extraction 45, 46 URL 46 Spark notebooks about 19 Databricks notebook 22 notebook approach, for machine learning (ML) 19-21 Spark pipeline about 15 URL 18 used, for datasets preprocessing 48 Spark RDD about 8, URL Spark SQL datasets, joining 40 URL 39, 41 used, for dataset reorganization 38 spark-ts library reference link 160 SPSS Analytics Server 126, 127, 187 SPSS modeler data nodes, missing 125 data treatment 124 SPSS on Spark 119 SQLContext 40 Structural Equation Modeling (SEM) 59, 60 SystemML 7, T Telco Data machine learning methods 180, 181 Spark, computing for 179, 180 use case 178, 179 using 178 [ 230 ] time series modeling about 159 coding, preparation steps 160 reference link 159 True Positive (TP) error rate 97 Z Zeppelin notebook URL 136 used, for Spark implementation 163, 164 [ 231 ]

Định dạng
Số trang	240
Dung lượng	2,36 MB