Practical Machine Learning with H2O Powerful, Scalable Techniques for Deep Learning and AI Darren Cook Practical Machine Learning with H2O by Darren Cook Copyright © 2017 Darren Cook All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Nicole Tache Production Editor: Colleen Lobner Copyeditor: Kim Cofer Proofreader: Charles Roumeliotis Indexer: WordCo Indexing Services, Inc Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest December 2016: First Edition Revision History for the First Edition 2016-12-01: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491964606 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Practical Machine Learning with H2O, the cover image of a crayfish, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-96460-6 [LSI] Preface It feels like machine learning has finally come of age It has been a long childhood, stretching back to the 1950s and the first program to learn from experience (playing checkers), as well as the first neural networks We’ve been told so many times by AI researchers that the breakthrough is “just around the corner” that we long ago stopped listening But maybe they were on the right track all along, maybe an idea just needs one more order of magnitude of processing power, or a slight algorithmic tweak, to go from being pathetic and pointless to productive and profitable In the early ’90s, neural nets were being hailed as the new AI breakthrough I did some experiments applying them to computer go, but they were truly awful when compared to the (still quite mediocre) results I could get using a mix of domain-specific knowledge engineering, and heavily pruned tree searches And the ability to scale looked poor, too When, 20 years later, I heard talk of this new and shiny deep learning thing that was giving impressive results in computer go, I was confused how this was different from the neural nets I’d rejected all those years earlier “Not that much” was the answer; sometimes you just need more processing power (five or six orders of magnitude in this case) for an algorithm to bear fruit H2O is software for machine learning and data analysis Wanting to see what other magic deep learning could perform was what personally led me to H2O (though it does more than that: trees, linear models, unsupervised learning, etc.), and I was immediately impressed It ticks all the boxes: Open source (the liberal Apache license) Easy to use Scalable to big data Well-documented and commercially supported On its third version (i.e., a mature architecture) Wide range of OS/language support With the high-quality team that H2O.ai (the company behind H2O) has put together, it is only going to get better There is the attitude of not just “How do we get this to work?” but “How do we get this to work efficiently at big data scale?” permeating the whole development If machine learning has come of age, H2O looks to be not just an economical family car for it, but simultaneously the large load delivery truck for it Stretching my vehicle analogy a bit further, this book will show you not just what the dashboard controls do, but also the best way to use them to get from A to B It will be as practical as possible, with only the bare minimum explanation of the maths or theory behind the learning algorithms Of course H2O is not perfect; here are a few issues I’ve noticed people mutter about There is no GPU support (which could make deep learning, in particular, quicker).1 The cluster support is all ’bout that bass (big data), no treble (complex but relatively small data), so for the latter you may be limited to needing a single, fast, machine with lots of cores Also no high availability (HA) for clusters H2O compiles to Java; it is well-optimized and the H2O algorithms are known for their speed but, theoretically at least, carefully optimized C++ could be quicker There is no SVM algorithm Finally, it tries to support numerous platforms, so each has some rough edges, and development is sometimes slowed by trying to keep them all in sync In other words, and wringing the last bit of life out of my car analogy: a Formula 1 car might beat it on the straights, and it isn’t yet available in yellow Who Uses It and Why? A number of well-known companies are using H2O for their big data processing, and the website claims that over 5000 organizations currently use it The company behind it, H2O.ai, has over 80 staff, more than half of which are developers But those are stats to impress your boss, not a no-nonsense developer For R and Python developers, who already feel they have all the machine learning libraries they need, the primary things H2O brings are ease of use and efficient scalability to data sets too large to fit in the memory of your largest machine For SparkML users, who feel they already have that, H2O algorithms are fewer in number but apparently significantly quicker As a bonus, the intelligent defaults mean your code is very compact and clear to read: you can literally get a well-tuned, state-of-the-art, deep learning model as a one-liner One of the goals of this book was to show you how to tune the models, but as we will see, sometimes I’ve just had to give up and say I can’t beat the defaults About You To bring this book in at under a thousand pages, I’ve taken some liberties I am assuming you know either R or Python Advanced language features are not used, so competence in any programming language should be enough to follow along, but the examples throughout the book are only in one of those two languages Python users would benefit from being familiar with pandas, not least because it will make all your data science easier I’m also assuming a bit of mental flexibility: to save repeating every example twice, I’m hoping R users can grasp what is going on in a Python example, and Python users can grasp an R example These slides on Python for R users are a good start (for R users too) Some experience with manipulating data is assumed, even if just using spreadsheet software or SQL tables And I assume you have a fair idea of what machine learning and AI are, and how they are being used more and more in the infrastructure that runs our society Maybe you are reading this book because you want to be part of that and fmake sure the transformations to come are done ethically and for the good of everyone, whatever their race, sex, nationality, or beliefs If so, I salute you I am also assuming you know a bit of statistics Nothing too scary—this book takes the “Practical” in the title seriously, and the theory behind the machine-learning algorithms is kept to the minimum needed to know how to tune them (as opposed to being able to implement them from scratch) Use Wikipedia or a search engine for when you crave more But you should know your mean from your median from your mode, and know what a standard deviation and the normal distribution are But more than that, I am hoping you know that statistics can mislead, and machine learning can overfit That you appreciate that when someone says an experiment is significant to p = 0.05 it means that out of every 20 such experiments you read about, probably one of them is wrong A good moment to enjoy Significant, on xkcd This might also be a good time to mention “my machine,” which I sometimes reference for timings It is a mid-level notebook, a couple of years old, 8GB of memory, four real cores, eight hyper-threads This is capable of running everything in the book; in fact 4GB of system memory should be enough However, for some of the grid searches (described in Chapter 5) I “cheated” and started up a cluster in the cloud (covered, albeit briefly, in “Clusters” in Chapter 10) I did this just out of practicality: not wanting to wait 24 hours for an experiment to finish before I can write about it Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords Constant width bold Shows commands or other text that should be typed literally by the user Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context T IP This element signifies a tip or suggestion NOT E This element signifies a general note WARNING This element indicates a warning or caution parameters, GLM Parameters-GLM Parameters GLRM (Generalized Low Rank Model), GLRM-GLRM attempts to address missing data with, GLRM imputation of missing data with, GLRM grid search, Grid Search-High-Level Strategy cartesian, Cartesian-Cartesian high-level strategy, High-Level Strategy RandomDiscrete, RandomDiscrete-RandomDiscrete grid, extracting metrics from, Football: Tuned GBM H h20.H2OFrame(), Load Directly from Python H2O building from source, Building from Source clusters, Clusters-Hadoop first learning example, Our First Learning-On Being Unlucky Flow web interface, Flow-Predictions installation, Preparing to Install-Install H2O with Python (pip) installation of latest version, Installing the Latest Version installation with R (CRAN), Install H2O with R (CRAN)-Install H2O with R (CRAN) preparing to install, Preparing to Install-Privacy privacy settings, Privacy running from command line, Running from the Command Line Sparkling Water for sharing functions with Spark, Spark / Sparkling Water h2o.ensemble(), Stacking: h2o.ensemble h2o.stack(), Stacking: h2o.ensemble h2oinit() in Python, Install H2O with Python (pip) in R, Install H2O with R (CRAN) Hadoop, Hadoop handwritten digits test case (see MNIST data (handwritten digits) test case) helper variables, Our First Learning hidden layers, Parameters hiding data, Sampling, Generalizing histograms, building, Aggregating Rows huber distribution, Regression hyper-parameters, Cartesian, RandomDiscrete I importing data, Data Import, Data Export-Load Directly from Python alternative file formats, Load Other File Formats from Python, Load Directly from Python-Load Directly from Python from R, Load Directly from R into H2O, Getting Data into H2O-Load Directly from Python loading csv files, Load CSV Files-Load CSV Files memory requirements, Memory Requirements preparation of data, Preparing the Data-Preparing the Data with Flow, Data-Data independence, of data, Split Data Already in H2O indexing/indices, Indexing, The Essentials installation H2O, Preparing to Install-Install H2O with Python (pip) H2O with Python, Install H2O with Python (pip)-Install H2O with Python (pip) H2O with R (CRAN), Install H2O with R (CRAN)-Install H2O with R (CRAN) Java, Installing Java of latest version of H2O, Installing the Latest Version preparation for, Preparing to Install-Privacy Python, Installing Python R, Installing R int (column type), Our First Learning Iris data set about, Our First Learning DL example with, Our First Learning-On Being Unlucky importing data (Flow example), Data-Data learning examples with, Our First Learning-On Being Unlucky naive bayes on, Naive Bayes performance vs predictions, Performance Versus Predictions randomnesss effect on results, On Being Unlucky training/predictions with Python, Training and Predictions, with Python training/predictions with R, Training and Predictions, with R J jargon, Our First Learning Java, installing, Installing Java K k-folds (cross-validation), Cross-Validation (aka k-folds) L lambda, Python vs H2O usage, GLM Parameters laplace distribution, Regression layers, network, Network Layers-Network Layers lazy operations, Laziness, Naming, Deleting, Rows and Columns linear models (see GLM (generalized linear models)) loading data (see importing data) Logloss, Classification Metrics M MAE (Mean Absolute Error), Regression Metrics manipulating data, Data Manipulation-Rows and Columns aggregating rows, Aggregating Rows data summaries, Data Summaries indexing, Indexing laziness, naming, deleting, Laziness, Naming, Deleting operations on columns, Operations on Columns rows and columns, Rows and Columns-Rows and Columns splitting data already in H2O, Split Data Already in H2O-Split Data Already in H2O mean_per_class_error, Classification Metrics metrics binomial classification, Binomial Classification classification, Classification Metrics extracting from grid, Football: Tuned GBM H2O-supported, Supported Metrics-Binomial Classification regression metrics, Regression Metrics misclassification, Classification Metrics missing data football scores data set, Missing Data… And Yet More Columns, Missing Data (Again) imputation with GLRM, GLRM strategies for, Lose the R! unsupervised learning and, Missing Data-Lose the R! zeros as indicator of, Let’s Take a Look! MNIST data (handwritten digits) test case analyzing with graphical methods, Taking a Look comparison of algorithm results, MNIST Results-MNIST Results data set, Data Set: Handwritten Digits-About the Data Set data setup/loading, Setup and Load DL: default, MNIST: Default Deep Learning-MNIST: Default Deep Learning DL: results, MNIST Results DL: tuned, MNIST: Tuned Deep Learning-MNIST: Tuned Deep Learning finding optimum approach for, How Low Can You Go?-That Was as Low as I Go… GBM: default, MNIST: Default GBM GBM: results, MNIST Results GBM: tuned, MNIST: Tuned GBM-MNIST: Tuned GBM GLM: default, MNIST: Default GLM-MNIST: Default GLM GLM: results, MNIST Results GLM: tuned, MNIST: Tuned GLM-MNIST: Tuned GLM improving results from algorithms, Helping the Models-Helping the Models increasing training data rows, The More the Merrier-The More the Merrier RF with enhanced data, Enhanced Data RF: default, MNIST: Default Random Forest RF: results, MNIST Results RF: tuned, MNIST: Tuned Random Forest-Enhanced Data training samples, Data Set: Handwritten Digits model files exporting, Model Files saving all, Save All Models model parameters checkpoints, Checkpoints common, Common Model Parameters-Summary cross-validation (k-folds), Cross-Validation (aka k-folds) data weighting, Data Weighting-Data Weighting early stopping, Early Stopping-Early Stopping effort, Effort essentials, The Essentials generalizing, Sampling, Generalizing output control, Output Control regression, Regression sampling, Sampling, Generalizing scoring and validation, Scoring and Validation models, with Flow, Models moving averages, The Other Third moving standard deviation, The Other Third MSE (Mean Squared Error), Regression Metrics, Classification Metrics multinomial classification, metrics for, Classification Metrics N naive bayes, Naive Bayes Natural Language Processing (NLP) data preparation, K-Means Clustering DL auto-encoder, Deep Learning Auto-Encoder-Stacked Auto-Encoder k-means clustering and, K-Means Clustering-K-Means Clustering stacked auto-encoder, Stacked Auto-Encoder-Stacked Auto-Encoder network layers, Network Layers-Network Layers neural networks, What Are Neural Nets?-Activation Functions (see also deep learning) activation functions, Activation Functions network layers, Network Layers-Network Layers numbers vs categories in, Numbers Versus Categories neuron defined, What Are Neural Nets? hints for choosing number of, Network Layers no data, zero vs., The Other Third O ordered factors, The Data Columns, Numbers Versus Categories output control, Output Control P parameters DL, Parameters-Deep Learning Scoring, Appendix: More Deep Learning ParametersAppendix: More Deep Learning Parameters GBM, Parameters-Parameters GLM, GLM Parameters-GLM Parameters illegal combinations, RandomDiscrete model, Common Model Parameters-Summary RF, Parameters-Parameters supported metrics, Supported Metrics-Binomial Classification pattern recognition (see handwritten digits (MNIST data)) Pearson correlation, when to use, Indexing perfect negative correlation, Let’s Take a Look! performance, predictions vs., Performance Versus Predictions pip, Installing Python poisson distribution, Regression POJO (Plain Old Java Object), POJOs precision (defined), Binomial Classification predictions performance vs., Performance Versus Predictions with Flow, Predictions principal component analysis (PCA), Principal Component Analysis privacy, Privacy, Running from the Command Line probability distributions, Regression productivity, scaling and, Preparing the Data Python H2O installation with, Install H2O with Python (pip)-Install H2O with Python (pip) installation, Installing Python loading data directly from, Load Directly from Python-Load Directly from Python training/predictions with, Training and Predictions, with Python Q quantile distribution, Regression R R H2O installation with, Install H2O with R (CRAN)-Install H2O with R (CRAN) installation, Installing R loading data directly from, Load Directly from R training/predictions with, Training and Predictions, with R R2 (R-squared), Regression Metrics random forest (RF), Random Forest-Summary about, Random Forest building energy efficiency results, Building Energy Results building energy efficiency: default, Building Energy Efficiency: Default Random ForestBuilding Energy Efficiency: Default Random Forest building energy efficiency: tuned, Building Energy Efficiency: Tuned Random ForestBuilding Energy Efficiency: Tuned Random Forest decision trees, Decision Trees defined, Common Model Parameters enhanced MNIST data, Enhanced Data football scores: default, Football: Default Random Forest football scores: results, Football Data football scores: tuned, Football: Tuned Random Forest grid search, Grid Search-High-Level Strategy MNIST data: default, MNIST: Default Random Forest MNIST data: results, MNIST Results MNIST data: tuned, MNIST: Tuned Random Forest-Enhanced Data parameters, Parameters-Parameters RandomDiscrete grid search, RandomDiscrete-RandomDiscrete randomness, results and, On Being Unlucky real (column type), Our First Learning recall, Binomial Classification regression, Regression (see also building energy efficiency test case) model parameters, Regression regression metrics, Regression Metrics regression tree, Decision Trees regularization DL and, Deep Learning Regularization GLM and, GLM Parameters RMSE (square root of MSE), Regression Metrics RMSLE (Root Mean Squared Logarithmic Error), Regression Metrics robust regression, Regression row indexing, Indexing rows aggregating, Aggregating Rows H2O conventions, Install H2O with Python (pip) manipulating, Rows and Columns-Rows and Columns RStudio, Installing R S sampling, parameters for, Sampling, Generalizing scaling, productivity and, Preparing the Data scoring, DL, Deep Learning Scoring-Deep Learning Scoring, Deep Learning Scoring seed defined, Effort with cartesian grid search, Cartesian server load, monitoring, Memory Requirements simple moving average (SMA), The Other Third Spark, Spark / Sparkling Water Sparkling Water, Spark / Sparkling Water Spearman correlation, when to use, Indexing splitting, Split Data Already in H2O-Split Data Already in H2O about, Our First Learning building energy efficiency data set example, Splitting the Data randomness and, On Being Unlucky SQL databases, importing data from, Load Other File Formats stacking auto-encoder, Stacked Auto-Encoder-Stacked Auto-Encoder h2o.ensemble(), Stacking: h2o.ensemble string (column type), Our First Learning subsets, Preparing the Data summaries, data, Data Summaries supervised learning columns in, Our First Learning naive bayes, Naive Bayes T test (H2O convention), Our First Learning test set football scores, How to Train and Test? validation data set vs., Split Data Already in H2O time (column type), Our First Learning time-series data, splitting of, Split Data Already in H2O train (H2O convention), Our First Learning training data set balance of, Split Data Already in H2O filtering for hardness, Filtering for Hardness football scores, How to Train and Test? increasing rows for, The More the Merrier-The More the Merrier setting aside test set from, Split Data Already in H2O training, checkpoints for, Checkpoints tweedie distribution, Regression, Building Energy Efficiency: Tuned GLM U U-values, Let’s Take a Look! unordered factors, The Data Columns unsupervised learning, Unsupervised Learning-Summary DL auto-encoder, Deep Learning Auto-Encoder-Stacked Auto-Encoder GLRM, GLRM-GLRM k-means clustering, K-Means Clustering-K-Means Clustering missing data strategies, Missing Data-Lose the R! PCA, Principal Component Analysis V valid (H2O convention), Our First Learning validation data set cross-validation as alternative to, Split Data Already in H2O (see also cross-validation)) determining optimum size of, Still Desperate for More football scores, How to Train and Test? test set vs., Split Data Already in H2O variables, deleting, Laziness, Naming, Deleting W Water Meter, Memory Requirements Z zero as indicator of missing data, Let’s Take a Look! no data vs., The Other Third About the Author Darren Cook has over 20 years of experience as a software developer, data analyst, and technical director, working on everything from financial trading systems to NLP, data visualization tools, and PR websites for some of the world’s largest brands He is skilled in a wide range of computer languages, including R, C++, PHP, JavaScript, and Python He works at QQ Trend, a financial data analysis and data products company Colophon The animal on the cover of Practical Machine Learning with H2O is a crayfish, a small lobster-like crustacean found in freshwater habitats throughout the world Alternate names include crawfish, crawdads, and mudbugs, depending on the region There are over 500 species of crayfish, over half of which occur in North America There is great variation in size, shape, and color across species Crayfish are typically 3 to 4 inches in North America, while certain species in Australia grow to be a staggering 15 inches and can weigh as much as 8 pounds Like crabs and other crustaceans, crayfish shed their hard outer shells periodically, eating them to recoup calcium They are nocturnal creatures, possessing keen eyesight as well as the ability to move their eyes in different directions at once Crayfish have eight pairs of legs, four of which are used for walking The other legs are used for swimming backward, a maneuver that allows the crayfish to dart quickly through the water Lost limbs can be regenerated, a capability that comes in handy during the competitive (and often aggressive) mating season Crayfish are opportunistic omnivores who consume almost anything, including plants, clams, snails, insects, and dead organic matter Their own predators include fish (they are widely regarded as a tackle box staple), otters, birds, and humans More than 100 million pounds of crawfish are produced each year in Louisiana, where it was adopted as the state’s official crustacean in 1983 Many of the animals on O’Reilly covers are endangered; all of them are important to the world To learn more about how you can help, go to animals.oreilly.com The cover image is from Treasury of Animal Illustrations by Dover The cover fonts are URW Typewriter and Guardian Sans The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono ... Practical Machine Learning with H2O Powerful, Scalable Techniques for Deep Learning and AI Darren Cook Practical Machine Learning with H2O by Darren Cook Copyright © 2017 Darren Cook... train, test = data.split_frame([0.8]) m = h2o. estimators.deeplearning.H2ODeepLearningEstimator() m.train(x, y, train) p = m.predict(test) Example 1-2 Deep learning on the Iris data set, in R library (h2o) h2o. init(nthreads = -1)... give it exactly 4GB of your memory, but only two of your eight cores? First shut down H2O with h2o. shutdown(), then type h2o. init(nthreads=2, max_mem_size=4) The following excerpt from the information table confirms that it worked: H2O cluster total free memory: 3.56 GB H2O cluster total cores: