Packt prac jan 2016 ISBN 178439968x

Practical Machine Learning Table of Contents Practical Machine Learning Credits Foreword About the Author Acknowledgments About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why subscribe? Free access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions Introduction to Machine learning Machine learning Definition Core Concepts and Terminology What is learning? Data Labeled and unlabeled data Tasks Algorithms Models Logical models Geometric models Probabilistic models Data and inconsistencies in Machine learning Under-fitting Over-fitting Data instability Unpredictable data formats Practical Machine learning examples Types of learning problems Classification Clustering Forecasting, prediction or regression Simulation Optimization Supervised learning Unsupervised learning Semi-supervised learning Reinforcement learning Deep learning Performance measures Is the solution good? Mean squared error (MSE) Mean absolute error (MAE) Normalized MSE and MAE (NMSE and NMAE) Solving the errors: bias and variance Some complementing fields of Machine learning Data mining Artificial intelligence (AI) Statistical learning Data science Machine learning process lifecycle and solution architecture Machine learning algorithms Decision tree based algorithms Bayesian method based algorithms Kernel method based algorithms Clustering methods Artificial neural networks (ANN) Dimensionality reduction Ensemble methods Instance based learning algorithms Regression analysis based algorithms Association rule based learning algorithms Machine learning tools and frameworks Summary Machine learning and Large-scale datasets Big data and the context of large-scale Machine learning Functional versus Structural – A methodological mismatch Commoditizing information Theoretical limitations of RDBMS Scaling-up versus Scaling-out storage Distributed and parallel computing strategies Machine learning: Scalability and Performance Too many data points or instances Too many attributes or features Shrinking response time windows – need for real-time responses Highly complex algorithm Feed forward, iterative prediction cycles Model selection process Potential issues in large-scale Machine learning Algorithms and Concurrency Developing concurrent algorithms Technology and implementation options for scaling-up Machine learning MapReduce programming paradigm High Performance Computing (HPC) with Message Passing Interface (MPI) Language Integrated Queries (LINQ) framework Manipulating datasets with LINQ Graphics Processing Unit (GPU) Field Programmable Gate Array (FPGA) Multicore or multiprocessor systems Summary An Introduction to Hadoop’s Architecture and Ecosystem Introduction to Apache Hadoop Evolution of Hadoop (the platform of choice) Hadoop and its core elements Machine learning solution architecture for big data (employing Hadoop) The Data Source layer The Ingestion layer The Hadoop Storage layer The Hadoop (Physical) Infrastructure layer – supporting appliance Hadoop platform / Processing layer The Analytics layer The Consumption layer Explaining and exploring data with Visualizations Security and Monitoring layer Hadoop core components framework Hadoop Distributed File System (HDFS) Secondary Namenode and Checkpoint process Splitting large data files Block loading to the cluster and replication Writing to and reading from HDFS Handling failures HDFS command line RESTFul HDFS MapReduce MapReduce architecture What makes MapReduce cater to the needs of large datasets? MapReduce execution flow and components Developing MapReduce components InputFormat OutputFormat Mapper implementation Hadoop 2.x Hadoop ecosystem components Hadoop installation and setup Installing Jdk 1.7 Creating a system user for Hadoop (dedicated) Disable IPv6 Steps for installing Hadoop 2.6.0 Starting Hadoop Hadoop distributions and vendors Summary Machine Learning Tools, Libraries, and Frameworks Machine learning tools – A landscape Apache Mahout How does Mahout work? Installing and setting up Apache Mahout Setting up Maven Setting-up Apache Mahout using Eclipse IDE Setting up Apache Mahout without Eclipse Mahout Packages Implementing vectors in Mahout R Installing and setting up R Integrating R with Apache Hadoop Approach 1 – Using R and Streaming APIs in Hadoop Approach 2 – Using the Rhipe package of R Approach 3 – Using RHadoop Summary of R/Hadoop integration approaches Implementing in R (using examples) R Expressions Assignments Functions R Vectors Assigning, accessing, and manipulating vectors R Matrices R Factors R Data Frames R Statistical frameworks Julia Installing and setting up Julia Downloading and using the command line version of Julia Using Juno IDE for running Julia Using Julia via the browser Running the Julia code from the command line Implementing in Julia (with examples) Using variables and assignments Numeric primitives Data structures Working with Strings and String manipulations Packages Interoperability Integrating with C Integrating with Python Integrating with MATLAB Graphics and plotting Benefits of adopting Julia Integrating Julia and Hadoop Python Toolkit options in Python Implementation of Python (using examples) Installing Python and setting up scikit-learn Loading data Apache Spark Scala Programming with Resilient Distributed Datasets (RDD) Spring XD Summary Decision Tree based learning Decision trees Terminology Purpose and uses Constructing a Decision tree Handling missing values Considerations for constructing Decision trees Choosing the appropriate attribute(s) Information gain and Entropy Gini index Gain ratio Termination Criteria / Pruning Decision trees Decision trees in a graphical representation Inducing Decision trees – Decision tree algorithms CART simulation / Simulation optimization / Optimization supervised learning / Supervised learning unsupervised learning / Unsupervised learning semi-supervised learning / Semi-supervised learning reinforcement learning / Reinforcement learning deep learning / Deep learning process lifecycle, Machine learning / Machine learning process lifecycle and solution architecture Producer/Consumer Model about / Distributed and parallel computing strategies Protocol Buffer URL / Approach 2 – Using the Rhipe package of R PyBrain / Implementation of Python (using examples) Pydoop about / Implementation of Python (using examples) PyML / Implementation of Python (using examples) Python about / Python toolkit options / Toolkit options in Python implementing / Implementation of Python (using examples) installing / Installing Python and setting up scikit-learn Python (scikit-learn) used, for implementing decision trees / Using Python (scikit-learn) used, for implementing KNN / Using Python (scikit-learn) used, for implementing Support Vector Machines (SVM) / Using Python (Scikitlearn) used, for implementing Apriori and FP-growth / Using Python (Scikit-learn) used, for implementing k-means clustering / Using Python (scikit-learn) used, for implementing Deep learning methods / Using Python (Scikit-learn) used, for implementing ANNs / Using Python (Scikit-learn) used, for implementing ensemble methods / Using Python (Scikit-learn) Q Q-Learning technique about / Q-Learning – off-Policy TD Quadratic Discriminant Analysis (QDA) / C4.5 quartiles about / Revisiting statistics QUEST / C4.5 R R about / R capabilities / R installing / Installing and setting up R setting up / Installing and setting up R used, for implementing decision trees / Using R used, for implementing KNN / Using R used, for implementing Support Vector Machines (SVM) / Using R used, for implementing Apriori and FP-growth / Using R used, for implementing k-means clustering / Using R used, for implementing Naïve Bayes algorithm / Using R used, for implementing logistic regression / Using R used, for implementing linear regression / Using R used, for implementing Deep learning methods / Using R used, for implementing ANNs / Using R used, for implementing ensemble methods / Using R R, integrating with Apache Hadoop about / Integrating R with Apache Hadoop R and Streaming APIs, using in Hadoop / Approach 1 – Using R and Streaming APIs in Hadoop Rhipe package, using of R / Approach 2 – Using the Rhipe package of R RHadoop, using / Approach 3 – Using RHadoop R / Hadoop integration approaches pros / Summary of R/Hadoop integration approaches cons / Summary of R/Hadoop integration approaches Radial Bias Function (RBF) networks / Radial Bias Function (RBF) networks random access sparse vectors / Implementing vectors in Mahout random forests about / Random forests, Random forests randomness about / Important terms and definitions range about / Revisiting statistics R Data Frames about / R Data Frames RDBMS theoretical limitations / Theoretical limitations of RDBMS rdfs package / Approach 3 – Using RHadoop recommendation systems / Recommendation systems rectified linear neurons / Rectified linear neurons / linear threshold neurons Recurrent Neural Networks (RNNs) / Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs) Reducer job / MapReduce architecture reference reward about / Reinforcement comparison methods regression analysis about / Regression analysis statistics, revisiting / Revisiting statistics regression analysis based algorithms about / Regression analysis based algorithms regression methods about / Regression methods key assumptions / Regression methods simple regression / Simple regression or simple linear regression simple linear regression / Simple regression or simple linear regression multiple regression / Multiple regression polynomial (non-linear) regression / Polynomial (non-linear) regression Generalized Linear Models (GLM) / Generalized Linear Models (GLM) logistic regression (logit link) / Logistic regression (logit link) Poisson regression / Poisson regression Reinforcement Comparison methods about / Reinforcement comparison methods Reinforcement Learning (RL) about / Reinforcement Learning (RL) context / The context of Reinforcement Learning terms / The context of Reinforcement Learning examples / Examples of Reinforcement Learning evaluative feedback / Evaluative Feedback Markov Decision Process (MDP) / Markov Decision Process (MDP) Delayed Rewards / Delayed rewards optimal policy / The policy key features / Reinforcement Learning – key features solution methods / Reinforcement learning solution methods Reinforcement Learning (RL) problem world grid example / The Reinforcement Learning problem – the world grid example Remote Procedure Calls (RPC) / Hadoop ecosystem components Resilient Distributed Dataset (RDD) / Apache Spark Resilient Distributed Datasets (RDD) programming with / Programming with Resilient Distributed Datasets (RDD) RESTFul HDFS / RESTFul HDFS reward about / The context of Reinforcement Learning R Expressions about / R Expressions assignments / Assignments functions / Functions R Factors about / R Factors rhbase package / Approach 3 – Using RHadoop R Learning (Off-policy) about / R Learning (Off-policy) R Matrices about / R Matrices rmr package / Approach 3 – Using RHadoop root mean square error (RMSE) / Mean squared error (MSE) Rote Learner / Instance-based learning (IBL) R Statistical frameworks about / R Statistical frameworks rule extraction / Forecasting, prediction or regression R Vectors about / R Vectors assigning / Assigning, accessing, and manipulating vectors accessing / Assigning, accessing, and manipulating vectors manipulating / Assigning, accessing, and manipulating vectors S 4Store / Vendors sample about / Important terms and definitions stratified sampling / Important terms and definitions sample size about / Important terms and definitions sample space probability about / Probability Sampling Bias about / Important terms and definitions Sarsa about / Sarsa - on-Policy TD Scala about / Scala examples / Scala scaling-out storage versus scaling-up storage / Scaling-up versus Scaling-out storage scikit-learn about / Implementation of Python (using examples) setting up / Installing Python and setting up scikit-learn used, for implementing Naïve Bayes algorithm / Using scikit-learn used, for implementing logistic regression / Using scikit-learn used, for implementing linear regression / Using scikit-learn SciPy about / Implementation of Python (using examples) semantic data architecture about / Semantic data architecture business data lake / The business data lake central data integration / Semantic Web technologies peer-to-peer / Semantic Web technologies features / Ontology and data integration vendors / Vendors Semantic Web technologies about / Semantic Web technologies ontology and data integration / Ontology and data integration semi-supervised learning about / Reinforcement Learning (RL) sequence files about / Implementing vectors in Mahout sequential access sparse vectors / Implementing vectors in Mahout Sesame / Vendors shallow learning algorithm / Background Shared Nothing Architecture (SNA) / The Hadoop (Physical) Infrastructure layer – supporting appliance Sigmoid neurons / Sigmoid neurons simple linear regression about / Simple regression or simple linear regression simple regression about / Simple regression or simple linear regression Single Instruction Multiple Data (SIMD) / Distributed and parallel computing strategies Single Instruction Single Data (SISD) / Distributed and parallel computing strategies singularity / Regression methods skewed data about / Revisiting statistics smart data / The Ingestion layer Softmax regression technique about / Softmax regression technique solution architecture, Machine learning / Machine learning process lifecycle and solution architecture solution methods, Reinforcement Learning (RL) about / Reinforcement learning solution methods Dynamic Programming (DP) / Dynamic Programming (DP) Monte Carlo methods / Monte Carlo methods temporal difference (TD) learning / Temporal difference (TD) learning Q-Learning technique / Q-Learning – off-Policy TD actor-critic methods (on-policy) / Actor-critic methods (on-policy) R Learning (Off-policy) / R Learning (Off-policy) Spark used, for implementing decision trees / Using Spark used, for implementing KNN / Using Spark used, for implementing Support Vector Machines (SVM) / Using Spark used, for implementing Apriori and FP-growth / Using Spark used, for implementing k-means clustering / Using Spark used, for implementing Naïve Bayes algorithm / Using Spark used, for implementing logistic regression / Using Spark used, for implementing linear regression / Using Spark used, for implementing Deep learning methods / Using Spark used, for implementing ANNs / Using Spark used, for implementing ensemble methods / Using Spark Spark SQL / Apache Spark Spark Streaming / Apache Spark sparse vectors about / Implementing vectors in Mahout random access sparse vectors / Implementing vectors in Mahout sequential access sparse vectors / Implementing vectors in Mahout specialized trees about / Specialized trees oblique trees / Oblique trees random forests / Random forests evolutionary trees / Evolutionary trees Hellinger trees / Hellinger trees Spring XD / The Analytics layer, Vendors about / Spring XD features / Spring XD Spring XD architecture, layers about / Spring XD Speed Layer / Spring XD Batch Layer / Spring XD Serving Layer / Spring XD Sqoop / Hadoop platform / Processing layer about / Hadoop ecosystem components URL / Hadoop ecosystem components SSE (Sum Squared Error) / Simple regression or simple linear regression SSL (Secure Socket Layer) / Security and Monitoring layer standard deviation about / Important terms and definitions Stardog / Vendors state about / The context of Reinforcement Learning statistical learning versus Machine learning / Statistical learning statisticians objective / Statistician’s thinking stochastic binary neurons / Stochastic binary neurons stratified sampling about / Important terms and definitions stream mining / Stream mining or classification String manipulations, Julia working with / Working with Strings and String manipulations Strings, Julia working with / Working with Strings and String manipulations sum of squared error of prediction (SSE) / Convergence or stopping criteria for the kmeans clustering supervised ensemble methods about / Supervised ensemble methods boosting / Boosting bagging / Bagging wagging / Wagging supervised learning about / Reinforcement Learning (RL) Support Vector Machine (SVM) / Implementation of Python (using examples) Support Vector Machines (SVM) about / Support Vector Machines (SVM) Inseparable Data / Inseparable Data implementing / Implementing SVM implementing, Mahout used / Using Mahout implementing, R used / Using R implementing, Spark used / Using Spark implementing, Python (scikit-learn) used / Using Python (Scikit-learn) implementing, Julia used / Using Julia support vector machines (SVM) / Kernel method based algorithms symmetric distribution about / Revisiting statistics synapses / Synapses T Tableau / Apache Spark Tajo about / Hadoop ecosystem components URL / Hadoop ecosystem components task dependency graph / Developing concurrent algorithms task parallelization about / Distributed and parallel computing strategies TaskTracker / MapReduce architecture Temporal Credit Assignment about / Delayed rewards temporal difference (TD) learning about / Temporal difference (TD) learning Sarsa / Sarsa - on-Policy TD terms, Reinforcement Learning (RL) agent / The context of Reinforcement Learning environment / The context of Reinforcement Learning state / The context of Reinforcement Learning action / The context of Reinforcement Learning policy / The context of Reinforcement Learning reward / The context of Reinforcement Learning value / The context of Reinforcement Learning top-K recommendation / Instance-based learning (IBL) Total Cost of Ownership (TCO) / Commoditizing information Total Lifetime Value (TLV) / Classification Total overall cost of ownership (TCO) / Emerging perspectives traditional ETL architecture limitations / Evolution of data architectures transfer learning / Transfer learning tree Induction method ID3 / C4.5 CHAID / C4.5 QUEST / C4.5 CAL5 / C4.5 FACT / C4.5 LMDT / C4.5 MARS / C4.5 U Ubuntu-based Hadoop Installation prerequisites / Hadoop installation and setup Jdk 1.7, installing / Installing Jdk 1.7 system user, creating for Hadoop / Creating a system user for Hadoop (dedicated) IPv6, disabling / Disable IPv6 uncertainty sources / Probability Unique Transaction Identifier (UTI) / Association rule – a definition unlabelled data set about / Reinforcement Learning (RL) unsupervised ensemble methods about / Unsupervised ensemble methods V value about / The context of Reinforcement Learning variable about / Important terms and definitions variables, Julia / Using variables and assignments variance / Solving the errors: bias and variance about / Revisiting statistics properties / Properties of variance vectors implementing, in Mahout / Implementing vectors in Mahout Visualizations about / The Consumption layer data, exploring with / Explaining and exploring data with Visualizations Voronoi cell / Nearest Neighbors W wagging about / Wagging WebHDFS REST API URL / RESTFul HDFS Wisdom of Crowds about / The wisdom of the crowd aggregation / The wisdom of the crowd independence / The wisdom of the crowd decentralization / The wisdom of the crowd diversity of opinion / The wisdom of the crowd usage of combiner / The wisdom of the crowd dependency between classifiers / The wisdom of the crowd diversity, generating / The wisdom of the crowd size of ensemble / The wisdom of the crowd cross inducers / The wisdom of the crowd Y YARN about / Hadoop ecosystem components Z ZooKeeper / Hadoop platform / Processing layer about / Hadoop ecosystem components URL / Hadoop ecosystem components ... for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt? ??s online digital book library... companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: January 2016 Production reference: 2270116 Published by Packt Publishing Ltd Livery Place... Multi-model database architecture / polyglot persistence Vendors Lambda Architecture (LA) Vendors Summary Index Practical Machine Learning Practical Machine Learning Copyright © 2016 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system,

Định dạng
Số trang	653
Dung lượng	16,56 MB