Mastering machine learning with scikit learn second edition

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	249
Dung lượng	6,3 MB

Nội dung

Mastering Machine Learning with scikit-learn Second Edition -FBSOUPJNQMFNFOUBOEFWBMVBUFNBDIJOFMFBSOJOHTPMVUJPOT XJUITDJLJUMFBSO Gavin Hackeling BIRMINGHAM - MUMBAI Mastering Machine Learning with scikit-learn Second Edition Copyright © 2017 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: October 2014ȱ econd published: July 2017ȱ Production reference: 1200717 1VCMJTIFECZ1BDLU1VCMJTIJOH-UE -JWFSZ1MBDF -JWFSZ4USFFU #JSNJOHIBN #1#6, ISBN 978-1-78829-987-9 XXXQBDLUQVCDPN Credits Author Gavin Hackeling Copy Editors Safis Editing Vikrant Phadkay Reviewer Oleg Okun Project Coordinator Nidhi Joshi Commissioning Editor Amey Varangaonkar Proofreader Safis Editing Acquisition Editor Aman Singh Indexer Tejal Daruwale Soni Content Development Editor Aishwarya Pandere Graphics Tania Dutta Technical Editor Suwarna Patil Production Coordinator Arvindkumar Gupta About the Author Gavin Hackeling is a data scientist and author He was worked on a variety of machine learning problems, including automatic speech recognition, document classification, object recognition, and semantic segmentation An alumnus of the University of North Carolina and New York University, he lives in Brooklyn with his wife and cat I would like to thank my wife, Hallie, and the scikti-learn community About the Reviewer Oleg Okun is a machine learning expert and an author/editor of four books, numerous journal articles, and conference papers His career spans more than a quarter of a century He was employed in both academia and industry in his motherland, Belarus, and abroad (Finland, Sweden, and Germany) His work experience includes document image analysis, fingerprint biometrics, bioinformatics, online/offline marketing analytics, credit scoring analytics, and text analytics He is interested in all aspects of distributed machine learning and the Internet of Things Oleg currently lives and works in Hamburg, Germany I would like to express my deepest gratitude to my parents for everything that they have done for me www.PacktPub.com For support files and downloads related to your book, please visitXXX1BDLU1VCDPN Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at XXX1BDLU1VCDPNand as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at TFSWJDF!QBDLUQVCDPN for more details At XXX1BDLU1VCDPN, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks IUUQTXXXQBDLUQVCDPNNBQU Get the most in-demand software skills with Mapt Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book At Packt, quality is at the heart of our editorial process To help us improve, please leave us an honest review on this book's Amazon page at IUUQTXXXBNB[PODPNEQ If you'd like to join our team of regular reviewers, you can e-mail us at DVTUPNFSSFWJFXT!QBDLUQVCDPN We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback Help us be relentless in improving our products! Table of Contents Preface Chapter 1: The Fundamentals of Machine Learning Defining machine learning Learning from experience Machine learning tasks Training data, testing data, and validation data Bias and variance An introduction to scikit-learn Installing scikit-learn Installing using pip Installing on Windows Installing on Ubuntu 16.04 Installing on Mac OS Installing Anaconda Verifying the installation Installing pandas, Pillow, NLTK, and matplotlib Summary Chapter 2: Simple Linear Regression Simple linear regression Evaluating the fitness of the model with a cost function Solving OLS for simple linear regression Evaluating the model Summary Chapter 3: Classification and Regression with k-Nearest Neighbors K-Nearest Neighbors Lazy learning and non-parametric models Classification with KNN Regression with KNN Scaling features Summary Chapter 4: Feature Extraction Extracting features from categorical variables Standardizing features 6 10 13 15 16 17 17 17 17 18 18 18 19 20 20 25 27 29 31 32 32 33 34 42 44 47 48 48 49 Extracting features from text The bag-of-words model Stop word filtering Stemming and lemmatization Extending bag-of-words with tf-idf weights Space-efficient feature vectorizing with the hashing trick Word embeddings Extracting features from images Extracting features from pixel intensities Using convolutional neural network activations as features Summary Chapter 5: From Simple Linear Regression to Multiple Linear Regression Multiple linear regression Polynomial regression Regularization Applying linear regression Exploring the data Fitting and evaluating the model Gradient descent Summary Chapter 6: From Linear Regression to Logistic Regression Binary classification with logistic regression Spam filtering Binary classification performance metrics Accuracy Precision and recall Calculating the F1 measure ROC AUC Tuning models with grid search Multi-class classification Multi-class classification performance metrics Multi-label classification and problem transformation Multi-label classification performance metrics Summary Chapter 7: Naive Bayes 50 50 53 54 57 59 61 64 65 66 68 70 70 74 79 80 81 84 86 90 91 92 94 95 97 98 99 100 102 104 107 108 113 114 115 Bayes' theorem Generative and discriminative models [ ii ] 115 117 GSPNTLMFBSODMVTUFSJNQPSU.JOJ#BUDI,.FBOT JNQPSUHMPC [ 211 ] K-means First, we load the images, convert them to grayscale, and extract SURF descriptors SURF descriptors can be extracted more quickly than many similar features, but extracting descriptors from 2,000 images is still computationally expensive Unlike previous examples, this script requires several minutes to execute on most computers: *O BMM@JOTUBODF@GJMFOBNFT BMM@JOTUBODF@UBSHFUT GPSGJOHMPCHMPC DBUTBOEEPHTJNH KQH UBSHFUJG DBU JOPTQBUITQMJU GFMTF BMM@JOTUBODF@GJMFOBNFTBQQFOE G BMM@JOTUBODF@UBSHFUTBQQFOE UBSHFU TVSG@GFBUVSFT GPSGJOBMM@JOTUBODF@GJMFOBNFT JNBHFNIJNSFBE GBT@HSFZ5SVF 5IFGJSTUFMFNFOUTPGFBDIEFTDSJQUPSEFTDSJCFJUTQPTJUJPO BOEPSJFOUBUJPO 8FSFRVJSFPOMZUIFEFTDSJQUPS TVSG@GFBUVSFTBQQFOE TVSGTVSG JNBHF USBJO@MFOJOU MFO BMM@JOTUBODF@GJMFOBNFT 9@USBJO@TVSG@GFBUVSFTOQDPODBUFOBUF TVSG@GFBUVSFT 9@UFTU@TVSG@GFBVUSFTOQDPODBUFOBUF TVSG@GFBUVSFT Z@USBJOBMM@JOTUBODF@UBSHFUT Z@UFTUBMM@JOTUBODF@UBSHFUT We then group the extracted descriptors into clusters We use JOJ#BUDI,.FBOT, a variation of K-means that uses a random sample of the instances in each iteration Because it computes the distances to the centroids for only a sample of the instances in each iteration, JOJ#BUDI,.FBOT converges more quickly, but its clusters' distortions may be greater In practice, the results are similar, and this compromise is acceptable: *O O@DMVTUFST FTUJNBUPS.JOJ#BUDI,.FBOT O@DMVTUFSTO@DMVTUFST FTUJNBUPSGJU@USBOTGPSN 9@USBJO@TVSG@GFBUVSFT 0VU BSSBZ Next, we construct feature vectors for the training and testing data We find the cluster associated with each of the extracted SURF descriptors and count them using NumPy's CJO$PVOU function This results in a 300-dimensional feature vector for each instance: *O 9@USBJO GPSJOTUBODFJOTVSG@GFBUVSFT DMVTUFSTFTUJNBUPSQSFEJDU JOTUBODF GFBUVSFTOQCJODPVOU DMVTUFST JGMFO GFBUVSFTO@DMVTUFST GFBUVSFTOQBQQFOE GFBUVSFTOQ[FSPT O@DMVTUFST MFO GFBUVSFT 9@USBJOBQQFOE GFBUVSFT 9@UFTU GPSJOTUBODFJOTVSG@GFBUVSFT DMVTUFSTFTUJNBUPSQSFEJDU JOTUBODF GFBUVSFTOQCJODPVOU DMVTUFST JGMFO GFBUVSFTO@DMVTUFST GFBUVSFTOQBQQFOE GFBUVSFTOQ[FSPT O@DMVTUFST MFO GFBUVSFT 9@UFTUBQQFOE GFBUVSFT Finally, we train a logistic regression classifier on the feature vectors and targets, and assess its precision, recall, and accuracy: *O DMG-PHJTUJD3FHSFTTJPO $QFOBMUZ M DMGGJU 9@USBJOZ@USBJO QSFEJDUJPOTDMGQSFEJDU 9@UFTU QSJOU DMBTTJGJDBUJPO@SFQPSU Z@UFTUQSFEJDUJPOT 0VU QSFDJTJPOSFDBMMGTDPSFTVQQPSU BWHUPUBM [ 213 ] K-means Summary In this chapter, we discussed our first unsupervised learning task, clustering Clustering is used to discover structures in unlabeled data We learned about the K-means clustering algorithm, which iteratively assigns instances to clusters and refines the positions of the cluster centroids While K-means learns from experience without supervision, its performance is still measurable; we learned to use distortion and the silhouette coefficient to evaluate clusters We applied K-means to two different problems First, we used K-means for image quantizationK a compression technique that represents a range of colors with a single color We also used K-means to learn features in a semi-supervised image classification problem In the next chapter we will discuss another unsupervised learning task called dimensionality reduction Like the semi-supervised feature representations we created to classify images of cats and dogs, dimensionality reduction can be used to reduce the dimensions of a feature representation while retaining as much information as possible [ 214 ] 14 Dimensionality Reduction with Principal Component Analysis In this chapter, we will discuss a technique for reducing the dimensions of data called principal component analysis (PCA) Dimensionality reduction is motivated by several problems Firstly, it can be used to mitigate problems caused by the curse of dimensionality Secondly, dimensionality reduction can be used to compress data while minimizing the amount of information that is lost Thirdly, understanding the structure of data with hundreds of dimensions can be difficult; data with only two or three dimensions can be visualized easily We will use PCA to visualize a high-dimensional dataset in two dimensions and to build a face recognition system Principal component analysis Recall from previous chapters that problems involving high-dimensional data can be affected by the curse of dimensionality As the number of dimensions of a dataset increases, the number of samples required for an estimator to generalize increases exponentially Acquiring such large data may be infeasible in some applications, and learning from large datasets requires more memory and processing power Furthermore, the sparseness of data often increases with its dimensions It can become more difficult to detect similar instances in high-dimensional space as all instances are similarly sparse Dimensionality Reduction with Principal Component Analysis PCA also known as the Karhunen-Loeve Transform (KLT), is a technique for finding patterns in high-dimensional data PCA is commonly used to explore and visualize highdimensional datasets It can also be used to compress data and to process data before it is used by another estimator PCA reduces a set of possibly correlated high-dimensional variables to a lower dimensional set of linearly uncorrelated synthetic variables called principal components The lower dimensional data preserves as much of the variance of the original data as possible PCA reduces the dimensions of a dataset by projecting the data onto a lower dimensional subspace For example, a two-dimensional dataset could be reduced by projecting the points onto a line; each instance in the dataset would then be represented by a single value rather than by a pair of values A three-dimensional dataset could be reduced to two dimensions by projecting the variables onto a plane In general, an m-dimensional dataset can be reduced by projecting onto an n-dimensional subspace, where n is less than m More formally, PCA can be used to find a set of vectors that span a subspace that minimizes the sum of the squared errors of the projected data This projection will retain the greatest proportion of the original dataset's variance Imagine that you are a photographer for a gardening supply catalog, and you are tasked with photographing a watering can The watering can is three-dimensional, but the photograph is two-dimensional; you must create a two-dimensional representation that describes as much of the watering can as possible The following are four pictures that you can use: [ 216 ] Dimensionality Reduction with Principal Component Analysis In the first photograph the back of the watering can is visible, but the front cannot be seen The second picture is angled to look directly down the spout of the watering can; this picture provides information about the front of the can that was not visible in the first photograph, but now the handle cannot be seen The height of the watering can cannot be discerned from the bird's eye view of the third picture The fourth picture is the obvious choice for the catalog; the watering can's height, top, spout, and handle are all discernible in this image The motivation of PCA is similar; it can project data in a high-dimensional space to a lower dimensional space such that it retains as much of the variance as possible PCA rotates the dataset to align with its principal components to maximize the variance contained within the first several principal components Assume that we have the dataset that is plotted in the following screenshot: [ 217 ] Dimensionality Reduction with Principal Component Analysis The instances approximately form a long, thin ellipse stretching from the origin to the topright corner of the plot To reduce the dimensions of this dataset, we must project the points onto a line The following are two lines that the data can be projected onto Along which line the instances vary the most? The instances vary more along the dashed line than the dotted line In fact, the dashed line is the first principal component The second principal component must be orthogonal to the first principal component; that is, it must be statistically independent of the first principal component In a two-dimensional space, the first and second principal components will appear to be perpendicular, as shown in the following screenshot: [ 218 ] Dimensionality Reduction with Principal Component Analysis Each subsequent principal component preserves the maximum amount of the remaining variance; the only constraint is that it must be orthogonal to the other principal components Now assume that the dataset is three-dimensional The scatter plot of the previous points looks like a flat disc that has been rotated slightly about one of the axes [ 219 ] Dimensionality Reduction with Principal Component Analysis The points can be rotated and translated such that the tilted disk lies almost exactly in two dimensions The points now form an ellipse; the third dimension contains almost no variance and can be discarded PCA is most useful when the variance in a dataset is distributed unevenly across the dimensions Consider a three-dimensional dataset with a spherical convex hull PCA cannot be used effectively with this dataset because there is equal variance in each dimension; none of the dimensions can be discarded without losing a significant amount of information It is easy to visually identify the principal components of datasets with only two or three dimensions In the next section, we will discuss how to calculate the principal components of high-dimensional data Variance, covariance, and covariance matrices There are several terms that we must define before discussing how PCA works Recall that variance is a measure of how spread out a set of values is Variance is calculated as the average of the squared differences of the values and the mean of the values, per the following equation: Covariance is a measure of how much two variables change together; it is a measure of the strength of the correlation between two sets of variables If the covariance of two variables is zero, the variables are uncorrelated Note that uncorrelated variables are not necessarily independent, as correlation is only a measure of linear dependence The covariance of two variables is calculated using the following equation: If the covariance is non-zero, the sign indicates whether the variables are positively or negatively correlated When two variables are positively correlated, one increases as the other increases One variable decreases relative to its mean as the other variable increases relative to its mean when variables are negatively correlated A covariance matrix describes the covariances between each pair of dimensions in a dataset The element (i, j) indicates the covariance of the ith and jth dimensions of the data For example, a covariance matrix for a set of three-dimensional data is given by the following matrix: [ 220 ] Dimensionality Reduction with Principal Component Analysis Let's calculate the covariance matrix for the following dataset: v1 v2 v3 −1.4 2.2 0.2 −1.5 2.4 0.1 −1 1.9 −1.2 The means of the variables are , , and We can then calculate the covariances of each pair of variables to produce the following covariance matrix: We can verify our calculations using NumPy: *O JNQPSUOVNQZBTOQ 9OQBSSBZ < > QSJOU OQDPW 95 0VU [ 221 ] Dimensionality Reduction with Principal Component Analysis Eigenvectors and eigenvalues Recall that a vector is described by a direction and a magnitude, or length An eigenvector of a matrix is a non-zero vector that satisfies the following equation: Here, is an eigenvector, A is a square matrix, and λ is a scalar called an eigenvalue The direction of an eigenvector remains the same after it has been transformed by A; only its magnitude changes, as indicated by the eigenvalue That is, multiplying a matrix by one of its eigenvectors is equal to scaling the eigenvector The prefix eigen is the German word for "belonging to' or "peculiar to"; the eigenvectors of a matrix are the vectors that "belong to" and characterize the structure of the data Eigenvectors and eigenvalues are can only be derived from square matrices, and not all square matrices have eigenvectors or eigenvalues If a matrix does have eigenvectors and eigenvalues, it will have a pair for each of its dimensions The principal components of a matrix are the eigenvectors of its covariance matrix, ordered by their corresponding eigenvalues The eigenvector with the greatest eigenvalue is the first principal component; the second principal component is the eigenvector with the second greatest eigenvalue, and so on Let's calculate the eigenvectors and eigenvalues of the following matrix: Recall that the product of A and any eigenvector of A must be equal to the eigenvector multiplied by its eigenvalue We will begin by finding the eigenvalues, which we can find using the characteristic equation: [ 222 ] Dimensionality Reduction with Principal Component Analysis The characteristic equation states that the determinant of the matrix that is the difference between the data matrix and the product of the identity matrix and an eigenvalue is zero Substituting our values for A produces the following: We can then substitute in our first eigenvalue to solve for the eigenvectors The preceding steps can be rewritten as a system of equations: Any non-zero vector that satisfies the preceding equations, such as the following, can be used as the eigenvector: PCA requires unit eigenvectors, or eigenvectors that have length equal to We can normalize an eigenvector by dividing it by its norm, which is given by the following equation: The norm of our vector is equal to: [ 223 ] ... Defining machine learning Learning from experience Machine learning tasks Training data, testing data, and validation data Bias and variance An introduction to scikit- learn Installing scikit- learn. .. by machine learning systems Finally, we will discuss performance measures that can be used to assess machine learning systems [7] The Fundamentals of Machine Learning Learning from experience Machine. .. introduction to scikit- learn Since its release in 2007, scikit- learn has become one of the most popular machine learning libraries scikit- learn provides algorithms for machine learning tasks including

Ngày đăng: 24/10/2018, 08:15