Advanced Machine Learning with Python Table of Contents Advanced Machine Learning with Python Credits About the Author About the Reviewers www.PacktPub.com eBooks, discount offers, and more Why subscribe? Preface What is advanced machine learning? What should you expect from this book? What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions Unsupervised Machine Learning Principal component analysis PCA – a primer Employing PCA Introducing k-means clustering Clustering – a primer Kick-starting clustering analysis Tuning your clustering configurations Self-organizing maps SOM – a primer Employing SOM Further reading Summary Deep Belief Networks Neural networks – a primer The composition of a neural network Network topologies Restricted Boltzmann Machine Introducing the RBM Topology Training Applications of the RBM Further applications of the RBM Deep belief networks Training a DBN Applying the DBN Validating the DBN Further reading Summary Stacked Denoising Autoencoders Autoencoders Introducing the autoencoder Topology Training Denoising autoencoders Applying a dA Stacked Denoising Autoencoders Applying the SdA Assessing SdA performance Further reading Summary Convolutional Neural Networks Introducing the CNN Understanding the convnet topology Understanding convolution layers Understanding pooling layers Training a convnet Putting it all together Applying a CNN Further Reading Summary Semi-Supervised Learning Introduction Understanding semi-supervised learning Semi-supervised algorithms in action Self-training Implementing self-training Finessing your self-training implementation Improving the selection process Contrastive Pessimistic Likelihood Estimation Further reading Summary Text Feature Engineering Introduction Text feature engineering Cleaning text data Text cleaning with BeautifulSoup Managing punctuation and tokenizing Tagging and categorising words Tagging with NLTK Sequential tagging Backoff tagging Creating features from text data Stemming Bagging and random forests Testing our prepared data Further reading Summary Feature Engineering Part II Introduction Creating a feature set Engineering features for ML applications Using rescaling techniques to improve the learnability of features Creating effective derived variables Reinterpreting non-numeric features Using feature selection techniques Performing feature selection Correlation LASSO Recursive Feature Elimination Genetic models Feature engineering in practice Acquiring data via RESTful APIs Testing the performance of our model Twitter Translink Twitter Consumer comments The Bing Traffic API Deriving and selecting variables using feature engineering techniques The weather API Further reading Summary Ensemble Methods Introducing ensembles Understanding averaging ensembles Using bagging algorithms Using random forests Applying boosting methods Using XGBoost Using stacking ensembles Applying ensembles in practice Using models in dynamic applications Understanding model robustness Identifying modeling risk factors Strategies to managing model robustness Further reading Summary Additional Python Machine Learning Tools Alternative development tools Introduction to Lasagne Getting to know Lasagne Introduction to TensorFlow Getting to know TensorFlow Using TensorFlow to iteratively improve our models Knowing when to use these libraries Further reading Summary A Chapter Code Requirements Index Advanced Machine Learning with Python Advanced Machine Learning with Python Copyright © 2016 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: July 2016 Production reference: 1220716 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78439-863-7 www.packtpub.com Credits Author John Hearty Reviewers Jared Huffman Ashwin Pajankar Commissioning Editor Akram Hussain Acquisition Editor Sonali Vernekar Content Development Editor Mayur Pawanikar Technical Editor Suwarna Patil Copy Editor Tasneem Fatehi Project Coordinator Nidhi Joshi Proofreader Safis Editing Indexer Mariammal Chettiyar Graphics Disha Haria Production Coordinator Arvindkumar Gupta Cover Work Arvindkumar Gupta K k-means clustering about / Introducing k-means clustering clustering / Clustering – a primer clustering analysis / Kick-starting clustering analysis configuration, tuning / Tuning your clustering configurations K-Nearest Neighbors (KNN) / Using bagging algorithms Keras / Knowing when to use these libraries L Lasagne about / Introduction to Lasagne, Getting to know Lasagne LASSO about / LASSO LeNet about / Putting it all together libraries usage, deciding / Knowing when to use these libraries M Markov Chain Monte Carlo (MCMC) about / Training max-pooling about / Understanding pooling layers mean-pooling about / Understanding pooling layers modeling risk factors longitudinally variant / Identifying modeling risk factors slow change / Identifying modeling risk factors Key parameter / Identifying modeling risk factors models using, in dynamic applications / Using models in dynamic applications robustness / Understanding model robustness modeling risk factors, identifying / Identifying modeling risk factors robustness, managing / Strategies to managing model robustness Motor Vehicle Accident (MVA) / Translink Twitter Multi-Layer Perceptron (MLP) about / The composition of a neural network multicollinearity / Correlation N n-dimensional input about / Denoising autoencoders n-gram tagger about / Sequential tagging Natural Language Toolkit (NLTK) about / Tagging and categorising words used, for tagging / Tagging with NLTK Network In Network (NIN) about / Putting it all together network topologies about / Network topologies neural networks about / Neural networks – a primer composition / The composition of a neural network learning process / The composition of a neural network neurons / The composition of a neural network connectivity functions / The composition of a neural network network topologies / Network topologies O OpinRank Review dataset about / Applying the SdA URL / Applying the SdA orthogonalization about / PCA – a primer orthonormalization about / PCA – a primer overcomplete about / Denoising autoencoders P Permanent Contrastive Divergence (PCD) about / Training Platt calibration about / Implementing self-training pooling layers about / Understanding pooling layers porter stemmer about / Stemming Pragmatic Chaos model / Using stacking ensembles price-earnings (P/E) ratio / Creating effective derived variables principal component analysis (PCA) about / Principal component analysis features / PCA – a primer employing / Employing PCA Pylearn2 / Knowing when to use these libraries R random forests about / Bagging and random forests, Testing our prepared data using / Using random forests Random Patches about / Bagging and random forests random patches about / Using bagging algorithms random subspaces about / Using bagging algorithms Rectified Linear Units (ReLU) about / Putting it all together Recursive Feature Elimination (RFE) / Performing feature selection about / Recursive Feature Elimination RESTful APIs data, acquiring / Acquiring data via RESTful APIs model performance, testing / Testing the performance of our model Restricted Boltzmann Machine (RBM) about / Restricted Boltzmann Machine, Introducing the RBM topology / Topology training / Training applications / Applications of the RBM, Further applications of the RBM Root Mean Squared Error (RMSE) about / Genetic models S scikit-learn about / Employing PCA Self-Organizing Map (SOM) about / The composition of a neural network self-organizing maps (SOM) about / Self-organizing maps, SOM – a primer employing / Employing SOM self-training about / Self-training implementing / Implementing self-training improving / Finessing your self-training implementation selection process, improving / Improving the selection process Contrastive Pessimistic Likelihood Estimation (CPLE) / Contrastive Pessimistic Likelihood Estimation semi-supervised algorithms using / Semi-supervised algorithms in action semi-supervised learning about / Introduction, Understanding semi-supervised learning self-training / Self-training sequential tagging about / Sequential tagging Silhouette Coefficient about / Kick-starting clustering analysis stacked denoising autoencoders (SdA) about / Stacked Denoising Autoencoders applying / Applying the SdA performance, assessing / Assessing SdA performance stacking ensembles using / Using stacking ensembles stemming about / Stemming Stochastic Gradient Descent (SGD) about / Implementing self-training stride about / Understanding convolution layers subtaggers about / Backoff tagging sum-pooling about / Understanding pooling layers Support Vector Classification (SVC) about / Recursive Feature Elimination T tagging with, Natural Language Toolkit (NTLK) / Tagging with NLTK sequential tagging / Sequential tagging backoff tagging / Backoff tagging TB-scale datasets about / Clustering – a primer tensor / Understanding convolution layers TensorFlow about / Introduction to TensorFlow, Getting to know TensorFlow using / Using TensorFlow to iteratively improve our models TensorFlow library about / Understanding convolution layers text data cleaning / Cleaning text data cleaning, with BeautifulSoup / Text cleaning with BeautifulSoup punctuation, managing / Managing punctuation and tokenizing tokenisation, managing / Managing punctuation and tokenizing words, categorizing / Tagging and categorising words words, tagging / Tagging and categorising words features, creating / Creating features from text data text feature engineering about / Text feature engineering text data, cleaning / Cleaning text data stemming / Stemming bagging / Bagging and random forests random forests / Bagging and random forests prepared data, testing / Testing our prepared data Theano about / Denoising autoencoders tokenisation about / Managing punctuation and tokenizing transforming autoencoder about / Understanding pooling layers translation-invariance about / Understanding pooling layers Translink Twitter about / Translink Twitter trigram tagger about / Sequential tagging Twitter using / Twitter Translink Twitter, using / Translink Twitter consumer comments, analyzing / Consumer comments Bing Traffic API / The Bing Traffic API U U-Matrix about / Employing SOM unigram tagger about / Sequential tagging V v-fold cross-validation about / Tuning your clustering configurations validity measure (v-measure) about / Kick-starting clustering analysis W weather API creating / The weather API Y Yahoo Weather API about / Acquiring data via RESTful APIs Z Zipf distribution about / Reinterpreting non-numeric features ... Understanding semi-supervised learning Semi-supervised algorithms in action Self -training Implementing self -training Finessing your self -training implementation Improving the selection process Contrastive Pessimistic Likelihood Estimation... Restricted Boltzmann Machine Introducing the RBM Topology Training Applications of the RBM Further applications of the RBM Deep belief networks Training a DBN Applying the DBN Validating the DBN Further reading... Understanding the convnet topology Understanding convolution layers Understanding pooling layers Training a convnet Putting it all together Applying a CNN Further Reading Summary Semi-Supervised Learning