Máy học nâng cao sử dụng Python

[1] Advanced Machine Learning with Python Solve challenging data science problems by mastering cutting-edge machine learning techniques in Python John Hearty BIRMINGHAM - MUMBAI Advanced Machine Learning with Python Copyright © 2016 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: July 2016 Production reference: 1220716 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78439-863-7 www.packtpub.com Credits Author John Hearty Reviewers Jared Huffman Project Coordinator Nidhi Joshi Proofreader Safis Editing Ashwin Pajankar Indexer Commissioning Editor Mariammal Chettiyar Akram Hussain Graphics Acquisition Editor Disha Haria Sonali Vernekar Production Coordinator Content Development Editor Arvindkumar Gupta Mayur Pawanikar Cover Work Technical Editor Suwarna Patil Copy Editor Tasneem Fatehi Arvindkumar Gupta About the Author John Hearty is a consultant in digital industries with substantial expertise in data science and infrastructure engineering Having started out in mobile gaming, he was drawn to the challenge of AAA console analytics Keen to start putting advanced machine learning techniques into practice, he signed on with Microsoft to develop player modelling capabilities and big data infrastructure at an Xbox studio His team made significant strides in engineering and data science that were replicated across Microsoft Studios Some of the more rewarding initiatives he led included player skill modelling in asymmetrical games, and the creation of player segmentation models for individualized game experiences Eventually John struck out on his own as a consultant offering comprehensive infrastructure and analytics solutions for international client teams seeking new insights or data-driven capabilities His favourite current engagement involves creating predictive models and quantifying the importance of user connections for a popular social network After years spent working with data, John is largely unable to stop asking questions In his own time, he routinely builds ML solutions in Python to fulfil a broad set of personal interests These include a novel variant on the StyleNet computational creativity algorithm and solutions for algo-trading and geolocation-based recommendation He currently lives in the UK About the Reviewers Jared Huffman is a lifelong gamer and extreme data geek After completing his bachelor's degree in computer science, he started his career in his hometown of Melbourne, Florida While there, he honed his software development skills, including work on a credit card-processing system and a variety of web tools He finished it off with a fun contract working at NASA's Kennedy Space Center before migrating to his current home in the Seattle area Diving head first into the world of data, he took up a role working on Microsoft's internal finance tools and reporting systems Feeling that he could no longer resist his love for video games, he joined the Xbox division to build their Business To date, Jared has helped ship and support 12 games and presented at several events on various machine learning and other data topics His latest endeavor has him applying both his software skills and analytics expertise in leading the data science efforts for Minecraft There he gets to apply machine learning techniques, trying out fun and impactful projects, such as customer segmentation models, churn prediction, and recommendation systems Outside of work, Jared spends much of his free time playing board games and video games with his family and friends, as well as dabbling in occasional game development First I'd like to give a big thanks to John for giving me the honor of reviewing this book; it's been a great learning experience Second, thanks to my amazing wife, Kalen, for allowing me to repeatedly skip chores to work on it Last, and certainly not least, I'd like to thank God for providing me the opportunities to work on things I love and still make a living doing it Being able to wake up every day and create games that bring joy to millions of players is truly a pleasure Ashwin Pajankar is a software professional and IoT enthusiast with more than years of experience in software design, development, testing, and automation He graduated from IIIT Hyderabad, earning an M Tech in computer science and engineering He holds multiple professional certifications from Oracle, IBM, Teradata, and ISTQB in development, databases, and testing He has won several awards in college through outreach initiatives, at work for technical achievements, and community service through corporate social responsibility programs He was introduced to Raspberry Pi while organizing a hackathon at his workplace, and has been hooked on Pi ever since He writes plenty of code in C, Bash, Python, and Java on his cluster of Pis He's already authored two books on Raspberry Pi and reviewed three other titles related to Python for Packt Publishing His LinkedIn Profile is https://in.linkedin.com/in/ashwinpajankar I would like to thank my wife, Kavitha, for the motivation www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at customercare@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library Here, you can search, access, and read Packt's entire library of books Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Of the many people I feel gratitude towards, I particularly want to thank my parents … mostly for their patience I'd like to extend thanks to Tyler Lowe for his invaluable friendship, to Mark Huntley for his bothersome emphasis on accuracy, and to the former team at Lionhead Studios I also greatly value the excellent work done by Jared Huffman and the industrious editorial team at Packt Publishing, who were hugely positive and supportive throughout the creation of this book Finally, I'd like to dedicate the work and words herein to you, the reader There has never been a better time to get to grips with the subjects of this book; the world is stuffed with new opportunities that can be seized using creativity and an appropriate model I hope for your every success in the pursuit of those solutions Chapter Using TensorFlow to iteratively improve our models Even from the single example in the preceding section, we should be able to recognise what TensorFlow brings to the table It offers a simple interface for the task of developing complex architectures and training methods, giving us easier access to the algorithms we've learnt about earlier in this book As we know, however, developing an initial model is only a small part of the model development process We usually need to test and dissect our models repeatedly to improve their performance However, this tends to be an area where our tools are less unified in a single library or technique, and the tests and monitoring solutions less consistent across models TensorFlow looks to solve the problem of how to get good insight into our models during iteration, in what it calls the execution phase of model development During the execution phase, we can make use of tools provided by the TensorFlow team to explore and improve our models Perhaps the most important of these tools is TensorBoard, which provides an explorable, visual representation of the model we've built TensorBoard provides several capabilities, including dashboards that show both basic model information (including performance measurements during each iteration for test and/or training) [ 241 ] Additional Python Machine Learning Tools In addition, TensorBoard dashboards provide lower-level information including plots of the range of values for weights, biases and activation values at every model layer; tremendously useful diagnostic information during iteration The process of accessing this data is hassle-free and it is immediately useful Further to this, TensorBoard provides a detailed graph of the tensor flow for a given model The tensor is an n-dimensional array of data (in this case, of n-many features); it's what we tend to think of when we use the term the input dataset The series of operations that is applied to a tensor is described as the tensor flow and in TensorFlow it's a fundamental concept, for a simple and compelling reason When refining and debugging a machine learning model, what matters is having information about the model and its operations at even a low level [ 242 ] Chapter TensorBoard graphs show the structure of a model in variable detail From this initial view, it is possible to dig into each component of the model and into successive sub-elements In this case, we are able to view the specific operations that take place within the dropout function of our second network layer We can see what happens and identify what to tweak for our next iteration This level of transparency is unusual and can be very helpful when we want to tweak model components, especially when a model element or layer is underperforming (as we might see, for instance, from TensorBoard graphs showing layer metaparameter values or from network performance as a whole) TensorBoards can be created from event logs and generated when TensorFlow is run This makes the benefits of TensorBoards easily obtained during the course of everyday development using TensorFlow [ 243 ] Additional Python Machine Learning Tools As of late April 2016, the DeepMind team joined the Google Brain team and a broad set of other researchers and developers in using TensorFlow By making TensorFlow open source and freely available, Google is committing to continue supporting TensorFlow as a powerful tool for model development and refinement Knowing when to use these libraries At one or two points in this chapter, we probably ran into the question of Okay, so, why didn't you just teach us about this library to begin with? It's fair to ask why we spent time digging around in Theano functions and other low-level information when this chapter presents perfectly good interfaces that make life easier Naturally, I advocate using the best tools available, especially for prototyping tasks where the value of the work is more in understanding the general ballpark you're in, or in identifying specific problem classes It's worth recognising the three reasons for not presenting content earlier in this book using either of these libraries The first reason is that these tools will only get you so far They can a lot, agreed, so depending on the domain and the nature of that domain's problems, some data scientists may be able to rely on them for the majority of deep learning needs Beyond a certain level of performance and problem complexity, of course, you need to understand what is needed to construct a model in Theano, create your own scoring function from scratch or leverage the other techniques described in this book Another part of the decision to focus on teaching lower-level implementation is about the developing maturity of the technologies involved At this point, Lasagne and TensorFlow are definitely worth discussing and recommending to you Prior to this, when the majority of the book was written, the risk around discussing the libraries in this chapter was greater There are many projects based on Theano (some of the more prominent frameworks which weren't discussed in this chapter are Keras, Blocks and Pylearn2) Even now, it's entirely possible that different libraries and tools will be the subject of discussion or the default working environment in a year or two years' time This field moves extremely fast, largely due to the influence of key companies and research groups who have to keep building new tools as the old ones reach their useful limits… or it just becomes clear how to things better The other reason to dig in at a lower level, honestly, is that this is an involved book It sets theory alongside code and uses the code to teach the theory Abstracting away how the algorithms work and simply discussing how to apply them to crack a particular example can be tempting The tools discussed in this chapter enable practitioners to get very good scores on some problems without ever understanding the functions that are being called My opinion is that this is not a very good way to train a data scientist [ 244 ] Chapter If you're going to operate on subtle and difficult data problems, you need to be able to modify and define your own algorithm You need to understand how to choose an appropriate solution To these things, you need the details provided in this book and even more very specific information that I haven't provided, due to the limitations of (page) space and time At that point, you can apply deep learning algorithms flexibly and knowledgeably Similarly, it's important to recognise what these tools well, or less well At present, Lasagne fits very well within that use-case where a new model is being developed for benchmarking or early passes, where the priority should be on iteration speed and getting results TensorFlow, meanwhile, fits later into the development lifespan of a model When the easy gains disappear and it's necessary to spend a lot of time debugging and improving a model, the relatively quick iterations of TensorFlow are a definite plus, but it's the diagnostic tools provided by TensorBoard that present an overwhelming value-add There is, therefore, a place for both libraries in your toolset Depending on the nature of the problem at hand, these libraries and more will prove to be valuable assets Further reading The Lasagne User Guide is thorough and worth reading Find it at http://lasagne readthedocs.io/en/latest/index.html Similarly, find the TensorFlow tutorials at https://www.tensorflow.org/ versions/r0.9/get_started/index.html Summary In this final chapter, we moved some distance from our previous discussions of algorithms, configuration and diagnosis to consider tools that improve our experience when implementing deep learning algorithms We discovered the advantages to using Lasagne, an interface to Theano designed to accelerate and simplify early prototyping of our models Meanwhile, we examined TensorFlow, the library developed by Google to aid Deep Learning model adjustment and optimization TensorFlow offers us a remarkable amount of visibility of model performance, at minimal effort, and makes the task of diagnosing and debugging a complex, deep model structure much less challenging [ 245 ] Additional Python Machine Learning Tools Both tools have their own place in our processes, with each being appropriate for a particular set of problems Over the course of this book as a whole, we have walked through and reviewed a broad set of advanced machine learning techniques We went from a position where we understood some fundamental algorithms and concepts, to having confident use of a very current, powerful and sought-after toolset Beyond the techniques, though, this book attempts to teach one further concept, one that's much harder to teach and to learn, but which underpins the best performance in machine learning The field of machine learning is moving very fast This pace is visible in new and improved scores that are posted almost every week in academic journals or industry white papers It's visible in how training examples like MNIST have moved quickly from being seen as meaningful challenges to being toy problems, the deep learning version of the Iris dataset Meanwhile, the field moves on to the next big challenge; CIFAR-10, CIFAR-100 At the same time, the field moves cyclically Concepts introduced by academics like Yann LeCun in the 80's are in resurgence as computing architectures and resource growth make their use more viable over real data at scale To use many of the most current techniques at their best limits, it's necessary to understand concepts that were defined decades ago, themselves defined on the back of other concepts defined still longer ago This book tries to balance these concerns Understanding the cutting edge and the techniques that exist there is critical; understanding the concepts that'll define the new techniques or adjustments made in two or three years' time is equally important Most important of all, however, is that this book gives you an appreciation of how malleable these architectures and approaches can be A concept consistently seen at the top end of data science practice is that the best solution to a specific problem is a problem-specific solution This is why top Kaggle contest winners perform extensive feature preparation and tweak their architectures It's why TensorFlow was written to allow clear vision of granular properties of ones' architectures Having the knowledge and the skills to tweak implementations or combine algorithms fluently is what it takes to have true mastery of machine learning techniques [ 246 ] Chapter Through the many techniques and examples reviewed within this book, it is my hope that the ways of thinking about data problems and a confidence in manipulating and configuring these algorithms has been passed on to you as a practicing data scientist The many recommended Further reading examples in this book are largely intended to further extend that knowledge and help you develop the skills taught in this book Beyond that, I wish you all the best of luck in your model building and configuration I hope that you learn for yourself just how enjoyable and rewarding this field can be! [ 247 ] Chapter Code Requirements This book's content leverages openly available data and code, including open source Python libraries and frameworks While each chapter's example code is accompanied by a README file documenting all the libraries required to run the code provided in that chapter's accompanying scripts, the content of these files is collated here for your convenience It is recommended that you already have some libraries that are required for the earlier chapters when working with code from any later chapter These requirements are identified using keywords It is particularly important to set up the libraries mentioned in Chapter 1, Unsupervised Machine Learning, for any content provided later in the book The requirements for every chapter are given in the following table: Chapter Number Requirements • Python (3.4 recommended) • sklearn (NumPy, SciPy) • matplotlib 2-4 • theano • Semisup-learn • Natural Language Toolkit (NLTK) • BeautifulSoup • Twitter API account • XGBoost • Lasagne • TensorFlow [ 249 ] Index A C AdaBoost 209 Adjusted Rand Index (ARI) 10 area under the curve (AUC) 146, 179 autoencoders about 57, 58 denoising 60, 61 topology 58, 59 training 59, 60 averaging ensembles about 203 bagging algorithms, using 203-205 random forests, using 205-208 carp 138 Champion/Challenger 230 CIFAR-10 dataset 85 clustering completeness score 10 composable layer 81 Contrastive Pessimistic Likelihood Estimation (CPLE) 102, 114, 115 convnet topology about 79-81 backward pass 88 forward pass 88 implementing 88-92 pooling layers 85-87 training 88 convolutional neural networks (CNN) about 77, 78, 239 applying 92-99 convnet topology 79-81 convolution layers 81-84 correlation 167, 168 covariance B backoff taggers 139 backoff tagging 139, 140 bagging 143-146 bagging algorithms using 203-205 Batch Normalization 99 BeautifulSoup text data, cleaning 131, 132 Best Matching Unit (BMU) 19 Bing Traffic API 176, 185-187 blend-of-blends 215 Blocks 244 boosting methods applying 209-211 Extreme Gradient Boosting (XGBoost), using 212-214 Borda count 231 Brill taggers 139 D data acquiring, via Twitter 180 deep belief network (DBN) about 27, 49 applying 50-53 training 50 validating 54 DeepFace 78 [ 251 ] denoising autoencoders (dA) about 57, 60, 61 applying 62-66 DepthConcat element 91 development tools about 236 Lasagne 236 libraries usage, deciding 244, 245 TensorFlow 236 Diabolo network 57 dynamic applications models, using 221, 222 E eigenvalue eigenvector elbow method 14, 211 ensembles about 202, 203 applying 218-221 averaging ensembles 203 boosting methods, applying 209-211 stacking ensembles, using 215-218 Extreme Gradient Boosting (XGBoost) using 212-214 extremely randomized trees (ExtraTrees) 206 F Fast Fourier Transform 88 feature engineering about 129, 130, 175, 176 data, acquiring via RESTful APIs 176, 177 variables, deriving 187-191 variables, selecting 187-191 weather API, creating 191-199 feature engineering, for ML applications about 157 effective derived variables, creating 160, 161 non-numeric features, reinterpreting 162-165 rescaling techniques, using 157-160 feature selection correlation 167, 168 genetic models 173, 174 LASSO 169, 170 performing 167 Recursive Feature Elimination (RFE) 170-173 techniques, using 165, 166 feature set creating 156 feature engineering, for ML applications 157 feature selection techniques, using 165, 166 Fisher's discriminant ratio 113 Fully Connected layer 89 G genetic models 173, 174 Gibbs sampling 35 Gini Impurity (gini) 217 Go 78 GoogLeNet 78, 90 gradient descent algorithms URL 157 H h-dimensional representation 58 heart dataset URL 108 hierarchical grouping 67 homogeneity score 10 I i-dimensional input 58 ImageNet 78 Inception network 90 K Keras 244 k-means clustering about 1, clustering clustering analysis 8-13 configuration, tuning 13-18 K-Nearest Neighbors (KNN) 205 [ 252 ] L Lasagne 236-238 LASSO 169, 170 LeNet 89 libraries usage, deciding 244, 245 M Markov Chain Monte Carlo (MCMC) 36 max-pooling 85 mean-pooling 85 modeling risk factors key parameter 229 longitudinally variant 228 slow change 229 models modeling risk factors, identifying 228, 229 robustness 222-228 robustness, managing 230-233 using, in dynamic applications 221, 222 Motor Vehicle Accident (MVA) 183 multicollinearity 167 Multi-Layer Perceptron (MLP) 29 N Natural Language Toolkit (NLTK) about 137 used, for tagging 137 n-dimensional input 60 Network In Network (NIN) 91 network topologies 29-32 neural networks about 28 composition 28, 29 connectivity functions 29 learning process 28 neurons 29 n-gram tagger 138 O OpinRank Review dataset about 67 URL 68 orthogonalization orthonormalization overcomplete 60 P Permanent Contrastive Divergence (PCD) 35 Platt calibration 107 pooling layers 85-87 porter stemmer 141 Pragmatic Chaos model 216 price-earnings (P/E) ratio 161 principal component analysis (PCA) about 1, employing 4-7 features 2-4 Pylearn2 244 R random forests about 143-146 using 205-208 random patches 143, 204 random subspaces 203 Rectified Linear Units (ReLU) 91 Recursive Feature Elimination (RFE) 167-173 RESTful APIs data, acquiring 176, 177 model performance, testing 177-179 Restricted Boltzmann Machine (RBM) about 27, 33, 34 applications 37-48 topology 34, 35 training 35-37 Root Mean Squared Error (RMSE) 173 S scikit-learn self-organizing maps (SOM) about 1, 18, 19, 29 employing 20-23 self-training about 103-105 Contrastive Pessimistic Likelihood Estimation (CPLE) 114, 115 [ 253 ] implementing 105-110 improving 110-113 selection process, improving 114 semi-supervised learning about 101-103 self-training 103-105 using 103 sequential tagging 138, 139 Silhouette Coefficient 11 software requisites 249 stacked denoising autoencoders (SdA) about 57, 66, 67 applying 67-74 performance, assessing 74 stacking ensembles using 215-218 stemming 141, 142 Stochastic Gradient Descent (SGD) 108 stride 82 subtaggers 139 sum-pooling 85 Support Vector Classification (SVC) 171 T tagging backoff tagging 139, 140 sequential tagging 138, 139 with, Natural Language Toolkit (NTLK) 137 TB-scale datasets tensor 84 TensorFlow about 239, 240 using 241-244 text data cleaning 131 cleaning, with BeautifulSoup 131, 132 features, creating 141 punctuation, managing 132-136 tokenization, managing 132-136 words, categorizing 136, 137 words, tagging 136, 137 text feature engineering about 130, 131 bagging 143-146 prepared data, testing 146-153 random forests 143-146 stemming 141, 142 text data, cleaning 131 Theano 61 tokenization 132 transforming autoencoder 87 translation-invariance 85 Translink Twitter 180-183 trigram tagger 138 Twitter Bing Traffic API 185-187 consumer comments, analyzing 184 Translink Twitter, using 180-183 using 180 U UCI Handwritten Digits dataset using U-Matrix 22 unigram tagger 138 V validity measure (v-measure) 10 v-fold cross-validation 16 W weather API creating 191-199 Y Yahoo Weather API 177 Z Zipf distribution 164 [ 254 ] ... Learning with Python Solve challenging data science problems by mastering cutting-edge machine learning techniques in Python John Hearty BIRMINGHAM - MUMBAI Advanced Machine Learning with Python Copyright... writes plenty of code in C, Bash, Python, and Java on his cluster of Pis He's already authored two books on Raspberry Pi and reviewed three other titles related to Python for Packt Publishing His... is largely unable to stop asking questions In his own time, he routinely builds ML solutions in Python to fulfil a broad set of personal interests These include a novel variant on the StyleNet

Định dạng
Số trang	278
Dung lượng	2,14 MB
File đính kèm	Advanced Machine Learning using Python.rar (2 MB)