Python Data Science Essentials Second Edition Become an efficient data science practitioner by understanding Python's key concepts Alberto Boschetti Luca Massaron BIRMINGHAM - MUMBAI Python Data Science Essentials Second Edition Copyright © 2016 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: April 2015 Second edition: October 2016 Production reference: 1211016 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78646-213-8 www.packtpub.com Credits Authors Copy Editor Alberto Boschetti Vikrant Phadke Luca Massaron Reviewer Project Coordinator Zacharias Voulgaris Nidhi Joshi Commissioning Editor Proofreader Veena Pagare Safis Editing Acquisition Editor Indexer Namrata Patil Aishwarya Gangawane Content Development Editor Graphics Mayur Pawanikar Disha Haria Technical Editor Production Coordinator Vivek Arora Arvindkumar Gupta About the Authors Alberto Boschetti is a data scientist with expertise in signal processing and statistics He holds a PhD in telecommunication engineering and currently lives and works in London In his work projects, he faces challenges ranging from natural language processing (NLP), behavioral analysis, and machine learning to distributed processing He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events I would like to thank my family, my friends, and my colleagues Also, a big thanks to the open source community Luca Massaron is a data scientist and marketing research director specializing in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience of solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms From being a pioneer of web audience analysis in Italy to achieving the rank of a top ten Kaggler, he has always been very passionate about every aspect of data and its analysis, and also about demonstrating the potential of data-driven knowledge discovery to both experts and non-experts Favoring simplicity over unnecessary sophistication, Luca believes that a lot can be achieved in data science just by doing the essentials To Yukiko and Amelia, for their loving patience "Roads go ever ever on, under cloud and under star, yet feet that wandering have gone turn at last to home afar" About the Reviewer Zacharias Voulgaris is a data scientist and technical author specializing in data science books He has an engineering and management background, with post-graduate studies in information systems and machine learning Zacharias has worked as a research fellow at Georgia Tech, investigating and applying machine learning technologies to real-world problems, as an SEO manager in an e-marketing company in Europe, as a program manager in Microsoft, and as a data scientist at US Bank and at G2 Web Services Dr Voulgaris has also authored technical books, the most notable of which is Data Scientist the definitive guide to becoming a data scientist (Technics Publications), and his newest book, Julia for Data Science (Technics Publications), was released during the summer of 2016 He has also written a number of data-science-related articles on blogs and participates in various data science/machine learning meetup groups Finally, he has provided technical editorial aid in the book Python Data Science Essentials (Packt), by the same authors as this book I would very much like to express my gratitude to the authors of the book for giving me the opportunity to contribute to this project Also, I'd like to thank Bastiaan Sjardin for introducing me to them and to the world of technical editing It's been a privilege working with all of you www.PacktPub.com For support files and downloads related to your book, please visit www.PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Table of Contents Preface Chapter 1: First Steps Introducing data science and Python Installing Python Python or Python 3? Step-by-step installation The installation of packages Package upgrades Scientific distributions Anaconda Leveraging conda to install packages Enthought Canopy PythonXY WinPython Explaining virtual environments conda for managing environments A glance at the essential packages NumPy SciPy pandas Scikit-learn Jupyter Matplotlib Statsmodels Beautiful Soup NetworkX NLTK Gensim PyPy XGBoost Theano Keras Introducing Jupyter Fast installation and first test usage Jupyter magic commands How Jupyter Notebooks can help data scientists Alternatives to Jupyter Datasets and code used in the book 10 10 11 13 15 16 17 17 18 19 19 19 22 23 24 24 25 25 26 26 27 27 28 28 29 29 29 31 33 33 38 40 42 51 51 Scikit-learn toy datasets The MLdata.org public repository LIBSVM data examples Loading data directly from CSV or text files Scikit-learn sample generators Summary Chapter 2: Data Munging 52 56 57 57 61 62 63 The data science process Data loading and preprocessing with pandas Fast and easy data loading Dealing with problematic data Dealing with big datasets Accessing other data formats Data preprocessing Data selection Working with categorical and text data A special type of data – text Scraping the Web with Beautiful Soup Data processing with NumPy NumPy's n-dimensional array The basics of NumPy ndarray objects Creating NumPy arrays From lists to unidimensional arrays Controlling the memory size Heterogeneous lists From lists to multidimensional arrays Resizing arrays Arrays derived from NumPy functions Getting an array directly from a file Extracting data from pandas NumPy's fast operations and computations Matrix operations Slicing and indexing with NumPy arrays Stacking NumPy arrays Summary Chapter 3: The Data Pipeline 64 66 67 70 74 77 79 83 87 90 96 99 100 101 104 104 105 107 107 109 110 112 112 114 116 118 120 123 124 Introducing EDA Building new features Dimensionality reduction 124 129 132 [ ii ] The covariance matrix Principal Component Analysis (PCA) PCA for big data – RandomizedPCA Latent Factor Analysis (LFA) Linear Discriminant Analysis (LDA) Latent Semantical Analysis (LSA) Independent Component Analysis (ICA) Kernel PCA T-SNE Restricted Boltzmann Machine (RBM) The detection and treatment of outliers Univariate outlier detection EllipticEnvelope OneClassSVM Validation metrics Multilabel classification Binary classification Regression Testing and validating Cross-validation Using cross-validation iterators Sampling and bootstrapping Hyperparameter optimization Building custom scoring functions Reducing the grid search runtime Feature selection Selection based on feature variance Univariate selection Recursive elimination Stability and L1-based selection Wrapping everything in a pipeline Combining features together and chaining transformations Building custom transformation functions Summary Chapter 4: Machine Learning 132 134 139 140 141 142 142 143 145 146 147 149 151 156 160 161 164 165 165 171 174 176 179 182 184 186 187 188 190 192 194 195 198 199 200 Preparing tools and datasets Linear and logistic regression Naive Bayes K-Nearest Neighbors 200 202 206 209 [ iii ] Strengthen Your Python Foundations Classes, objects, and OOP Classes are collections of methods and attributes Briefly, attributes are variables of the object (for example, each instance of the Employee class has its own name, age, salary, and benefits; all of them are attributes) Methods are simply functions that modify attributes (for example, to set the employee name, to set his/her age, and also to read this info from a database or from a CSV list) To create a class, use the class keyword In the following example, we will create a class for an incrementer The purpose of this object is to keep track of the value of an integer and eventually increase it by 1: class Incrementer(object): def init (self): print ("Hello world, I'm the constructor") self._i = Everything within the def indentation is a class method In this case, the method named init sets the i internal variable to zero (it looks exactly like a function described in the previous chapter) Look carefully at the method's definition Its argument is self (this is the object itself), and every internal variable access is made through self Moreover, init is not just a method; it's the constructor (it's called when the object is created) In fact, when we build an Incrementer object, this method is automatically called, as follows: i = Incrementer() # prints "Hello world, I'm the constructor" Now, let's create the increment() method, which increments the i internal counter and returns the status Within the class definition, include the method: def increment(self): self._i += return self._i Then, run the following code: i = Incrementer() print (i.increment()) print (i.increment()) print (i.increment()) The preceding code results in the following output: Hello world, I'm the constructor [ 347 ] Strengthen Your Python Foundations Finally, let's see how to create methods that accept parameters We will now create the set_counter method, which sets the _i internal variable Within the class definition, add the following code: defset_counter(self, counter): self._i = counter Then, run the following code: i = Incrementer() i.set_counter(10) print (i.increment()) print (i._i) The preceding code gives this output: Hello world, I'm the constructor 11 11 Note the last line of the preceding code, where you access the internal variable Remember that in Python, all the internal attributes of the objects are public by default, and they can be read, written, and changed externally Exceptions Exceptions and errors are strongly correlated, but they are different things An exception, for example, can be gracefully handled Here are some examples of exceptions: 0/0 Traceback (most recent call last): File "", line 1, in ZeroDivisionError: integer division or modulo by zero len(1, 2) Traceback (most recent call last): File "", line 1, in TypeError: len() takes exactly one argument (2 given) pi * Traceback (most recent call last): File "", line 1, in NameError: name 'pi' is not defined [ 348 ] Strengthen Your Python Foundations In this example, three different exceptions have been raised (see the last line of each block) To handle exceptions, you can use a try/except block in the following way: try: a = 10/0 exceptZeroDivisionError: a = You can use more than one except clause to handle more than one exception You can eventually use a final “all-the-other” exception case handle In this case, the structure is as follows: try: exceptKeyError: print ("There is a KeyError error in the code") except (TypeError, ZeroDivisionError): print ("There is a TypeError or a ZeroDivisionError error in the code") except: print ("There is another error in the code") Finally, it is important to mention that there is the final clause, finally, that will be executed in all circumstances It's very handy if you want to clean up the code (closing files, de-allocating resources, and so on) These are the things that should be done independently, regardless of whether an error has occurred or not In this case, the code assumes the following shape: try: except: finally: Iterators and generators Looping through a list or a dictionary is very simple Note that with dictionaries, the iteration is key-based, which is demonstrated in the following example: for entry in ['alpha', 'bravo', 'charlie', 'delta']: print (entry) # prints the content of the list, one entry for line a_dict = {1: 'alpha', 2: 'bravo', 3: 'charlie', 4: 'delta'} for key in a_dict: [ 349 ] Strengthen Your Python Foundations print (key, a_dict[key]) # # # # # Prints: alpha bravo charlie delta On the other hand, if you need to iterate through a sequence and generate objects on the fly, you can use a generator A great advantage of doing this is that you don't have to create and store the complete sequence at the beginning Instead, you build every object every time the generator is called As a simple example, let's create a generator for a number sequence without storing the complete list in advance: def incrementer(): i = whilei