Python: End-to-end Data Analysis Leverage the power of Python to clean, scrape, analyze, and visualize your data A course in three modules BIRMINGHAM - MUMBAI Python: End-to-end Data Analysis Copyright © 2016 Packt Publishing All rights reserved No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this course to ensure the accuracy of the information presented However, the information contained in this course is sold without warranty, either express or implied Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information Published on: May 2017 Production reference: 1050517 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78839-469-7 www.packtpub.com Credits Authors Phuong Vo.T.H Content Development Editor Aishwarya Pandere Martin Czygan Ivan Idris Magnus VilhelmPersson Luiz Felipe Martins Reviewers Dong Chao Hai Minh Nguyen Kenneth Emeka Odoh Bill Chambers Alexey Grigorev Dr VahidMirjalili Michele Usuelli Hang (Harvey) Yu Laurie Lugrin Chris Morgan Michele Pratusevich Graphics Jason Monteiro Production Coordinator Deepika Naik Preface The use of Python for data analysis and visualization has only increased in popularity in the last few years The aim of this book is to develop skills to effectively approach almost any data analysis problem, and extract all of the available information This is done by introducing a range of varying techniques and methods such as uni- and multivariate linear regression, cluster finding, Bayesian analysis, machine learning, and time series analysis Exploratory data analysis is a key aspect to get a sense of what can be done and to maximize the insights that are gained from the data Additionally, emphasis is put on presentation-ready figures that are clear and easy to interpret What this learning path covers Module 1, Getting Started with Python Data Analysis, shows how to work with timeoriented data in Pandas How you clean, inspect, reshape, merge, or group data – these are the concerns in this chapter The library of choice in the course will be Pandas again Module 2, Python Data Analysis Cookbook, demonstrates how to visualize data and mentions frequently encountered pitfalls Also, discusses statistical probability distributions and correlation between two variables Module 3, Mastering Python Data Analysis, introduces linear, multiple, and logistic regression with in-depth examples of using SciPy and stats models packages to test various hypotheses of relationships between variables [i] Preface What you need for this learning path Module 1: There are not too many requirements to get started You will need a Python programming environment installed on your system Under Linux and Mac OS X, Python is usually installed by default Installation on Windows is supported by an excellent installer provided and maintained by the community.This book uses a recent Python 2, but many examples will work with Python 3as well The versions of the libraries used in this book are the following: NumPy 1.9.2,Pandas 0.16.2, matplotlib 1.4.3, tables 3.2.2, pymongo 3.0.3, redis 2.10.3, and scikit-learn 0.16.1 As these packages are all hosted on PyPI, the Python package index, they can be easily installed with pip To install NumPy, you would write: $ pip install numpy If you are not using them already, we suggest you take a look at virtual environments for managing isolating Python environment on your computer For Python 2, there are two packages of interest there: virtualenv and virtualenvwrapper Since Python 3.3, there is a tool in the standard library called pyvenv (https://docs python.org/3/ library/venv.html), which serves the same purpose Most libraries will have an attribute for the version, so if you already have a library installed, you can quickly check its version: >>>importredis >>>redis. version '2.10.3' This works well for most libraries A few, such as pymongo, use a different attribute (pymongo uses just version, without the underscores) While all the examples can be run interactively in a Python shell, we recommend using IPython IPython started as a more versatile Python shell, but has since evolved into a powerful tool for exploration and sharing We used IPython 4.0.0 with Python 2.7.10 IPython is a great way to work interactively with Python, be it in the terminal or in the browser Module 2: First, you need a Python distribution I recommend the full Anaconda distribution as it comes with the majority of the software we need I tested the code with Python 3.4 and the following packages: • joblib 0.8.4 • IPython 3.2.1 [ ii ] Preface • NetworkX 1.9.1 • NLTK 3.0.2 • Numexpr 2.3.1 • pandas 0.16.2 • SciPy 0.16.0 • seaborn 0.6.0 • sqlalchemy 0.9.9 • statsmodels 0.6.1 • matplotlib 1.5.0 • NumPy 1.10.1 • scikit-learn 0.17 • dautil0.0.1a29 For some recipes, you need to install extra software, but this is explained whenever the software is required Module 3: All you need to follow through the examples in this book is a computer running any recent version of Python While the examples use Python 3, they can easily be adapted to work with Python 2, with only minor changes The packages used in the examples are NumPy, SciPy, matplotlib, Pandas, stats models, PyMC, Scikit-learn Optionally, the packages basemap and cartopy are used to plot coordinate points on maps The easiest way to obtain and maintain a Python environment that meets all the requirements of this book is to download a prepackaged Python distribution In this book, we have checked all the code against Continuum Analytics' Anaconda Python distribution and Ubuntu Xenial Xerus (16.04) running Python To download the example data and code, an Internet connection is needed Who this learning path is for This learning path is for developers, analysts, and data scientists who want to learn data analysis from scratch This course will provide you with a solid foundation from which to analyze data with varying complexity A working knowledge of Python (and a strong interest in playing with your data) is recommended [ iii ] Preface Reader feedback Feedback from our readers is always welcome Let us know what you think about this course—what you liked or disliked Reader feedback is important for us as it helps us develop titles that you will really get the most out of To send us general feedback, simply e-mail feedback@packtpub.com, and mention the course's title in the subject of your message If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors Customer support Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase Downloading the example code You can download the example code files for this course from your account at http://www.packtpub.com If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you You can download the code files by following these steps: Log in or register to our website using your e-mail address and password Hover the mouse pointer on the SUPPORT tab at the top Click on Code Downloads & Errata Enter the name of the course in the Search box Select the course for which you're looking to download the code files Choose from the drop-down menu where you purchased this course from Click on Code Download You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website This page can be accessed by entering the course's name in the Search box Please note that you need to be logged into your Packt account [ iv ] Preface Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: • WinRAR / 7-Zip for Windows • Zipeg / iZip / UnRarX for Mac 7-Zip / PeaZip for Linux The code bundle for the course is also hosted on GitHub at https://github.com/ PacktPublishing/Python-End-to-end-Data-Analysis We also have other code bundles from our rich catalog of books, videos, and courses available at https:// github.com/PacktPublishing/ Check them out! Errata Although we have taken every care to ensure the accuracy of our content, mistakes happen If you find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful if you could report this to us By doing so, you can save other readers from frustration and help us improve subsequent versions of this course If you find any errata, please report them by visiting http:// www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title To view the previously submitted errata, go to https://www.packtpub.com/books/ content/support and enter the name of the course in the search field The required information will appear under the Errata section Piracy Piracy of copyrighted material on the Internet is an ongoing problem across all media At Packt, we take the protection of our copyright and licenses very seriously If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy [v] Preface Please contact us at copyright@packtpub.com with a link to the suspected pirated material We appreciate your help in protecting our authors and our ability to bring you valuable content Questions If you have a problem with any aspect of this course, you can contact us at questions@packtpub.com, and we will our best to address the problem [ vi ] Iris-Setosa 151 Iris-Versicolour 151 Iris-Virginica 151 J jackknife resampling reference link 249 jackknifing 247 569 Java Runtime Environment (JRE) 325 JavaScript Object Notation (JSON) 569 joblib installation link 227 used, for reusing models 458, 459 jq tool 129 Just in time compiling Numba, using 533, 534 K kernel density estimating 244-246 estimating, reference link 246 kernel density plots and box plots, combing with violin plots 220, 221 kernels 564 K-fold cross-validation 569 K-means clustering reference link 331 kmeans() function reference link 509 kurtosis calculating 551-555 L LDA reference link 438 learning curves 432, 569 least recently used (LRU) cache used, for caching 556-558 leaves 442 legends 75-77 lemmatization about 406-409 reference link 410 less equal (le) function 43 less than (lt) function 43 leverage 229 libraries, for data processing Mirador Modular toolkit for data processing (MDP) Natural language processing toolkit (NLTK) Orange RapidMiner Statsmodels Theano libraries, implemented in C++ Caffe MLpack MultiBoost Vowpal Wabbit libraries, in data analysis Mahout Mallet overview Spark Weka linear algebra about 26 arbitrary precision, using for 297, 298 with NumPy 26, 27 linear discriminant analysis (LDA) about 569 applying, for dimension reduction 437, 438 linkage() function reference link 461 liquidity stocks, ranking with 370, 371 LiveStats reference link 556 lmplot() function reference link 205 logarithmic plots 570 logarithms used, for transforming data 282, 283 logging for robust error checking 182, 184 reference link 185 [ 598 ] logistic function 570 logit() function applying, for transforming proportions 286, 287 lombscargle() function reference link 351 Lomb-Scargle periodogram about 349, 570 reference link 351 using 349, 350 lu_solve() function reference link 299 lxml documentation URL 310 M machine learning (ML) machine learning models defining 147, 148 supervised learning 148 unsupervised learning 148 Mahout about reference main sequence 270 Mallet about reference mathematics and statistics reference links 582 Matplotlib about 11, 574 configuring 190-194 references 194 Matplotlib API Primer about 62-64 figures 67 line properties 65, 66 subplots 67-69 matplotlib color maps reference link 208, 209 selecting 208, 209 matrix of scatterplots viewing 213-215 matthews_corrcoef() function reference link 497 Matthews correlation coefficient (MCC) about 470-570 reference link 497 maximum clique 425 maximum drawdown 372 maximum likelihood estimation (MLE) method 239 MayaVi 81 mean calculating 551-555 mean absolute deviation (MAD) 236 mean_absolute_error() function reference link 492 mean absolute error (MeanAE) calculating 490, 491 reference link 492 Mean Absolute Percentage Error (MAPE) determining 485, 486 reference link 487 Mean Percentage Error (MPE) determining 485, 486 reference link 487 mean silhouette coefficient used, for evaluating clusters 479-481 mean_squared_error() function reference link 479 mean squared error (MSE) about 343 computing 476-478 reference link 479 medfilt() documentation reference link 362 median_absolute_error() function reference link 479 median absolute error (MedAE) computing 476-478 mel frequency spectrum about 354 reference link 354 mel scale about 351 reference link 354, 356 Memory class reference link 459 memory leaks 550 570 [ 599 ] memory_profiler module reference link 551 used, for profiling memory usage 550, 551 memory usage profiling, with memory_profiler module 550, 551 metadata extracting, from images 521, 523 methods for manipulating documents 119 Miniconda 168 Mirador about reference missing data working with 51-53 MLpack about reference models reusing, with joblib 458, 459 Modern Portfolio Theory (MPT) about 397 reference link 399 Modular toolkit for data processing (MDP) about reference Monte Carlo method about 249 reference link 252 Moore's law 570 moving block bootstrapping time series data about 359-361 reference link 362 mpld3 d3.js, used for visualization via 215-217 mpmath 270 MultiBoost multiple models majority voting 439-441 stacking 438-441 multiple tasks launching, with concurrent.futures module 540-542 multiple threads running, with threading module 536-539 N named entities recognizing 410-412 named-entity recognition (NER) about 410, 570 reference link 410, 412 Natural language processing toolkit (NLTK) nested cross-validation 455 network graphs visualizing, with hive plots 221-223 NetworkX reference link 222 news articles tokenizing, in sentences 405 tokenizing, in words 405, 406 n-grams 406 noisy data central tendency, measuring 275-277 non-ASCII text dealing with 308, 310 non-negative matrix factorization (NMF) documentation link 414 reference link 414 used, for extraction of topics 412, 414 non-parametric runs test used, for examining market 382-384 not equal (ne) function 43 Numba used, for Just in time compiling 533-535 numerical expressions speeding up, with Numexpr 535, 536 Numexpr reference link 536 used, for speeding up numerical expressions 535, 536 NumPy about 10, 13, 575, 576 linear algebra with 26 random numbers 27-30 URL 182 NumPy arrays about 14 array creation 16 data type 14, 16 fancy indexing 19 indexing 18 numerical operations on arrays 20 slicing 18 NumPy print options seeding 194, 195 URL 196 O object detection 514 object-relational mapping (ORM) 302, 570 octave 503 Open Computing Language (OpenCL) about 570 used, for harnessing GPU 564-566 Open Source Computer Vision (OpenCV) about 570 reference link 500 setting up 500-503 Orange about reference Ostu's thresholding method about 512 reference link 514 outliers about 270 clipping 270, 271 filtering 270, 271 reference link 270 overfitting 163 P Pandas about 10, 576 data structure 34 configuring 188, 189 package overview 33 parsing functions 111 URL 188 pandas library about 188 URL 188 Pandas objects parameters 110 PCA class reference link 436 pdist() function reference link 461 peaks analyzing 338-340 Pearson's correlation reference link 260 used, for correlating variables 257-260 PEP8 about 14 URL 14 pep8 analyzer URL 197 using 196 periodogram() function reference link 336 periodograms used, for performing spectral analysis 334, 335, 336 pesentations reference links 582, 583 phase synchronization measuring 340-342 phi coefficient 570 plot types bar plot 71 contour plot 72 exploring 70 histogram plot 74 scatter plot 70 point biserial correlation reference link 265 used, for correlating binary variable 263, 264 used, for correlating continuous variable 263, 264 Poisson distribution about 571 aggregated counts, fitting 238, 240 reference link 241 posterior distribution 241 power ladder used, for transforming data 280, 281 [ 601 ] power spectral density estimating, with Welch's method 336-338 precision computing 469, 470, 471 reference link 472 precision_score() function reference link 472 prediction performance measuring 162-164 principal component analysis (PCA) about 160, 570 applying, for dimension reduction 435 reference link 436 principal component regression (PCR) about 435 reference link 436 principal components 435, 570 prior distribution 241 probability weights used, for sampling 249-252 probplot() function reference link 476 Proj.4 reference link 224 proportions transforming, by applying logit() function 286-288 PyMongo 11 PyOpenCL reference link 566 PyOpenCL 2015.2.3 reference link 564 Python applications sandboxing, with Docker images 174, 175 Python data visualization tools about 80 Bokeh 81 MayaVi 81, 82 Python libraries, in data analysis about Matplotlib 11 NumPy 10 Pandas 10 PyMongo 11 scikit-learn library 11 Python threading reference link 539 R R library homepage link 227 RandomForestClassifier class reference link 445 random forests about 442 reference link 445, 459 used, for learning 442, 443, 444, 445 random number generators seeding 194, 195 random walk hypothesis (RWH) 385 random walks reference link 386 testing for 385, 386 RANSAC algorithm reference link 448 used, for fitting noisy data 445-447 RapidMiner about reference recall computing 469-471 reference link 472 recall_score() function reference link 472 receiver operating characteristic (ROC) examining 472, 473 reference link 474 regressor visualizing 475, 476 reports standardizing 196-199 reproducible data analysis 168 reproducible sessions 587 requests-cache website reference link 560 rescaled range 363 reference link 366 residual sum of squares (RSS) calculating 490, 491 reference link 492 [ 602 ] Resilient Distributed Datasets (RDDs) 326 resources accessing asynchronously, with asyncio module 543-545 returns statistics analyzing 374-376 RFE class reference link 434 RGB (red, green and blue) 508 risk and return exploring 380 risk-free rate 380 robust error checking with logs 182, 184 robust linear model fitting 288-290 robust regression 288 571 roc_auc_score() function reference link 474 S savgol_filter() function reference link 348 Savitzky-Golay filter about 346 reference link 348 Scale-invariant Feature Transform (SIFT) about 503 applying 503, 504 documentation, reference link 505 reference link 505 scatter plot 70, 571 scikit-learn 226 about 577 references 194, 196 scikit-learn library 11 scikit-learn modules data representation, defining 150-152 defining, for different models 148, 149 SciPy about 578 for exponential distribution, reference link 236 for Poisson distribution 241 seaborn 578 seaborn color palettes about 205-207 reference link 208 selecting 205 search engine indexing reference link 418 security market line (SML) 380 Selenium URL 305 using 302 Series 34, 35 shapefile format 224 shared nothing architecture about 547 reference link 549 shared-nothing architecture 571 Sharpe ratio about 370 reference link 372 stocks, ranking with 370, 371 short-time Fourier transform (STFT) 351, 571 signal processing 571 signals analyzing, with discrete cosine transform (DCT) 354-356 silhouette coefficients about 479 reference link 481 silhouette_score() function reference link 481 simple and log returns computing 368, 369 reference link 369 Single Instruction Multiple Data (SIMD) 13 skewness calculating 551-555 smoothing evaluating 346-348 social network closeness centrality calculating 420, 421 social network density computing 418-420 software aspects 532 software performance improving 532 [ 603 ] Sortino ratio about 372, 373 reference link 374 stocks, ranking with 372, 373 Spark about data, clustering 327-331 references setting up 326, 327 Spearman rank correlation about 571 reference link 263 used, for correlating variables 260-262 spectral analysis performing, with periodograms 334-336 reference link 336 spectral clustering about 571 reference link 529 used, for segmenting images 527-529 spectral_clustering() function reference link 529 Speeded Up Robust Features (SURF) detecting 505-507 reference link 507 split() function reference link 520 SQLAlchemy reference link 319 square root of the MSE (RMSE) 477 stacking about 439 reference link 441 Stanford Network Analysis Project (SNAP) about 221 reference link 221 star schema about 319, 571 implementing, with dimension tables 319-323 implementing, with fact 319-323 URL 324 statistics functions 43-45 Statsmodels about 9, 579 references 246 stemming 406-409 STFT reference link 354 stock prices database populating 391-395 tables, creating for 389, 391 stop words about 571 reference link 410 streaming algorithms 561 Structured Query Language (SQL) 571 supervised learning about 152-157 classification 152-157 classification problems 148 regression 152-157 regression problems 148 Support Vector Machine (SVM) 154 SURF documentation, reference link 507 T tab separated values (TSV) 224 table column adding, to existing table 314, 315 tables creating, for stock prices database 389, 390 tabulate PyPi URL 199 term frequency 406 term frequency-inverse document frequency (tf-idf) 571 test web server setting up 317, 319 text method 77 texture features extracting, from images 523-526 TF-IDF about 406-409 reference link 410 TfidfVectorizer class reference link 410 Theano about 9, 462, 463 documentation link 464 installing 462 [ 604 ] threading module used, for running multiple threads 536-539 Timedeltas 100 time series about 572 plotting 101, 103, 104 reference, Pandas documentation 88 resampling 94 time series data block bootstrapping 357-359 block bootstrapping, reference link 359 downsampling 94, 96 unsampling 97, 98 time series primer 85 time slicing 536 time zone handling 99 tmean() reference link 277 topic models reference link 414 topic models;about 412 trend smoothing factor 343 trima() reference link 277 trimean 275 truncated mean 275 two-way ANOVA reference link 267 U unigrams 406 unit testing about 185 performing 185, 187 unittest.mock library URL 187 unsupervised learning clustering 158-162 defining 158-162 dimensionality reduction 158-162 V Vagrant about 170 reference link 172 URL 171 validation 455 validation curves 432 variables correlating, with Pearson's correlation 257-259 correlating, with Spearman rank correlation 260, 262 relations evaluating, with ANOVA 265-267 variance calculating 551-555 reference link 556 Viola-Jones object detection framework about 514 reference link 517 violin plots about 220, 572 box plots and kernel density plots, combing with 220, 221 reference link 221 VirtualBox about 170, 172 URL 171 virtualenv virtual environment, creating with 172-174 virtual environment creating, with virtualenv 172-174 creating, with virtualenvwrapper 172-174 URL 174 virtualenvwrapper URL 174 virtual environment, creating with 172-174 visualization toolkit (VTK) 81 VotingClassifier class reference link 441 Vowpal Wabbit about reference W Wald-Wolfowitz runs test about 382 reference link 384 watermark extension using 177 [ 605 ] weak learners 452 web scraping 305-307 web browsing simulating 302-305 weighted least squares about 291 used, for taking variance into account 291, 292 Weka about reference welch() function reference link 338 Welch's method reference link 338 used, for estimating power spectral density 336-338 winsorizing technique 273, 274, 572 Within Cluster Sum of Squares (WCSS) 328 Within Set Sum Squared Error (WSSSE) 328 WordNetLemmatizer class reference link 410 X xmlstarlet tool 129 XPath URL 305 Y YAML about 170 URL 170 [ 606 ] A adjusted R-square value 88 age-adjustment 73 age-standardization 73 agglomerative clustering 122 Akaike Information Criterion (AIC) 88, 231 Anaconda Scientific Python URL autocorrelation function (ACF) 234 automatic function 234 autoregressive (AR) model 230, 231 autoregressive integrated moving average (ARIMA) model 235, 236 Aviation Accident Database 141 Aviation Safety Network URL 141 B Bayes formula 139 Bayes Information Criterion (BIC) 88 Bayesian analysis of data 151, 152, 153, 155, 156 Bayesian inference 138 Bayesian Information Criterion (BIC) 231 Bayesian method 138, 139 Binomial distribution 64, 67 boxplot 35, 36 C cartopy 160 Centers for Disease Control and Prevention (CDC) 265 classifier selecting 202 climate change 163 cluster finding 109, 110 clustering 108, 109, 183, 184, 187 components about 220 decomposing 221, 223, 224, 225 confidence intervals versus credible 139 continuous data 51 Continuum Analytics 261 coordinates plotting 160 cumulative distribution function 42, 43, 44, 45, 46, 48, 49 D data repositories 264 data visualization 265 data Bayesian analysis 151, 152, 153, 155 binning 148 Degrees From Equator (DFE) 116 dependent variable 88 Df Model 88 Df Residuals 88 differencing 227, 228 discrete data 51 distance from equator (DFE) 83 distributions working with 51, 53, 55, 56, 57, 60, 61 divisive clustering 122 E Enthought URL 7, 262 Eurostat 264 expected value 62 experiment 42 exponential distribution 57 J F John Snow on cholera 110, 111, 112, 113, 114 Jupyter library URL Jupyter Notebook about 9, 238 command mode shortcuts 239 edit mode shortcuts 240 keyboard shortcuts 239 markdown cells 240 URL 261 Jupyter URL feature-rich emcee package URL 140 G General Social Survey (GSS) about 20 data, downloading 20 data, obtaining 20 data, reading 21, 22 URL 264 GitHub URL 262 gross domestic product (GDP) about 91 versus absolute latitude 116, 118, 119, 120 K H Heliocentric distance 123 hierarchical cluster algorithm 132, 134, 135, 137 hierarchical clustering analysis about 122 agglomerative clustering 122 data, reading in 122, 131 data, reducing 122, 127 divisive clustering 122 Hierarchical Data Format (HDF) 99 histogram 23, 26, 27 I imports 10 indexing and slicing 209, 211 intercept 72 International Organization for Standardization (ISO) 82 IPython library URL IPython notebook about K-means clustering 116 K-Nearest Neighbor 200 Kernel Density Estimation (KDE) 29 keyboard shortcuts 239 L Law Dome URL 164 Least Absolute Shrinkage and Selection Operator (LASSO) 176 libraries linear analysis Bayesian analysis and OLS, checking with 181, 182, 183 linear regression about 71, 72, 73, 176 climate data 176, 178, 179, 181 dataset, getting 73, 74, 75, 76, 78, 80 testing with 81, 82, 83, 84, 86, 87, 89, 91 logistic regression 100, 103, 104 M machine learning supervised 174, 175 unsupervised 174, 175 Markov Chain Monte Carlo (MCMC) 140 matplotlib library URL 8, 28 [ 268 ] matplotlib styles 256, 257, 259 mean 62 models about 41 creating 166, 167 forms 41 origin 63 sampling 166, 167 month binning by 158 MovieTweetings 50K movie ratings dataset URL 11 moving average (MA) 232, 233 Mpl toolkits 162 multivariate distribution 68 multivariate regression about 91 economic indicators, adding 91, 94, 97 taking step back 98 N National Center for Supercomputing Applications (NCSA) 99 National Health Statistics Reports URL 52 National Opinion Research Center (NORC) URL 20 National Transportation Safety Board (NTSB) about 141 URL 141 Normal Distribution Plot 31 notebook interface Notebook Python extensions about 241 codefolding extension 243 collapsible headings 245, 246 help panel 247 initialization cells 247 installing 241 NbExtensions menu item 249 ruler 249 skip-traceback 250 table of contents 252 Notebooks URL NTSB database about 141, 142, 143, 144, 146, 147 URL 141 NumPy URL O OpenData by Socrata 264 URL 141 Ordinary Least Squares (OLS) 176 Ordinary Least Squares (OLS) method 87, 96 P p parameter 233 Pandas data type 22 Pandas library example 10, 11, 12, 13, 16, 17 URL Pandas and time series data 206, 207, 209 partial autocorrelation function (PACF) 234 patterns 220 Pearson correlation coefficient 79 percent point function 56 point estimates 34 predefined distributions URL 51 probability density function (pdf) 61, 62, 63 probability mass function (pmf) 63 pseudorandom 43 pymc library URL PyPI URL 262 Python packages 140 Python weekly newsletter URL 261 Python and IPython 261 URL Q q parameter 233 [ 269 ] R Random Forest 201 random module URL 43 random variables 42 random variates (vs) 44, 58 resampling 212 resources about 261 general 261 S scatterplots 37, 38, 39 scikit-learn library URL Scikit-learn about 175 URL 175 scipy Scipy-toolkits URL 262 seeds classification about 188, 189 data visualization 190, 192 data, classifying 196, 197 feature, selecting 194, 196 sigmoid function 102 smoothing 212, 213, 214, 215, 217 Stack Overflow URL 262 STATA 21 stationarity 218, 219 statistical interface 32 Statistics Handbook URL 52 suicide rate versus GDP 116 Support Vector Machine (SVM) 188 surface sites URL 164 SVC linear kernal 198 SVC polynomial 200 SVC Radial Basis Function (RBF) 199 T time series analysis 204 time series models 229 U univariate distributions 68 unvariate data about 23 boxplot 35 boxplots 33 characterization 29, 30 histograms 23, 24, 25, 26, 27, 28 numeric summaries 33 numerical data 33, 34 statistical interface 32 V variable relationships 37, 38 W weighted least squares (WLS) 88 World Coordinate System (WCS) 123 World Health Organization 9WHO) 73 Thank you for buying Python: End-to-end Data Analysis About Packt Publishing Packt, pronounced 'packed', published its first book, Mastering phpMyAdmin for Effective MySQL Management, in April 2004, and subsequently continued to specialize in publishing highly focused books on specific technologies and solutions Our books and publications share the experiences of your fellow IT professionals in adapting and customizing today's systems, applications, and frameworks Our solution-based books give you the knowledge and power to customize the software and technologies you're using to get the job done Packt books are more specific and less general than the IT books you have seen in the past Our unique business model allows us to bring you more focused information, giving you more of what you need to know, and less of what you don't Packt is a modern yet unique publishing company that focuses on producing quality, cutting-edge books for communities of developers, administrators, and newbies alike For more information, please visit our website at www.packtpub.com Writing for Packt We welcome all inquiries from people who are interested in authoring Book proposals should be sent to author@packtpub.com If your book idea is still at an early stage and you would like to discuss it first before writing a formal book proposal, then please contact us; one of our commissioning editors will get in touch with you We're not just looking for published authors; if you have strong technical skills but no writing experience, our experienced editors can help you develop a writing career, or simply get some additional reward for your expertise Please check www.PacktPub.com for information on our titles .. .Python: End- to -end Data Analysis Leverage the power of Python to clean, scrape, analyze, and visualize your data A course in three modules BIRMINGHAM - MUMBAI Python: End- to -end Data Analysis. .. group data – these are the concerns in this chapter The library of choice in the course will be Pandas again Module 2, Python Data Analysis Cookbook, demonstrates how to visualize data and mentions... Bayesian analysis, machine learning, and time series analysis Exploratory data analysis is a key aspect to get a sense of what can be done and to maximize the insights that are gained from the data