Python: Data Analytics and Visualization Table of Contents Python: Data Analytics and Visualization Credits Preface What this learning path covers What you need for this learning path Who this learning path is for Reader feedback Customer support Downloading the example code Errata Piracy Questions Module 1 Introducing Data Analysis and Libraries Data analysis and processing An overview of the libraries in data analysis Python libraries in data analysis NumPy Pandas Matplotlib PyMongo The scikit-learn library Summary NumPy Arrays and Vectorized Computation NumPy arrays Data types Array creation Indexing and slicing Fancy indexing Numerical operations on arrays Array functions Data processing using arrays Loading and saving data Saving an array Loading an array Linear algebra with NumPy NumPy random numbers Summary Data Analysis with Pandas An overview of the Pandas package The Pandas data structure Series The DataFrame The essential basic functionality Reindexing and altering labels Head and tail Binary operations Functional statistics Function application Sorting Indexing and selecting data Computational tools Working with missing data Advanced uses of Pandas for data analysis Hierarchical indexing The Panel data Summary Data Visualization The matplotlib API primer Line properties Figures and subplots Exploring plot types Scatter plots Bar plots Contour plots Histogram plots Legends and annotations Plotting functions with Pandas Additional Python data visualization tools Bokeh MayaVi Summary Time Series Time series primer Working with date and time objects Resampling time series Downsampling time series data Upsampling time series data Time zone handling Timedeltas Time series plotting Summary Interacting with Databases Interacting with data in text format Reading data from text format Writing data to text format Interacting with data in binary format HDF5 Interacting with data in MongoDB Interacting with data in Redis The simple value List Set Ordered set Summary Data Analysis Application Examples Data munging Cleaning data Filtering Merging data Reshaping data Data aggregation Grouping data Summary Machine Learning Models with scikit-learn An overview of machine learning models The scikit-learn modules for different models Data representation in scikit-learn Supervised learning – classification and regression Unsupervised learning – clustering and dimensionality reduction Measuring prediction performance Summary Module Getting Started with Predictive Modelling Introducing predictive modelling Scope of predictive modelling Ensemble of statistical algorithms Statistical tools Historical data Mathematical function Business context Knowledge matrix for predictive modelling Task matrix for predictive modelling Applications and examples of predictive modelling LinkedIn's "People also viewed" feature What it does? How is it done? Correct targeting of online ads How is it done? Santa Cruz predictive policing How is it done? Determining the activity of a smartphone user using accelerometer data How is it done? Sport and fantasy leagues How was it done? Python and its packages – download and installation Anaconda Standalone Python Installing a Python package Installing pip Installing Python packages with pip Python and its packages for predictive modelling IDEs for Python Summary Data Cleaning Reading the data – variations and examples Data frames Delimiters Various methods of importing data in Python Case – reading a dataset using the read_csv method The read_csv method Use cases of the read_csv method Passing the directory address and filename as variables Reading a txt dataset with a comma delimiter Specifying the column names of a dataset from a list Case – reading a dataset using the open method of Python Reading a dataset line by line Changing the delimiter of a dataset Case – reading data from a URL Case – miscellaneous cases Reading from an xls or xlsx file Writing to a CSV or Excel file Basics – summary, dimensions, and structure Handling missing values Checking for missing values What constitutes missing data? How missing values are generated and propagated Treating missing values Deletion Imputation Creating dummy variables Visualizing a dataset by basic plotting Scatter plots Histograms Boxplots Summary Data Wrangling Subsetting a dataset Selecting columns Selecting rows Selecting a combination of rows and columns Creating new columns Generating random numbers and their usage Various methods for generating random numbers Seeding a random number Generating random numbers following probability distributions Probability density function Cumulative density function Uniform distribution Normal distribution Using the Monte-Carlo simulation to find the value of pi Geometry and mathematics behind the calculation of pi Generating a dummy data frame Grouping the data – aggregation, filtering, and transformation Aggregation Filtering Transformation Miscellaneous operations Random sampling – splitting a dataset in training and testing datasets Method – using the Customer Churn Model Method – using sklearn Method – using the shuffle function Concatenating and appending data Merging/joining datasets Inner Join Left Join Right Join An example of the Inner Join An example of the Left Join An example of the Right Join Summary of Joins in terms of their length Summary Statistical Concepts for Predictive Modelling Random sampling and the central limit theorem Hypothesis testing Null versus alternate hypothesis Z-statistic and t-statistic Confidence intervals, significance levels, and p-values Different kinds of hypothesis test A step-by-step guide to a hypothesis test An example of a hypothesis test Chi-square tests Correlation Summary Linear Regression with Python Understanding the maths behind linear regression Linear regression using simulated data Fitting a linear regression model and checking its efficacy Finding the optimum value of variable coefficients Making sense of result parameters p-values F-statistics Residual Standard Error Implementing linear regression with Python Linear regression using the statsmodel library Multiple linear regression Multi-collinearity Variance Inflation Factor Model validation Training and testing data split Summary of models Linear regression with scikit-learn Feature selection with scikit-learn Handling other issues in linear regression Handling categorical variables Transforming a variable to fit non-linear relations Handling outliers Other considerations and assumptions for linear regression Summary Logistic Regression with Python Linear regression versus logistic regression Understanding the math behind logistic regression Contingency tables Conditional probability Odds ratio Moving on to logistic regression from linear regression Estimation using the Maximum Likelihood Method Likelihood function: Log likelihood function: Building the logistic regression model from scratch Making sense of logistic regression parameters Wald test Likelihood Ratio Test statistic Chi-square test Implementing logistic regression with Python Processing the data Data exploration Data visualization Creating dummy variables for categorical variables Feature selection Implementing the model Model validation and evaluation Cross validation Model validation The ROC curve Confusion matrix Summary Clustering with Python Introduction to clustering – what, why, and how? What is clustering? How is clustering used? Why we clustering? Mathematics behind clustering R radial layout / Radial layout random forest implementing, using Python / Implementing a random forest using Python features / Why random forests work? parameters / Important parameters for random forests random forest algorithm about / The random forest algorithm random forests about / Understanding and implementing random forests random numbers about / Generating random numbers and their usage generating / Generating random numbers and their usage usage / Generating random numbers and their usage methods, for generating / Various methods for generating random numbers seeding / Seeding a random number generating, following probability distributions / Generating random numbers following probability distributions random sampling about / Random sampling – splitting a dataset in training and testing datasets dataset, testing / Random sampling – splitting a dataset in training and testing datasets dataset, splitting / Random sampling – splitting a dataset in training and testing datasets Customer Churn Model, using / Method – using the Customer Churn Model sklearn, using / Method – using sklearn shuffle function, using / Method – using the shuffle function and central limit theorem / Random sampling and the central limit theorem RapidMiner about / An overview of the libraries in data analysis reference / An overview of the libraries in data analysis reader-driven narratives about / Reader-driven narratives Gapminder / Gapminder union address, state / The State of the Union address USA, mortality rate / Mortality rate in the USA example narratives / A few other example narratives read_csv method about / Case – reading a dataset using the read_csv method, The read_csv method filepath / The read_csv method sep / The read_csv method dtype / The read_csv method header / The read_csv method names / The read_csv method skiprows / The read_csv method index_col / The read_csv method skip_blank_lines / The read_csv method na-filter / The read_csv method use cases / Use cases of the read_csv method Receiver Operating Characteristic (ROC) curve about / Model validation Recursive Feature Elimination (RFE) / Feature selection with scikitlearn regression tree algorithm about / Regression tree algorithm regression trees about / Understanding and implementing regression trees advantages / Regression tree algorithm implementing, with Python / Implementing a regression tree using Python Relative Strength Indicator (RSI) URL / Obtaining data Residual Standard Error (RSE) about / Residual Standard Error result parameters about / Making sense of result parameters p-values / p-values F-statistics / F-statistics Residual Standard Error (RSE) / Residual Standard Error retrospective analytics about / Introducing predictive modelling right-tailed test about / Different kinds of hypothesis test Right Join about / Right Join characteristics / Right Join example / An example of the Right Join ROC curve about / The ROC curve confusion matrix / Confusion matrix S Scalar selection about / Scalar selection scatter plot about / Scatter plots, Scatter plots plotting / Scatter plots scatter plots about / Scatter plots and bubble charts, Scatter plots URL / Scatter plots Schelling Segregation Model (SSM) about / Schelling's Segregation Model Scientific PYthon Development EnviRonment (Spyder) / Anaconda from Continuum Analytics scientific visualization / Perception and presentation methods Scikit / Packages websites scikit-learn about / Python and its packages for predictive modelling features / Python and its packages for predictive modelling URL / Python and its packages for predictive modelling installing / Installing scikit-learn scikit-learn library about / The scikit-learn library scikit-learn modules defining, for different models / The scikit-learn modules for different models data representation, defining / Data representation in scikit-learn scikit-learn package URL / Linear regression SciPy about / NumPy, SciPy, and MKL functions, SciPy packages / SciPy linear equations, example / An example of linear equations vectorized numerical derivative / The vectorized numerical derivative / Packages websites Seaborn / Packages websites Sensitivity (True Positive Rate) / The ROC curve Series about / Series sets about / Sets shuffle function using / Method – using the shuffle function signal processing / Signal processing Silhouette Coefficient / Silhouette Coefficient sklearn using / Method – using sklearn slicing about / Slicing flat used / Slice using flat social networks analysis / Analysis of social networks Spark about / An overview of the libraries in data analysis reference / An overview of the libraries in data analysis sparse matrices visualize sparseness / Visualizing sparseness Specificity (True Negative Rate) / The ROC curve sports example about / A sports example URL / A sports example results, visually representing / Visually representing the results Spyder about / IDEs for Python, An overview of Spyder features / IDEs for Python components / An overview of Spyder square map plot / The square map plot SSA module URL / Stochastic block models stacks about / Stacks Standalone Python about / Standalone Python statistical algorithms, predictive modelling about / Ensemble of statistical algorithms supervised algorithms / Ensemble of statistical algorithms un-supervised algorithms / Ensemble of statistical algorithms statistical learning about / An overview of statistical and machine learning statistics best practices / Best practices for statistics statistics functions / Functional statistics Statsmodels about / An overview of the libraries in data analysis Stochastic block models about / Stochastic block models Stochastic Differential Equation (SDE) / Simulation examples stochastic model about / The stochastic model Monte Carlo simulation / Monte Carlo simulation portfolio valuation / The portfolio valuation simulation model / The simulation model geometric Brownian simulation / Geometric Brownian simulation diffusion-based simulation / The diffusion-based simulation stock price URL / Obtaining data stories creating, with data / Creating interesting stories with data reader-driven narratives / Why are stories so important?, Readerdriven narratives author-driven narratives / Why are stories so important?, Authordriven narratives supervised learning classification problems / An overview of machine learning models regression problems / An overview of machine learning models about / Supervised learning – classification and regression classification / Supervised learning – classification and regression regression / Supervised learning – classification and regression Support Vector Machine (SVM) about / Supervised learning – classification and regression Support vector machines (SVM) about / Support vector machines surface-3D plot / The surface-3D plot sypder-app / Anaconda from Continuum Analytics T t-statistic about / Z-statistic and t-statistic t-test / Best practices for statistics t-test (Student-t distribution) about / Z-statistic and t-statistic tab completion URL / Anaconda from Continuum Analytics task matrix, predictive modelling about / Task matrix for predictive modelling TextBlob URL / The Twitter text, The Naïve Bayes classifier text method about / Legends and annotations Theano about / An overview of the libraries in data analysis threshold model about / The threshold model Timedeltas about / Timedeltas time series reference, Pandas documentation / Working with date and time objects resampling / Resampling time series plotting / Time series plotting time series data downsampling / Downsampling time series data unsampling / Upsampling time series data time series primer about / Time series primer time zone handling about / Time zone handling tries about / Tries tuples about / Tuples Twitter text about / The Twitter text two-tailed test about / Different kinds of hypothesis test U uniform distribution about / Uniform distribution unsupervised learning defining / Unsupervised learning – clustering and dimensionality reduction clustering / Unsupervised learning – clustering and dimensionality reduction dimensionality reduction / Unsupervised learning – clustering and dimensionality reduction use cases, read_csv method about / Use cases of the read_csv method directory address and filename, passing as variables / Passing the directory address and filename as variables txt dataset, reading with comma delimiter / Reading a txt dataset with a comma delimiter dataset column names, specifying from list / Specifying the column names of a dataset from a list V value of pi calculating / Geometry and mathematics behind the calculation of pi Variance Inflation Factor (VIF) about / Variance Inflation Factor Veusz / Visualization plots with Anaconda VisPy / Interactive visualization packages about / VisPy URL / VisPy visualization benefits / How does visualization help decision-making? URL / How does visualization help decision-making?, Visualization plots about / Where does visualization fit in? plots / Visualization plots planning, need for / Why does visualization require planning? scientific visualization / Perception and presentation methods information visualization / Perception and presentation methods matplotlib used / Visualization using matplotlib visualization, best practices about / Some best practices for visualization comparison and ranking / Comparison and ranking correlation / Correlation distribution / Distribution location-specific or geodata / Location-specific or geodata part to whole / Part-to-whole relationships trends over time / Trends over time visualization, interactive about / Interactive visualization event listeners / Event listeners layouts / Layouts visualization example in sports / The visualization example in sports visualization plots, with Anaconda about / Visualization plots with Anaconda surface-3D plot / The surface-3D plot square map plot / The square map plot visualization toolkit (VTK) / MayaVi visualization tools, in Python about / Visualization tools in Python development tools / Development tools Canopy, from Enthought / Canopy from Enthought Anaconda, from continuum analytics / Anaconda from Continuum Analytics Vowpal Wabbit about / An overview of the libraries in data analysis reference / An overview of the libraries in data analysis VSTOXX data URL / The volatility plot, Implied volatilities W Wakari / Interactive visualization packages Wald test / Wald test web feeds about / Web feeds Weka about / An overview of the libraries in data analysis reference / An overview of the libraries in data analysis word clouds about / Word clouds installing / Installing word clouds input for / Input for word clouds web feeds / Web feeds Twitter text / The Twitter text stock price chart, plotting / Plotting the stock price chart data, obtaining / Obtaining data used, for viewing positive sentiments / Viewing positive sentiments using word clouds World Health Organization (WHO) / The Ebola example X xmlstarlet tool / Data munging Z Z-statistic about / Z-statistic and t-statistic Z-test / Best practices for statistics Z- test (normal distribution) about / Z-statistic and t-statistic .. .Python: Data Analytics and Visualization Table of Contents Python: Data Analytics and Visualization Credits Preface What this learning path... Framework for Data Visualization Data, information, knowledge, and insight Data Information Knowledge Data analysis and insight The transformation of data Transforming data into information Data collection... Anaconda Packages websites About matplotlib Bibliography Index Python: Data Analytics and Visualization Python: Data Analytics and Visualization Copyright © 2017 Packt Publishing All rights reserved