Statistical Application Development with R and Python - Second Edition
Credits
About the Author
Acknowledgment
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Data Characteristics
Questionnaire and its components
Understanding the data characteristics in an R environment
Experiments with uncertainty in computer science
Installing and setting up R
Using R packages
RSADBE – the books R package
Python installation and setup
Using pip for packages
IDEs for R and Python
The companion code bundle
Discrete distributions
Discrete uniform distribution
Binomial distribution
Hypergeometric distribution
Negative binomial distribution
Poisson distribution
Continuous distributions
Uniform distribution
Exponential distribution
Normal distribution
Summary
2. Import/Export Data
Packages and settings – R and Python
Understanding data.frame and other formats
Constants, vectors, and matrices
Time for action – understanding constants, vectors, and basic arithmetic
What just happened?
Doing it in Python
Time for action – matrix computations
What just happened?
Doing it in Python
The list object
Time for action – creating a list object
What just happened?
The data.frame object
Time for action – creating a data.frame object
What just happened?
Have a go hero
The table object
Time for action – creating the Titanic dataset as a table object
What just happened?
Have a go hero
Using utils and the foreign packages
Time for action – importing data from external files
What just happened?
Doing it in Python
Importing data from MySQL
Doing it in Python
Exporting data/graphs
Exporting R objects
Exporting graphs
Time for action – exporting a graph
What just happened?
Managing R sessions
Time for action – session management
What just happened?
Doing it in Python
Pop quiz
Summary
3. Data Visualization
Packages and settings – R and Python
Visualization techniques for categorical data
Bar chart
Going through the built-in examples of R
Time for action – bar charts in R
What just happened?
Doing it in Python
Have a go hero
Dot chart
Time for action – dot charts in R
What just happened?
Doing it in Python
Spine and mosaic plots
Time for action – spine plot for the shift and operator data
What just happened?
Time for action – mosaic plot for the Titanic dataset
What just happened?
Pie chart and the fourfold plot
Visualization techniques for continuous variable data
Boxplot
Time for action – using the boxplot
What just happened?
Doing it in Python
Histogram
Time for action – understanding the effectiveness of histograms
What just happened?
Doing it in Python
Have a go hero
Scatter plot
Time for action – plot and pairs R functions
What just happened?
Doing it in Python
Have a go hero
Pareto chart
A brief peek at ggplot2
Time for action – qplot
What just happened?
Time for action – ggplot
What just happened?
Pop quiz
Summary
4. Exploratory Analysis
Packages and settings – R and Python
Essential summary statistics
Percentiles, quantiles, and median
Hinges
Interquartile range
Time for action – the essential summary statistics for The Wall dataset
What just happened?
Techniques for exploratory analysis
The stem-and-leaf plot
Time for action – the stem function in play
What just happened?
Letter values
Data re-expression
Have a go hero
Bagplot – a bivariate boxplot
Time for action – the bagplot display for multivariate datasets
What just happened?
Resistant line
Time for action – resistant line as a first regression model
What just happened?
Smoothing data
Time for action – smoothening the cow temperature data
What just happened?
Median polish
Time for action – the median polish algorithm
What just happened?
Have a go hero
Summary
5. Statistical Inference
Packages and settings – R and Python
Maximum likelihood estimator
Visualizing the likelihood function
Time for action – visualizing the likelihood function
What just happened?
Doing it in Python
Finding the maximum likelihood estimator
Using the fitdistr function
Time for action – finding the MLE using mle and fitdistr functions
What just happened?
Confidence intervals
Time for action – confidence intervals
What just happened?
Doing it in Python
Hypothesis testing
Binomial test
Time for action – testing probability of success
What just happened?
Tests of proportions and the chi-square test
Time for action – testing proportions
What just happened?
Tests based on normal distribution – one sample
Time for action – testing one-sample hypotheses
What just happened?
Have a go hero
Tests based on normal distribution – two sample
Time for action – testing two-sample hypotheses
What just happened?
Have a go hero
Doing it in Python
Summary
6. Linear Regression Analysis
Packages and settings - R and Python
The essence of regression
The simple linear regression model
What happens to the arbitrary choice of parameters?
Time for action - the arbitrary choice of parameters
What just happened?
Building a simple linear regression model
Time for action - building a simple linear regression model
What just happened?
Have a go hero
ANOVA and the confidence intervals
Time for action - ANOVA and the confidence intervals
What just happened?
Model validation
Time for action - residual plots for model validation
What just happened?
Doing it in Python
Have a go hero
Multiple linear regression model
Averaging k simple linear regression models or a multiple linear regression model
Time for action - averaging k simple linear regression models
What just happened?
Building a multiple linear regression model
Time for action - building a multiple linear regression model
What just happened?
The ANOVA and confidence intervals for the multiple linear regression model
Time for action - the ANOVA and confidence intervals for the multiple linear regression model
What just happened?
Have a go hero
Useful residual plots
Time for action - residual plots for the multiple linear regression model
What just happened?
Regression diagnostics
Leverage points
Influential points
DFFITS and DFBETAS
The multicollinearity problem
Time for action - addressing the multicollinearity problem for the gasoline data
What just happened?
Doing it in Python
Model selection
Stepwise procedures
The backward elimination
The forward selection
The stepwise regression
Criterion-based procedures
Time for action - model selection using the backward, forward, and AIC criteria
What just happened?
Have a go hero
Summary
7. Logistic Regression Model
Packages and settings – R and Python
The binary regression problem
Time for action – limitation of linear regression model
What just happened?
Probit regression model
Time for action – understanding the constants
What just happened?
Doing it in Python
Logistic regression model
Time for action – fitting the logistic regression model
What just happened?
Doing it in Python
Hosmer-Lemeshow goodness-of-fit test statistic
Time for action – Hosmer-Lemeshow goodness-of-fit statistic
What just happened?
Model validation and diagnostics
Residual plots for the GLM
Time for action – residual plots for logistic regression model
What just happened?
Doing it in Python
Have a go hero
Influence and leverage for the GLM
Time for action – diagnostics for the logistic regression
What just happened?
Have a go hero
Receiving operator curves
Time for action – ROC construction
What just happened?
Doing it in Python
Logistic regression for the German credit screening dataset
Time for action – logistic regression for the German credit dataset
What just happened?
Doing it in Python
Have a go hero
Summary
8. Regression Models with Regularization
Packages and settings – R and Python
The overfitting problem
Time for action – understanding overfitting
What just happened?
Doing it in Python
Have a go hero
Regression spline
Basis functions
Piecewise linear regression model
Time for action – fitting piecewise linear regression models
What just happened?
Natural cubic splines and the general B-splines
Time for action – fitting the spline regression models
What just happened?
Ridge regression for linear models
Protecting against overfitting
Time for action – ridge regression for the linear regression model
What just happened?
Doing it in Python
Ridge regression for logistic regression models
Time for action – ridge regression for the logistic regression model
What just happened?
Another look at model assessment
Time for action – selecting iteratively and other topics
What just happened?
Pop quiz
Summary
9. Classification and Regression Trees
Packages and settings – R and Python
Understanding recursive partitions
Time for action – partitioning the display plot
What just happened?
Splitting the data
The first tree
Time for action – building our first tree
What just happened?
Constructing a regression tree
Time for action – the construction of a regression tree
What just happened?
Constructing a classification tree
Time for action – the construction of a classification tree
What just happened?
Doing it in Python
Classification tree for the German credit data
Time for action – the construction of a classification tree
What just happened?
Doing it in Python
Have a go hero
Pruning and other finer aspects of a tree
Time for action – pruning a classification tree
What just happened?
Pop quiz
Summary
10. CART and Beyond
Packages and settings – R and Python
Improving the CART
Time for action – cross-validation predictions
What just happened?
Understanding bagging
The bootstrap
Time for action – understanding the bootstrap technique
What just happened?
How the bagging algorithm works
Time for action – the bagging algorithm
What just happened?
Doing it in Python
Random forests
Time for action – random forests for the German credit data
What just happened?
Doing it in Python
The consolidation
Time for action – random forests for the low birth weight data
What just happened?
Summary
Index