Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 258 trang
THÔNG TIN TÀI LIỆU
Cấu trúc
Cover
Copyright
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Table of Contents
Preface
Chapter 1: First Steps
Introducing data science and Python
Installing Python
Python 2 or Python 3?
Step-by-step installation
A glance at the essential Python packages
NumPy
SciPy
pandas
Scikit-learn
IPython
Matplotlib
Statsmodels
Beautiful Soup
NetworkX
NLTK
Gensim
PyPy
The installation of packages
Package upgrades
Scientific distributions
Anaconda
Enthought Canopy
PythonXY
WinPython
Introducing IPython
The IPython Notebook
Datasets and code used in the book
Scikit-learn toy datasets
MLdata.org public repository
LIBSVM data examples
Loading data directly from CSV or text files
Scikit-learn sample generators
Summary
Chapter 2: Data Munging
The data science process
Data loading and preprocessing with pandas
Fast and easy data loading
Dealing with problematic data
Dealing with big datasets
Accessing other data formats
Data preprocessing
Data selection
Working with categorical and textual data
A special type of data–text
Data processing with NumPy
NumPy's n-dimensional array
The basics of NumPy ndarray objects
Creating NumPy arrays
From lists to unidimensional arrays
Controlling the memory size
Heterogeneous lists
From lists to multidimensional arrays
Resizing arrays
Arrays derived from NumPy functions
Getting an array directly from a file
Extracting data from pandas
NumPy fast operation and computations
Matrix operations
Slicing and indexing with NumPy arrays
Stacking NumPy arrays
Summary
Chapter 3: Data Science Pipeline
Introducing EDA
Feature creation
Dimensionality reduction
Covariance matrix
Principal Component Analysis (PCA)
A variation of PCA for big data–randomized PCA
Latent Factor Analysis (LFA)
Linear Discriminant Analysis (LDA)
Latent Semantical Analysis (LSA)
Independent Component Analysis (ICA)
Kernel PCA
Restricted Boltzmann Machine (RBM)
Detection and treatment of outliers
Univariate outlier detection
EllipticEnvelope
OneClassSVM
Scoring functions
Multilabel classification
Binary classification
Regression
Testing and validating
Cross-validation
Using cross-validation iterators
Sampling and bootstrapping
Hyper-parameters optimization
Building custom scoring functions
Reducing grid search runtime
Feature selection
Univariate selection
Recursive elimination
Stability and L1-based selection
Summary
Chapter 4: Machine Learning
Linear and logistic regression
Naive Bayes
The k-Nearest Neighbors
Advanced nonlinear algorithms
SVM for classification
SVM for regression
Tuning SVM
Ensemble strategies
Pasting by random samples
Bagging with weak ensembles
Random Subspaces and Random Patches
Sequences of models – AdaBoost
Gradient tree boosting (GTB)
Dealing with big data
Creating some big datasets as examples
Scalability with volume
Keeping up with velocity
Dealing with variety
A quick overview of Stochastic Gradient Descent (SGD)
A peek into Natural Language Processing (NLP)
Word tokenization
Stemming
Word Tagging
Named Entity Recognition (NER)
Stopwords
A complete data science example – text classification
An overview of unsupervised learning
Summary
Chapter 5: Social Network Analysis
Introduction to graph theory
Graph algorithms
Graph loading, dumping, and sampling
Summary
Chapter 6: Visualization
Introducing the basics of matplotlib
Curve plotting
Using panels
Scatterplots
Histograms
Bar Graphs
Image visualization
Selected graphical examples with pandas
Boxplots and histograms
Scatterplots
Parallel coordinates
Advanced data learning representation
Learning curves
Validation curves
Feature importance
GBT partial dependence plot
Summary
Index
Nội dung
[...]...Table of Contents Datasets and code used in the book Scikit-learn toy datasets The MLdata.org public repository LIBSVM data examples Loading data directly from CSV or text files Scikit-learn sample generators Summary 22 22 26 26 27 30 31 Chapter 2: Data Munging 33 Chapter 3: The Data Science Pipeline 83 The data science process Data loading and preprocessing with pandas Fast and easy data loading Dealing... swing and go through the following topics: • How to set up a Python Data Science Toolbox • Using IPython • An overview of the data that we are going to study in this book Introducing data science and Python Data science is a relatively new knowledge domain, though its core components have been studied and researched for many years by the computer science community These components include linear algebra,... and aspiring data scientists with limited Python experience and a working knowledge of data analysis, but no specific expertise of data science algorithms • Data analysts who are proficient in statistic modeling using R or MATLAB tools and who would like to exploit Python to perform data science operations • Developers and programmers who intend to expand their knowledge and learn about data manipulation... of data science or a well-grounded data science practitioner, you can take advantage of this essential introduction to Python for data science You can use it to the fullest if you already have at least some previous experience in basic coding, writing general-purpose computer programs in Python, or some other data analysis-specific language, such as MATLAB or R The book will delve directly into Python. .. for interactive computing, libraries, and datasets) necessary to immediately start on data science using Python Chapter 2, Data Munging, explains how to upload the data to be analyzed by applying alternative techniques when the data is too big for the computer to handle It introduces all the key data manipulation and transformation techniques Chapter 3, The Data Science Pipeline, offers advanced explorative... will delve directly into Python for data science, providing you with a straight and fast route to solve various data science problems using Python and its powerful data analysis and machine learning packages The code examples that are provided in this book don't require you to master Python However, they will assume that you at least know the basics of Python scripting, data structures such as lists and... The data science process Data loading and preprocessing with pandas Fast and easy data loading Dealing with problematic data Dealing with big datasets Accessing other data formats Data preprocessing Data selection Working with categorical and textual data A special type of data – text Data processing with NumPy NumPy's n-dimensional array The basics of NumPy ndarray objects Creating NumPy arrays From... first start the interactive console with the ipython command that is used to run IPython, as shown here: $> ipython Python 2.7.6 (default, Sep 9 2014, 15:04:36) Type "copyright", "credits" or "license" for more information IPython 2.3.1 An enhanced Interactive Python ? -> Introduction and overview of IPython's features %quickref -> Quick reference help -> Python' s own help system object? details ->... machine learning, business intelligence, and data storage and retrieval The Python programming language, having conquered the scientific community during the last decade, is now an indispensable tool for the data science practitioner and a must-have tool for every aspiring data scientist Python will offer you a fast, reliable, cross-platform, mature environment for data analysis, machine learning, and algorithmic... your career as a data scientist? We believe that the best tool is Python, and we intend to provide you with all the essential information that you will need for a fast start Also, other tools such as R and MATLAB provide data scientists with specialized tools to solve specific problems in statistical analysis and matrix manipulation in data science However, only Python completes your data scientist skill . alt="" Python Data Science Essentials Become an efcient data science practitioner by thoroughly understanding the key concepts of Python Alberto Boschetti Luca Massaron BIRMINGHAM - MUMBAI Python. 1 Introducing data science and Python 2 Installing Python 3 Python 2 or Python 3? 3 Step-by-step installation 4 A glance at the essential Python packages 5 NumPy 5 SciPy 6 pandas 6 Scikit-learn 6 IPython. Canopy 13 PythonXY 13 WinPython 13 Introducing IPython 13 The IPython Notebook 15 Table of Contents [ ii ] Datasets and code used in the book 22 Scikit-learn toy datasets 22 The MLdata.org public