Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 468 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
468
Dung lượng
7,01 MB
Nội dung
Building Machine Learning Systems with Python Second Edition Table of Contents Building Machine Learning Systems with Python Second Edition Credits About the Authors About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why subscribe? Free access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions Getting Started with Python Machine Learning Machine learning and Python – a dream team What the book will teach you (and what it will not) What to do when you are stuck Getting started Introduction to NumPy, SciPy, and matplotlib Installing Python Chewing data efficiently with NumPy and intelligently with SciPy Learning NumPy Indexing Handling nonexisting values Comparing the runtime Learning SciPy Our first (tiny) application of machine learning Reading in the data Preprocessing and cleaning the data Choosing the right model and learning algorithm Before building our first model… Starting with a simple straight line Towards some advanced stuff Stepping back to go forward – another look at our data Training and testing Answering our initial question Summary Classifying with Real-world Examples The Iris dataset Visualization is a good first step Building our first classification model Evaluation – holding out data and cross-validation Building more complex classifiers A more complex dataset and a more complex classifier Learning about the Seeds dataset Features and feature engineering Nearest neighbor classification Classifying with scikit-learn Looking at the decision boundaries Binary and multiclass classification Summary Clustering – Finding Related Posts Measuring the relatedness of posts How not to do it How to do it Preprocessing – similarity measured as a similar number of common words Converting raw text into a bag of words Counting words Normalizing word count vectors Removing less important words Stemming Installing and using NLTK Extending the vectorizer with NLTK’s stemmer Stop words on steroids Our achievements and goals Clustering K-means Getting test data to evaluate our ideas on Clustering posts Solving our initial challenge Another look at noise Tweaking the parameters Summary Topic Modeling Latent Dirichlet allocation Building a topic model Comparing documents by topics Modeling the whole of Wikipedia Choosing the number of topics Summary Classification – Detecting Poor Answers Sketching our roadmap Learning to classify classy answers Tuning the instance Tuning the classifier Fetching the data Slimming the data down to chewable chunks Preselection and processing of attributes Defining what is a good answer Creating our first classifier Starting with kNN Engineering the features Training the classifier Measuring the classifier’s performance Designing more features Deciding how to improve Bias-variance and their tradeoff Fixing high bias Fixing high variance High bias or low bias Using logistic regression A bit of math with a small example Applying logistic regression to our post classification problem Looking behind accuracy – precision and recall Slimming the classifier Ship it! Summary Classification II – Sentiment Analysis Sketching our roadmap Fetching the Twitter data Introducing the Naïve Bayes classifier Getting to know the Bayes’ theorem Being naïve Using Naïve Bayes to classify Accounting for unseen words and other oddities Accounting for arithmetic underflows Creating our first classifier and tuning it Solving an easy problem first Using all classes Tuning the classifier’s parameters Cleaning tweets Taking the word types into account Determining the word types Successfully cheating using SentiWordNet Our first estimator Putting everything together Summary Regression Predicting house prices with regression Multidimensional regression Cross-validation for regression Penalized or regularized regression L1 and L2 penalties Using Lasso or ElasticNet in scikit-learn Visualizing the Lasso path P-greater-than-N scenarios An example based on text documents Setting hyperparameters in a principled way Summary Recommendations Rating predictions and recommendations Splitting into training and testing Normalizing the training data A neighborhood approach to recommendations A regression approach to recommendations Combining multiple methods Basket analysis Obtaining useful predictions Analyzing supermarket shopping baskets Association rule mining More advanced basket analysis Summary Classification – Music Genre Classification Sketching our roadmap Fetching the music data Converting into a WAV format Looking at music Decomposing music into sine wave components Using FFT to build our first classifier Increasing experimentation agility Training the classifier Using a confusion matrix to measure accuracy in multiclass problems An alternative way to measure classifier performance using receiver-operator characteristics Improving classification performance with Mel Frequency Cepstral Coefficients Summary 10 Computer Vision Introducing image processing Loading and displaying images Thresholding Gaussian blurring Putting the center in focus Basic image classification Computing features from images Writing your own features Using features to find similar images Classifying a harder dataset Local feature representations Summary 11 Dimensionality Reduction Sketching our roadmap Selecting features Detecting redundant features using filters Correlation Mutual information Asking the model about the features using wrappers Other feature selection methods Feature extraction About principal component analysis Sketching PCA Applying PCA Limitations of PCA and how LDA can help Multidimensional scaling Summary 12 Bigger Data Learning about big data Using jug to break up your pipeline into tasks An introduction to tasks in jug Looking under the hood Using jug for data analysis Reusing partial results Using Amazon Web Services Creating your first virtual machines Installing Python packages on Amazon Linux Running jug on our cloud machine Automating the generation of clusters with StarCluster Summary A Where to Learn More Machine Learning Online courses Books L labels / Learning to classify classy answers Laplace smoothing / Accounting for unseen words and other oddities Lasso / L1 and L2 penalties latent Dirichlet allocation (LDA) about / Latent Dirichlet allocation Wikipedia URL / Latent Dirichlet allocation topic model, building / Building a topic model lift about / Association rule mining linear discriminant analysis (LDA) / Sketching our roadmap about / Limitations of PCA and how LDA can help local feature representations about / Local feature representations logistic regression about / Using logistic regression using / Using logistic regression example / A bit of math with a small example applying, to post classification problem / Applying logistic regression to our post classification problem LSF (Load Sharing Facility) / Using jug to break up your pipeline into tasks M machine learning about / Machine learning and Python – a dream team first tiny application / Our first (tiny) application of machine learning machine learning algorithm about / What the book will teach you (and what it will not) Machine Learning Toolkit (Milk) URL / All that was left out matplotlib URL / Introduction to NumPy, SciPy, and matplotlib about / Introduction to NumPy, SciPy, and matplotlib matshow() function / Using a confusion matrix to measure accuracy in multiclass problems MDP toolkit URL / All that was left out Mel Frequency Cepstrum (MFC) used, for improving classification performance / Improving classification performance with Mel Frequency Cepstral Coefficients MetaOptimize URL / What to do when you are stuck, Question and answer sites about / Question and answer sites MLComp URL / Getting test data to evaluate our ideas on model, first tiny application selecting / Before building our first model… straight line model / Starting with a simple straight line complex model / Towards some advanced stuff data, viewing / Stepping back to go forward – another look at our data training / Training and testing testing / Training and testing model function, calculating / Answering our initial question mpmath URL / Accounting for arithmetic underflows multiclass classification about / Binary and multiclass classification multidimensional regression about / Multidimensional regression using / Multidimensional regression multidimensional scaling (MDS) / Sketching our roadmap about / Multidimensional scaling MultinomialNB about / Creating our first classifier and tuning it MultinomialNB classifier / Tuning the classifier’s parameters music analyzing / Looking at music decomposing, into sine wave components / Decomposing music into sine wave components music data fetching / Fetching the music data wave format, converting into / Converting into a WAV format N Natural Language Toolkit (NLTK) / Stemming installing / Installing and using NLTK URL / Installing and using NLTK vectorizer, extending with / Extending the vectorizer with NLTK’s stemmer Naïve Bayes about / Sketching our roadmap Naïve Bayes classifier about / Introducing the Naïve Bayes classifier Naïve Bayes theorem / Getting to know the Bayes’ theorem working / Being naïve using, to classify / Using Naïve Bayes to classify unseen words, accounting for / Accounting for unseen words and other oddities arithmetic underflows, accounting for / Accounting for arithmetic underflows GaussianNB / Creating our first classifier and tuning it MultinomialNB / Creating our first classifier and tuning it BernoulliNB / Creating our first classifier and tuning it problem, solving / Solving an easy problem first classes, using / Using all classes parameters, tuning / Tuning the classifier’s parameters nearest neighbor classifier about / Nearest neighbor classification neighborhood approach, recommendations about / A neighborhood approach to recommendations NumAllCaps about / Designing more features NumExclams about / Designing more features NumPy about / Introduction to NumPy, SciPy, and matplotlib examples / Chewing data efficiently with NumPy and intelligently with SciPy reference link, for examples / Chewing data efficiently with NumPy and intelligently with SciPy learning / Learning NumPy indexing / Indexing nonexisting values, handling / Handling nonexisting values runtime, comparing / Comparing the runtime O one-dimensional regression about / Predicting house prices with regression online course, machine learning URL / Online courses Otsu / Thresholding overfitting about / Towards some advanced stuff OwnerUserId / Preselection and processing of attributes P parameters, clustering tweaking / Tweaking the parameters Part Of Speech (POS) / Sketching our roadmap Pattern URL / All that was left out PBS (Portable Batch System) / Using jug to break up your pipeline into tasks penalized regression about / Penalized or regularized regression L1 penalties / L1 and L2 penalties L2 penalties / L1 and L2 penalties Lasso, using in scikit-learn / Using Lasso or ElasticNet in scikit-learn ElasticNet, using in scikit-learn / Using Lasso or ElasticNet in scikit-learn Lasso path, visualizing / Visualizing the Lasso path P greater than N scenarios / P-greater-than-N scenarios example, text documents / An example based on text documents hyperparameters, setting in principled way / Setting hyperparameters in a principled way Penn Treebank Project URL / Determining the word types POS column about / Successfully cheating using SentiWordNet POS tag abbreviations / Determining the word types PostTypeId attribute / Preselection and processing of attributes pre-processing phase achievements / Our achievements and goals goals / Our achievements and goals precision-recall (P/R) / An alternative way to measure classifier performance using receiver-operator characteristics precision_recall_curve() function / Looking behind accuracy – precision and recall predictions, rating with regression about / Rating predictions and recommendations dataset, splitting into training and testing / Splitting into training and testing training data, normalizing / Normalizing the training data preprocessing about / Preprocessing – similarity measured as a similar number of common words principal component analysis (PCA) / Sketching our roadmap about / About principal component analysis properties / About principal component analysis sketching / Sketching PCA applying / Applying PCA limitations / Limitations of PCA and how LDA can help PyBrain URL / All that was left out Python installing / Installing Python reference link / Installing Python Python packages installing, on Amazon Linux / Installing Python packages on Amazon Linux Q Q&A sites MetaOptimize / What to do when you are stuck Cross Validated / What to do when you are stuck Stack Overflow / What to do when you are stuck TwoToReal / What to do when you are stuck Kaggle / What to do when you are stuck R Receiver-Operator Characteristic (ROC) used, for measuring classifier performance / An alternative way to measure classifier performance using receiver-operator characteristics about / An alternative way to measure classifier performance using receiveroperator characteristics recommendations neighborhood approach / A neighborhood approach to recommendations regression approach / A regression approach to recommendations multiple methods, combining / Combining multiple methods regression cross-validation / Cross-validation for regression about / L1 and L2 penalties regression approach, recommendations about / A regression approach to recommendations issues / A regression approach to recommendations resources, machine learning online courses / Online courses books / Books question and answer sites / Question and answer sites blogs / Blogs data sources / Data sources competition / Getting competitive Ridge Regression / L1 and L2 penalties roadmap sketching / Sketching our roadmap root mean square error (RMSE) about / Predicting house prices with regression advantage / Predicting house prices with regression roundness / Features and feature engineering running status / Creating your first virtual machines S save() function / Increasing experimentation agility scikit-learn classification about / Classifying with scikit-learn decision boundaries, examining / Looking at the decision boundaries scikit-learn module about / Classifying with scikit-learn SciPy about / Introduction to NumPy, SciPy, and matplotlib URL / Introduction to NumPy, SciPy, and matplotlib learning / Learning SciPy toolboxes / Learning SciPy secret key about / Using Amazon Web Services Securities and Exchange Commission (SEC) / An example based on text documents Seeds dataset about / Learning about the Seeds dataset features / Learning about the Seeds dataset sentiment analysis roadmap, sketching / Sketching our roadmap Twitter data, fetching / Fetching the Twitter data Naïve Bayes classifier / Introducing the Naïve Bayes classifier first classifier, creating / Creating our first classifier and tuning it tweets, cleaning / Cleaning tweets SentiWordNet URL / Successfully cheating using SentiWordNet similarity measuring about / Measuring the relatedness of posts bag of word approach / How to do it SoX URL / Converting into a WAV format sparse about / L1 and L2 penalties sparsity / Building a topic model specgram function / Looking at music Speeded Up Robust Features (SURF) about / Local feature representations stacked learning / Combining multiple methods Stack Overflow URL / What to do when you are stuck StarCluster used, for automating cluster generation / Automating the generation of clusters with StarCluster about / Automating the generation of clusters with StarCluster URL / Automating the generation of clusters with StarCluster stemming about / Stemming T Talkbox SciKit URL / Improving classification performance with Mel Frequency Cepstral Coefficients task about / An introduction to tasks in jug testing accuracy / Evaluation – holding out data and cross-validation TfidfVectorizer parameter / Tuning the classifier’s parameters thresholding about / Thresholding TimeToAnswer / Engineering the features Title attribute / Preselection and processing of attributes toolboxes, SciPy cluster / Learning SciPy constants / Learning SciPy fftpack / Learning SciPy integrate / Learning SciPy interpolate / Learning SciPy io / Learning SciPy linalg / Learning SciPy ndimage / Learning SciPy odr / Learning SciPy optimize / Learning SciPy signal / Learning SciPy sparse / Learning SciPy spatial / Learning SciPy special / Learning SciPy stats / Learning SciPy topics documents comparing by / Comparing documents by topics number of topics, selecting / Choosing the number of topics training accuracy / Evaluation – holding out data and cross-validation train_model()function about / Solving an easy problem first transform(documents) method about / Our first estimator tweets cleaning / Cleaning tweets Twitter data fetching / Fetching the Twitter data two-levels of cross-validation / Setting hyperparameters in a principled way TwoToReal URL / What to do when you are stuck, Question and answer sites U underfitting about / Stepping back to go forward – another look at our data V ViewCount / Preselection and processing of attributes virtual machines, Amazon Web Services (AWS) creating / Creating your first virtual machines Python packages, installing on Amazon Linux / Installing Python packages on Amazon Linux jug, running on cloud machine / Running jug on our cloud machine visual words / Local feature representations W Wikipedia dump URL / Modeling the whole of Wikipedia word types about / Taking the word types into account determining / Determining the word types estimator / Our first estimator implementing / Putting everything together ... Building Machine Learning Systems with Python Second Edition Table of Contents Building Machine Learning Systems with Python Second Edition Credits About the Authors About the Reviewers... Question and answer sites Blogs Data sources Getting competitive All that was left out Summary Index Building Machine Learning Systems with Python Second Edition Building Machine Learning Systems with Python Second Edition... Chapter 1, Getting Started with Python Machine Learning, introduces the basic idea of machine learning with a very simple example Despite its simplicity, it will challenge us with the risk of overfitting Chapter 2, Classifying with Real-world Examples, uses real data to learn about