data analysis with pandas (2019)

723 15 0
data analysis with pandas (2019)

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Hands On Data Analysis with Pandas Efficiently perform data collection, wrangling, analysis, and visualization using Python Stefanie Molin BIRMINGHAM MUMBAI Hands On Data Analysis with Pandas Copyrigh.

Hands-On Data Analysis with Pandas Efficiently perform data collection, wrangling, analysis, and visualization using Python Stefanie Molin BIRMINGHAM - MUMBAI Hands-On Data Analysis with Pandas Copyright © 2019 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information Commissioning Editor: Sunith Shetty Acquisition Editor: Devika Battike Content Development Editor: Athikho Sapuni Rishana Senior Editor: Martin Whittemore Technical Editor: Vibhuti Gawde Copy Editor: Safis Editing Project Coordinator: Kirti Pisat Proofreader: Safis Editing Indexer: Pratik Shirodkar Production Designer: Arvindkumar Gupta First published: July 2019 Production reference: 2160919 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78961-532-6 www.packtpub.com When I think back on all I have accomplished, I know that I couldn't have done it without the support and love of my parents This book is dedicated to both of you: to Mom, for always believing in me and teaching me to believe in myself I know I can anything I set my mind to because of you And to Dad, for never letting me skip school and sharing a countdown with me Packt.com Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career For more information, please visit our website Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Fully searchable for easy access to vital information Copy and paste, print, and bookmark content Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at customercare@packtpub.com for more details At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks Foreword Recent advancements in computing and artificial intelligence have completely changed the way we understand the world Our current ability to record and analyze data has already transformed industries and inspired big changes in society Stefanie Molin's Hands-On Data Analysis with Pandas is much more than an introduction to the subject of data analysis or the pandas Python library; it's a guide to help you become part of this transformation Not only will this book teach you the fundamentals of using Python to collect, analyze, and understand data, but it will also expose you to important software engineering, statistical, and machine learning concepts that you will need to be successful Using examples based on real data, you will be able to see firsthand how to apply these techniques to extract value from data In the process, you will learn important software development skills, including writing simulations, creating your own Python packages, and collecting data from APIs Stefanie possesses a rare combination of skills that makes her uniquely qualified to guide you through this process Being both an expert data scientist and a strong software engineer, she can not only talk authoritatively about the intricacies of the data analysis workflow, but also about how to implement it correctly and efficiently in Python Whether you are a Python programmer interested in learning more about data analysis, or a data scientist learning how to work in Python, this book will get you up to speed fast, so you can begin to tackle your own data analysis projects right away Felipe Moreno New York, June 10, 2019 Felipe Moreno has been working in information security for the last two decades He currently works for Bloomberg LP, where he leads the Security Data Science team within the Chief Information Security Office, and focuses on applying statistics and machine learning to security problems Contributors About the author Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries She holds a B.S in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers Writing this book was a tremendous amount of work, but I have grown a lot through the experience: as a writer, as a technologist, and as a person This wouldn't have been possible without the help of my friends, family, and colleagues I'm very grateful to you all In particular, I want to thank Aliki Mavromoustaki, Felipe Moreno, Suphannee Sivakorn, Lucy Hao, Javon Thompson, Alexander Comerford, and Ryan Molin (The full version of my acknowledgments can be found on my GitHub; see the preface for the link.) About the reviewer Aliki Mavromoustaki is the lead data scientist at Tasman Analytics She works with direct-to-consumer companies to deliver scalable infrastructure and implement eventdriven analytics Previously, she worked at Criteo, an AdTech company that employs machine learning to help digital commerce companies target valuable customers Aliki worked on optimizing marketing campaigns and designed statistical experiments comparing Criteo products Aliki holds a PhD in fluid dynamics from Imperial College London, and was an assistant adjunct professor in applied mathematics at UCLA Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea Table of Contents Preface Section 1: Getting Started with Pandas Chapter 1: Introduction to Data Analysis Chapter materials Fundamentals of data analysis Data collection Data wrangling Exploratory data analysis Drawing conclusions Statistical foundations Sampling Descriptive statistics Measures of central tendency Mean Median Mode Measures of spread Range Variance Standard deviation Coefficient of variation Interquartile range Quartile coefficient of dispersion Summarizing data Common distributions Scaling data Quantifying relationships between variables Pitfalls of summary statistics Prediction and forecasting Inferential statistics Setting up a virtual environment Virtual environments venv Windows Linux/macOS Anaconda Installing the required Python packages Why pandas? Jupyter Notebooks Launching JupyterLab Validating the virtual environment 11 12 13 14 15 16 17 18 18 18 19 19 20 20 20 21 22 22 23 23 27 29 29 31 33 37 39 39 40 40 41 42 43 43 44 44 46 Table of Contents Closing JupyterLab Summary Exercises Further reading Chapter 2: Working with Pandas DataFrames Chapter materials Pandas data structures Series Index DataFrame Bringing data into a pandas DataFrame From a Python object From a file From a database From an API Inspecting a DataFrame object Examining the data Describing and summarizing the data Grabbing subsets of the data Selection Slicing Indexing Filtering Adding and removing data Creating new data Deleting unwanted data Summary Exercises Further reading 47 48 48 50 52 53 54 58 59 61 64 65 69 73 75 79 79 83 87 87 90 91 94 101 101 110 114 114 115 Section 2: Using Pandas for Data Analysis Chapter 3: Data Wrangling with Pandas Chapter materials What is data wrangling? Data cleaning Data transformation The wide data format The long data format Data enrichment Collecting temperature data Cleaning up the data Renaming columns Type conversion [ ii ] 117 118 120 121 121 123 125 129 129 141 141 143 Appendix Choosing the appropriate visualization When creating a data visualization, it is paramount that we select an appropriate plot type; the following diagram can be used to help select the proper visualization: [ 693 ] Appendix Machine learning workflow The following diagram summarizes the workflow for building machine learning models from data collection and data analysis through training and evaluating the model: [ 694 ] Other Books You May Enjoy If you enjoyed this book, you may be interested in these other books by Packt: Data Analysis with Python David Taieb ISBN: 9781789950069 A new toolset that has been carefully crafted to meet for your data analysis challenges Full and detailed case studies of the toolset across several of today’s key industry contexts Become super productive with a new toolset across Python and Jupyter Notebook Look into the future of data science and which directions to develop your skills next Other Books You May Enjoy Hands-On Data Analysis with NumPy and Pandas Curtis Miller ISBN: 9781789530797 Understand how to install and manage Anaconda Read, sort, and map data using NumPy and pandas Find out how to create and slice data arrays using NumPy Discover how to subset your DataFrames using pandas Handle missing data in a pandas DataFrame Explore hierarchical indexing and plotting with pandas [ 696 ] Other Books You May Enjoy Leave a review - let other readers know what you think Please share your thoughts on this book with others by leaving a review on the site that you bought it from If you purchased the book from Amazon, please leave us an honest review on this book's Amazon page This is vital so that other potential readers can see and use your unbiased opinion to make purchasing decisions, we can understand what our customers think about our products, and our authors can see your feedback on the title that they have worked with Packt to create It will only take a few minutes of your time, but is valuable to other potential customers, our authors, and Packt Thank you! [ 697 ] Index B 5-number summary 23 baselining about 651 dummy classifier 651, 653 Naive Bayes (NB) 654, 655, 657 Bayes' theorem 654 bear market 426, 431 Bernoulli trial 28 Bessel's correction reference link 21 beta 410, 425, 431 bias-variance trade-off 606 binning 211 Bitcoin historical data from HTML 384 bootstrap plot 306 bootstrap sample 17 bootstrapping 17, 306 box plot 24 brute-force attack 450 bull market 426 A absolute path 73 abstract method 387, 395 abstraction 44 accuracy 562 active learning 660 alpha 426, 431 Anaconda installation references 42 anomaly detection 450, 629 Anscombe's quartet about 31 reference link 32 Application Programming Interface (API) about 12 data, importing into pandas DataFrame 75, 77, 78 area under the curve (AUC) 568, 653 area under the precision-recall curve (AUPR) 571 argparse module reference link 466 ARIMA 440 AssetGroupAnalyzer class 427, 429 assets comparing 429, 432 astronomical units (AUs) 546 attributes 54 AUROC 568 autocorrelation plot 305 average precision (AP) 572 C Calinski and Harabaz score 543 categorical data 14, 312, 313, 314 central tendency 18 centroid 534, 539 chaining 90 class imbalance about 493 addressing 615, 617 addressing, with over-sampling 619, 621 addressing, with under-sampling 618 class method 379 classes 54 classification metrics about 562 accuracy and error rate 562, 563 F score 565, 566 precision and recall 563, 565 sensitivity and specificity 566, 567 classification prediction confidence inspecting 612, 613, 615 classification about 505, 555, 561, 562, 573, 574 confusion matrix 559, 560 logistic regression 556 precision-recall curve 571, 572 Receiver Operating Characteristic (ROC) curve 567, 569 results, evaluating 559 classifier 555 clustering about 505, 533, 534 algorithms, reference link 534 k-means 534 results, evaluating 542, 543 coefficient of determination 551 coefficient of variation (CV) 22 Colaboratory 45 composition 427 conditional probability 654 confusion matrix 559, 560, 561, 562 constructor 79 context manager 73, 74, 633 contextlib reference link 633 correlation matrix 281 correlations 29, 315, 316, 319, 320, 321, 322, 323 covariance 29 crosstabs 234, 237 CRUD 194 cumulative distribution function (CDF) 26, 286 curse of dimensionality 590 cyclical component 35 D data analysis about 11 findings, summarizing 15 fundamentals 11 reference link 11 Data Camp reference link 683 data cleaning 121 data collection 12 data enrichment about 129 aggregating 129 binning 129 columns, adding 129 resampling 129 data extraction Bitcoin historical data, from HTML 384 FAANG historical data, from IEX 386 historical stock data, from Yahoo! Finance 385 StockReader class 376, 378, 379, 383 with pandas 376 data issues mitigating 181, 182, 183, 185, 186, 188 data leakage 532 data preprocessing about 520 additional transformers 530, 531 data, centering 523, 524 data, encoding 525, 527 data, scaling 523, 524 imputing 528, 529 pipelines 531 testing set 520, 522 training set 520, 522 data resources about 679 APIs, working with 680 Python package 679 searching 680 data structures 52 data transformation about 121, 122 long data format 125, 127, 128 wide data format 123, 125 data wrangling about 13, 120 data cleaning 121 [ 699 ] data enrichment 129 data transformation 121, 122 issues 13 data adding 101 cleaning up 141 columns, renaming 141, 142 creating 102, 105, 107, 109 DataFrames, melting 170, 171 DataFrames, pivoting 164, 165, 167, 169 deleting 110, 112, 113 exploring 632, 633, 635, 637, 638, 639, 640 filtering 94, 97, 99, 100 importing, from file into pandas DataFrame 69, 71, 72 importing, into pandas DataFrame 64, 65 indexing 91, 94 reindexing 150, 151, 153, 154, 157, 159, 161 removing 101 reordering 150, 151, 153, 154, 157, 159, 161 restructuring 161, 163 selection 87, 88, 89 slicing 87, 90 sorting 150, 151, 153, 154, 157, 159, 161 type conversion 143, 145, 147, 149 database data, importing into pandas DataFrame 73, 74 DataFrame object data, describing 83, 85, 86 data, examining 79, 81 data, summarizing 83, 85, 86 inspecting 79 DataFrame operations about 208 arithmetic 208, 210 binning 211, 212, 214, 216 functions, applying 216, 218 pipes 222, 224 statistics 208, 210 thresholds 211, 212, 214, 216 window calculations 219, 221 DataFrames database-style operations 194 melting 170, 171 merging 197, 199, 201, 204, 206, 207 pivoting 164, 165, 167, 169 querying 195, 196 summarizing 227 Davies-Bouldin score 543 decorators 380 deep learning 506 dependent variable 34 descriptive statistics about 16, 18 common distributions 27 data scaling 29 data, summarizing 23 relationships between variables, quantifying 29, 30, 31 summary statistics, pitfalls 31 dictionary attack 452 dictionary comprehension 177 dimensionality reduction 593, 596, 599, 601 discretizing 211 distributions 329, 330 diversification 432 docstrings 64, 377 dummy classifier 651, 653 dummy variables 526 dunder methods 633 duplicate data handling 173 E elastic net regression 622 elbow point method 537 empirical cumulative distribution function (ECDF) 27, 283, 287 encode 525 ensemble methods about 606, 607 gradient boosting 608, 609 random forest 607 voting classifier 610, 611 epochs 662 estimator 507 exoplanet data 513 explained variance score 554 [ 700 ] exploratory data analysis (EDA) about 11, 14, 54, 387, 390, 471, 473, 474, 475, 476, 478, 479, 480, 481, 482, 507, 631 chemical properties data 510, 511 exoplanet data 513, 514, 517, 519 multiple assets, visualizing 412, 415, 417, 419 planets data 513, 514, 517, 519 stock, visualizing 404, 406, 408, 410, 412 Visualizer class 391, 392, 394, 396, 398, 401, 403 wine quality data 507, 509, 510 exponential distribution 27 exponential smoothing 37 exponentially weighted moving average (EWMA) 408 eXtensible Markup Language (XML) 503 extrapolation 35, 188 F f-strings 130 FAANG historical data from IEX 386 Facebook, Apple, Amazon, Netflix, and Google (FAANG) 678 faceting 330, 332 false positive rate (FPR) 649 feature construction 590 feature engineering about 590 dimensionality reduction 593, 596, 599, 601 feature importances 603, 606 feature unions 602 interaction terms 591, 593 polynomial features 591, 593 feature extraction 590 feature importances 603 feature selection 590 feature transformation 590 feature unions 602 forecasting 33 function 54 G Gaussian distribution 27 generator 363 generator expressions 364 Global Historical Climatology Network - Daily (GHCND) 118 gradient boosting 608, 609 gradient descent 545, 608 grid search about 582 used, for tuning hyperparameter 582, 584, 585, 586, 588, 589 groupby using 228, 229, 231, 234 H harmonic mean 495, 565 heatmaps 315, 316, 319, 320, 321, 322, 323 hexbins 280 hierarchical index 165 historical stock data from Yahoo! Finance 385 homoskedasticity 442 HTTP methods reference link 76 hyperparameter about 582 tuning, with grid search 582, 584, 585, 586, 588, 589 I imputation 186 incremental learners 660 independent variable 34 index class reference link 60 indexing 91 indicator variables 526 inferential statistics 16, 37 inheritance 397 instance method 380 instances 54 integrated development environment (IDE) 42 interaction terms 591, 593 [ 701 ] interpolation 35, 188 interquartile range (IQR) 23, 490 interval scale 15 invalid data handling 173 Investors Exchange (IEX) 386 IPython reference link 56 isolation forest about 641 reference link 643 iterables 325 iterator 325 J JavaScript Object Notation (JSON) 72 joint probability 654 Jupyter Notebooks about 44 JupyterLab, closing 47 JupyterLab, launching 44 virtual environment, validating 46, 47 JupyterLab closing 47 launching 44, 45 reference link 47 K k-fold cross-validation 584 k-means about 534 centroids, interpreting 539, 541, 542 cluster space, visualizing 539, 541, 542 data, scaling 535, 536 elbow point method 537, 539 k-nearest neighbors (k-NN) 523, 616 Kaggle reference link 683 kernel 46 kernel density estimate (KDE) 25, 283, 330, 613 kurtosis 26 L L1 regularization 621 L2 regularization 621 lag plot 303, 304 lambda functions 106 LASSO regression 621 learning rate 662 linear regression about 544 equation, interpreting 546, 547 model predicting 545 predictions, creating 547, 549 with statsmodels 442, 444 Linux/macOS virtual environment, creating with venv 41, 42 list comprehension 56 Local Outlier Factor (LOF) about 644, 645 reference link 644 logging module reference link 466 login attempts assumptions 451, 453 command line, simulating 466, 468, 469, 470, 471 helper functions 455 simulating 451 login_attempt_simulator package helper functions 453, 454 LoginAttemptSimulator class 455, 457, 460, 465 simulator package 453 LoginAttemptSimulator class 455, 457, 460, 463, 465 logistic regression about 556, 658 wine quality data, predicting 557, 558 wine type, determining 558 long data format 125, 127, 128 M machine learning (ML) 504, 678 macro average 564 magic command 56, 260 [ 702 ] matplotlib about 259 basics 259, 261, 262, 264, 265 options 269, 271 plot components 265, 267, 269 reference link 262 mean 18 mean absolute error (MAE) 554, 555, 588 mean squared error (MSE) 554 measures of central tendency, descriptive statistics about 18 mean 18 median 19 mode 19 measures of spread, descriptive statistics about 20 coefficient of variation (CV) 22 interquartile range (IQR) 22, 23 quartile coefficient of dispersion (QCD) 23 range 20 standard deviation 21, 22 variance 20, 21 median 19 memoryless property 452 meta-learning 601 metrics 551, 552, 554, 555 micro average 564 midhinge 23 min-max scaling 29 missing data handling 173 mode 19 model classes 507 model tuning process 582 modeling performance models, comparing 446 modeling stock performance about 433 ARIMA 440 linear regression, with statsmodels 442, 444 models, comparing 444 StockModeler class 433, 435, 437 time series decomposition 439 modular code 372 module 372 modulo operator (%) 363 multicollinearity 545 multiple assets visualizing 412, 415, 417, 419 multiple linear regression 544 N Naive Bayes (NB) 654, 655, 657 National Centers for Environmental Information (NCEI) 117 National Oceanic and Atmospheric Administration (NOAA) 118 natural language processing (NLP) 105, 520 negation operator (~) 97 negative classes 559 Nelson rules 497 neural networks 506 nominal data 15 non-null entries 82 Null values 82 numpy aggregations with 226, 227 O object-oriented programming (OOP) 44, 369 objects 54 observational study 37 one-hot encoding 526 online learning 660 opening, high, low, and closing (OHLC) 155 OpenML reference link 680 ordinal data 15 ordinal encoding 525 ordinary least squares (OLS) 545 out-of-bag samples 607 outliers 19 over-sampling 616 overfitting 520, 590 P package 372 pandas data structures about 54, 56, 57, 58 [ 703 ] DataFrame class 61, 63 Index class 59, 60 Series class 58, 59 pandas DataFrame data, importing 64, 65 data, importing from API 75, 77, 78 data, importing from database 73, 74 data, importing from file 69, 71, 72 data, importing from Python object 65, 66, 68 pandas.plotting subpackage about 300 autocorrelation plot 305 bootstrap plot 306 lag plot 303, 304 scatter matrix 300, 302 pandas about 43 aggregations with 226, 227 used, for plotting 271 pandas_datareader reference link 386 PartialFitPipeline subclass creating 661 partials 459 Pearson correlation coefficient 30 PEP reference link 378 pivot point using 421 pivot tables 234, 237 plotting, with pandas about 271 data, counts 291, 294, 295, 297, 299 data, distribution 283, 286, 289 data, frequencies 291, 294, 295, 297, 299 time series evolution 273, 275, 276 variables, relationships visualizing 277, 279, 280, 282 Poisson distribution 27 Poisson process 452 polynomial features 591, 593 positive class 559 posterior probability 654 precision 494, 563 prediction 33 preprocessing 502, 520 principal components analysis (PCA) 595 prior probability 654 private method 396 probability density function (PDF) 25, 286, 463 probability mass function (PMF) 28, 463 problematic data finding 174, 175, 177, 178, 179, 180 Prompt Assessment of Global Earthquakes for Response (PAGER) 86 property 379 pylint package reference link 373 pytest package reference link 373 Python standard library course reference link 684 Python object data, importing into pandas DataFrame 65, 66, 68 Python package about 679 building 372 stock_analysis package, overview 374, 376 structure 372 Python practice about 684 references 684 Pythonic 364 Q quantitative data 14 quartile coefficient of dispersion (QCD) 23 R random forest 607 random sample 17 range 20 ratio scale 15 raw string 98 recall 494, 563 Receiver Operating Characteristic (ROC) curve 567, 569 red, green, blue, alpha (RGBA) 354 regression plots 324, 326, 328 [ 704 ] regression about 33, 505, 544 linear regression 544 metrics 551, 552, 554, 555 model evaluation, reference link 555 residuals, analyzing 549, 550 results, evaluating 549 regressor 555 regular expressions (regex) 98 regularization 621, 622 reinforcement learning 505 resampling 17 residuals about 322, 438, 442, 549 analyzing 549 resistance levels 408 ridge regression 621 root mean squared error (RMSE) 554 rule-based anomaly detection about 483 percent difference 484, 485, 487 performance, evaluating 491, 492, 493, 496 Tukey fence 488 Z-score 490 S sample about 16 reference link 17 scatter matrix 300 scientific notation 524 scikit-learn, for random datasets generating references 680 scikit-learn about 660, 680 references, for datasets loading 680 seaborn about 320, 321, 679 categorical data 312, 313, 314 correlations 315, 316, 319, 322, 323 distributions 329, 330 faceting 330, 332 heatmaps 315, 316, 319, 322, 323 reference link 312 regression plots 324, 326, 328 utilizing, for advanced plotting 312 security operations center (SOC) 629 seed 152 semi-quartile range 23 semi-supervised learning 505 sensitivity 566 Sharpe ratio 427 significance level (alpha) 38 silhouette coefficient 542 simple linear regression 544 simple random sample 17 specificity 566 SQLAlchemy package reference link 634 standard deviation 21 standard normal 27 static class 433 statistics about 8, 16 descriptive statistics 16, 18 forecasting 33 inferential statistics 16 prediction 33 sampling 16, 17 stochastic gradient descent (SGD) about 662 further improvements 673 initial model, building 663, 664 model, evaluating 664, 665, 667, 668, 670 model, updating 670, 671 results, presenting 671 stock visualizing 404, 406, 408, 410, 412 stock_analysis package reference link 374 StockAnalyzer class 420, 422 StockModeler class 433, 435, 437 StockReader class 376, 378, 383 stratified random sample 17 string methods reference link 90 Structured Query Language (SQL) 74, 194 supervised learning 504 supervised methods about 650 [ 705 ] baselining 651 logistic regression 658 support levels 408 support vector machine (SVM) 555, 617, 640 syntactic sugar 383 Synthetic Minority Over-sampling Technique (SMOTE) 619 unsupervised learning 504 unsupervised methods about 640 isolation forest 641, 643 Local Outlier Factor (LOF) 644, 645 models, comparing 645, 646, 647, 648, 649 US Geological Survey (USGS) 53, 55 T V technical analysis AssetGroupAnalyzer class 427, 429 assets, comparing 429, 432 of financial instruments 419 StockAnalyzer class 420, 422, 424 temperature data collecting 129, 131, 132, 136, 138, 140 testing set 521 time series decomposition 35, 439 time series about 240 data, differentiating 246, 247 filtering 240, 243 merging 252, 254 resampling 247, 251 shifting, for lagged data 245, 246 time-based selection 240, 243 training set 521 transformers 507 transpose 163 trend component 35 true negative rate (TNR) 566 true positive rate (TPR) 494, 564, 649 Tukey fence 488, 489 type I error 560 type II error 560 variance 20, 21 Variance Ratio Criterion 543 virtual environment about 8, 39 Anaconda 42 creating, for Linux/macOS with venv 41, 42 creating, for Windows with venv 40, 41 creating, with venv 40 required Python packages, installing 43 setting up 39 validating 46, 47 visualization annotations 350 colors 352, 353, 354, 356, 357, 359, 360, 361, 363, 364, 365 customizing 341 formatting 332 reference lines, adding 342, 343, 346, 347 regions, shading 347, 350 visualizations, formatting axes, formatting 337, 338, 339, 340 legends 335, 336 titles and labels 332, 334, 335 Visualizer class 391, 392, 394, 396, 398, 401, 403 voting classifier 610, 611 U W under-sampling 616 underfit 520 Unified Modeling Language (UML) diagrams 374 uniform distribution 28 uniform resource identifier (URI) 634 unimodal 19 univariate statistics 18 websites, data resources about 681 finance 681 government data 681 health and economy 682 miscellaneous 683 social networks 682 sports 682 [ 706 ] weighted average 564 Western Electric rules 497 wide data format 123, 125 Windows virtual environment, creating with venv 40, 41 wrappers 44 Z Z-scores 490, 491 Zen of Python 364 ... 1, Introduction to Data Analysis Chapter 2, Working with Pandas DataFrames Introduction to Data Analysis Before we can begin our hands-on introduction to data analysis with pandas, we need to... Section 1: Getting Started with Pandas Chapter 1: Introduction to Data Analysis Chapter materials Fundamentals of data analysis Data collection Data wrangling Exploratory data analysis Drawing conclusions... Chapter 2: Working with Pandas DataFrames Chapter materials Pandas data structures Series Index DataFrame Bringing data into a pandas DataFrame From a Python object From a file From a database From

Ngày đăng: 06/08/2022, 21:08

Mục lục

  • Cover

  • Title Page

  • Copyright and Credits

  • Dedication

  • About Packt

  • Foreword

  • Contributors

  • Table of Contents

  • Preface

  • Section 1: Getting Started with Pandas

  • Chapter 1: Introduction to Data Analysis

    • Chapter materials

    • Fundamentals of data analysis

      • Data collection

      • Data wrangling

      • Exploratory data analysis

      • Drawing conclusions

    • Statistical foundations

      • Sampling

      • Descriptive statistics

        • Measures of central tendency

          • Mean

          • Median

          • Mode

        • Measures of spread

          • Range

          • Variance

          • Standard deviation

          • Coefficient of variation

          • Interquartile range

          • Quartile coefficient of dispersion

        • Summarizing data

        • Common distributions

        • Scaling data

        • Quantifying relationships between variables

        • Pitfalls of summary statistics

      • Prediction and forecasting

      • Inferential statistics

    • Setting up a virtual environment

      • Virtual environments

        • venv

          • Windows

          • Linux/macOS

        • Anaconda

      • Installing the required Python packages

      • Why pandas?

      • Jupyter Notebooks

        • Launching JupyterLab

        • Validating the virtual environment

        • Closing JupyterLab

    • Summary

    • Exercises

    • Further reading

  • Chapter 2: Working with Pandas DataFrames

    • Chapter materials

    • Pandas data structures

      • Series

      • Index

      • DataFrame

    • Bringing data into a pandas DataFrame

      • From a Python object

      • From a file

      • From a database

      • From an API

    • Inspecting a DataFrame object

      • Examining the data

      • Describing and summarizing the data

    • Grabbing subsets of the data

      • Selection

      • Slicing

      • Indexing

      • Filtering

    • Adding and removing data

      • Creating new data

      • Deleting unwanted data

    • Summary

    • Exercises

    • Further reading

  • Section 2: Using Pandas for Data Analysis

  • Chapter 3: Data Wrangling with Pandas

    • Chapter materials

    • What is data wrangling?

      • Data cleaning

      • Data transformation

        • The wide data format

        • The long data format

      • Data enrichment

    • Collecting temperature data

    • Cleaning up the data

      • Renaming columns

      • Type conversion

      • Reordering, reindexing, and sorting data

    • Restructuring the data

      • Pivoting DataFrames

      • Melting DataFrames

    • Handling duplicate, missing, or invalid data

      • Finding the problematic data

      • Mitigating the issues

    • Summary

    • Exercises

    • Further reading

  • Chapter 4: Aggregating Pandas DataFrames

    • Chapter materials

    • Database-style operations on DataFrames

      • Querying DataFrames

      • Merging DataFrames

    • DataFrame operations

      • Arithmetic and statistics

      • Binning and thresholds

      • Applying functions

      • Window calculations

      • Pipes

    • Aggregations with pandas and numpy

      • Summarizing DataFrames

      • Using groupby

      • Pivot tables and crosstabs

    • Time series

      • Time-based selection and filtering 

      • Shifting for lagged data

      • Differenced data

      • Resampling

      • Merging

    • Summary

    • Exercises

    • Further reading

  • Chapter 5: Visualizing Data with Pandas and Matplotlib

    • Chapter materials

    • An introduction to matplotlib

      • The basics

      • Plot components

      • Additional options

    • Plotting with pandas

      • Evolution over time

      • Relationships between variables

      • Distributions

      • Counts and frequencies

    • The pandas.plotting subpackage

      • Scatter matrices

      • Lag plots

      • Autocorrelation plots

      • Bootstrap plots

    • Summary

    • Exercises

    • Further reading

  • Chapter 6: Plotting with Seaborn and Customization Techniques

    • Chapter materials

    • Utilizing seaborn for advanced plotting

      • Categorical data

      • Correlations and heatmaps

      • Regression plots

      • Distributions

      • Faceting

    • Formatting

      • Titles and labels

      • Legends

      • Formatting axes

    • Customizing visualizations

      • Adding reference lines

      • Shading regions

      • Annotations

      • Colors

    • Summary

    • Exercises

    • Further reading

  • Section 3: Applications - Real-World Analyses Using Pandas

  • Chapter 7: Financial Analysis - Bitcoin and the Stock Market

    • Chapter materials

    • Building a Python package

      • Package structure

      • Overview of the stock_analysis package

    • Data extraction with pandas

      • The StockReader class

      • Bitcoin historical data from HTML

      • S&P 500 historical data from Yahoo! Finance

      • FAANG historical data from IEX

    • Exploratory data analysis

      • The Visualizer class family

      • Visualizing a stock

      • Visualizing multiple assets

    • Technical analysis of financial instruments

      • The StockAnalyzer class

      • The AssetGroupAnalyzer class

      • Comparing assets

    • Modeling performance

      • The StockModeler class

      • Time series decomposition

      • ARIMA

      • Linear regression with statsmodels

      • Comparing models

    • Summary

    • Exercises

    • Further reading

  • Chapter 8: Rule-Based Anomaly Detection

    • Chapter materials

    • Simulating login attempts

      • Assumptions

      • The login_attempt_simulator package

        • Helper functions

        • The LoginAttemptSimulator class

      • Simulating from the command line

    • Exploratory data analysis

    • Rule-based anomaly detection

      • Percent difference

      • Tukey fence

      • Z-score

      • Evaluating performance

    • Summary

    • Exercises

    • Further reading

  • Section 4: Introduction to Machine Learning with Scikit-Learn

  • Chapter 9: Getting Started with Machine Learning in Python

    • Chapter materials

    • Learning the lingo

    • Exploratory data analysis

      • Red wine quality data

      • White and red wine chemical properties data

      • Planets and exoplanets data

    • Preprocessing data

      • Training and testing sets

      • Scaling and centering data

      • Encoding data

      • Imputing

      • Additional transformers

      • Pipelines

    • Clustering

      • k-means

        • Grouping planets by orbit characteristics

        • Elbow point method for determining k

        • Interpreting centroids and visualizing the cluster space

      • Evaluating clustering results

    • Regression

      • Linear regression

        • Predicting the length of a year on a planet

        • Interpreting the linear regression equation

        • Making predictions

      • Evaluating regression results

        • Analyzing residuals

        • Metrics

    • Classification

      • Logistic regression

        • Predicting red wine quality

        • Determining wine type by chemical properties

      • Evaluating classification results

        • Confusion matrix

        • Classification metrics

          • Accuracy and error rate

          • Precision and recall

          • F score

          • Sensitivity and specificity

        • ROC curve

        • Precision-recall curve

    • Summary

    • Exercises

    • Further reading

  • Chapter 10: Making Better Predictions - Optimizing Models

    • Chapter materials

    • Hyperparameter tuning with grid search

    • Feature engineering

      • Interaction terms and polynomial features

      • Dimensionality reduction

      • Feature unions

      • Feature importances

    • Ensemble methods

      • Random forest

      • Gradient boosting

      • Voting

    • Inspecting classification prediction confidence

    • Addressing class imbalance

      • Under-sampling

      • Over-sampling

    • Regularization

    • Summary

    • Exercises

    • Further reading

  • Chapter 11: Machine Learning Anomaly Detection

    • Chapter materials

    • Exploring the data

    • Unsupervised methods

      • Isolation forest

      • Local outlier factor

      • Comparing models

    • Supervised methods

      • Baselining

        • Dummy classifier

        • Naive Bayes

      • Logistic regression

    • Online learning

      • Creating the PartialFitPipeline subclass

      • Stochastic gradient descent classifier

        • Building our initial model

        • Evaluating the model

        • Updating the model

        • Presenting our results

        • Further improvements

    • Summary

    • Exercises

    • Further reading

  • Section 5: Additional Resources

  • Chapter 12: The Road Ahead

    • Data resources

      • Python packages

        • Seaborn

        • Scikit-learn

      • Searching for data

      • APIs

      • Websites

        • Finance

        • Government data

        • Health and economy

        • Social networks

        • Sports

        • Miscellaneous

    • Practicing working with data

    • Python practice

    • Summary

    • Exercises

    • Further reading

  • Solutions

  • Appendix

    • Data analysis workflow

    • Choosing the appropriate visualization

    • Machine learning workflow

  • Other Books You May Enjoy

  • Index

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan