Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 533 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
533
Dung lượng
16,5 MB
Nội dung
www.it-ebooks.info
www.it-ebooks.info
Use your data – or lose
Save 20% with code EBOOK
Register Now
Strata Conference
Sep 22-23, 2011, NY
Strata Summit
Sep 20-21, 2011, NY
Strata Jumpstart
Sep 19, 2011, NY
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Data AnalysiswithOpenSource Tools
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Data Analysis with
Open Source Tools
Philipp K. Janert
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Data AnalysiswithOpenSource Tools
by Philipp K. Janert
Copyright
c
2011 Philipp K. Janert. All rights reserved. Printed in the United States of America.
Published by O’Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (http://my.safaribooksonline.com). For more information,
contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com.
Editor: Mike Loukides
Production Editor: Sumita Mukherji
Copyeditor: Matt Darnell
Production Services: MPS Limited, a Macmillan
Company, and Newgen North America, Inc.
Indexer: Fred Brown
Cover Designer: Karen Montgomery
Interior Designer: Edie Freedman
and Ron Bilodeau
Illustrator: Philipp K. Janert
Printing History:
November 2010: First Edition.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. DataAnalysiswithOpen Source
Tools, the image of a common kite, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc.
was aware of a trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author
assume no responsibility for errors or omissions, or for damages resulting from the use of the
information contained herein.
ISBN: 978-0-596-80235-6
[M]
[2011-05-27]
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Furious activity is no substitute for understanding.
—H. H. Williams
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
CONTENTS
PREFACE xiii
1 INTRODUCTION 1
Data Analysis 1
What’s in This Book 2
What’s with the Workshops? 3
What’s with the Math? 4
What You’ll Need 5
What’s Missing 6
PART I Graphics: Looking at Data
2 A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 11
Dot and Jitter Plots 12
Histograms and Kernel Density Estimates 14
The Cumulative Distribution Function 23
Rank-Order Plots and Lift Charts 30
Only When Appropriate: Summary Statistics and Box Plots 33
Workshop: NumPy 38
Further Reading 45
3 TWO VARIABLES: ESTABLISHING RELATIONSHIPS 47
Scatter Plots 47
Conquering Noise: Smoothing 48
Logarithmic Plots 57
Banking 61
Linear Regression and All That 62
Showing What’s Important 66
Graphical Analysis and Presentation Graphics 68
Workshop: matplotlib 69
Further Reading 78
4 TIME AS A VARIABLE: TIME-SERIES ANALYSIS 79
Examples 79
The Task 83
Smoothing 84
Don’t Overlook the Obvious! 90
The Correlation Function 91
vii
www.it-ebooks.info
[...]... analytically, you will need to develop some familiarity with a few mathematical concepts There is simply no way around it (You can work with data without any math skills—look at what any data modeler or database administrator does But if you want to do any sort of analysis, then a little math becomes a necessity.) I have tried to make the text accessible to readers with a minimum of previous knowledge Some college... Intelligence Corporate Metrics and Dashboards Data Quality Issues Workshop: Berkeley DB and SQLite Further Reading 16 448 460 468 CONTENTS www.it-ebooks.info Notation and Basic Math Where to Go from Here Further Reading C 472 479 481 WORKING WITHDATA 485 Sources for Data Cleaning and Conditioning Sampling Data File Formats The Care and Feeding of Your Data Zoo Skills Terminology Further Reading 485... not—make up a large part of actual dataanalysis and also introduces some data- related terminology What's with the Workshops? Every full chapter (after this one) includes a section titled “Workshop” that contains some programming examples related to the chapter’s material I use these Workshops for two purposes On the one hand, I’d like to introduce a number of open source tools and libraries that may be... VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 99 False-Color Plots A Lot at a Glance: Multiplots Composition Problems Novel Plot Types Interactive Explorations Workshop: Tools for Multivariate Graphics Further Reading INTERMEZZO: A DATAANALYSIS SESSION 127 A DataAnalysis Session Workshop: gnuplot Further Reading 6 100 105 110 116 120 123 125 127 136 138 PART II Analytics: Modeling Data viii 142 151 155 158... carefully selected sample may lead to better results than a large, messy data set Big Data makes it easy to forget the basics It is a little early to say anything definitive about Big Data, but the current trend strikes me as being something quite different: it is not just classical dataanalysis on a larger scale The approach of classical dataanalysis and statistics is inductive Given a part, make statements... contrast, Big Data (at least as it is currently being used) seems primarily concerned with individual data points Given that this specific user liked this specific movie, what other specific movie might he like? This is a very different question than asking which movies are most liked by what people in general! Big Data will not replace general, inductive dataanalysis It is not yet clear just where Big Data will... Big data Arguably the most painful omission concerns everything having to do with Big Data Big Data is a pretty new concept—I tend to think of it as relating to data sets that not merely don’t fit into main memory, but that no longer fit comfortably on a single disk, requiring compute clusters and the respective software and algorithms (in practice, map/reduce running on Hadoop) The rise of Big Data. .. (early 2009), Big Data was certainly on the horizon but was not necessarily considered mainstream yet As this book goes to print (late 2010), it seems that for many people in the tech field, data has become nearly synonymous with “Big Data. ” That kind of development usually indicates a fad The reality is that, in practice, many data sets are “small,” and in particular many relevant data sets are small... into your products documentation does require permission We appreciate, but do not require, attribution An attribution usually includes the title, author, publisher, and ISBN For example: DataAnalysiswith Open Source Tools, by Philipp K Janert Copyright 2011 Philipp K Janert, 978-0-596-80235-6.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to... probably means formal training) in the field that you are working in A book such as this one on general data analysis cannot replace this Formal statistical analysis A different form of data analysis exists in some particularly well-established fields In these situations, the environment from which the data arises is fully understood (or at least believed to be understood), and the methods and models to . 22:1 Data Analysis with Open Source Tools www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data. 22:1 Data Analysis with Open Source Tools Philipp K. Janert Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data Analysis. First Edition. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analysis with Open Source Tools, the image of a common kite, and related trade dress are trademarks of O’Reilly