1. Trang chủ
  2. » Công Nghệ Thông Tin

Bad data handbook (Big Data)

264 451 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 264
Dung lượng 10,72 MB

Nội dung

www.it-ebooks.info www.it-ebooks.info Bad Data Handbook Q Ethan McCallum www.it-ebooks.info Bad Data Handbook by Q Ethan McCallum Copyright © 2013 Q McCallum All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://my.safaribooksonline.com) For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com Editors: Mike Loukides and Meghan Blanchette Production Editor: Melanie Yarbrough Copyeditor: Gillian McGarvey November 2012: Proofreader: Melanie Yarbrough Indexer: Angela Howard Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano First Edition Revision History for the First Edition: 2012-11-05 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449321888 for release details Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc Bad Data Handbook, the cover image of a short-legged goose, and related trade dress are trade‐ marks of O’Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein ISBN: 978-1-449-32188-8 [LSI] www.it-ebooks.info Table of Contents About the Authors ix Preface xiii Setting the Pace: What Is Bad Data? Is It Just Me, or Does This Data Smell Funny? Understand the Data Structure Field Validation Value Validation Physical Interpretation of Simple Statistics Visualization Keyword PPC Example Search Referral Example Recommendation Analysis Time Series Data Conclusion 10 11 12 14 19 21 24 29 Data Intended for Human Consumption, Not Machine Consumption 31 The Data The Problem: Data Formatted for Human Consumption The Arrangement of Data Data Spread Across Multiple Files The Solution: Writing Code Reading Data from an Awkward Format Reading Data Spread Across Several Files Postscript Other Formats Summary 31 32 32 37 38 39 40 48 48 51 Bad Data Lurking in Plain Text 53 iii www.it-ebooks.info Which Plain Text Encoding? Guessing Text Encoding Normalizing Text Problem: Application-Specific Characters Leaking into Plain Text Text Processing with Python Exercises 54 58 61 63 67 68 (Re)Organizing the Web’s Data 69 Can You Get That? General Workflow Example robots.txt Identifying the Data Organization Pattern Store Offline Version for Parsing Scrape the Information Off the Page The Real Difficulties Download the Raw Content If Possible Forms, Dialog Boxes, and New Windows Flash The Dark Side Conclusion 70 71 72 73 75 76 79 80 80 81 82 82 Detecting Liars and the Confused in Contradictory Online Reviews 83 Weotta Getting Reviews Sentiment Classification Polarized Language Corpus Creation Training a Classifier Validating the Classifier Designing with Data Lessons Learned Summary Resources 83 84 85 85 87 88 90 91 92 92 93 Will the Bad Data Please Stand Up? 95 Example 1: Defect Reduction in Manufacturing Example 2: Who’s Calling? Example 3: When “Typical” Does Not Mean “Average” Lessons Learned Will This Be on the Test? 95 98 101 104 105 Blood, Sweat, and Urine 107 iv | Table of Contents www.it-ebooks.info A Very Nerdy Body Swap Comedy How Chemists Make Up Numbers All Your Database Are Belong to Us Check, Please Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository Rehab for Chemists (and Other Spreadsheet Abusers) tl;dr 107 108 110 113 114 115 117 When Data and Reality Don’t Match 119 Whose Ticker Is It Anyway? Splits, Dividends, and Rescaling Bad Reality Conclusion 120 122 125 127 10 Subtle Sources of Bias and Error 129 Imputation Bias: General Issues Reporting Errors: General Issues Other Sources of Bias Topcoding/Bottomcoding Seam Bias Proxy Reporting Sample Selection Conclusions References 131 133 135 136 137 138 139 139 140 11 Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad? 143 But First, Let’s Reflect on Graduate School … Moving On to the Professional World Moving into Government Work Government Data Is Very Real Service Call Data as an Applied Example Moving Forward Lessons Learned and Looking Ahead 143 144 146 146 147 148 149 12 When Databases Attack: A Guide for When to Stick to Files 151 History Building My Toolset The Roadblock: My Datastore Consider Files as Your Datastore Files Are Simple! Files Work with Everything Files Can Contain Any Data Type 151 152 152 154 154 154 154 Table of Contents www.it-ebooks.info | v Data Corruption Is Local They Have Great Tooling There’s No Install Tax File Concepts Encoding Text Files Binary Data Memory-Mapped Files File Formats Delimiters A Web Framework Backed by Files Motivation Implementation Reflections 155 155 155 156 156 156 156 156 156 158 159 160 161 161 13 Crouching Table, Hidden Network 163 A Relational Cost Allocations Model The Delicate Sound of a Combinatorial Explosion… The Hidden Network Emerges Storing the Graph Navigating the Graph with Gremlin Finding Value in Network Properties Think in Terms of Multiple Data Models and Use the Right Tool for the Job Acknowledgments 164 167 168 169 170 171 173 173 14 Myths of Cloud Computing 175 Introduction to the Cloud What Is “The Cloud”? The Cloud and Big Data Introducing Fred At First Everything Is Great They Put 100% of Their Infrastructure in the Cloud As Things Grow, They Scale Easily at First Then Things Start Having Trouble They Need to Improve Performance Higher IO Becomes Critical A Major Regional Outage Causes Massive Downtime Higher IO Comes with a Cost Data Sizes Increase Geo Redundancy Becomes a Priority Horizontal Scale Isn’t as Easy as They Hoped Costs Increase Dramatically vi | Table of Contents www.it-ebooks.info 175 175 176 176 177 177 177 177 178 178 178 179 179 179 180 180 Fred’s Follies Myth 1: Cloud Is a Great Solution for All Infrastructure Components How This Myth Relates to Fred’s Story Myth 2: Cloud Will Save Us Money How This Myth Relates to Fred’s Story Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID How This Myth Relates to Fred’s Story Myth 4: Cloud Computing Makes Horizontal Scaling Easy How This Myth Relates to Fred’s Story Conclusion and Recommendations 181 181 181 181 183 183 183 184 184 184 15 The Dark Side of Data Science 187 Avoid These Pitfalls Know Nothing About Thy Data Be Inconsistent in Cleaning and Organizing the Data Assume Data Is Correct and Complete Spillover of Time-Bound Data Thou Shalt Provide Your Data Scientists with a Single Tool for All Tasks Using a Production Environment for Ad-Hoc Analysis The Ideal Data Science Environment Thou Shalt Analyze for Analysis’ Sake Only Thou Shalt Compartmentalize Learnings Thou Shalt Expect Omnipotence from Data Scientists Where Do Data Scientists Live Within the Organization? Final Thoughts 187 188 188 188 189 189 189 190 191 192 192 193 193 16 How to Feed and Care for Your Machine-Learning Experts 195 Define the Problem Fake It Before You Make It Create a Training Set Pick the Features Encode the Data Split Into Training, Test, and Solution Sets Describe the Problem Respond to Questions Integrate the Solutions Conclusion 195 196 197 198 199 200 201 201 202 203 17 Data Traceability 205 Why? Personal Experience 205 206 Table of Contents www.it-ebooks.info | vii Snapshotting Saving the Source Weighting Sources Backing Out Data Separating Phases (and Keeping them Pure) Identifying the Root Cause Finding Areas for Improvement Immutability: Borrowing an Idea from Functional Programming An Example Crawlers Change Clustering Popularity Conclusion 206 206 207 207 207 208 208 208 209 210 210 210 210 211 18 Social Media: Erasable Ink? 213 Social Media: Whose Data Is This Anyway? Control Commercial Resyndication Expectations Around Communication and Expression Technical Implications of New End User Expectations What Does the Industry Do? Validation API Update Notification API What Should End Users Do? How Do We Work Together? 214 215 216 217 219 221 222 222 222 223 19 Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough 225 Framework Introduction: The Four Cs of Data Quality Analysis Complete Coherent Correct aCcountable Conclusion 226 227 229 232 233 237 Index 239 viii | Table of Contents www.it-ebooks.info Once you have validated your data, you have to decide what to with the problems you’ve found The decisions to fix, omit, or flag the offending records are similar to those for Completeness and will depend as always on your requirements, though the balance may be different Fixing errors involving referential or value integrity tend to be a mix‐ ture of finding orphaned records, deleting duplicates, and so forth Correct Having confirmed that your data is both complete and coherent, you’re still not quite ready to crunch numbers You now have to ask yourself whether your data is correct enough for what you’re trying to It may seem strange to consider this a precursor to analysis, as analysis often serves to somehow validate the dataset; but keep in mind, there may also be “sub-dimensions” of correctness that bear validation before you move on to the main event Similar to testing for coherence, correctness requires some degree of domain knowledge One thing to remember is that correctness itself can be relative Imagine you’ve gathered data from a distributed system, composed of hundreds of servers, and you wish to measure latency between the component services as messages flow through the system Can you just assume that clocks on all the machines are synchronized, or that the time‐ stamps on your log records are in sync? Maybe But even if you configure this system yourself, things change (and break) One simple check would be to confirm that the timestamps are moving in the proper direction Say that messages flow through systems s1, s2, and s3, in that order You could check that the timestamps are related as follows: message_timestamp(s1) [...]... ky file formats Sure, that’s part of the picture, but Bad Data is so much more It includes data that eats up your time, causes you to stay late at the office, drives you to tear out your hair in frustration It’s data that you can’t access, data that you had and then lost, data that’s not the same today as it was yesterday… In short, Bad Data is data that gets in the way There are so many ways to get... whether there is such a thing as truly bad data, in Will the Bad Data Please Stand Up? (Chapter 7) Your data may have problems, and you wouldn’t even know it As Jonathan A Schwabish explains in Subtle Sources of Bias and Error (Chapter 10), how you collect that data determines what will hurt you In Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad? (Chap‐ ter 11), Brett J Goldstein’s... stick with this data science bit long enough, you’ll certainly encounter your fair share To that end, we decided to compile Bad Data Handbook, a rogues gallery of data trou‐ blemakers We found 19 people from all reaches of the data arena to talk about how data issues have bitten them, and how they’ve healed In particular: Guidance for Grubby, Hands-on Work You can’t assume that a new dataset is clean... | Preface www.it-ebooks.info CHAPTER 1 Setting the Pace: What Is Bad Data? We all say we like data, but we don’t We like getting insight out of data That’s not quite the same as liking the data itself In fact, I dare say that I don’t quite care for data It sounds like I’m not alone It’s tough to nail down a precise definition of Bad Data. ” Some people consider it a purely hands-on, technical phenomenon:... on Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough (Chapter 19) In this complement to Kevin Fink’s article, we explain how to assess your data s quality, and how to build a structure around a data quality effort Setting the Pace: What Is Bad Data? www.it-ebooks.info | 3 www.it-ebooks.info CHAPTER 2 Is It Just Me, or Does This Data Smell Funny? Kevin Fink You are given a dataset... all Marck Vaisman uses The Dark Side of Data Science (Chap‐ ter 15) to document several worst practices that you should avoid Data Policy Sure, you know the methods you used, but do you truly understand how those final figures came to be? Reid Draper’s Data Traceability (Chapter 17) is food for thought for your data processing pipelines Data is particularly bad when it’s in the wrong place: it’s supposed... www.it-ebooks.info If you’re working with text data, sooner or later a character encoding bug will bite you Bad Data Lurking in Plain Text (Chapter 4), by Josh Levy, explains what sort of problems await and how to handle them To wrap up, Adam Laiacano’s (Re)Organizing the Web’s Data (Chapter 5) walks you through everything that can go wrong in a web-scraping effort Data That Does the Unexpected Sure, people... restaurants, or enjoying good beer Pete Warden is an ex-Apple software engineer, wrote the Big Data Glossary and the Data Source Handbook for O’Reilly, created the open-source projects Data Science Toolkit and OpenHeatMap, and broke the story about Apple’s iPhone location tracking file He’s the CTO and founder of Jetpac, a data- driven social photo iPad app, with over a billion pictures analyzed from 3 million... (Chap‐ ter 11), Brett J Goldstein’s career retrospective explains how dirty data will give your classical statistics training a harsh reality check Data Storage and Infrastructure How you store your data weighs heavily in how you can analyze it Bobby Norton explains how to spot a graph data structure that’s trapped in a relational database in Crouching Table, Hidden Network (Chapter 13) Cloud computing’s... large-scale data analysis, but it’s not without its faults In Myths of Cloud Computing (Chapter 14), Steve Francia dissects some of those assumptions so you don’t have to find out the hard way 2 | Chapter 1: Setting the Pace: What Is Bad Data? www.it-ebooks.info We debate using relational databases over NoSQL products, Mongo over Couch, or one Hadoop-based storage over another Tim McNamara’s When Databases ...www.it-ebooks.info Bad Data Handbook Q Ethan McCallum www.it-ebooks.info Bad Data Handbook by Q Ethan McCallum Copyright © 2013 Q McCallum All rights... Is Bad Data? We all say we like data, but we don’t We like getting insight out of data That’s not quite the same as liking the data itself In fact, I dare say that I don’t quite care for data. .. that end, we decided to compile Bad Data Handbook, a rogues gallery of data trou‐ blemakers We found 19 people from all reaches of the data arena to talk about how data issues have bitten them,

Ngày đăng: 19/04/2016, 18:53

TỪ KHÓA LIÊN QUAN