1. Trang chủ
  2. » Công Nghệ Thông Tin

Statistics for big data for dummies alan anderson

412 198 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 412
Dung lượng 7,29 MB

Nội dung

Statistics For Big Data For Dummiesđ Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com Copyright â 2015 by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc., and may not be used without written permission All other trademarks are the property of their respective owners John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this book LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHOR HAVE USED THEIR BEST EFFORTS IN PREPARING THIS BOOK, THEY MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS BOOK AND SPECIFICALLY DISCLAIM ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES REPRESENTATIVES OR WRITTEN SALES MATERIALS THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR YOUR SITUATION YOU SHOULD CONSULT WITH A PROFESSIONAL WHERE APPROPRIATE NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM For general information on our other products and services, please contact our Customer Care Department within the U.S at 877-762-2974, outside the U.S at 317572-3993, or fax 317-572-4002 For technical support, please visit www.wiley.com/techsupport Wiley publishes in a variety of print and electronic formats and by print-on-demand Some material included with standard print versions of this book may not be included in e-books or in print-on-demand If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com For more information about Wiley products, visit www.wiley.com Library of Congress Control Number: 2015943222 ISBN 978-1-118-94001-3 (pbk); ISBN 978-1-118-94002-0 (ePub); ISBN 978-1-11894003-7 (ePDF) Statistics For Big Data For Dummies Visit http://www.dummies.com/cheatsheet/statisticsforbigdata to view this book’s cheat sheet Table of Contents Cover Introduction About This Book Foolish Assumptions Icons Used in This Book Beyond the Book Where to Go From Here Part I: Introducing Big Data Statistics Chapter 1: What Is Big Data and What Do You Do with It? Characteristics of Big Data Exploratory Data Analysis (EDA) Statistical Analysis of Big Data Chapter 2: Characteristics of Big Data: The Three Vs Characteristics of Big Data Traditional Database Management Systems (DBMS) Chapter 3: Using Big Data: The Hot Applications Big Data and Weather Forecasting Big Data and Healthcare Services Big Data and Insurance Big Data and Finance Big Data and Electric Utilities Big Data and Higher Education Big Data and Retailers Big Data and Search Engines Big Data and Social Media Chapter 4: Understanding Probabilities The Core Structure: Probability Spaces Discrete Probability Distributions Continuous Probability Distributions Introducing Multivariate Probability Distributions Chapter 5: Basic Statistical Ideas Some Preliminaries Regarding Data Summary Statistical Measures Overview of Hypothesis Testing Higher-Order Measures Part II: Preparing and Cleaning Data Chapter 6: Dirty Work: Preparing Your Data for Analysis Passing the Eye Test: Does Your Data Look Correct? Being Careful with Dates Does the Data Make Sense? Frequently Encountered Data Headaches Other Common Data Transformations Chapter 7: Figuring the Format: Important Computer File Formats Spreadsheet Formats Database Formats Chapter 8: Checking Assumptions: Testing for Normality Goodness of fit test Jarque-Bera test Chapter 9: Dealing with Missing or Incomplete Data Missing Data: What’s the Problem? Techniques for Dealing with Missing Data Chapter 10: Sending Out a Posse: Searching for Outliers Testing for Outliers Robust Statistics Dealing with Outliers Part III: Exploratory Data Analysis (EDA) Chapter 11: An Overview of Exploratory Data Analysis (EDA) Graphical EDA Techniques EDA Techniques for Testing Assumptions Quantitative EDA Techniques Chapter 12: A Plot to Get Graphical: Graphical Techniques Stem-and-Leaf Plots Scatter Plots Box Plots Histograms Quantile-Quantile (QQ) Plots Autocorrelation Plots Chapter 13: You’re the Only Variable for Me: Univariate Statistical Techniques Counting Events Over a Time Interval: The Poisson Distribution Continuous Probability Distributions Chapter 14: To All the Variables We’ve Encountered: Multivariate Statistical Techniques Testing Hypotheses about Two Population Means Using Analysis of Variance (ANOVA) to Test Hypotheses about Population Means The F-Distribution F-Test for the Equality of Two Population Variances Correlation Chapter 15: Regression Analysis The Fundamental Assumption: Variables Have a Linear Relationship Defining the Population Regression Equation Estimating the Population Regression Equation Testing the Estimated Regression Equation Using Statistical Software Assumptions of Simple Linear Regression Multiple Regression Analysis Multicollinearity Chapter 16: When You’ve Got the Time: Time Series Analysis Key Properties of a Time Series Forecasting with Decomposition Methods Smoothing Techniques Seasonal Components Modeling a Time Series with Regression Analysis Comparing Different Models: MAD and MSE Part IV: Big Data Applications Chapter 17: Using Your Crystal Ball: Forecasting with Big Data ARIMA Modeling Simulation Techniques Chapter 18: Crunching Numbers: Performing Statistical Analysis on Your Computer Excelling at Excel Programming with Visual Basic for Applications (VBA) R, Matey! Chapter 19: Seeking Free Sources of Financial Data Yahoo! Finance Federal Reserve Economic Data (FRED) Board of Governors of the Federal Reserve System U.S Department of the Treasury Other Useful Financial Websites Part V: The Part of Tens Chapter 20: Ten (or So) Best Practices in Data Preparation Check Data Formats Verify Data Types Graph Your Data Verify Data Accuracy Identify Outliers Deal with Missing Values Check Your Assumptions about How the Data Is Distributed Back Up and Document Everything You Do Chapter 21: Ten (or So) Questions Answered by Exploratory Data Analysis (EDA) What Are the Key Properties of a Dataset? What’s the Center of the Data? How Much Spread Is There in the Data? Is the Data Skewed? What Distribution Does the Data Follow? Are the Elements in the Dataset Uncorrelated? Does the Center of the Dataset Change Over Time? Does the Spread of the Dataset Change Over Time? Are There Outliers in the Data? Does the Data Conform to Our Assumptions? About the Authors Cheat Sheet Advertisement Page Connect with Dummies End User License Agreement Introduction Welcome to Statistics For Big Data For Dummies! Every day, what has come to be known as big data is making its influence felt in our lives Some of the most useful innovations of the past 20 years have been made possible by the advent of massive data-gathering capabilities combined with rapidly improving computer technology For example, of course, we have become accustomed to finding almost any information we need through the Internet You can locate nearly anything under the sun immediately by using a search engine such as Google or DuckDuckGo Finding information this way has become so commonplace that Google has slowly become a verb, as in “I don’t know where to find that restaurant — I’ll just Google it.” Just think how much more efficient our lives have become as a result of search engines But how does Google work? Google couldn’t exist without the ability to process massive quantities of information at an extremely rapid speed, and its software has to be extremely efficient Another area that has changed our lives forever is e-commerce, of which the classic example is Amazon.com People can buy virtually every product they use in their daily lives online (and have it delivered promptly, too) Often online prices are lower than in traditional “brick-and-mortar” stores, and the range of choices is wider Online shopping also lets people find the best available items at the lowest possible prices Another huge advantage to online shopping is the ability of the sellers to provide reviews of products and recommendations for future purchases Reviews from other shoppers can give extremely important information that isn’t available from a simple product description provided by manufacturers And recommendations for future purchases are a great way for consumers to find new products that they might not otherwise have known about Recommendations are enabled by one application of big data — the use of highly sophisticated programs that analyze shopping data and identify items that tend to be purchased by the same consumers Although online shopping is now second nature for many consumers, the reality is that e-commerce has only come into its own in the last 15–20 years, largely thanks to the rise of big data A website such as Amazon.com must process quantities of information that would have been unthinkably gigantic just a few years ago, and that processing must be done quickly and efficiently Thanks to rapidly improving technology, many traditional retailers now also offer the option of making purchases online; failure to do so would put a retailer at a huge competitive disadvantage In addition to search engines and e-commerce, big data is making a major impact in a surprising number of other areas that affect our daily lives: Social media Online auction sites Insurance Healthcare Energy Political polling Weather forecasting Education Travel Finance Are There Outliers in the Data? An outlier is a member of a dataset that is significantly larger or smaller than the other values in the dataset (See Chapter 10 for a discussion of outliers.) Outliers can greatly affect some statistical tests, so it’s important to determine whether outliers are present, and if so, whether they should be removed from the dataset An outlier may be defined in terms of quantiles, as follows: If an observation is less than If an observation is greater than , it’s considered to be an outlier , it’s considered to be an outlier A box plot shows outliers as individual points at the top and bottom of the plot Figure 21-2 shows that in the ExxonMobil data, there are three outliers greater than and two outliers smaller than These could have been caused by several factors, such as unexpectedly good or bad data released by the company, surprisingly large changes in oil prices, and so forth Does the Data Conform to Our Assumptions? Many types of statistical analysis depend on key assumptions about the data Chapter 8 discusses this notion at length Some of the most commonly used assumptions include the following: Normally distributed data Independent observations in the data Constant parameters (mean, variance, and standard deviation) for the observations in the data No outliers in the data EDA techniques enable you to test these assumptions before proceeding with formal statistical tests The results for the ExxonMobil data given in this chapter show the following: The data is close to being normally distributed The members of the data are very nearly independent of each other The mean appears to be constant over time The variance appears to be increasing over time There are several outliers in the data About the Authors Alan Anderson, PhD is a professor of economics and finance at numerous schools, including Fordham University and New York University Outside of the academic environment he has many years of experience working as an economist, risk manager, and fixed income analyst Alan received his PhD in economics from Fordham University, and an M.S in financial engineering from Polytechnic University He is the author of Business Statistics For Dummies (Wiley, 2013) David Semmelroth has spent the last two decades working with data and training people to work with data He has helped develop increasingly sophisticated techniques to collect, manage, and use data to drive business results His industry experience includes working with a variety of financial services companies, from insurance to retail banking More recently he’s been doing consulting work related to customer databases and database marketing He also spent several years in the travel and entertainment industry David earned both his B.A and M.S degrees in mathematics from the University of Michigan and has taught a variety of courses in statistics and mathematics at the college level He is the author of Data Driven Marketing For Dummies (Wiley, 2014) Authors’ Acknowledgments The authors would like to thank their editors and everybody else at Wiley who helped make this book happen Publisher’s Acknowledgments Acquisitions Editor: Lindsay Lefevere Editor: Corbin Collins Technical Editor: Karen R Hum, PhD Special Help: Barry Schoenborn Production Editor: Kinson Raja Cover Image: ©iStock.com/loops7 To access the cheat sheet specifically for this book, go to http://www.dummies.com/cheatsheet/statisticsforbigdata Find out “HOW” at Dummies.com Take Dummies with you everywhere you go! Go to our Website Like us on Facebook Follow us on Twitter Watch us on YouTube Join us on LinkedIn Pin us on Pinterest Circle us on google+ Subscribe to our newsletter Create your own Dummies book cover Shop Online WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley’s ebook EULA ... Big Data and Healthcare Services Big Data and Insurance Big Data and Finance Big Data and Electric Utilities Big Data and Higher Education Big Data and Retailers Big Data and Search Engines Big Data and Social Media... Chapter 2: Characteristics of Big Data: The Three Vs Characteristics of Big Data Traditional Database Management Systems (DBMS) Chapter 3: Using Big Data: The Hot Applications Big Data and Weather Forecasting Big Data and Healthcare Services... These include the characteristics of big data, applications of big data, key statistical tools for analyzing big data, and forecasting techniques Characteristics of Big Data The three factors that distinguish big data from other types of data are volume, velocity,

Ngày đăng: 04/03/2019, 11:49

TỪ KHÓA LIÊN QUAN