Practical data science cookbook

396 254 0
Practical data science cookbook

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

www.it-ebooks.info Practical Data Science Cookbook 89 hands-on recipes to help you complete real-world data science projects in R and Python Tony Ojeda Sean Patrick Murphy Benjamin Bengfort Abhijit Dasgupta BIRMINGHAM - MUMBAI www.it-ebooks.info Practical Data Science Cookbook Copyright © 2014 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: September 2014 Production reference: 1180914 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78398-024-6 www.packtpub.com Cover image by Pratyush Mohanta (tysoncinematics@gmail.com) www.it-ebooks.info Credits Authors Project Coordinator Tony Ojeda Priyanka Goel Sean Patrick Murphy Benjamin Bengfort Proofreaders Simran Bhogal Abhijit Dasgupta Maria Gould Ameesha Green Reviewers Richard Heimann Paul Hindle Sarah Kelley Kevin McGowan Liang Shi Lucy Rowland Will Voorhees Indexers Commissioning Editor James Jones Rekha Nair Priya Sane Acquisition Editor Graphics James Jones Abhinash Sahu Content Development Editor Arvind Koul Production Coordinator Adonia Jones Technical Editors Cover Work Pankaj Kadam Adonia Jones Sebastian Rodrigues Copy Editors Insiya Morbiwala Sayanee Mukherjee Stuti Srivastava www.it-ebooks.info About the Authors Tony Ojeda is an accomplished data scientist and entrepreneur, with expertise in business process optimization and over a decade of experience creating and implementing innovative data products and solutions He has a Master's degree in Finance from Florida International University and an MBA with concentrations in Strategy and Entrepreneurship from DePaul University He is the founder of District Data Labs, a cofounder of Data Community DC, and is actively involved in promoting data science education through both organizations First and foremost, I'd like to thank my coauthors for the tireless work they put in to make this book something we can all be proud to say we wrote together I hope to work on many more projects and achieve many great things with you in the future I'd like to thank our reviewers, specifically Will Voorhees and Sarah Kelley, for reading every single chapter of the book and providing excellent feedback on each one This book owes much of its quality to their great advice and suggestions I'd also like to thank my family and friends for their support and encouragement in just about everything I Last, but certainly not least, I'd like to thank my fiancée and partner in life, Nikki, for her patience, understanding, and willingness to stick with me throughout all my ambitious undertakings, this book being just one of them I wouldn't dare take risks and experiment with nearly as many things professionally if my personal life was not the stable, loving, supportive environment she provides Sean Patrick Murphy spent 15 years as a senior scientist at The Johns Hopkins University Applied Physics Laboratory, where he focused on machine learning, modeling and simulation, signal processing, and high performance computing in the Cloud Now, he acts as an advisor and data consultant for companies in SF, NY, and DC He completed his graduation from The Johns Hopkins University and his MBA from the University of Oxford He currently co-organizes the Data Innovation DC meetup and cofounded the Data Science MD meetup He is also a board member and cofounder of Data Community DC www.it-ebooks.info Benjamin Bengfort is an experienced data scientist and Python developer who has worked in military, industry, and academia for the past years He is currently pursuing his PhD in Computer Science at the University of Maryland, College Park, doing research in Metacognition and Natural Language Processing He holds a Master's degree in Computer Science from North Dakota State University, where he taught undergraduate Computer Science courses He is also an adjunct faculty member at Georgetown University, where he teaches Data Science and Analytics Benjamin has been involved in two data science start-ups in the DC region: leveraging large-scale machine learning and Big Data techniques across a variety of applications He has a deep appreciation for the combination of models and data for entrepreneurial effect, and he is currently building one of these start-ups into a more mature organization I'd like to thank Will Voorhees for his tireless support in everything I've been doing, even agreeing to review my technical writing He made my chapters understandable, and I'm thankful that he reads what I write It's been essential to my career and sanity to have a classmate, a colleague, and a friend like him I'd also like to thank my coauthors, Tony and Sean, for working their butts off to make this book happen; it was a spectacular effort on their part I'd also like to thank Sarah Kelley for her input and fresh take on the material; so far, she's gone on many adventures with us, and I'm looking forward to the time when I get to review her books! Finally, I'd especially like to thank my wife, Jaci, who puts up with a lot, especially when I bite off more than I can chew and end up working late into the night Without her, I wouldn't be writing anything at all She is an inspiration, and one of the writers in my family, she is the one who students will be reading, even a hundred years from now Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia area, with several years of experience in biomedical consulting, business analytics, bioinformatics, and bioengineering consulting He has a PhD in Biostatistics from the University of Washington and over 40 collaborative peer-reviewed manuscripts, with strong interests in bridging the statistics/machine-learning divide He is always on the lookout for interesting and challenging projects, and is an enthusiastic speaker and discussant on new and better ways to look at and analyze data He is a member of Data Community DC and a founding member and co-organizer of Statistical Programming DC (formerly, R Users DC) www.it-ebooks.info About the Reviewers Richard Heimann is a technical fellow and Chief Data Scientist at L-3 National Security Solutions (NSS) (NYSE:LLL), and is also an EMC-certified data scientist with concentrations in spatial statistics, data mining, and Big Data Richard also leads the data science team at the L-3 Data Tactics Business Unit L-3 NSS and L-3 Data Tactics are both premier Big Data and analytics service providers based in Washington DC and serve customers globally Richard is an adjunct professor at the University of Maryland, Baltimore County, where he teaches Spatial Analysis and Statistical Reasoning Additionally, he is an instructor at George Mason University, teaching Human Terrain Analysis; he is also a selection committee member for the 2014-2015 AAAS Big Data and Analytics Fellowship Program and member of the WashingtonExec Big Data Council Richard has recently published a book titled Social Media Mining with R, Packt Publishing He recently supported DARPA, DHS, the US Army, and the Pentagon with analytical support Sarah Kelley is a junior Python developer and aspiring data scientist She currently works at a start-up in Bethesda, Maryland, where she spends most of her time on data ingestion and wrangling Sarah holds a Master's degree in Education from Seattle University She is a self-taught programmer who became interested in the field through her desire to inspire her students to pursue careers in Mathematics, Science, and technology www.it-ebooks.info Liang Shi received his PhD in Computer Science and a Master's degree in Statistics from the University of Georgia in 2008 and 2006, respectively His PhD study is on Machine Learning and AI, mainly solving surrogate model-assisted optimization problems After graduation, he joined the Data Mining Research team at McAfee; his job was to detect network threats through machine-learning approaches based on Big Data and cloud computing platforms He later joined Microsoft as a software engineer, and continued his security research and development leveraged by machine-learning algorithms, basically for online advertisement fraud detection on very large, real-time data scales In 2012, he rejoined McAfee (Intel) as a senior researcher, conducting network threat research, again with the help of machine-learning and cloud computing techniques Early this year, he joined Pivotal as a senior data scientist; his work is mainly on data scientist projects with clients of popular companies, mainly for IT and security data analytics He is very familiar with statistical and machine-learning modeling and theories, and he is proficient with many programming languages and analytical tools He has several journal- and conference-proceeding publications, and he also published a book chapter Will Voorhees is a software developer with experience in all sorts of interesting things from mobile app development and natural language processing to infrastructure security After teaching English in Austria and bootstrapping an education technology start-up, he moved to the West Coast, joined a big tech company, and is now happily working on infrastructure security software used by thousands of developers In his free time, Will enjoys reviewing technical books, watching movies, and convincing his dog that she's a good girl, yes she is www.it-ebooks.info www.PacktPub.com Support files, eBooks, discount offers, and more You might want to visit www.PacktPub.com for support files and downloads related to your book Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library Here, you can access, read and search across Packt's entire library of books Why Subscribe? ff Fully searchable across every book published by Packt ff Copy and paste, print and bookmark content ff On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books Simply use your login credentials for immediate access www.it-ebooks.info Table of Contents Preface 1 Chapter 1: Preparing Your Data Science Environment Introduction 7 Understanding the data science pipeline Installing R on Windows, Mac OS X, and Linux 11 Installing libraries in R and RStudio 14 Installing Python on Linux and Mac OS X 17 Installing Python on Windows 18 Installing the Python data stack on Mac OS X and Linux 21 Installing extra Python packages 24 Installing and using virtualenv 26 Chapter 2: Driving Visual Analysis with Automobile Data (R) 31 Chapter 3: Simulating American Football Data (R) 59 Introduction 31 Acquiring automobile fuel efficiency data 32 Preparing R for your first project 34 Importing automobile fuel efficiency data into R 35 Exploring and describing fuel efficiency data 38 Analyzing automobile fuel efficiency over time 43 Investigating the makes and models of automobiles 54 Introduction Acquiring and cleaning football data Analyzing and understanding football data Constructing indexes to measure offensive and defensive strength Simulating a single game with outcomes decided by calculations Simulating multiple games with outcomes decided by calculations www.it-ebooks.info 59 61 65 74 77 81 .. .Practical Data Science Cookbook 89 hands-on recipes to help you complete real-world data science projects in R and Python Tony Ojeda Sean Patrick... What this book covers Chapter 1, Preparing Your Data Science Environment, introduces you to the data science pipeline and helps you get your data science environment properly set up with instructions... and understanding of data science and apply good data science to your domains Practicing data scientists require a great number and diversity of tools to get the job done Data practitioners scrape,

Ngày đăng: 12/09/2017, 01:35

Mục lục

  • Chapter 1: Preparing Your Data Science Environment

    • Introduction

    • Understanding the data science pipeline

    • Installing R on Windows, Mac OS X, and Linux

    • Installing libraries in R and RStudio

    • Installing Python on Linux and Mac OS X

    • Installing Python on Windows

    • Installing the Python data stack on Mac OS X and Linux

    • Installing extra Python packages

    • Installing and using virtualenv

    • Acquiring automobile fuel efficiency data

    • Preparing R for your first project

    • Exploring and describing the fuel efficiency data

    • Analyzing automobile fuel efficiency over time

    • Investigating the makes and models of automobiles

    • Chapter 3: Simulating American Football Data (R)

      • Introduction

      • Acquiring and cleaning football data

      • Analyzing and understanding football data

      • Constructing indexes to measure offensive and defensive strength

      • Simulating a single game with outcomes decided by calculations

      • Simulating multiple games with outcomes decided by calculations

Tài liệu cùng người dùng

Tài liệu liên quan