Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 52 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
52
Dung lượng
11,34 MB
Nội dung
Introduction to Data Science Lecture CS 194 Fall 2014 John Canny Including notes from Michael Franklin Dan Bruckner, Evan Sparks, Shivaram Venkataraman Outline • Data Science – Why all the excitement? – examples • • • • Where does data come from So what is Data Science Doing Data Science About the course – what we’ll cover – data science first, big data later – requirements, workload etc Data Analysis Has Been Around for a While R.A Fisher W.E Demming Peter Luhn Howard Dresner Abridged Version of Jeff Hammerbacher’s timeline for CS 194, 2012 Data makes everything clearer • Seven Countries Study (Ancel Keys, UCB 1925,28) • 13,000 subjects total, 5-40 years follow-up Data Science: Why all the Excitement? e.g., Google Flu Trends: Detecting outbreaks two weeks ahead of CDC data New models are estimating which cities are most at risk for spread of the Ebola virus Why the all the Excitement? Data and Election 2012 (cont.) • …that was just one of several ways that Mr Obama’s campaign operations, some unnoticed by Mr Romney’s aides in Boston, helped save the president’s candidacy In Chicago, the campaign recruited a team of behavioral scientists to build an extraordinarily sophisticated database …that allowed the Obama campaign not only to alter the very nature of the electorate, making it younger and less white, but also to create a portrait of shifting voter allegiances The power of this operation stunned Mr Romney’s aides on election night, as they saw voters they never even knew existed turn out in places like Osceola County, Fla New York TImes, Wed Nov 7, 2012 A history of the (Business) Internet: 1997 Pagerank: The web as a behavioral dataset DB size = 50 billion sites Google server farms million machines (est) Other Berkeley Projects (used in the course) Ipython: Created by Fernando Perez (Brain Science) Probably the most widely used Data Science Environment BIDMach: A hardware-optimized (rooflined) toolkit The fastest tool for general machine learning Caffe: Most popular toolkit for DNN (Deep Neural Network) modeling 40 DOING DATA SCIENCE Ben Fry’s Model Acquire Parse Filter Mine Represent Refine Interact 42 Jeff Hammerbacher’s Model Identify problem Instrument data sources Collect data Prepare data (integrate, transform, clean, filter, aggregate) Build model Evaluate model Communicate results 43 From the Trenches Yahoo [KDD 2009, best app paper] Ebay [SIGIR 2011, hon mention] Quantcast [2012] Microsoft [CIKM 2014] Data Scientist’s Practice Clean, prep Digging Around in Data Hypothesize Model Evaluate Interpret Large Scale Exploitation What’s Hard about Data Science • • • • • Overcoming assumptions Making ad-hoc explanations of data patterns Overgeneralizing Communication Not checking enough (validate models, data pipeline integrity, etc.) • Using statistical tests correctly • Prototype Production transitions • Data pipeline complexity (who you ask?) About the Course Grading • • • • • Class Participation (M) and in-class labs (W) 20% Midterm 20% Final Project (in groups) 25% Homeworks 30% Bunnies 5% Lab model – hands-on weekly labs here (145 Moffitt) A bunny is a cuter, cuddlier species of quiz… Normally due Monday before class (Friday this week) Projects Project teams should form by week Project proposals will be due 10/2 You can choose a project topic, but we will also provide a list of suggested projects from around campus (from BIDs researchers) You need: • A clear problem statement • An accessible dataset • Modeling plan + appropriate tools About the Course Staff Contact: Instructor: John Canny, lastname@berkeley.edu Office hours: MW 2-3pm GSIs: Charles Reiss: Tu/Th 5-6 at 283E Soda firstname.lastname@berkeley.edu Biye Jiang: Wed 10-11 at 283E Soda firstletteroffirstnamelastname@berkeley.edu Use Piazza for questions… Course Site, Readings The main course site is in bCourses, titled “Introduction to Data Science” Most work will be submitted there, some homeworks will be submitted on instructional machines with glookup Readings will be linked from the course site Some are campus only, configure proxy.lib.berkeley.edu in your browser to read at home If time permits… A data analysis exercise: English Premier League Soccer: Everton vs W Bromwich Albion W Bromwich Goals Predict the outcome: Everton Goals 0 2+ 2+ Analysis What kinds of data will you use? • Almost anything is OK, except other predictions • History: individual or pair-wise? • Team or players? • Numerical or text? • What kind of model will you build? • What assumptions are safe to make? Predictions PL Followers Non-Followers 0 2+ 2+ Everton Goals W Bromwich Goals W Bromwich Goals Everton Goals 1 2+ Answers on Monday! 2+ Readings This weeks reading bunny by Friday Read next weeks readings, and complete bunny before class on Monday ... Fisher W.E Demming Peter Luhn Howard Dresner Abridged Version of Jeff Hammerbacher’s timeline for CS 194, 2012 Data makes everything clearer • Seven Countries Study (Ancel Keys, UCB 1925,28) • 13,000... Washington, UCB, • New degree programs, courses, boot-camps: – e.g., at Berkeley: Stats, I-School, CS, Astronomy… – One proposal (elsewhere) for an MS in “Big Data Science” DATA SCIENCE – WHAT IS... purpose classifier Supernova Not Nugent group / C3 LBL Scientific Modeling Data-Driven Approach Physics-based models Problem-Structured Mostly deterministic, precise Run on Supercomputer or High-end