Should we all be teaching
Intro to Data Science instead
of Intro to Databases?
7/11/14
Bill Howe, UW
Mike Franklin, UC Berkeley
Juliana Freire, NYU
Jim Frew, UC Santa Barbara
Bill Howe, University of Washington
Tim Kraska, Brown
Raghu Ramakrishnan, Microsoft
couldnt make it
7/11/14
Bill Howe, UW
Plan
context (8 min)
panelists (5 x (5min + 2min))
discussion
7/11/14
Bill Howe, UW
What is Data Science?
The next sexy job
The ability to take datato be able to understand it, to process
it, to extract value from it, to visualize it, to communicate it
thats going to be a hugely important skill.
Hal Varian, Google
Data science, as it's practiced, is a blend of Red-Bull-fueled
hacking and espresso-inspired statistics.
Data science is the civil engineering of data. Its acolytes
possess a practical knowledge of tools & materials, coupled
with a theoretical understanding of what's possible.
Mike Driscoll, CEO of metamarkets:
7/11/14
Bill Howe, UW
Drew Conways Data Science Venn Diagram
7/11/14
Bill Howe, UW
Data Scientist (n.): Person who is better at
statistics than any software engineer and better
at software engineering than any statistician.
-- Josh Wills, Cloudera
7/11/14
Bill Howe, UW
A data scientist is a computer scientist
that understands error bars
-- unknown
A data scientist is a statistician that
lives in Silicon Valley
-- paraphrase of Stephen Probst, Teradata
7/11/14
Bill Howe, UW
What do data scientists do?
They need to find nuggets of truth in data and then explain it to the
business leaders
-- Rchard Snee, EMC
Data scientists tend to be hard scientists, particularly physicists, rather
than computer science majors. Physicists have a strong mathematical
background, computing skills, and come from a discipline in which survival
depends on getting the most from the data. They have to think about the
big picture, the big problem.
-- DJ Patil, Chief Scientist at LinkedIn
7/11/14
Bill Howe, UW
A data scientist is someone who can obtain, scrub, explore, model
and interpret data, blending hacking, statistics and machine
learning. Data scientists not only are adept at working with data, but
appreciate data itself as a first-class product.
-- Hilary Mason, chief scientist at bit.ly
7/11/14
Bill Howe, UW
I worry that the Data Scientist role is like
the mythical webmaster of the 90s:
master of all trades.
-- Aaron Kimball, CTO Wibidata
7/11/14
Bill Howe, UW
10
Mike Driscolls three sexy skills of data geeks
Statistics
traditional analysis
Data Munging
parsing, scraping, and formatting data
Visualization
graphs, tools, etc.
7/11/14
Bill Howe, UW
11
Three types of tasks:
1) Preparing to run a model
80% of the work
-- Aaron Kimball
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
2) Running the model
3) Interpreting the results
7/11/14
Bill Howe, UW
The other 80% of the work
12
What are the abstractions of
data science?
Data Jujitsu
Data Wrangling
Data Munging
7/11/14
Translation: We have no idea what
this is all about
Bill Howe, UW
13
What are the abstractions of
data science?
matrices and linear algebra?
relations and relational algebra?
objects and methods?
files and scripts?
data frames and functions?
Claim: Relational Algebra is at least as important as Linear Algebra
7/11/14
Bill Howe, UW
14
Huge number of
relevant courses,
new and existing.
7/11/14
Bill Howe, UW
15
Tools
tools
abstr.
structs
stats
desk
cloud
Math
Scale
Audience
hackers
7/11/14
analysts
Bill Howe, UW
16
tools
abstr.
structs
stats
desk
cloud
hackers
7/11/14
Bill Howe, UW
analysts
17
William W. Cohen
Machine
Learning
tools
abstr.
structs
stats
desk
cloud
hackers
7/11/14
Bill Howe, UW
tools
abstr.
structs
stats
desk
cloud
hackers
analysts
analysts
18
Dan
Suciu
CSE 344 Introduction to Data Management
Magda
Balazinska
tools
abstr.
structs
stats
desk
cloud
hackers
7/11/14
Bill Howe, UW
analysts
19
Jeff Hammerbacher Mike Franklin
tools
abstr.
structs
stats
desk
cloud
hackers
7/11/14
Bill Howe, UW
analysts
20
Introduction to Data Science
Rachel Schutt
tools
abstr.
structs
stats
desk
cloud
hackers
7/11/14
Bill Howe, UW
analysts
21
tools
abstr.
structs
stats
desk
cloud
hackers
7/11/14
Bill Howe, UW
analysts
22
Bill Howe Richard Sharp Roger Barga
7/11/14
Bill Howe, UW
tools
abstr.
structs
stats
desk
cloud
hackers
analysts
23
UW Big Data Education Efforts
Students
Non-Students
CS/Informa3cs
Non-Major
professionals
researchers
undergrads
grads
undergrads
grads
UWEO
Data
Science
Cer3cate
IGERT:
Big
Data
PhD
Track
CS
Courses
Bootcamps
and
workshops
Intro
to
Data
Programming
Data
Science
Masters
(planned)
MOOC:
Intro
to
Data
Science
Incubator:
hands-on
training
7/11/14
Bill Howe, UW
24
Bill Howe
Session 1,
Spring 2013
tools
abstr.
structs
stats
desk
cloud
Session 2
(starts Monday!)
hackers
7/11/14
Bill Howe, UW
analysts
25
Participation numbers
Registered:
Clicked play in first 2 weeks:
Turned in 1st homework:
Completed all assignments:
Passed:
Forum threads:
Forum posts:
119,517 totally irrelevant
78,589
10,663
~9000 typical attrition for a MOOC
7022
4661
22,900
Fairly consistent with Coursera data across hard courses
26
Who took the course?
7/11/14
Bill Howe, UW
27
7/11/14
Bill Howe, UW
28
Syllabus
Data Science Landscape (~1 week)
Data Manipulation at Scale
Relational Databases (~1 week)
MapReduce (~1 week)
NoSQL (~1 week)
Analytics
Statistics Pearls (~1 week)
multiple hypothesis testing, effect size, bayesian, bootstrap
Machine Learning Pearls (~1 week)
evaluation / overfitting, boosting / bagging, trees / forests, gradient descent
Visualization (~1 week)
Graph Analytics (~1 week)
Guest Lectures
Relational Algebra is the Calculus of Big Data
RA-flavored Hadoop-spawn: Pig, HIVE, blah
Hadoop contemporaries: Cascalog, Flume, blah
Post-Hadoop: Spark/Shark, Dremel, blah
Its all RA
7/11/14
Bill Howe, UW
30
Relational Algebra is the Calculus of Small Data
Galaxy bioinformatics workflows
Operate on Genomics Intervals -> Join
Pandas (Python)
merge(left, right, on=key)
dplyr (R)
filter(x), select(x), arrange(x), groupby(x),
inner_join(x, y), left_join(x, y), .
Manimal, Pyxis/StatusQuo, others
Extract RA operators implemented manually in Java code
7/11/14
Bill Howe, UW
31
The next hour
~5 minute talks
Discussion
7/11/14
Bill Howe, UW
32
Possible Responses
Data science is just a buzzword; theres
no substance to it.
Im already teaching all this stuff;
theres nothing new here.
This is a job for statistics departments /
B-schools / I-schools / applied math /
anyone else.
7/11/14
Bill Howe, UW
33