4-2 Bda PPTS
4-2 Bda PPTS
Introduction
1
Theme of this Course
2
Introduction to Big Data
3
Big Data Definition
4
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase
in
collected/generated
data
5
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types
of data
6
Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
7
Big Data: 3V’s
8
Some Make it 4V’s
9
Harnessing Big Data
10
Who’s Generating Big Data
Mobile devices
(tracking all objects all the
time)
• The progress and innovation is no longer hindered by the ability to collect data
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
12
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
13
Value of Big Data Analytics
• Big data is more real-time in
nature than traditional DW
applications
14
Challenges in Handling Big Data
15
What Technology D o We Have
For Big Data ??
16
17
Big Data Technology
18
What You Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
• RHadoop: Statistical tools for managing big data
• Mahout: Data mining and machine learning tools over big data
• Learn state-of-art technology from recent research papers
• Optimizations, indexing techniques, and other extensions to Hadoop
19
Course Logistics
20
Course Logistics
• Lectures
• Tuesday, Thursday: (4:00pm - 5:20pm)
21
Textbook & Reading List
• No specific textbook
• Big Data is a relatively new topic (so no fixed syllabus)
• Reading List
• We will cover the state-of-art technology from research papers in
big conferences
• Many Hadoop-related papers are available on the course website
• Related books:
• Hadoop, The Definitive Guide [pdf]
22
Requirements & Grading
• Seminar-Type Course
• Students will read research papers and present them (Reading List
) Done in teams
• Hands-on Course of two
• No written homework or exams
• Several coding projects covering the entire
semester
23
Requirements & Grading (Cont’d)
• Reviews
• When a team is presenting (not the instructor), the other students should prepare
a review on the presented paper
• Course website gives guidelines on how to make good reviews
24
Late Submission Policy
• For Projects
• One-day late 10% off the max grade
• Two-day late 20% off the max grade
• Three-day late 30% off the max grade
• Beyond that, no late submission is accepted
• Submissions:
• Submitted via blackboard system by the due date
• Demonstrated to the instructor within the week after
• For Reviews
• No late submissions
• Student may skip at most 4 reviews
• Submissions:
• Given to the instructor at the beginning of class
25
More about Projects
• A virtual machine is created including the needed platform for the projects
• Ubuntu OS (Version 12.10)
• Hadoop platform (Version 1.1.0)
• Apache Pig (Version 0.10.0)
• Mahout library (Version 0.7)
• Rhadoop
• In addition to other software packages
26
Next Step from You…
1. Form teams of two
2. Visit the course website (Reading List), each team
selects its first paper to present (1st come 1st served)
• Send me your choices top 2/3 choices
27
Course Output: What You
Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
• RHadoop: Statistical tools for managing big data
• Mahout: Analytics and data mining tools over big data
• Learn state-of-art technology from recent research papers
• Optimizations, indexing techniques, and other extensions to Hadoop
28
Open Source World’s Solution
Spider Runtime
Batch Processing System
on top of Hadoop
Business
Intelligence Database
Batch Processing System
on top fo Hadoop
Sender Re ce ive r
eRMRcSeeaidepvunpecder
user
buffer
m e togr disk,e andN do f l o w
e Cr Mocd maa op plaCealssrounuesmfd alspPenoaaerdrerw
fun
r cc ti
a
checkpointing
o
l n
l s
rftsuofciunoso
nnctnocttitro1ioo,nlncsas:
r t , dump
tR i teiod nu c e
l:l
MapReduce and
Hadoop Distributed
File System 54
⚫ Introduction to MapReduce
⚫ From CS Foundation to MapReduce
⚫ MapReduce programming model
⚫ Hadoop Distributed File System
⚫ Relevance to Undergraduate Curriculum
⚫ Demo (Internet access needed)
⚫ Our experience with the framework
⚫ Summary
⚫ References
green 2
Data Mai
sun 1
collection n
moon 1
land 1
WordCounter part 1
parse( )
count( )
DataCollection ResultTable
web 2
weed 1
green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1
parse( )
count( )
weed 1
Data green 2
collection
sun 1
moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
W ordList
Separate counters
DataCollection ResultTable
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Peta-scale Data
64
Mai
web 2
n
weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..
* part 1
1..*
1..
Parser * Counter
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Addressing the Scale Issue
65
⚫ Single machine cannot serve all the data: you need a distributed
special (file) system
⚫ Large number of commodity hardware disks: say, 1000 disks 1TB
each
Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
Thus failure is norm and not an exception.
File system has to be fault-tolerant: replication, checksum
Data transfer bandwidth is critical (location of data)
weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..
* part 1
1..*
1..
Parser * Counter
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data Peta Scale Data is Commonly Distributed
collection
67
Mai
web 2
Data n
collection weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..
Data * part 1
1..*
1..
collection Parser * Counter
WordList
Data DataCollection ResultTable
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data Write Once Read Many (WORM) data
collection
68
Mai
web 2
Data n
collection weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..
Data * part 1
1..*
1..
collection Parser * Counter
WordList
Data DataCollection ResultTable
collection
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data WORM Data is Amenable to Parallelism
collection
69
Mai
Data n
collection
1. Data with WORM
characteristics : yields
Data to parallel
collection processing;
Thread 2. Data without
1..*
1..
dependencies: yields
Data * to out of order
1..*
1..
collection Parser * Counter
processing
WordList
Data DataCollection ResultTable
collection
Data Thread
Our parse is a mapping operation:
collection Parser
1....*
1....
*
Counter
MAP: input <key, value> pairs
DataCollection WordList ResultTable
Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread
collection
1....
1....* *
1....
green
1
1 web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1
Map web
weed
1
moon
1
land
1
1web 1
Data green
green
1 part
1
1
Collection: split1 Split the data to web … 1
sun
1 web
1
weed
1
moon 11
Supply multiple weedKEY VAL
1 land green
1green 1
processors green UE
1
part …1
1
sun
1
sun KEY VALUE
web 11
moon 1
green 1part
moon
1
… 1
…
green 11 KEY VALUE
… 1
KEY VALUE
Data
Collection: split n
Reduce
Map
Data
Collection: split1 Split the data to
Supply
multiple
processors Reduce
Data Map
Collection: split 2
……
Data
…
Map Reduce
Collection: split n
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
P-0002
Parse-hash ,count3
combine part0
map reduce
Cat split
reduce part1
split map combine
Bat
map part2
split combine reduce
Dog
split map
Other
Word
s
(size:
TByte)
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
MapReduce Programming
Model
76
⚫ Highly fault-tolerant
⚫ High throughput
⚫ Suitable for applications with large data sets
⚫ Streaming access to file system data
⚫ Can be built out of commodity hardware
HDFS Client
Application
Local file
system
Block size:
2K Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Hadoop Distributed File System
85
blockmap
Application
Local file
system
Block size:
2K Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Relevance and Impact on Undergraduate courses
86
torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-
113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapred
uce.html
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Hive - SQL on top of Hadoop
Map-Reduce and SQL
• Map-Reduce is scalable
– SQL has a huge user base
– SQL is easy to code
• Solution: Combine SQL and Map-Reduce
– Hive on top of Hadoop (open source)
– Aster Data (proprietary)
– Green Plum (proprietary)
Hive
• A database/data warehouse on top of Hadoop
– Rich data types (structs, lists and maps)
– Efficient implementations of SQL filters, joins and group-
by’s on top of map reduce
• Allow users to access Hive data without using
Hive
• Link:
– http://svn.apache.org/repos/asf/hadoop/hive/
trunk/
Hive Architecture
Map Reduce HDFS
SerDe
ke av AB
y
ke av bv
1 B 111 y ABC
Map Reduce
ke bv 1 11 22 ke av bv cv
y 1C 2 Map Reduce y
1 22 ke cv 1 11 22 33
2 y 1 2 3
• SQL: 1 33
3 = b.key) join c on a.key =
– FROM (a join b on a.ke
c.key SELECT … y
Share Common Read Operations
pv_users
pag ag
e pagpeaidg_eaidg_ea_gpea_rtsiaulm_su
eid m
pag
pag ag
ag cou
cou
1 25 eid e nt
1 25
Map-Reduce eid
1 25 4
1 25 2
1 25 2 32 1
2 32 1
2 32
1 25 2
1 25
Map-side Aggregation / Combiner
Machine 1
<k1, v1>
<male, 343> <male, 343> <male, 343>
<k2, v2> <male, 466>
<female, 128> <male, 123> <male, 123>
<k3, v3>
<k4, v4>
<male, 123> <female, 128> <female, 128>
<k5, v5> <female, 372>
<female, 244> <female, 244> <female, 244>
<k6, v6>
Query Rewrite
• Predicate Push-down
– select * from (select * from t) where col1 = ‘2008’;
• Column Pruning
– select col1, col3 from (select * from t);
TODO: Column-based Storage and Map-side Join
• MetaStore UI:
– Browse and navigate all tables in the system
– Comment on each table and each column
– Also captures data dependencies
• HiPal:
– Interactively construct SQL queries by mouse clicks
– Support projection, filtering, group by and joining
– Also support
Hive Query Language
• Philosophy
– SQL
– Map-Reduce with custom scripts (hadoop streaming)
• Query Operators
– Projections
– Equi-joins
– Group by
– Sampling
– Order By
Hive QL – Custom Map/Reduce Scripts
• Extended SQL:
• FROM (
• FROM pv_users
• MAP pv_users.userid, pv_users.date
• USING 'map_script' AS (dt, uid)
• CLUSTER BY dt) map
• INSERT INTO TABLE pv_users_reduced
• REDUCE map.dt, map.uid
• USING 'reduce_script' AS (date, count);