Lez.d-01-Hadoop (A) Intro
Lez.d-01-Hadoop (A) Intro
1
Contents
Motivation
Scale of Cloud Computing
Hadoop
Hadoop Distributed File System (HDFS)
MapReduce
2
Motivation - Traditional Distributed systems
Processor Bound
Using multiple machines
Developer is burdened with
managing too many things
Synchronization
Failures
Couple of CPUs?
10s of CPUs?
100s of CPUs?
4
5
Hadoop @ Yahoo!
6
7
What we need
Handling failure
One computer = fails once in 1000 days
Easy scalability
Relative increase/decrease of performance depending on
increase/decrease of nodes
8
What we’ve got : Hadoop!
9
What we’ve got : Hadoop!
10
HDFS
Hadoop distributed File System
11
How HDFS works
NameNode -
Master
Secondary
NameNode
DataNodes
- Slave
12
Storing file on HDFS
Motivation
Reliability,
Availability,
Network Bandwidth
The input file (say 1 TB) is split into smaller chunks/blocks of 64 MB (or multiples
of 64MB)
The chunks are stored on multiple nodes as independent files on slave nodes
To ensure that data is not lost, replicas are stored in the following way:
One on local node
One on remote rack (incase local rack fails)
One on local rack (incase local node fails)
Other randomly placed
Default replication factor is 3
13
File
Master
Node
B1 B2 B3 Bn
Blocks 8 gigabit
1 gigabit
B1
Data nodes
B1 B2
B2
Hub 1 Hub 2
14
The master node: NameNode
NameNode -
Master
Functions:
Manages File System- mapping files to blocks and blocks
to data nodes
Maintaining status of data nodes
Heartbeat
Datanode sends heartbeat at regular intervals
If heartbeat is not received, datanode is declared dead
Blockreport
DataNode sends list of blocks on it
Used to check health of HDFS
15
NameNode Functions
Replication NameNode -
Master
On Datanode failure
On Disk failure
On Block corruption
Data integrity
Checksum for each block
Decommissioning
16
HDFS Robustness
Safemode
At startup: No replication possible
Receives Heartbeats and Blockreports from Datanodes
Only a percentage of blocks are checked for defined replication
factor
17
HDFS Summary
Fault tolerant
Scalable
Reliable
File are distributed in large blocks for
Efficient reads
Parallel access
18
Questions?
19
MapReduce
20
What is MapReduce?
21
Origin: Functional Programming
22
Origin: Functional Programming
23
Sum of squares example
[1,2,3,4]
M1 M2 M3 M4
Input Sq (1) Sq (2) Sq (3) Sq (4) MAPPER
Intermediate
output 1 4 9 16
R1
Output 30 REDUCER
24
Sum of squares of even and odd
[1,2,3,4]
Input M1 M2 M3 M4
Sq (1) Sq (2) Sq (3) Sq (4) MAPPER
Intermediate
output (odd, 1) (even, 4) (odd, 9) (even, 16)
R1 R2
Output REDUCER
(even, 20) (odd, 10)
25
Programming model- key, value pairs
26
Sum of squares of even and odd and prime
[1,2,3,4]
Input
Sq (1) Sq (2) Sq (3) Sq (4)
R1 R2
(even, 20) (odd, 10)
Output R3
(prime, 13)
27
Many keys, many values
28
Fibonacci sequence
29
1TB text file containing color
Input:
names- Blue, Green, Yellow,
Purple, Pink, Red, Maroon, Grey,
30
N1 f.001 N1 f.001
Blue Blue
Purple Blue=420
Blue MAPPER Blue
Green=200
Red
Green Green
Blue Blue
Maroon grep Blue|Green
Green Green REDUCER
Yellow
Blue= 3000
Green= 5500
awk ‘{arr[$1]++;}
Nn f.00n Nn END{print arr[Blue], arr[Green]}’
Blue f.00n
Purple Green awk ‘{arr[$1]+=$2;}
Blue END{print arr[Blue], arr[Green]}’
Blue
Red
Green Blue
Blue=500
Blue Blue Green=200
Maroon Green sort |unique -c
Green
Yellow COMBINER
31
MapReduce Overview
INPUT MAP SHUFFLE REDUCE OUTPUT
Map Reduce
Input Output
data
Map
Map
Reduce
Works on a record
Works on output of Map
32
MapReduce Overview
INPUT MAP REDUCE OUTPUT
Input Output
data
Map Combine
Map
Combine Reduce
33
MapReduce Overview
34
MapReduce Summary
35
What Hadoop is not..
36
Hadoop: Take Home
37
Questions?
Coming up next …
First hadoop program
38
Your first program in hadoop
Task:
Given a text file, generate a list of words with the
number of times each of them appear in the file
Input:
Plain text file hadoop is a framework written in java
hadoop supports parallel processing
Expected Output: and is a simple framework
39
Your second program in hadoop
Task:
1
Given a text file containing numbers, one 2
per line, count sum of squares of odd, even 5
3
and prime 5
6
Input: 3
7
File containing integers, one per line 9
4
Expected Output:
<type, sum of squares> for odd, even, prime
<odd, 302>
<even, 278>
<prime, 323 >
40
Your second program in hadoop
Input
Input (Key, List of Values)
Value
3 <odd,9> <prime,9>
3 9 <odd,81>
odd:<9,81,9,49> <odd,148>
9 6 <even,36>
6
2 <even,4> <prime,4> prime:<9,4,9> <prime,22>
2
3 <odd,9> <prime,9>
even:<,36,4,64> <even,104>
3
7 <odd,49>
7
OutputOutput
8 <even,64> (Key,Value)
8 (Key,Value)
41
Your second program in hadoop
Map Reduce
(Invoked on a record) (Invoked on a key)
void map (int x){ void reduce(List l ){
int sq = x * x; for(y in List l){
if(x is odd) sum += y;
print(“odd”,sq); }
if(x is even) print(Key, sum);
print(“even”,sq); }
if(x is prime)
print(“prime”,sq);
}
Library functions
boolean odd(int x){ …}
boolean even(int x){ …}
boolean prime(int x){ …}
42
Your second program in hadoop
Map
(Invoked on a record)
Map
(Invoked on a record)
43
Your second program in hadoop: Reduce
Reduce
(Invoked on a
key)
void reduce(List l ){
for(y in List l){
sum += y;
}
print(Key, sum);
}
Reduce
(Invoked on a key)
44
Your second program in hadoop
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, “OddEvenPrime");
job.setJarByClass(OddEvenPrime.class);
job.setMapperClass(OddEvenPrimeMapper.class);
job.setCombinerClass(OddEvenPrimeReducer.class);
job.setReducerClass(OddEvenPrimeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
45
Questions?
Coming up next …
More examples
Hadoop Ecosystem
46
Example: Counting Fans
47
Example: Counting Fans
48
Example: Counting Fans
Traditional Way
Central Processing
Processing time
determined by the
speed at which
people(data) can
move
49
Example: Counting Fans
Hadoop Way
Appoint processors per
block (MAPPER)
Send them to each block
and ask them to send a
Reducer signal for each person
Central processor will
aggregate the results
(REDUCER)
Can make the processor
smart by asking him/her
to aggregate locally and
only send aggregated
value (COMBINER)
Combiner
50
HomeWork : Exit Polls 2014
51
Hadoop EcoSystem: Basic
52
Hadoop Distributions
53
Who all are using Hadoop
http://wiki.apache.org/hadoop/PoweredBy
54
References
For understanding Hadoop
Official Hadoop website- http://hadoop.apache.org/
http://wiki.apache.org/hadoop/
http://www.cloudera.com/hadoop-training/
http://developer.yahoo.com/hadoop/tutorial/module2.
html#basics
55
References
Setup and Installation
Installing on Ubuntu
http://www.michael-
noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_
(Single-Node_Cluster)
http://www.michael-
noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_
(Multi-Node_Cluster)
Installing on Debian
http://archive.cloudera.com/docs/_apt.html
56
Further Reading
http://developer.yahoo.com/hadoop/tutorial/
http://www.cloudera.com/content/cloudera-
content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-
Tutorial.html
57
Questions?
58