0% found this document useful (0 votes)

15 views

Lez.d-01-Hadoop (A) Intro

This is a lesson on Hadoop given at my University during my Master's in Data Science degree under the course of Data Management and Computer Networks.

Uploaded by

Alvi Rownok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Lez.d-01-Hadoop (A) Intro

This is a lesson on Hadoop given at my University during my Master's in Data Science degree under the course of Data Management and Computer Networks.

Uploaded by

Alvi Rownok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Introduction to

1
Contents

 Motivation
 Scale of Cloud Computing
 Hadoop
 Hadoop Distributed File System (HDFS)

 MapReduce

 Sample Code Walkthrough

 Hadoop EcoSystem

2
Motivation - Traditional Distributed systems
 Processor Bound
 Using multiple machines
 Developer is burdened with
managing too many things
 Synchronization

 Failures

 Data moves from shared disk to

compute node
 Cost of maintaining clusters
 Scalability as and when required
not present
3
What is the scale we are talking about?

Couple of CPUs?

10s of CPUs?

100s of CPUs?

4
5
Hadoop @ Yahoo!

6
7
What we need

 Handling failure
 One computer = fails once in 1000 days

 1000 computers = 1 per day

 Petabytes of data to be processed in parallel

 1 HDD= 100 MB/sec

 1000 HDD= 100 GB/sec

 Easy scalability
 Relative increase/decrease of performance depending on
increase/decrease of nodes

8
What we’ve got : Hadoop!

 Created by Doug Cutting

 Started as a module in nutch and then matured as an
apache project
 Named it after his son's stuffed
elephant

9
What we’ve got : Hadoop!

 Fault-tolerant file system

 Hadoop Distributed File System (HDFS)
 Modeled on Google File system
 Takes computation to data
 Data Locality
 Scalability:
 Program remains same for 10, 100, 1000,… nodes
 Corresponding performance improvement
 Parallel computation using MapReduce
 Other components – Pig, Hbase, HIVE, ZooKeeper

10
HDFS
Hadoop distributed File System

11
How HDFS works
NameNode -
Master

Secondary
NameNode

DataNodes
- Slave

12
Storing file on HDFS

Motivation
 Reliability,
 Availability,
 Network Bandwidth

 The input file (say 1 TB) is split into smaller chunks/blocks of 64 MB (or multiples
of 64MB)
 The chunks are stored on multiple nodes as independent files on slave nodes
 To ensure that data is not lost, replicas are stored in the following way:
 One on local node
 One on remote rack (incase local rack fails)
 One on local rack (incase local node fails)
 Other randomly placed
 Default replication factor is 3

13
File
Master
Node
B1 B2 B3 Bn

Blocks 8 gigabit

1 gigabit

Data nodes
B1 B2
B2

Hub 1 Hub 2
14
The master node: NameNode
NameNode -
Master
Functions:
 Manages File System- mapping files to blocks and blocks
to data nodes
 Maintaining status of data nodes
 Heartbeat
 Datanode sends heartbeat at regular intervals
 If heartbeat is not received, datanode is declared dead
 Blockreport
 DataNode sends list of blocks on it
 Used to check health of HDFS

15
NameNode Functions

 Replication NameNode -
Master
 On Datanode failure

 On Disk failure

 On Block corruption

 Data integrity
 Checksum for each block

 Stored in hidden file

 Rebalancing- balancer tool

 Addition of new nodes

 Decommissioning

 Deletion of some files

16
HDFS Robustness

 Safemode
 At startup: No replication possible
 Receives Heartbeats and Blockreports from Datanodes
 Only a percentage of blocks are checked for defined replication
factor

All is well   Exit Safemode

 Replicate blocks wherever necessary

17
HDFS Summary

 Fault tolerant
 Scalable
 Reliable
 File are distributed in large blocks for
 Efficient reads

 Parallel access

18
Questions?

19
MapReduce

20
What is MapReduce?

 It is a powerful paradigm for parallel computation

 Hadoop uses MapReduce to execute jobs on files in
HDFS
 Hadoop will intelligently distribute computation over
cluster
 Take computation to data

21
Origin: Functional Programming

map f [a, b, c] = [f(a), f(b), f(c)]

map sq [1, 2, 3] = [sq(1), sq(2), sq(3)]
= [1,4,9]

 Returns a list constructed by applying a function (the first

argument) to all items in a list passed as the second
argument

22
Origin: Functional Programming

reduce f [a, b, c] = f(a, b, c)

reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL))))
= 14
 Returns a list constructed by applying a function (the first
argument) on the list passed as the second argument
 Can be identity (do nothing)

23
Sum of squares example

[1,2,3,4]
M1 M2 M3 M4
Input Sq (1) Sq (2) Sq (3) Sq (4) MAPPER

Intermediate
output 1 4 9 16

Output 30 REDUCER

24
Sum of squares of even and odd

[1,2,3,4]
Input M1 M2 M3 M4
Sq (1) Sq (2) Sq (3) Sq (4) MAPPER

Intermediate
output (odd, 1) (even, 4) (odd, 9) (even, 16)

R1 R2
Output REDUCER
(even, 20) (odd, 10)

25
Programming model- key, value pairs

Format of input- output

(key, value)

Map: (k1, v1) → list (k2, v2)

Reduce: (k2, list v2) → list (k3, v3)

26
Sum of squares of even and odd and prime

[1,2,3,4]

Input
Sq (1) Sq (2) Sq (3) Sq (4)

Intermediate (odd, 1) (even, 4) (odd, 9) (even, 16)

output (prime, 4) (prime, 9)

R1 R2
(even, 20) (odd, 10)
Output R3

(prime, 13)

27
Many keys, many values

Format of input- output

(key, value)

Map: (k1, v1) → list (k2, v2)

Reduce: (k2, list v2) → list (k3, v3)

28
Fibonacci sequence

 f(n) = f(n-1) + f(n-2)

i.e. f(5) = f(4) + f(3)
 0, 1, 1, 2, 3, 5, 8, 13,… •MapReduce will not work
on this kind of calculation
•No inter-process
f(5)
communication
•No data sharing
f(4) f(3)

f(3) f(2) f(2) f(1)

f(2) f(1) f(1) f(0)

29
1TB text file containing color
Input:
names- Blue, Green, Yellow,
Purple, Pink, Red, Maroon, Grey,

Desired Occurrence of colors Blue

output: and Green

30
N1 f.001 N1 f.001
Blue Blue
Purple Blue=420
Blue MAPPER Blue
Green=200
Red
Green Green
Blue Blue
Maroon grep Blue|Green
Green Green REDUCER
Yellow
Blue= 3000
Green= 5500

awk ‘{arr[$1]++;}
Nn f.00n Nn END{print arr[Blue], arr[Green]}’
Blue f.00n
Purple Green awk ‘{arr[$1]+=$2;}
Blue END{print arr[Blue], arr[Green]}’
Blue
Red
Green Blue
Blue=500
Blue Blue Green=200
Maroon Green sort |unique -c
Green
Yellow COMBINER

31
MapReduce Overview
INPUT MAP SHUFFLE REDUCE OUTPUT

Map Reduce

Input Output
data

Map

Map
Reduce

Works on a record
Works on output of Map

32
MapReduce Overview
INPUT MAP REDUCE OUTPUT

Map Combine Reduce

Input Output
data

Map Combine

Map
Combine Reduce

Works on output of Map Works on output of Combiner

33
MapReduce Overview

34
MapReduce Summary

 Mapper, reducer and combiner act on <key, value>

pairs
 Mapper gets one record at a time as an input
 Combiner (if present) works on output of map
 Reducer works on output of map (or combiner, if
present)
 Combiner can be thought of local-reducer
 Reduces output of maps that are executed on same
node

35
What Hadoop is not..

 It is not a POSIX file system

 It is not a SAN file system
 It is not for interactive file accessing
 It is not meant for a large number of small files-
it is for a small number of large files
 MapReduce cannot be used for any and all
applications

36
Hadoop: Take Home

 Takes computation to data

 Suitable for large data centric operations
 Scalable on demand
 Fault tolerant and highly transparent

37
Questions?

 Coming up next …
 First hadoop program

 Second hadoop program

38
Your first program in hadoop

Open up any tutorial on hadoop and first program

you see will be of wordcount 

Task:
 Given a text file, generate a list of words with the
number of times each of them appear in the file
Input:
 Plain text file hadoop is a framework written in java
hadoop supports parallel processing
Expected Output: and is a simple framework

 <word, frequency> pairs for all words in the file

<hadoop, 2> <framework , 2> <supports , 1>
<is, 2> <written , 1> <parallel , 1>
<a , 2> <in , 1> <processing. , 1>
<java , 1> <and,1> <simple,1>

39
Your second program in hadoop

Task:
1
 Given a text file containing numbers, one 2
per line, count sum of squares of odd, even 5
3
and prime 5
6
Input: 3
7
 File containing integers, one per line 9
4
Expected Output:
 <type, sum of squares> for odd, even, prime
<odd, 302>
<even, 278>
<prime, 323 >

40
Your second program in hadoop
Input
Input (Key, List of Values)
Value

3 <odd,9> <prime,9>

3 9 <odd,81>
odd:<9,81,9,49> <odd,148>
9 6 <even,36>

6
2 <even,4> <prime,4> prime:<9,4,9> <prime,22>

2
3 <odd,9> <prime,9>
even:<,36,4,64> <even,104>
3
7 <odd,49>
7
OutputOutput
8 <even,64> (Key,Value)
8 (Key,Value)

File on HDFS Map: square Reducer: sum

41
Your second program in hadoop

Map Reduce
(Invoked on a record) (Invoked on a key)
void map (int x){ void reduce(List l ){
int sq = x * x; for(y in List l){
if(x is odd) sum += y;
print(“odd”,sq); }
if(x is even) print(Key, sum);
print(“even”,sq); }
if(x is prime)
print(“prime”,sq);
}

Library functions
boolean odd(int x){ …}
boolean even(int x){ …}
boolean prime(int x){ …}

42
Your second program in hadoop

Map
(Invoked on a record)

void map (int x){

int sq = x * x;
if(x is odd)
print(“odd”,sq);
if(x is even)
print(“even”,sq);
if(x is prime)
print(“prime”,sq);
}

Map
(Invoked on a record)
43
Your second program in hadoop: Reduce

Reduce
(Invoked on a
key)

void reduce(List l ){
for(y in List l){
sum += y;
}
print(Key, sum);
}

Reduce
(Invoked on a key)

44
Your second program in hadoop
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, “OddEvenPrime");

job.setJarByClass(OddEvenPrime.class);
job.setMapperClass(OddEvenPrimeMapper.class);
job.setCombinerClass(OddEvenPrimeReducer.class);
job.setReducerClass(OddEvenPrimeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}

45
Questions?

 Coming up next …
 More examples
 Hadoop Ecosystem

46
Example: Counting Fans

47
Example: Counting Fans

Problem: Give Crowd

statistics
Count fans
supporting India and
Pakistan

48
Example: Counting Fans
 Traditional Way
 Central Processing

 Every fan comes to the

centre and presses
India/Pak button
 Issues
 Slow/Bottlenecks
45882 67917
 Only one processor

 Processing time
determined by the
speed at which
people(data) can
move

49
Example: Counting Fans
Hadoop Way
 Appoint processors per
block (MAPPER)
 Send them to each block
and ask them to send a
Reducer signal for each person
 Central processor will
aggregate the results
(REDUCER)
 Can make the processor
smart by asking him/her
to aggregate locally and
only send aggregated
value (COMBINER)

Combiner

50
HomeWork : Exit Polls 2014

51
Hadoop EcoSystem: Basic

52
Hadoop Distributions

53
Who all are using Hadoop

http://wiki.apache.org/hadoop/PoweredBy

54
References
For understanding Hadoop
 Official Hadoop website- http://hadoop.apache.org/

 Hadoop presentation wiki-

http://wiki.apache.org/hadoop/HadoopPresentations?a
ction=AttachFile
 http://developer.yahoo.com/hadoop/

 http://wiki.apache.org/hadoop/

 http://www.cloudera.com/hadoop-training/

 http://developer.yahoo.com/hadoop/tutorial/module2.
html#basics

55
References
Setup and Installation
 Installing on Ubuntu
 http://www.michael-
noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_
(Single-Node_Cluster)
 http://www.michael-
noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_
(Multi-Node_Cluster)
 Installing on Debian
 http://archive.cloudera.com/docs/_apt.html

56
Further Reading

 Hadoop: The Definitive Guide: Tom White

 http://developer.yahoo.com/hadoop/tutorial/

 http://www.cloudera.com/content/cloudera-
content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-
Tutorial.html

57
Questions?

Topic 4 - Inverter
No ratings yet
Topic 4 - Inverter
9 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
02-Hadoop
No ratings yet
02-Hadoop
117 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
67 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
ProgrammingHadoop ApacheConUS08
No ratings yet
ProgrammingHadoop ApacheConUS08
7 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
BDA practical (1)
No ratings yet
BDA practical (1)
18 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
bda megh
No ratings yet
bda megh
50 pages
Unit 5
No ratings yet
Unit 5
7 pages
BDT UNIT - III
No ratings yet
BDT UNIT - III
12 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
lsde_workshop_wk9(2)
No ratings yet
lsde_workshop_wk9(2)
31 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Hadoop
No ratings yet
Hadoop
34 pages
BDA.Unit-4
No ratings yet
BDA.Unit-4
32 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Big Data
No ratings yet
Big Data
43 pages
HADOOP
No ratings yet
HADOOP
19 pages
Palak
No ratings yet
Palak
10 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
From Everand
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
Rodrigo Copetti
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Lez.b-06 - nVIDIA GPU and Servers
No ratings yet
Lez.b-06 - nVIDIA GPU and Servers
18 pages
Lez.g 01 Lustre
No ratings yet
Lez.g 01 Lustre
37 pages
Lez.b-03 Infiniband
No ratings yet
Lez.b-03 Infiniband
31 pages
Lez.a-01-Big Data Introduction
No ratings yet
Lez.a-01-Big Data Introduction
26 pages
McCafe On The Go Project Report Mc-1863 For Techfest IIT Bombay 2018
No ratings yet
McCafe On The Go Project Report Mc-1863 For Techfest IIT Bombay 2018
24 pages
Global Content Distribution Final
No ratings yet
Global Content Distribution Final
63 pages
Mxes 0112 21
No ratings yet
Mxes 0112 21
14 pages
ULIP 2022 Digital Case Solving Round
No ratings yet
ULIP 2022 Digital Case Solving Round
10 pages
Bosch 0 227 100 211 Ignition Control With MegaSquirt-II PDF
100% (1)
Bosch 0 227 100 211 Ignition Control With MegaSquirt-II PDF
5 pages
4-5-6 Montessori Math and Common Core Scope and Sequence 2
No ratings yet
4-5-6 Montessori Math and Common Core Scope and Sequence 2
6 pages
Ferrocement Beams
100% (4)
Ferrocement Beams
15 pages
Chapter 6 - Distributed Database Management Systems
No ratings yet
Chapter 6 - Distributed Database Management Systems
57 pages
Furuno Felcom 18 Installation Manual
No ratings yet
Furuno Felcom 18 Installation Manual
60 pages
Day 01 Basic of Ms-Excel
No ratings yet
Day 01 Basic of Ms-Excel
3 pages
Q.3 A Majority Detector Has Three Input Variables A, B and C and Two Output Light Indicators
No ratings yet
Q.3 A Majority Detector Has Three Input Variables A, B and C and Two Output Light Indicators
3 pages
Spiekbrief XC8 Brainbox Fun
100% (1)
Spiekbrief XC8 Brainbox Fun
2 pages
JJ512 Pneumatic PH 4 Lab Sheet
No ratings yet
JJ512 Pneumatic PH 4 Lab Sheet
4 pages
Bitcoin Design
No ratings yet
Bitcoin Design
2 pages
PROFIL UserManual PDF
No ratings yet
PROFIL UserManual PDF
283 pages
Astm C 109
No ratings yet
Astm C 109
7 pages
Guide Techview User S Guide en 132476
No ratings yet
Guide Techview User S Guide en 132476
130 pages
Hidraulico 430f Edwin
0% (1)
Hidraulico 430f Edwin
15 pages
Lab4 PDF
No ratings yet
Lab4 PDF
11 pages
01 Intro PDF
No ratings yet
01 Intro PDF
11 pages
Basic Equipment Repair: Name-Shubham Singh REGISTRATION NO.-11607655 ROLL NO.-31
No ratings yet
Basic Equipment Repair: Name-Shubham Singh REGISTRATION NO.-11607655 ROLL NO.-31
13 pages
A Practical Handbook of Speech Coders
No ratings yet
A Practical Handbook of Speech Coders
14 pages
SQL Tutorial 1
No ratings yet
SQL Tutorial 1
107 pages
Military Cryptography
No ratings yet
Military Cryptography
6 pages
6342-2 Solution To Munkres
No ratings yet
6342-2 Solution To Munkres
3 pages
Provided by Universiti Putra Malaysia Institutional Repository
No ratings yet
Provided by Universiti Putra Malaysia Institutional Repository
16 pages
31 Pdfsam SICAM T 7KG966 US PDF
No ratings yet
31 Pdfsam SICAM T 7KG966 US PDF
1 page
G 8 Physics 2020-2021 TERM1 Exam - Answer Key
No ratings yet
G 8 Physics 2020-2021 TERM1 Exam - Answer Key
16 pages
Notes For II
No ratings yet
Notes For II
23 pages
المعرفة السوقية
No ratings yet
المعرفة السوقية
40 pages
MSTE 1 - Part 1
No ratings yet
MSTE 1 - Part 1
2 pages
2025 Zaoch Eng
No ratings yet
2025 Zaoch Eng
4 pages
JEE Result
No ratings yet
JEE Result
1 page

Lez.d-01-Hadoop (A) Intro

Uploaded by

Lez.d-01-Hadoop (A) Intro

Uploaded by

Introduction to

 Sample Code Walkthrough

 Data moves from shared disk to

 1000 computers = 1 per day

 Petabytes of data to be processed in parallel

 1000 HDD= 100 GB/sec

 Created by Doug Cutting

 Fault-tolerant file system

 Stored in hidden file

 Rebalancing- balancer tool

 Deletion of some files

All is well   Exit Safemode

 It is a powerful paradigm for parallel computation

map f [a, b, c] = [f(a), f(b), f(c)]

 Returns a list constructed by applying a function (the first

reduce f [a, b, c] = f(a, b, c)

Format of input- output

Map: (k1, v1) → list (k2, v2)

Intermediate (odd, 1) (even, 4) (odd, 9) (even, 16)

Format of input- output

Map: (k1, v1) → list (k2, v2)

 f(n) = f(n-1) + f(n-2)

f(3) f(2) f(2) f(1)

f(2) f(1) f(1) f(0)

Desired Occurrence of colors Blue

Map Combine Reduce

Works on output of Map Works on output of Combiner

 Mapper, reducer and combiner act on <key, value>

 It is not a POSIX file system

 Takes computation to data

 Second hadoop program

Open up any tutorial on hadoop and first program

 <word, frequency> pairs for all words in the file

File on HDFS Map: square Reducer: sum

void map (int x){

FileInputFormat.addInputPath(job, new Path(args[0]));

Problem: Give Crowd

 Every fan comes to the

 Hadoop presentation wiki-

 Hadoop: The Definitive Guide: Tom White

You might also like