Introduction To MapReduce
Introduction To MapReduce
Jeffrey D. Ullman
What This Course Is
About
Data mining = extraction of actionable
information from (usually) very large datasets, is
the subject of extreme hype, fear, and interest.
It’s not all about machine learning.
But a lot of it is.
Emphasis on algorithms that scale.
Parallelization often essential.
Modeling
Often, especially for ML-type algorithms, the
result is a model = a simple representation of
the data, typically used for prediction.
Example: PageRank is a number Google assigns
to each Web page, representing the
“importance” of the page.
Calculated from the link structure of the Web.
Summarizes in one number, all the links leading to
one page.
Used to help decide which pages Google shows you.
Example: College Football
Ranks
Problem: Most teams don’t play each other, and
there are weaker and stronger leagues.
So which teams are the best? Who would win a
game between teams that never played each other?
Model (old style): a list of the teams from best
to worst.
Algorithm (old style):
1. Cost of a list = # of teams out of order.
2. Start with some list (e.g., order by % won).
3. Make incremental cost improvements until
none possible (e.g., swap adjacent teams)
Example: Swap Adjacent
Only Tail = winner;
head = loser.
A
A A A
B B B B
C C E
C D E C
D E D D
Jeffrey D. Ullman
Distributed File
Systems
Chunking
Replication
Distribution on Racks
Commodity Clusters
Standard architecture:
1. Cluster of commodity Linux nodes (compute
nodes).
= processor + main memory + disk.
2. Gigabit Ethernet interconnect.
Cluster Architecture
2-10 Gbps backbone among racks
1 Gbps between
any pair of nodes Switch
in a rack
Switch Switch
File
Chunks
3-way replication of
files, with copies on
different racks.
Alternative: Erasure
Coding
More recent approach to stable file storage.
Allows reconstruction of a lost chunk.
Advantage: less redundancy for a given
probability of loss.
Disadvantage: no choice regarding where to
obtain a given chunk.
MapReduce
A Quick Introduction
Word-Count Example
Fault-Tolerance
What Does MapReduce
Give You?
1. Easy parallel programming.
2. Invisible management of hardware and
software failures.
3. Easy management of very-large-scale data.
MapReduce in a Nutshell
A MapReduce job starts with a collection of input
elements of a single type.
Technically, all types are key-value pairs.
Apply a user-written Map function to each
input element, in parallel.
Mapper applies the Map function to a single element.
Many mappers grouped in a Map task (the unit of
parallelism).
The output of the Map function is a set of 0, 1, or
more key-value pairs.
The system sorts all the key-value pairs by key,
In a Nutshell – (2)
Another user-written function, the Reduce
function, is applied to each key-(list of
values).
Application of the Reduce function to one key and its
list of values is a reducer.
Often, many reducers are grouped into a Reduce task.
Each reducer produces some output, and the
output of the entire job is the union of what is
produced by each reducer.
MapReduce Pattern
key-value
pairs
Input Output
Mappers Reducers
Example: Word Count
We have a large file of documents (the input
elements).
Documents are sequences of words.
Count the number of times each distinct word
appears in the file.
Word Count Using
MapReduce
map(key, value):
// key: document ID; value: text of document
FOR (each word w IN value) Expect to be all 1’s,
but “combiners” allow
emit(w, 1); local summing of
integers with the same
key before passing
reduce(key, value-list): to reducers.
// key: a word; value-list: a list of integers
result = 0;
FOR (each integer v on value-list)
result += v;
emit(key, result);
Coping With Failures
MapReduce is designed to deal with compute
nodes failing to execute a Map task or Reduce
task.
Re-execute failed tasks, not whole jobs.
Key point: MapReduce tasks have the
blocking property: no output is used
until task is complete.
Thus, we can restart a Map task that failed
without fear that a Reduce task has already
used some output of the failed Map task.
Some MapReduce
Algorithms
Relational Join
Matrix Multiplication in One or Two
Rounds
Relational Join
Stick the tuples of two relations together when
they agree on common attributes (column
names).
Example: R(A,B) JOIN S(B,C) = {abc | ab is in R
and bc is in S}.
A B B C A B C
2 6 1 2 6
1
JOIN
2
=
5 7 3 2 6
3 2
4 5 5 8 4 5 7
9 10 4 5 8
The Map Function
Each tuple (a,b) in R is mapped to key = b, value
= (R,a).
Note: “R” in the value is just a bit that means “this
value represents a tuple in R, not S.”
Each tuple (b,c) in S is mapped to key = b, value
= (S,c).
After grouping by keys, each reducer gets a key-
list that looks like
(b, [(R,a1), (R,a2),…, (S,c1), (S,c2),…]).
The Reduce Function
For each pair (R,a) and (S,c) on the list for key b,
emit (a,b,c).
Note this process can produce a quadratic number of
outputs as a function of the list length.
If you took CS245, you may recognize this algorithm
as essentially a “parallel hash join.”
It’s a really efficient way to join relations, as long as
you don’t have too many tuples with a common
shared value.
Two-Pass Matrix
Multiplication
Multiply matrix M = [mij] by N = [njk].
Want: P = [pik], where pik = Σj mij*njk.
First pass is similar to relational join; second
pass is a group+aggregate operation.
Typically, large relations are sparse (mostly
0’s), so we can assume the nonzero element mij
is really a tuple of a relation (i, j, mij).
Similarly for njk.
0 elements are not represented at all.
The Map and Reduce
Functions
The Map function: mij -> key = j, value = (M,i,mij);
njk -> key = j, value = (N,k,njk).
As for join, M and N here are bits indicating which
relation the value comes from.
The Reduce function: for key j, pair each
(M,i,mij) on its list with each (N,k,njk) and
produce key = (i,k), value = mij * njk.
The Second Pass
The Map function: The identity function.
Result is that each key (i,k) is paired with the list
of products mij * njk for all j.
The Reduce function: sum all the elements on
the list, and produce key = (i,k), value = that
sum.
I.e., each output element ((i,k),s) says that the
element pik of the product matrix P is s.
Single-Pass Matrix
Multiplication
We can use a single pass if:
1. Keys (reducers) correspond to output elements (i,k).
2. We send input elements to more than one reducer.
The Map function: mij -> for all k: key = (i,k), value
= (M,j,mij); njk -> for all i: key = (i,k), value =
(N,j,njk).
The Reduce function: for each (M,j,mij) on the list
for key (i,k) find the (N,j,njk) with the same j.
Multiply mij by njk and then sum the products.
Extensions to
MapReduce
Data-Flow Systems
Bulk-Synchronous Systems
Tyranny of Communication
Data-Flow Systems
MapReduce uses two ranks of tasks: one for
Map the second for Reduce.
Data flows from the first rank to the second.
Generalize in two ways:
1. Allow any number of ranks.
2. Allow functions other than Map and Reduce.
As long as data flow is in one direction only, we
can have the blocking property and allow
recovery of tasks rather than whole jobs.
Spark and Flink
Two recent data-flow systems from Apache.
Incorporate group+aggregate (GA) as an
explicit operator.
Allow any acyclic data flow involving these
primitives.
Are tuned to operate in main memory.
Big performance improvement over Hadoop in many
cases.
Example: 2-Pass MatMult
The 2-pass MapReduce algorithm had a second
Map function that didn’t really do anything.
We could think of it as a five-rank data-flow
algorithm of the form Map-GA-Reduce-GA-
Reduce, where the output forms are:
1. (j, (M,i,mij)) and (j, (N,k,njk)).
2. j with list of (M,i,mij)’s and (N,k,njk)’s.
3. ((i,k), mij * njk).
4. (i,k) with list of mij * njk’s.
5. ((i,k), pik).
Bulk-Synchronous
Systems
Graph Model of Data
Some Systems Using This Model
The Graph Model
Views all computation as a recursion on some
graph.
Graph nodes send messages to one another.
Messages bunched into supersteps, where each
graph node processes all data received.
Sending individual messages would result in far too
much overhead.
Checkpoint all compute nodes after some fixed
number of supersteps.
On failure, rolls all tasks back to previous
checkpoint.
Example: Shortest Paths
I found a path Is this the
from node M to shortest path from
you of length L M I know about?
If so …
table of
I found a path
Node shortest from node M to
I found a path N paths you of length L+6
from node M to to N
you of length L+5
5 3 6
I found a path
from node M to
you of length L+3
Some Systems
Pregel: the original, from Google.
Giraph: open-source (Apache) Pregel.
Built on Hadoop.
GraphX: a similar front end for Spark.
GraphLab: similar system that deals more
effectively with nodes of high degree.
Will split the work for such a graph node among
several compute nodes.