Introduction To MapReduce

This document provides an overview and introduction to a course on data mining and machine learning algorithms that can scale to large datasets. It discusses key topics that will be covered, including modeling data, example applications like ranking college football teams, and different approaches to solving problems like rule-based systems versus modeling. It also outlines some prerequisites and introduces parallel computing systems like MapReduce and how they allow algorithms to scale efficiently.

Uploaded by

Ashwin Ajmera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views43 pages

Introduction To MapReduce

Uploaded by

Ashwin Ajmera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Introduction

Overview of the Course

MapReduce
Other Parallel-Computing
Systems

Jeffrey D. Ullman
What This Course Is
About
 Data mining = extraction of actionable
information from (usually) very large datasets, is
the subject of extreme hype, fear, and interest.
 It’s not all about machine learning.
 But a lot of it is.
 Emphasis on algorithms that scale.
 Parallelization often essential.
Modeling
 Often, especially for ML-type algorithms, the
result is a model = a simple representation of
the data, typically used for prediction.
 Example: PageRank is a number Google assigns
to each Web page, representing the
“importance” of the page.
 Calculated from the link structure of the Web.
 Summarizes in one number, all the links leading to
one page.
 Used to help decide which pages Google shows you.
Example: College Football
Ranks
 Problem: Most teams don’t play each other, and
there are weaker and stronger leagues.
 So which teams are the best? Who would win a
game between teams that never played each other?
 Model (old style): a list of the teams from best
to worst.
 Algorithm (old style):
1. Cost of a list = # of teams out of order.
2. Start with some list (e.g., order by % won).
3. Make incremental cost improvements until
none possible (e.g., swap adjacent teams)
Example: Swap Adjacent
Only Tail = winner;
head = loser.
A
A A A
B B B B
C C E
C D E C
D E D D

E Cost Cost Cost

=4 =3 =2

Note: best ordering , ECABD,

with a cost of 1, is unreachable
using only adjacent swaps.
A More Modern Approach
 Model has one variable for each of n teams
(“power rating”).
 Variable x for team X.
 Cost function is an expression of variables.
 Example: hinge loss function h({x,y}) =
 0 if teams X and Y never played or X won and x>y or Y
won and y>x.
 y-x if X won and y>x, and x-y if Y won and x>y.
 Cost function = Σ{x,y} h({x,y}) – terms to prevent all
variables from being equal, e.g., c(Σ{x,y} |x-y|).
Solution
 Start with some solution, e.g., all variables = 1.
 Find gradient of the cost for one or more
variables (= derivative of cost function with
respect to that variable).
 Make changes to variables in the direction of
the gradient that lower cost, and repeat until
improvements are “small.”
Rules Versus Models
 In many applications, all we want is an algorithm
that will say “yes” or “no.”
 Example: a model for email spam based on
weighted occurrences of words or phrases.
 Would give high weight to words like “Viagra” or
phrases like “Nigerian prince.”
 Problem: when the weights are in favor of spam,
there is no obvious reason why it is spam.
 Sometimes, no one cares; other times
understanding is vital.
Rules – (2)
 Rules like “Nigerian prince” -> spam are
understandable and actionable.
 But the downside is that every email with that
phrase will be considered spam.
 Next lecture will talk about these
association rules, and how they are used
in managing (brick and mortar) stores, where
understanding the meaning of a rule is essential.
Prerequisites
 Basic Algorithms.
 ITCS 6114 is surely sufficient, but may be more than
needed.
 Probability,
 There will be a review session and a review doc is
linked from the class home page.
 Linear algebra.
 Another review doc + review session is available.
 Programming (Java).
 Database systems (SQL, relational algebra).
 CS145 is sufficient by not necessary.
The MapReduce
Environment
Distributed File Systems
MapReduce and Hadoop
Examples of MR Algorithms

Jeffrey D. Ullman
Distributed File
Systems
 Chunking
 Replication
 Distribution on Racks
Commodity Clusters
 Standard architecture:
1. Cluster of commodity Linux nodes (compute
nodes).
 = processor + main memory + disk.
2. Gigabit Ethernet interconnect.
Cluster Architecture
2-10 Gbps backbone among racks
1 Gbps between
any pair of nodes Switch
in a rack

Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem
Disk Disk Disk Disk

Each rack contains 16-64 nodes

Stable Storage
 Problem: if compute nodes can fail, how can we
store data persistently?
 Answer: Distributed File System.
 Provides global file namespace.
 Examples: Google GFS, Colossus; Hadoop HDFS.
Distributed File System
 Chunk Servers.
 File is split into contiguous chunks, typically 64MB.
 Each chunk replicated (usually 3x).
 Try to keep replicas in different racks.
 Master Node for a file.
 Stores metadata, location of all chunks.
 Possibly replicated.
Racks of Compute Nodes

File

Chunks
3-way replication of
files, with copies on
different racks.
Alternative: Erasure
Coding
 More recent approach to stable file storage.
 Allows reconstruction of a lost chunk.
 Advantage: less redundancy for a given
probability of loss.
 Disadvantage: no choice regarding where to
obtain a given chunk.
MapReduce
 A Quick Introduction
 Word-Count Example
 Fault-Tolerance
What Does MapReduce
Give You?
1. Easy parallel programming.
2. Invisible management of hardware and
software failures.
3. Easy management of very-large-scale data.
MapReduce in a Nutshell
 A MapReduce job starts with a collection of input
elements of a single type.
 Technically, all types are key-value pairs.
 Apply a user-written Map function to each
input element, in parallel.
 Mapper applies the Map function to a single element.
 Many mappers grouped in a Map task (the unit of
parallelism).
 The output of the Map function is a set of 0, 1, or
more key-value pairs.
 The system sorts all the key-value pairs by key,
In a Nutshell – (2)
 Another user-written function, the Reduce
function, is applied to each key-(list of
values).
 Application of the Reduce function to one key and its
list of values is a reducer.
 Often, many reducers are grouped into a Reduce task.
 Each reducer produces some output, and the
output of the entire job is the union of what is
produced by each reducer.
MapReduce Pattern
key-value
pairs

Input Output

Mappers Reducers
Example: Word Count
 We have a large file of documents (the input
elements).
 Documents are sequences of words.
 Count the number of times each distinct word
appears in the file.
Word Count Using
MapReduce
map(key, value):
// key: document ID; value: text of document
FOR (each word w IN value) Expect to be all 1’s,
but “combiners” allow
emit(w, 1); local summing of
integers with the same
key before passing
reduce(key, value-list): to reducers.
// key: a word; value-list: a list of integers
result = 0;
FOR (each integer v on value-list)
result += v;
emit(key, result);
Coping With Failures
 MapReduce is designed to deal with compute
nodes failing to execute a Map task or Reduce
task.
 Re-execute failed tasks, not whole jobs.
 Key point: MapReduce tasks have the
blocking property: no output is used
until task is complete.
 Thus, we can restart a Map task that failed
without fear that a Reduce task has already
used some output of the failed Map task.
Some MapReduce
Algorithms
 Relational Join
 Matrix Multiplication in One or Two
Rounds
Relational Join
 Stick the tuples of two relations together when
they agree on common attributes (column
names).
 Example: R(A,B) JOIN S(B,C) = {abc | ab is in R
and bc is in S}.

A B B C A B C
2 6 1 2 6
1
JOIN
2
=
5 7 3 2 6
3 2
4 5 5 8 4 5 7
9 10 4 5 8
The Map Function
 Each tuple (a,b) in R is mapped to key = b, value
= (R,a).
 Note: “R” in the value is just a bit that means “this
value represents a tuple in R, not S.”
 Each tuple (b,c) in S is mapped to key = b, value
= (S,c).
 After grouping by keys, each reducer gets a key-
list that looks like
(b, [(R,a1), (R,a2),…, (S,c1), (S,c2),…]).
The Reduce Function
 For each pair (R,a) and (S,c) on the list for key b,
emit (a,b,c).
 Note this process can produce a quadratic number of
outputs as a function of the list length.
 If you took CS245, you may recognize this algorithm
as essentially a “parallel hash join.”
 It’s a really efficient way to join relations, as long as
you don’t have too many tuples with a common
shared value.
Two-Pass Matrix
Multiplication
 Multiply matrix M = [mij] by N = [njk].
 Want: P = [pik], where pik = Σj mij*njk.
 First pass is similar to relational join; second
pass is a group+aggregate operation.
 Typically, large relations are sparse (mostly
0’s), so we can assume the nonzero element mij
is really a tuple of a relation (i, j, mij).
 Similarly for njk.
 0 elements are not represented at all.
The Map and Reduce
Functions
 The Map function: mij -> key = j, value = (M,i,mij);
njk -> key = j, value = (N,k,njk).
 As for join, M and N here are bits indicating which
relation the value comes from.
 The Reduce function: for key j, pair each
(M,i,mij) on its list with each (N,k,njk) and
produce key = (i,k), value = mij * njk.
The Second Pass
 The Map function: The identity function.
 Result is that each key (i,k) is paired with the list
of products mij * njk for all j.
 The Reduce function: sum all the elements on
the list, and produce key = (i,k), value = that
sum.
 I.e., each output element ((i,k),s) says that the
element pik of the product matrix P is s.
Single-Pass Matrix
Multiplication
 We can use a single pass if:
1. Keys (reducers) correspond to output elements (i,k).
2. We send input elements to more than one reducer.
 The Map function: mij -> for all k: key = (i,k), value
= (M,j,mij); njk -> for all i: key = (i,k), value =
(N,j,njk).
 The Reduce function: for each (M,j,mij) on the list
for key (i,k) find the (N,j,njk) with the same j.
Multiply mij by njk and then sum the products.
Extensions to
MapReduce
 Data-Flow Systems
 Bulk-Synchronous Systems
 Tyranny of Communication
Data-Flow Systems
 MapReduce uses two ranks of tasks: one for
Map the second for Reduce.
 Data flows from the first rank to the second.
 Generalize in two ways:
1. Allow any number of ranks.
2. Allow functions other than Map and Reduce.
 As long as data flow is in one direction only, we
can have the blocking property and allow
recovery of tasks rather than whole jobs.
Spark and Flink
 Two recent data-flow systems from Apache.
 Incorporate group+aggregate (GA) as an
explicit operator.
 Allow any acyclic data flow involving these
primitives.
 Are tuned to operate in main memory.
 Big performance improvement over Hadoop in many
cases.
Example: 2-Pass MatMult
 The 2-pass MapReduce algorithm had a second
Map function that didn’t really do anything.
 We could think of it as a five-rank data-flow
algorithm of the form Map-GA-Reduce-GA-
Reduce, where the output forms are:
1. (j, (M,i,mij)) and (j, (N,k,njk)).
2. j with list of (M,i,mij)’s and (N,k,njk)’s.
3. ((i,k), mij * njk).
4. (i,k) with list of mij * njk’s.
5. ((i,k), pik).
Bulk-Synchronous
Systems
 Graph Model of Data
 Some Systems Using This Model
The Graph Model
 Views all computation as a recursion on some
graph.
 Graph nodes send messages to one another.
 Messages bunched into supersteps, where each
graph node processes all data received.
 Sending individual messages would result in far too
much overhead.
 Checkpoint all compute nodes after some fixed
number of supersteps.
 On failure, rolls all tasks back to previous
checkpoint.
Example: Shortest Paths
I found a path Is this the
from node M to shortest path from
you of length L M I know about?
If so …

table of
I found a path
Node shortest from node M to
I found a path N paths you of length L+6
from node M to to N
you of length L+5
5 3 6

I found a path
from node M to
you of length L+3
Some Systems
 Pregel: the original, from Google.
 Giraph: open-source (Apache) Pregel.
 Built on Hadoop.
 GraphX: a similar front end for Spark.
 GraphLab: similar system that deals more
effectively with nodes of high degree.
 Will split the work for such a graph node among
several compute nodes.

EliteDataEngineeringProgramCurriculum (1)
No ratings yet
EliteDataEngineeringProgramCurriculum (1)
38 pages
spark
No ratings yet
spark
160 pages
Full Download Learning Informatica PowerCenter 10 x enterprise data warehousing and intelligent data centers Second Edition. Edition Rahul Malewar PDF DOCX
100% (5)
Full Download Learning Informatica PowerCenter 10 x enterprise data warehousing and intelligent data centers Second Edition. Edition Rahul Malewar PDF DOCX
65 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
0 All Slides PFIEV2017
No ratings yet
0 All Slides PFIEV2017
768 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Candidate Guide
No ratings yet
Candidate Guide
41 pages
RTRP Lab Project
No ratings yet
RTRP Lab Project
13 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
FBS-PLC User's Manual & Instructions
No ratings yet
FBS-PLC User's Manual & Instructions
737 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
7.troublesht Low Cov
No ratings yet
7.troublesht Low Cov
40 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Software Developer Resume
No ratings yet
Software Developer Resume
1 page
Ajay_Resume_VLaF
No ratings yet
Ajay_Resume_VLaF
2 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
MapR Certified Hadoop Developer Study Guide (MCHD)
No ratings yet
MapR Certified Hadoop Developer Study Guide (MCHD)
26 pages
Name of The Student Student ID Session 2. Present Address
No ratings yet
Name of The Student Student ID Session 2. Present Address
9 pages
HBase
No ratings yet
HBase
31 pages
Seunghyun Lee, WebAssembly Is All You Need - Exploiting Chrome and The V8 Sandbox 10+ Times With WASM
No ratings yet
Seunghyun Lee, WebAssembly Is All You Need - Exploiting Chrome and The V8 Sandbox 10+ Times With WASM
137 pages
MapReduce Example
No ratings yet
MapReduce Example
76 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Untitled
No ratings yet
Untitled
13 pages
Building Microservice With Go Kit
No ratings yet
Building Microservice With Go Kit
10 pages
Mind Map Database
No ratings yet
Mind Map Database
6 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Specialized Visualization Tools - Coursera PDF
50% (2)
Specialized Visualization Tools - Coursera PDF
3 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
Notes of Unit - V (SE, KCA-302) MCA III Sem-1
No ratings yet
Notes of Unit - V (SE, KCA-302) MCA III Sem-1
55 pages
W Purch Cost F
100% (1)
W Purch Cost F
25 pages
Modbus Communication
No ratings yet
Modbus Communication
34 pages
Microcontroller Introduction
No ratings yet
Microcontroller Introduction
28 pages
The Fde Cycle 68w3ct Worksheet
No ratings yet
The Fde Cycle 68w3ct Worksheet
8 pages
EMC VNX Deduplication and Compression
No ratings yet
EMC VNX Deduplication and Compression
15 pages
SquirrelSetup Varad Hinge
No ratings yet
SquirrelSetup Varad Hinge
9 pages
Practical 4 Asset Transfer App
No ratings yet
Practical 4 Asset Transfer App
8 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
31 pages
Java SE 8 Question Bank
100% (1)
Java SE 8 Question Bank
107 pages
Linux Lab Manual by Zoom PDF
No ratings yet
Linux Lab Manual by Zoom PDF
184 pages
Data Engineering 6 Months Plan
No ratings yet
Data Engineering 6 Months Plan
3 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Unit One: Introduction To Programming
No ratings yet
Unit One: Introduction To Programming
53 pages
Plugin Loading
No ratings yet
Plugin Loading
36 pages
VPN Connection - Guide For MQ-employees
No ratings yet
VPN Connection - Guide For MQ-employees
15 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
NICL Recruitment: Computer
No ratings yet
NICL Recruitment: Computer
8 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Python Jinja Tutorial
No ratings yet
Python Jinja Tutorial
10 pages
2013 ELCE U Boot Falcon Boot
No ratings yet
2013 ELCE U Boot Falcon Boot
16 pages
Advance Information: 128 X 64 Dot Matrix OLED/PLED Segment/Common Driver With Controller
100% (1)
Advance Information: 128 X 64 Dot Matrix OLED/PLED Segment/Common Driver With Controller
65 pages
Chapter 05 Slides
No ratings yet
Chapter 05 Slides
35 pages
Data Sheet
No ratings yet
Data Sheet
101 pages
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
46 pages
Aiag Cts Guide
No ratings yet
Aiag Cts Guide
35 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
DATA On Self Learning Modules
No ratings yet
DATA On Self Learning Modules
6 pages
Logix5000 Controllers Major, Minor, and I/O Faults: Programming Manual
No ratings yet
Logix5000 Controllers Major, Minor, and I/O Faults: Programming Manual
53 pages
Black Insurgency (McAdam)
No ratings yet
Black Insurgency (McAdam)
21 pages
12 Sympathizers (Oegema, Klandermans) PDF
No ratings yet
12 Sympathizers (Oegema, Klandermans) PDF
21 pages
Mpu 6050
No ratings yet
Mpu 6050
18 pages
Steps in SHA-256 Algorithm
No ratings yet
Steps in SHA-256 Algorithm
5 pages
Exception Handling: Visual Programming Languages
No ratings yet
Exception Handling: Visual Programming Languages
47 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Hadoop (Big Data) : Skills Gained
No ratings yet
Hadoop (Big Data) : Skills Gained
8 pages
Sakthivel Seenivasan - 5.9yrs - ReactJs
No ratings yet
Sakthivel Seenivasan - 5.9yrs - ReactJs
4 pages
Social Networks (Snow)
No ratings yet
Social Networks (Snow)
16 pages
Manual 26015 (Revision A, 2/2014) : Woodward Control Assistant Software
No ratings yet
Manual 26015 (Revision A, 2/2014) : Woodward Control Assistant Software
2 pages
Hemanshu Kumar Saraf - Resume New
No ratings yet
Hemanshu Kumar Saraf - Resume New
1 page
Koustav BigData Resume
No ratings yet
Koustav BigData Resume
2 pages
Farm Worker Movement (Jenkins, Perrow)
No ratings yet
Farm Worker Movement (Jenkins, Perrow)
21 pages
WELLPRO Module MODBUS RTU SETUP
No ratings yet
WELLPRO Module MODBUS RTU SETUP
1 page
Re Producing Feminine Bodies Emergent Spaces Through Contestation in The Women S March On Washington PDF
No ratings yet
Re Producing Feminine Bodies Emergent Spaces Through Contestation in The Women S March On Washington PDF
12 pages
A Living Archive of Modern Protest Memory Making in The Women S March
No ratings yet
A Living Archive of Modern Protest Memory Making in The Women S March
10 pages
1-Reading Raw Data in SAS Week1
No ratings yet
1-Reading Raw Data in SAS Week1
75 pages
Rate Limits - OpenAI API 2
No ratings yet
Rate Limits - OpenAI API 2
6 pages
Emergent and Divergent Spaces in The Women S March The Challenges of Intersectionality and Inclusion
No ratings yet
Emergent and Divergent Spaces in The Women S March The Challenges of Intersectionality and Inclusion
9 pages
Java Course Content Updated - WinPath IT
No ratings yet
Java Course Content Updated - WinPath IT
5 pages
NOSQL
No ratings yet
NOSQL
23 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
CAPE C Programming
No ratings yet
CAPE C Programming
3 pages
QlikView Essentials - Sample Chapter
No ratings yet
QlikView Essentials - Sample Chapter
21 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
1 Introduction Bash Shell Linux Mac Os m1 Overview Slides PDF
No ratings yet
1 Introduction Bash Shell Linux Mac Os m1 Overview Slides PDF
6 pages
Famepilot Django Assignment
No ratings yet
Famepilot Django Assignment
1 page
Icon Data 3d Metal Detector Data Logger PDF
No ratings yet
Icon Data 3d Metal Detector Data Logger PDF
4 pages
Framing The Women's March On Washington
No ratings yet
Framing The Women's March On Washington
10 pages
Notes
No ratings yet
Notes
3 pages
Map Reduce
No ratings yet
Map Reduce
1 page
Hibernet Object Relation
No ratings yet
Hibernet Object Relation
24 pages
Activity Clock PDF
No ratings yet
Activity Clock PDF
2 pages
11 Managing Sub Programs
No ratings yet
11 Managing Sub Programs
5 pages
SQL Questions
No ratings yet
SQL Questions
4 pages
Control Panel UNF
No ratings yet
Control Panel UNF
22 pages
Sidharth Anand Resume
No ratings yet
Sidharth Anand Resume
1 page
Projects: Full Stack Developer
No ratings yet
Projects: Full Stack Developer
1 page
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet

Introduction To MapReduce

Uploaded by

Introduction To MapReduce

Uploaded by

Introduction

Overview of the Course

E Cost Cost Cost

Note: best ordering , ECABD,

CPU CPU CPU CPU

Each rack contains 16-64 nodes

You might also like