Big Data Unit-2 PPT part2
Big Data Unit-2 PPT part2
Reduce
Map
Reduce
Map
Reduce
Map Reduce
9
Example of Map Reduce for finding word count in
a Big Document
10
How MapReduce works?
11
How MapReduce works?
12
Input: This is the input data / file to be processed.
Split: Hadoop splits the incoming data into smaller pieces
called "splits".
Map: In this step, Map Reduce processes each split
according to the logic defined in map () function. Each
mapper works on each split at a time. Each mapper is
treated as a task and multiple tasks are executed across
different Task Trackers and coordinated by the Job Tracker.
Combine: This is an optional step and is used to
improve the performance by reducing the amount of data
transferred across the network. Combiner is the same as
the reduce step and is used for aggregating the output
of the map () function before it is passed to the subsequent
steps.
Shuffle & Sort: In this step, outputs from all the mappers
is shuffl ed, sorted to put them in order, and grouped before
sending them to the next step.
Reduce: This step is used to aggregate the outputs of
mappers using the reduce() function. Output of reducer is
sent to the next and final step. Each reducer is treated as a13
Mapper Task
⚫ It is the first phase of MapReduce
programming and contains the coding
logic of the mapper function.
⚫ Mapper function accepts key-value pairs as
input as (k,v), where the key represents the
offset address of each record and the value
represents the entire record content
Shuffle and Sort
• The output of various mappers, then goes into
Shuffle and
Sort phase.
• All the duplicate values are removed, and different
values are grouped together based on similar keys.
• The output of the Shuffle and Sort phase will be 14
Reducer Task
⚫ The output of the Shuffle and Sort phase
will be the input of the Reducer phase.
• Reducer consolidates outputs of various
mappers
and computes the final job output.
• The final output is then written into a single
file in an output directory of HDFS
15
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• • Input data is split into small subsets of
data.
• • Map tasks work on these data splits.
• • The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• • The Reduce task(s) works on this
intermediate data 16
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• • Input data is split into small subsets of
data.
• • Map tasks work on these data splits.
• • The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• • The Reduce task(s) works on this
intermediate data 17
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• Input data is split into small subsets of
data.
• Map tasks work on these data splits.
• The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• The Reduce task(s) works on this
intermediate data 18
Anatomy of a Map-Reduce Job Run
Function
Component
20
Developing A Map-Reduce Application
23
MR unit
⚫ Testing library for MapReduce
⚫ Developed by Cloudera
⚫ Easy Integration with MapReduce and
standard testing tools
⚫ MRUnit is a Java-based testing(Junit test)
framework specifically designed for unit
testing Hadoop MapReduce programs.
⚫ It allows developers to test their Mapper,
Reducer, and Driver classes without
deploying them on a Hadoop cluster.
⚫ This makes it easy to develop as
well as to maintain Hadoop MapReduce
code bases
24
⚫ A MRUnit allows you to do TDD(Test
Driven Development) and write
lightweight unit tests which
accommodate Hadoop’s specific
architecture and constructs.
25
How MR unit test works?
27
Benefits of MRunit
Fast, lightweight
Catch basic errors more
quickly Easy to get up and
running quickly
28
Test Data and local tests
Before running jobs on a real Hadoop
cluster, it is essential to test the logic locally
to ensure correctness and efficiency.
30
Test Data and local tests
What is Test Data in Hadoop?
Test data refers to sample input data used to verify the
correctness of Hadoop programs before processing large
datasets.
32
Test Data and local tests
Methods for Local Testing
A. Local (Standalone) Mode
• Runs Hadoop programs on a single machine without
HDFS.
• Used for debugging and small-scale testing.
• Command to run a job in local mode:
• hadoop jar your-hadoop-job.jar input.txt output/
35
Failures in Hadoop MapReduce
3. Node Failure
A node running tasks may crash or become
unreachable.The ResourceManager reassigns tasks
to another node.
4. JobTracker / ResourceManager Failure (In
Hadoop 1.x and 2.x respectively)
In Hadoop 1.x, the JobTracker failure would cause all
running jobs to fail.In Hadoop 2.x (YARN), the
ResourceManager failure triggers recovery if High
Availability (HA) is enabled.
5. Disk or Network Failure
HDFS uses replication to prevent data loss from disk
failures. 36
Failures in Hadoop MapReduce
How Hadoop Handles Failures
Speculative Execution: If a task is running slower
than expected, Hadoop runs another copy and takes
the fastest result.
Retries: If a task fails, Hadoop retries it up to 4 times
(configurable).
Task Reassignment: Failed tasks are assigned to
other nodes.
Checkpointing: Intermediate results are
periodically saved to prevent reprocessing from
scratch.
37
Job Scheduling
Job scheduling is the process where different
tasks get executed at pre-determined time or
when the right event happen
A job scheduler is a system that can be
integrated with other software systems for the
purpose of executing or notifying other
software components when a pre-
determined, scheduled time arrives.
38
How does Job Scheduling work?
39
Job Scheduling
40
FIF
O
⚫ As the name suggests FIFO i.e. First In First
Out, so the tasks or application that comes
first will be served first.
⚫ This is the default Scheduler we use in
Hadoop.
⚫ The tasks are placed in a queue and the
tasks are performed in their submission
order.
⚫ In this method, once the job is
scheduled, no intervention is
allowed.
⚫ So sometimes the high-priority process has to
wait for a long time since the priority of the task
does not matter in this method. 41
How does Job Scheduling work?
42
Capacity Scheduler
44
Capacity Scheduler
Example
Imagine a large bank that has a Hadoop cluster
used by multiple departments:
1.Fraud Detection Team (high priority, requires
quick processing)
2.Customer Analytics Team (moderate priority,
requires significant resources)
3.Log Processing Team (low priority, runs in the
background)
45
Capacity Scheduler
46
Capacity Scheduler
How it Works in Real-Time?
1.If all teams are running jobs, each gets its minimum
guaranteed share.
48
Fair Scheduling
49
Fair Scheduling
Example
50
Fair Scheduling
51
Fair Scheduling
With the Fair Scheduler:
52
Fair Scheduling
With the Fair Scheduler:
53
Task Execution :
1. Job Submission:
• The user submits a job to the JobTracker (in Hadoop
1.x) or ResourceManager (in Hadoop 2.x/YARN).
• The job is split into multiple tasks: Map tasks and
Reduce tasks.
• The input data is split into chunks called input splits,
which are processed by map tasks.
2. Job Initialization:
• The JobTracker/ResourceManager communicates with
the NameNode to get the data's location.
• It assigns the job to an available NodeManager (in
YARN) or TaskTracker (in Hadoop 1.x).
54
Task Execution :
3. Map Phase (Mapper Execution):
55
Task Execution :
6. Output Phase:
• The final output is written to HDFS.
• The output is typically stored in a distributed manner.
56
MapReduce Types
57
MapReduce Types
1.Input Formats
i)TextInputFormat (Default)
ii) KeyValueTextInputFormat
iii) SequenceFileInputFormat
iv) SequenceFileAsTextInputFormat
(v) NLineInputFormat
(vi) MultipleInputs
58
MapReduce Types
2.Output Formats
i) TextOutputFormat (Default)
ii) KeyValueTextOutputFormat
iii) SequenceFileOutputFormat
iv) MapFileOutputFormat
v) MultipleOutputs
vi) LazyOutputFormat
(vii) DBOutputFormat
59
MapReduce Types
1.Input Formats
60
i)TextInputFormat (Default):
• Reads line-by-line from a text file.
• Each line is treated as a key-value pair:
Key → Byte offset of the line.
Value → Line content.
Use Case: Log files, CSV, structured text data
Example:
0 Hadoop is fast
20 MapReduce is powerful
61
ii) KeyValueTextInputFormat:
Reads key-value pairs from a text file (key and value separated by
a delimiter, usually a tab).
Key → Text before the first tab.
Value → Text after the first tab.
Use Case: Processing key-value formatted logs, configurations.
Example:
apple 5
banana 7
(iii) SequenceFileInputFormat:
Binary format that stores key-value pairs in a compressed
format.
More efficient than text-based formats.
Use Case: Large datasets where compression improves performance.
Example:
Input is stored as Writable objects:(101, "Data Processing")
(102, "Big Data Analytics")
62
(iv) SequenceFileAsTextInputFormat:
• Reads a SequenceFile, but outputs the values as text.
• Useful for debugging compressed sequence files
(v) NLineInputFormatSplits:
• Splits input so that each Mapper gets a fixed number (N) of
lines.
• Use Case: When data processing requires control over how
many lines a single Mapper processes.
(vi) MultipleInputs:
Allows multiple input files with different input formats.
Example:
Processing CSV logs with TextInputFormat and Binary data with
SequenceFileInputFormat together
63
Output Formats
● The outputFormat decides the way the output key-
value pairs are written in the output files by
RecordWriter.
● The output format defines how data is written by the
Reducer.
64
MapReduce Types
2.Output Formats
i) TextOutputFormat (Default)
ii) KeyValueTextOutputFormat
iii) SequenceFileOutputFormat
iv) MapFileOutputFormat
v) MultipleOutputs
vi) LazyOutputFormat
(vii) DBOutputFormat
65
(i) TextOutputFormat (Default):
Writes plain text files, with key-value pairs separated by a
tab.
Use Case: Simple outputs like word count results.
Example Output:
apple 10
banana 15
(ii) KeyValueTextOutputFormat:
Similar to TextOutputFormat, but allows custom separators
instead of a tab.
Use Case: Custom key-value storage formats.
67
3. SequenceFileAsBinaryOutputFormat: It is another
variant 3.
of SequenceFileInputFormat. It also writes keys
and values to
sequence file in binary It is another
4.MapFileOutputFormat: form of
format.
FileOutputFormat.
• It also writes output as map files.
• The framework adds a key in a MapFile in order.
• So we need to ensure that reducer emits keys in sorted order.
5. MultipleOutputs: This format allows writing data to files
whose names are derived from the output keys and
values.
6.LazyOutputFormat: It prevents empty files from being
created when a reducer or mapper does not produce
any output.
68
By default, Hadoop creates an output file (part-r-00000, part-r-00001,
etc.) for each reducer, even if the reducer doesn't write any data.
7.DBOutputFormat:
69
Counters in Hadoop
● Counters in Hadoop MapReduce are used for tracking
statistics and metrics during job execution.
● They help monitor the progress, detect issues, and collect
performance-related data.
Advantages of using
70
Counters in Hadoop
71
Counters in Hadoop
Types:
1.Built-in Counters
•File System Counters (e.g., number of bytes read/written)
•Job Counters (e.g., number of launched map tasks)
•Task Counters (e.g., CPU time taken by a task)
•User-Defined Counters
•Custom counters created by developers to track specific events
72
Counters in Hadoop
2. User-Defined Counters (Custom Counters)
Hadoop allows developers to define their own counters for
application-specific metrics.
3. . Distributed Counters
• Used for global aggregation across nodes.
• Updated during map or reduce tasks and sent to the JobTracker
for consolidation.
73
MapReduce
Features
⚫ Scalability
⚫ Flexibility
⚫ Availability
⚫ Fast
⚫ Security and Authentication
⚫ Cost-effective solution
⚫ Parallel Programming
⚫ Simplified Programming Model
⚫ High Throughput
⚫ Supports Multiple Programming
Languages
74
Real world MapReduce
⚫ Analysis of logs
⚫ data analysis
⚫ recommendation mechanisms
⚫ fraud detection
⚫ user behavior analysis
⚫ genetic algorithms
⚫ scheduling problems, resource
planning
75
Real world MapReduce
1. Search Engines (Google, Bing, Yahoo)
Web indexing and ranking pages.
• How it Works:
• Map: Parses and processes web pages to extract keywords.
• Reduce: Aggregates keyword counts and ranks web pages based
on relevance.
• Example: Google’s original PageRank algorithm was built using
MapReduce.
76
Real world MapReduce
77
78