0% found this document useful (0 votes)

4 views

Big Data Unit-2 PPT part2

MapReduce is a programming model developed by Google for processing large-scale data in a distributed manner, widely adopted by companies like Yahoo and Netflix. It consists of two main steps: Map, which filters and sorts data, and Reduce, which aggregates the results. The framework is integral to Hadoop, enabling efficient data processing across large clusters while providing fault tolerance and resource management through Job Tracker and Task Tracker daemons.

Uploaded by

guptaraman600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Big Data Unit-2 PPT part2

Uploaded by

guptaraman600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 78

Map

Reduce
Map
Reduce
Map
Reduce
Map Reduce

⚫ MapReduce was introduced by Google to meet the

demand of their large set of users for its
applications like- search etc.

⚫ Over the time, MapReduce has been adopted by

many companies like- Yahoo, Netflix, Facebook
etc. in various forms.

⚫ MapReduce is a programming model and

framework for processing large-scale data in a
distributed and parallel manner.
Map Reduce Framework
⚫ MapReduce is the processing unit of
Hadoop.
⚫ Hadoop MapReduce is a software
framework that makes it simple to write
programs that process enormous volumes
of data (multi-terabyte data sets) in parallel
on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-
tolerant way.
⚫ Then processing them in parallel on
Hadoop commodity servers.
⚫ In the end, it collects all the 5
Map Reduce
⚫ MapReduce algorithm is made up of two steps: 1.Map
2.Reduce.
⚫ First step is Map and the second step is Reduce.
⚫
Map procedure(Transform): Performsa
filtering and sorting operation.
⚫
Reduce procedure(Aggregates):
Performs a summary operation
Two essential daemons of MapReduce-
Job Tracker and
Task Tracker
⚫ Each cluster node has one master
JobTracker and one slave TaskTracker in
the MapReduce framework. The master is in
charge of scheduling the slaves’ component
tasks, monitoring them, and re-running any
failed tasks. The slaves carry out the
master’s instructions.
⚫ Functions of Master Job Tracker
 Managing resources
 Scheduling tasks
 Monitoring tasks
⚫ Functions of Slave Task Tracker
 Executes the task
 Provide task status 7
Map
Reduce of large datasets in a
⚫ It performs the processing
distributed and parallel manner
⚫ The MapReduce task works on <Key, Value> pair.
⚫ Two main features of MapReduce are
parallel programming model and
large-scale distributed model.
⚫ MapReduce allows for the distributed
processing of the map and reducton
operations.
⚫
Map procedure(Transform): Performsa
filtering and sorting operation.
⚫
Reduce procedure(Aggregates):
Performs a summary operation
Two essential daemons of MapReduce-
Job Tracker and
Task Tracker
8
Daemons working in MapReduce

9
Example of Map Reduce for finding word count in
a Big Document

10
How MapReduce works?

11
How MapReduce works?

12
Input: This is the input data / file to be processed.
Split: Hadoop splits the incoming data into smaller pieces
called "splits".
Map: In this step, Map Reduce processes each split
according to the logic defined in map () function. Each
mapper works on each split at a time. Each mapper is
treated as a task and multiple tasks are executed across
different Task Trackers and coordinated by the Job Tracker.
Combine: This is an optional step and is used to
improve the performance by reducing the amount of data
transferred across the network. Combiner is the same as
the reduce step and is used for aggregating the output
of the map () function before it is passed to the subsequent
steps.
Shuffle & Sort: In this step, outputs from all the mappers
is shuffl ed, sorted to put them in order, and grouped before
sending them to the next step.
Reduce: This step is used to aggregate the outputs of
mappers using the reduce() function. Output of reducer is
sent to the next and final step. Each reducer is treated as a13
Mapper Task
⚫ It is the first phase of MapReduce
programming and contains the coding
logic of the mapper function.
⚫ Mapper function accepts key-value pairs as
input as (k,v), where the key represents the
offset address of each record and the value
represents the entire record content
Shuffle and Sort
• The output of various mappers, then goes into
Shuffle and
Sort phase.
• All the duplicate values are removed, and different
values are grouped together based on similar keys.
• The output of the Shuffle and Sort phase will be 14
Reducer Task
⚫ The output of the Shuffle and Sort phase
will be the input of the Reducer phase.
• Reducer consolidates outputs of various
mappers
and computes the final job output.
• The final output is then written into a single
file in an output directory of HDFS

15
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• • Input data is split into small subsets of
data.
• • Map tasks work on these data splits.
• • The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• • The Reduce task(s) works on this
intermediate data 16
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• • Input data is split into small subsets of
data.
• • Map tasks work on these data splits.
• • The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• • The Reduce task(s) works on this
intermediate data 17
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• Input data is split into small subsets of
data.
• Map tasks work on these data splits.
• The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• The Reduce task(s) works on this
intermediate data 18
Anatomy of a Map-Reduce Job Run
Function
Component

Client - Submits the job

ResourceManager - Manages job execution (Hadoop 2.x)

NodeManager - Manages execution on worker nodes

Mapper - Processes input and produces

intermediate key-value pairs
Shuffle & Sort - Organizes intermediate data for
reducers

Reducer - Processes grouped data and produces

final output

HDFS - Stores input and output data

19
Developing A Map-Reduce
Application
Steps to Develop a MapReduce Application

Step 1: Set Up the Environment: Install Hadoop

and configure it.Set up HDFS (Hadoop
Distributed File System) for input/output storage.

Step 2: Define the Problem :Identify the input

data format (e.g., text, logs, structured
data).Decide what transformation needs to be
performed (e.g., word count, log analysis).

Step 3: Implement the Mapper Function:Read

input data line by line.Process each record and
emit intermediate key-value pairs.

20
Developing A Map-Reduce Application

Step 4: Implement the Reducer

Function:Aggregate and process the grouped
key-value pairs.Generate the final output.

Step 5: Configure the Driver (Main

Program):Set job configurations (Mapper,
Reducer, Combiner, Input & Output
format).Specify input and output paths.

Step 6: Compile and Package the Program:If

using Java, compile the code and create a JAR
file.If using Python, ensure the script is
executable for Hadoop Streaming.
21
Developing A Map-Reduce Application

Step 7: Upload Data to

HDFShdfs dfs -mkdir /inputhdfs dfs -put

localfile.txt /input

Step 8: Run the MapReduce Job:Execute the

job using Hadoop command-line tools.

Step 9: Retrieve and Analyze :OutputOutput is

stored in HDFS.

Step 10: Optimize & Debug (Optional)Use a

combiner to optimize performance.Debug using
Hadoop logs if issues arise. 22
Unit Test with MR unit

⚫A unit test is a type of software testing that focuses on

verifying the smallest testable parts of an
application, called units (typically individual
functions, methods, or classes), to ensure they work
as expected.

⚫MRUnit is a Java-based testing framework

specifically designed for unit testing Hadoop
MapReduce programs

23
MR unit
⚫ Testing library for MapReduce
⚫ Developed by Cloudera
⚫ Easy Integration with MapReduce and
standard testing tools
⚫ MRUnit is a Java-based testing(Junit test)
framework specifically designed for unit
testing Hadoop MapReduce programs.
⚫ It allows developers to test their Mapper,
Reducer, and Driver classes without
deploying them on a Hadoop cluster.
⚫ This makes it easy to develop as
well as to maintain Hadoop MapReduce
code bases
24
⚫ A MRUnit allows you to do TDD(Test
Driven Development) and write
lightweight unit tests which
accommodate Hadoop’s specific
architecture and constructs.

⚫ Test Driven (TDD) is

Development
development approach in software
which test
cases are developed to specify and validate
what the code will do.

25
How MR unit test works?

 Add MRUnit Dependency

Maven (automation abd project management tool)
MRUnit JAR(MRUnit Library )

 Write Your Mapper and Reducer

Create a Mapper that processes input key-value pairs and
emits intermediate key-value pairs.
Create a Reducer that aggregates values for each key.
 Setup MRUnit Test Cases
Use MapDriver to test the Mapper.Use ReduceDriver to test the
Reducer.
 Run Unit Tests
26
How MR unit test works?

 Run Unit Tests

Write JUnit test cases for Mapper, Reducer, and the full
MapReduce job.

Use mvn test or run the test cases in your IDE.

27
Benefits of MRunit
Fast, lightweight
Catch basic errors more
quickly Easy to get up and
running quickly

28
Test Data and local tests
Before running jobs on a real Hadoop
cluster, it is essential to test the logic locally
to ensure correctness and efficiency.

Example for Test Data-

⚫ If you want to test mobile software
applications.
⚫ Mobile has many different applications so to
test them you need different input data
such as photos of different formats,
music files supported and
unsupported formats, Videos file,
Contacts files, Different emails etc. 29
Test Data and local tests
Local test--
⚫ A local test runs directly on your own workstation, rather
than an Android device .
⚫ Local Testing is the ability to test private or internal
servers and local folders on the Browserstack Cloud
⚫ As such, it uses your local Java Virtual Machine (JVM),
rather than an Android device to run tests. Local tests enable
you to evaluate your app's logic more quickly.

30
Test Data and local tests
What is Test Data in Hadoop?
Test data refers to sample input data used to verify the
correctness of Hadoop programs before processing large
datasets.

Example Test Data (Word Count Program)

Suppose we have a simple text file for a word count
job: hello world hello
Hadoop hadoop
mapreduce
Expected output:
hello 2 world 1
hadoop 2 mapreduce 1
31
Test Data and local tests
Local Testing in Hadoop
Advantages of Test Locally
• Faster Debugging – No need to run jobs on a full Hadoop
cluster.
• Saves Resources – Avoids unnecessary use of distributed
systems.
• Ensures Accuracy – Helps catch errors before deploying
at scale
Methods for Local Testing
A. Local (Standalone) Mode
B. MRUnit for Unit Testing
C. MiniCluster (Pseudo-Distributed Mode)

32
Test Data and local tests
Methods for Local Testing
A. Local (Standalone) Mode
• Runs Hadoop programs on a single machine without
HDFS.
• Used for debugging and small-scale testing.
• Command to run a job in local mode:
• hadoop jar your-hadoop-job.jar input.txt output/

B. MRUnit for Unit Testing

• MRUnit is a Java testing framework for Hadoop

MapReduce.
• It tests Mappers and Reducers separately before full
execution. 33
Test Data and local tests
C. MiniCluster (Pseudo-Distributed Mode)

• Runs a small Hadoop cluster locally for near-real testing.

• Simulates a real Hadoop environment without full cluster
deployment.

3. Running a Local Test Step-by-Step

1. Prepare Input Data:Create a sample text file: input.txt
2. Write and Compile the Hadoop Program
Create Mapper, Reducer, and Driver classes.
3. Run the Hadoop Job Locally:
hadoop jar your-hadoop-job.jar input.txt output/
4. Verify Output:
cat output/part-r-00000 34
Failures in Hadoop MapReduce
Types
1. Task Failure
A Mapper or Reducer task might fail due to
software bugs, data corruption, or node
crashes.Hadoop retries the task on another node.
2. ApplicationMaster Failure
The ApplicationMaster can fail due to node failure
or resource issues.
YARN restarts the ApplicationMaster on another
node.

35
Failures in Hadoop MapReduce
3. Node Failure
A node running tasks may crash or become
unreachable.The ResourceManager reassigns tasks
to another node.
4. JobTracker / ResourceManager Failure (In
Hadoop 1.x and 2.x respectively)
In Hadoop 1.x, the JobTracker failure would cause all
running jobs to fail.In Hadoop 2.x (YARN), the
ResourceManager failure triggers recovery if High
Availability (HA) is enabled.
5. Disk or Network Failure
HDFS uses replication to prevent data loss from disk
failures. 36
Failures in Hadoop MapReduce
How Hadoop Handles Failures
Speculative Execution: If a task is running slower
than expected, Hadoop runs another copy and takes
the fastest result.
Retries: If a task fails, Hadoop retries it up to 4 times
(configurable).
Task Reassignment: Failed tasks are assigned to
other nodes.
Checkpointing: Intermediate results are
periodically saved to prevent reprocessing from
scratch.

37
Job Scheduling
Job scheduling is the process where different
tasks get executed at pre-determined time or
when the right event happen
A job scheduler is a system that can be
integrated with other software systems for the
purpose of executing or notifying other
software components when a pre-
determined, scheduled time arrives.

Schedulers are the ones in YARN which are

responsible for allocating the resources to
the various running applications

38
How does Job Scheduling work?

• Different clients send their jobs to

perform
• The jobs are managed by JobTracker or
resource manager
• There are three scheduling schemas
 FIFO queue-based scheduler
 The Fair Scheduler
 The Capacity Scheduler.

39
Job Scheduling

• The jobTracker comes with these three

scheduling techniques, and the default
is the FIFO.
• The Resource Manager takes Capacity
Scheduler and Fair Scheduler, where the
Capacity Scheduler is default.

40
FIF
O
⚫ As the name suggests FIFO i.e. First In First
Out, so the tasks or application that comes
first will be served first.
⚫ This is the default Scheduler we use in
Hadoop.
⚫ The tasks are placed in a queue and the
tasks are performed in their submission
order.
⚫ In this method, once the job is
scheduled, no intervention is
allowed.
⚫ So sometimes the high-priority process has to
wait for a long time since the priority of the task
does not matter in this method. 41
How does Job Scheduling work?

42
Capacity Scheduler

⚫ Resources are divided into queues with defined capacities to

ensure multi-tenancy.
⚫ (Tenancy refers to how resources are shared among multiple
users or groups in a system)
⚫ Each queue gets a guaranteed share of resources, and
multiple jobs can run concurrently
⚫ Capacity schedulers are used when it is required to allocate
the required number of resources to all the jobs.
⚫ Capacity schedulers are the ones that allow the
different persons for sharing a large Hadoop
cluster in a secure way.
⚫ it gives capacity guarantees and
safeguards to the
organization with the help of a cluster. 43
Capacity Scheduler
⚫ The Capacity Scheduler in Hadoop is designed to
be flexible.

⚫ While each queue has a minimum guaranteed

capacity, it can borrow unused resources from
other queues if they are idle.

⚫ However, when another queue submits jobs,

the extra borrowed resources are preempted
(taken back) to ensure fairness.

⚫ it gives capacity guarantees and

safeguards to the
organization with the help of a cluster.

44
Capacity Scheduler
Example
Imagine a large bank that has a Hadoop cluster
used by multiple departments:
1.Fraud Detection Team (high priority, requires
quick processing)
2.Customer Analytics Team (moderate priority,
requires significant resources)
3.Log Processing Team (low priority, runs in the
background)

45
Capacity Scheduler

46
Capacity Scheduler
How it Works in Real-Time?
1.If all teams are running jobs, each gets its minimum
guaranteed share.

2. If the Fraud Detection Team needs more resources, it can

borrow unused capacity from other queues (up to 80%).

3. If other teams submit jobs later, the scheduler preempts

low-priority tasks to free up resources for higher-priority
jobs.

4. The Log Processing Team's jobs run only when other

teams don’t fully utilize the cluster.
47
Fair Scheduling
⚫ Fair Scheduler is a resource management tool designed to
allocate cluster resources equitably among multiple jobs
and users.

⚫ Its primary goal is to ensure that all jobs receive, on average,

an equal share of resources over time

⚫ The priority of the job is kept in consideration.

⚫ A single job is running, it gets all of the clusters.

As more jobs are submitted, free task slots are
given to the jobs in such a way as to give
each user a fair share of the cluster.

48
Fair Scheduling

⚫ A short job belonging to one user will

complete in a reasonable time even while
another user’s long job is running, and the
long job will still make progress. Jobs are
placed in pools, and by default, each user gets
their own pool.
⚫ The Fair Scheduler supports preemption, so if
a pool has not received its fair share for a certain
period of time, then the scheduler will kill
tasks in pools running over capacity in order
to give the slots to the pool running under
capacity

49
Fair Scheduling

Example

Consider a university's shared research cluster

where multiple research teams run data processing
jobs:
•Different teams run different workloads, such as:

1. Team A (Data Science Group) – Running large machine

learning training jobs.
2. Team B (Bioinformatics Group) – Processing genomic
sequences.
3. Team C (IoT Group) – Running real-time sensor data
analytics.

50
Fair Scheduling

Without Fair Scheduling (FIFO Scheduling Issue)

If Team A submits a long-running job first, it monopolizes

the cluster.
Team B and Team C have to wait, causing resource
starvation for smaller jobs.

51
Fair Scheduling
With the Fair Scheduler:

•When Team A submits its job, it initially gets the entire

cluster because no other jobs are running.

•When Team B submits a job, the scheduler divides

resources equally, so both jobs get 50% each of
available resources.

•When Team C submits a job, the scheduler further

distributes the resources, ensuring all three jobs get an
equal share (33%).

52
Fair Scheduling
With the Fair Scheduler:

•When Team A submits its job, it initially gets the entire

cluster because no other jobs are running.

•When Team B submits a job, the scheduler divides

resources equally, so both jobs get 50% each of
available resources.

•When Team C submits a job, the scheduler further

distributes the resources, ensuring all three jobs get an
equal share (33%).

53
Task Execution :
1. Job Submission:
• The user submits a job to the JobTracker (in Hadoop
1.x) or ResourceManager (in Hadoop 2.x/YARN).
• The job is split into multiple tasks: Map tasks and
Reduce tasks.
• The input data is split into chunks called input splits,
which are processed by map tasks.

2. Job Initialization:
• The JobTracker/ResourceManager communicates with
the NameNode to get the data's location.
• It assigns the job to an available NodeManager (in
YARN) or TaskTracker (in Hadoop 1.x).

54
Task Execution :
3. Map Phase (Mapper Execution):

• Each Mapper task processes one input split at a time.

• The Mapper reads data from HDFS, processes it, and
outputs intermediate key-value pairs.

4. Shuffle and Sort Phase:

• The output from Mappers is shuffled and sorted by key.
• Data with the same key from multiple Mappers is
combined and sent to the appropriate Reducer.

55
Task Execution :

5. Reduce Phase (Reducer Execution):

• Each Reducer processes the sorted data.
• The Reducer aggregates the values associated with each key.

6. Output Phase:
• The final output is written to HDFS.
• The output is typically stored in a distributed manner.

56
MapReduce Types

-Input and Output Formats

• Hadoop MapReduce supports various input
and output formats to process different
types of data.
• These formats determine how data is read
by Mappers and written by Reducers.

57
MapReduce Types
1.Input Formats

i)TextInputFormat (Default)
ii) KeyValueTextInputFormat
iii) SequenceFileInputFormat
iv) SequenceFileAsTextInputFormat
(v) NLineInputFormat
(vi) MultipleInputs

58
MapReduce Types
2.Output Formats

i) TextOutputFormat (Default)
ii) KeyValueTextOutputFormat
iii) SequenceFileOutputFormat
iv) MapFileOutputFormat
v) MultipleOutputs
vi) LazyOutputFormat
(vii) DBOutputFormat

59
MapReduce Types
1.Input Formats

• InMapReduce job execution, InputFormat is the

first
• step.
• InputFormat describes how to split and read input
files.
● InputFormat is responsible for splitting the input data file into

records which is used for map-reduce operation.

○

60
i)TextInputFormat (Default):
• Reads line-by-line from a text file.
• Each line is treated as a key-value pair:
Key → Byte offset of the line.
Value → Line content.
Use Case: Log files, CSV, structured text data
Example:
0 Hadoop is fast
20 MapReduce is powerful

61
ii) KeyValueTextInputFormat:
Reads key-value pairs from a text file (key and value separated by
a delimiter, usually a tab).
Key → Text before the first tab.
Value → Text after the first tab.
Use Case: Processing key-value formatted logs, configurations.
Example:
apple 5
banana 7
(iii) SequenceFileInputFormat:
Binary format that stores key-value pairs in a compressed
format.
More efficient than text-based formats.
Use Case: Large datasets where compression improves performance.
Example:
Input is stored as Writable objects:(101, "Data Processing")
(102, "Big Data Analytics")
62
(iv) SequenceFileAsTextInputFormat:
• Reads a SequenceFile, but outputs the values as text.
• Useful for debugging compressed sequence files

(v) NLineInputFormatSplits:
• Splits input so that each Mapper gets a fixed number (N) of
lines.
• Use Case: When data processing requires control over how
many lines a single Mapper processes.

(vi) MultipleInputs:
Allows multiple input files with different input formats.

Example:
Processing CSV logs with TextInputFormat and Binary data with
SequenceFileInputFormat together
63
Output Formats
● The outputFormat decides the way the output key-
value pairs are written in the output files by
RecordWriter.
● The output format defines how data is written by the
Reducer.

64
MapReduce Types
2.Output Formats

i) TextOutputFormat (Default)
ii) KeyValueTextOutputFormat
iii) SequenceFileOutputFormat
iv) MapFileOutputFormat
v) MultipleOutputs
vi) LazyOutputFormat
(vii) DBOutputFormat

65
(i) TextOutputFormat (Default):
Writes plain text files, with key-value pairs separated by a
tab.
Use Case: Simple outputs like word count results.
Example Output:
apple 10
banana 15

(ii) KeyValueTextOutputFormat:
Similar to TextOutputFormat, but allows custom separators
instead of a tab.
Use Case: Custom key-value storage formats.

Example Output (Using ‘=’ as separator):

apple=10
banana=15
66
3. SequenceFileAsBinaryOutputFormat: It is another
variant 3.
of SequenceFileInputFormat. It also writes keys
and values to
4.MapFileOutputFormat: It is another form of
sequence file in binary
FileOutputFormat.
format.
It also writes output as map files.
The framework adds a key in a MapFile in order.
So we need to ensure that reducer emits keys in sorted order.

5. MultipleOutputs: This format allows writing data to files

whose names are derived from the output keys and
values.
6.LazyOutputFormat: It prevents empty files from being
created when a reducer or mapper does not produce
any output.

67
3. SequenceFileAsBinaryOutputFormat: It is another
variant 3.
of SequenceFileInputFormat. It also writes keys
and values to
sequence file in binary It is another
4.MapFileOutputFormat: form of
format.
FileOutputFormat.
• It also writes output as map files.
• The framework adds a key in a MapFile in order.
• So we need to ensure that reducer emits keys in sorted order.
5. MultipleOutputs: This format allows writing data to files
whose names are derived from the output keys and
values.
6.LazyOutputFormat: It prevents empty files from being
created when a reducer or mapper does not produce
any output.

68
By default, Hadoop creates an output file (part-r-00000, part-r-00001,
etc.) for each reducer, even if the reducer doesn't write any data.

• This can lead to unnecessary empty files.

• To avoid this, LazyOutputFormat ensures that an output file is only
created if the reducer writes at least one record.

7.DBOutputFormat:

It is the OutputFormat for writing to relational databases and HBase.

This format also sends the reduce output to a SQL

69
Counters in Hadoop
● Counters in Hadoop MapReduce are used for tracking
statistics and metrics during job execution.
● They help monitor the progress, detect issues, and collect
performance-related data.

Advantages of using

•Debugging: Detect issues like missing records.

•Performance Monitoring: Track data processing efficiency.
•Data Validation: Ensure correct processing of records

70
Counters in Hadoop

 Hadoop Counters validate that:

 It reads and written correct number of bytes.

 It has launched and successfully run correct number

of tasks or not.
 Counters also validate that the amount of CPU
and memory consumed is appropriate for our job and
cluster nodes or not.
 Counter also measures the progress or the
number of operations that occur within MapReduce
job.

71
Counters in Hadoop

Types:

1.Built-in Counters
•File System Counters (e.g., number of bytes read/written)
•Job Counters (e.g., number of launched map tasks)
•Task Counters (e.g., CPU time taken by a task)
•User-Defined Counters
•Custom counters created by developers to track specific events

72
Counters in Hadoop
2. User-Defined Counters (Custom Counters)
Hadoop allows developers to define their own counters for
application-specific metrics.

3. . Distributed Counters
• Used for global aggregation across nodes.
• Updated during map or reduce tasks and sent to the JobTracker
for consolidation.

73
MapReduce
Features

⚫ Scalability
⚫ Flexibility
⚫ Availability
⚫ Fast
⚫ Security and Authentication
⚫ Cost-effective solution
⚫ Parallel Programming
⚫ Simplified Programming Model
⚫ High Throughput
⚫ Supports Multiple Programming
Languages
74
Real world MapReduce
⚫ Analysis of logs
⚫ data analysis
⚫ recommendation mechanisms
⚫ fraud detection
⚫ user behavior analysis
⚫ genetic algorithms
⚫ scheduling problems, resource
planning

75
Real world MapReduce
1. Search Engines (Google, Bing, Yahoo)
Web indexing and ranking pages.
• How it Works:
• Map: Parses and processes web pages to extract keywords.
• Reduce: Aggregates keyword counts and ranks web pages based
on relevance.
• Example: Google’s original PageRank algorithm was built using
MapReduce.

2. Social Media (Facebook, Twitter, LinkedIn)

Analyzing user interactions, trending topics, and recommendations.
• How it Works:
• Map: Processes user activity (likes, shares, tweets).
• Reduce: Aggregates and generates insights (e.g., most shared post,
trending hashtag).
• Example: Facebook uses MapReduce for log analysis and friend
recommendation algorithms.

76
Real world MapReduce

3. Banking & Finance (JPMorgan, PayPal, Visa)

Fraud detection and risk assessment.
• How it Works:
• Map: Processes millions of financial
transactions.
• Reduce: Detects anomalies and patterns of
fraudulent activities.
• Example: PayPal uses MapReduce to analyze
transaction logs for fraud detection.

77
78

HPC Notes
No ratings yet
HPC Notes
24 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
Unit 3
No ratings yet
Unit 3
13 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Unit - III
No ratings yet
Unit - III
37 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
unit 2
No ratings yet
unit 2
12 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Executing Hadoop Map Reduce Jobs
No ratings yet
Executing Hadoop Map Reduce Jobs
2 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Ap Educe Undamentals: Business
No ratings yet
Ap Educe Undamentals: Business
74 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
BDA U2 - copy
No ratings yet
BDA U2 - copy
79 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Big Data Unit 2 AKTU Notes
No ratings yet
Big Data Unit 2 AKTU Notes
63 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
PBDS Unit4
No ratings yet
PBDS Unit4
32 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Hadoop
No ratings yet
Hadoop
34 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
1 UNIT-1
No ratings yet
1 UNIT-1
59 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
unit3
No ratings yet
unit3
33 pages
UNIT III
No ratings yet
UNIT III
38 pages
Unit 3
No ratings yet
Unit 3
22 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Data Science
No ratings yet
Data Science
7 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
26 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Big Data Unit-2 PPT part1
No ratings yet
Big Data Unit-2 PPT part1
76 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
MLT unit -1
No ratings yet
MLT unit -1
38 pages
MLT Unit-2
No ratings yet
MLT Unit-2
30 pages
MLT Unit-4
No ratings yet
MLT Unit-4
33 pages
MLT Unit-3
No ratings yet
MLT Unit-3
39 pages
Multicore Architecture and Programming1 - P21EC7024
No ratings yet
Multicore Architecture and Programming1 - P21EC7024
4 pages
Manuscript Evolution of Operating Systems
No ratings yet
Manuscript Evolution of Operating Systems
8 pages
Ps
No ratings yet
Ps
68 pages
cs516 Unit III
No ratings yet
cs516 Unit III
30 pages
Introduction of Process Management
No ratings yet
Introduction of Process Management
18 pages
Introduction - Java Multithreading
No ratings yet
Introduction - Java Multithreading
6 pages
CUDA-Multiple GPUs
No ratings yet
CUDA-Multiple GPUs
36 pages
Week 11 DBMS Online Lecture Concurrency Control
No ratings yet
Week 11 DBMS Online Lecture Concurrency Control
23 pages
Kubernetes - NullCon1
No ratings yet
Kubernetes - NullCon1
29 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Review of OS Concepts
No ratings yet
Review of OS Concepts
100 pages
3 Pipeline
No ratings yet
3 Pipeline
21 pages
CMT 308
No ratings yet
CMT 308
2 pages
Unit-2_complete
No ratings yet
Unit-2_complete
126 pages
Course Outline
No ratings yet
Course Outline
4 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Classical IPC Problems Reader's and Writer Problem
No ratings yet
Classical IPC Problems Reader's and Writer Problem
79 pages
SMM Cap1
No ratings yet
SMM Cap1
101 pages
MIS403 Lec15 Nov14
No ratings yet
MIS403 Lec15 Nov14
24 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Dcs Unit 2 Distributed Computing Systems
No ratings yet
Dcs Unit 2 Distributed Computing Systems
10 pages
Java+Concurrency+eBook
No ratings yet
Java+Concurrency+eBook
98 pages
4.4 Rmi RPC
No ratings yet
4.4 Rmi RPC
22 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Scripts Basicos
No ratings yet
Scripts Basicos
4 pages
Fundamentals of Concurrency
No ratings yet
Fundamentals of Concurrency
3 pages
Experiment No. 3 Aim - Program To Demonstrate Inter-Process Communication Using Java
No ratings yet
Experiment No. 3 Aim - Program To Demonstrate Inter-Process Communication Using Java
5 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
3 pages
CS Operating-System PDF
No ratings yet
CS Operating-System PDF
20 pages

Big Data Unit-2 PPT part2

Uploaded by

Big Data Unit-2 PPT part2

Uploaded by

Map

⚫ MapReduce was introduced by Google to meet the

⚫ Over the time, MapReduce has been adopted by

⚫ MapReduce is a programming model and

Client - Submits the job

ResourceManager - Manages job execution (Hadoop 2.x)

NodeManager - Manages execution on worker nodes

Mapper - Processes input and produces

Reducer - Processes grouped data and produces

HDFS - Stores input and output data

Step 1: Set Up the Environment: Install Hadoop

Step 2: Define the Problem :Identify the input

Step 3: Implement the Mapper Function:Read

Step 4: Implement the Reducer

Step 5: Configure the Driver (Main

Step 6: Compile and Package the Program:If

Step 7: Upload Data to

HDFShdfs dfs -mkdir /inputhdfs dfs -put

Step 8: Run the MapReduce Job:Execute the

Step 9: Retrieve and Analyze :OutputOutput is

Step 10: Optimize & Debug (Optional)Use a

⚫A unit test is a type of software testing that focuses on

⚫MRUnit is a Java-based testing framework

⚫ Test Driven (TDD) is

 Add MRUnit Dependency

 Write Your Mapper and Reducer

 Run Unit Tests

Use mvn test or run the test cases in your IDE.

Example for Test Data-

Example Test Data (Word Count Program)

B. MRUnit for Unit Testing

• MRUnit is a Java testing framework for Hadoop

• Runs a small Hadoop cluster locally for near-real testing.

3. Running a Local Test Step-by-Step

Schedulers are the ones in YARN which are

• Different clients send their jobs to

• The jobTracker comes with these three

⚫ Resources are divided into queues with defined capacities to

⚫ While each queue has a minimum guaranteed

⚫ However, when another queue submits jobs,

⚫ it gives capacity guarantees and

2. If the Fraud Detection Team needs more resources, it can

3. If other teams submit jobs later, the scheduler preempts

4. The Log Processing Team's jobs run only when other

⚫ Its primary goal is to ensure that all jobs receive, on average,

⚫ The priority of the job is kept in consideration.

⚫ A single job is running, it gets all of the clusters.

⚫ A short job belonging to one user will

Consider a university's shared research cluster

1. Team A (Data Science Group) – Running large machine

Without Fair Scheduling (FIFO Scheduling Issue)

If Team A submits a long-running job first, it monopolizes

•When Team A submits its job, it initially gets the entire

•When Team B submits a job, the scheduler divides

•When Team C submits a job, the scheduler further

•When Team A submits its job, it initially gets the entire

•When Team B submits a job, the scheduler divides

•When Team C submits a job, the scheduler further

• Each Mapper task processes one input split at a time.

4. Shuffle and Sort Phase:

5. Reduce Phase (Reducer Execution):

-Input and Output Formats

• InMapReduce job execution, InputFormat is the

records which is used for map-reduce operation.

Example Output (Using ‘=’ as separator):

5. MultipleOutputs: This format allows writing data to files

• This can lead to unnecessary empty files.

It is the OutputFormat for writing to relational databases and HBase.

This format also sends the reduce output to a SQL

•Debugging: Detect issues like missing records.

 Hadoop Counters validate that:

 It has launched and successfully run correct number

2. Social Media (Facebook, Twitter, LinkedIn)

3. Banking & Finance (JPMorgan, PayPal, Visa)

You might also like