0% found this document useful (0 votes)
4 views

Big Data Unit-2 PPT part2

MapReduce is a programming model developed by Google for processing large-scale data in a distributed manner, widely adopted by companies like Yahoo and Netflix. It consists of two main steps: Map, which filters and sorts data, and Reduce, which aggregates the results. The framework is integral to Hadoop, enabling efficient data processing across large clusters while providing fault tolerance and resource management through Job Tracker and Task Tracker daemons.

Uploaded by

guptaraman600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Big Data Unit-2 PPT part2

MapReduce is a programming model developed by Google for processing large-scale data in a distributed manner, widely adopted by companies like Yahoo and Netflix. It consists of two main steps: Map, which filters and sorts data, and Reduce, which aggregates the results. The framework is integral to Hadoop, enabling efficient data processing across large clusters while providing fault tolerance and resource management through Job Tracker and Task Tracker daemons.

Uploaded by

guptaraman600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Map

Reduce
Map
Reduce
Map
Reduce
Map Reduce

⚫ MapReduce was introduced by Google to meet the


demand of their large set of users for its
applications like- search etc.

⚫ Over the time, MapReduce has been adopted by


many companies like- Yahoo, Netflix, Facebook
etc. in various forms.

⚫ MapReduce is a programming model and


framework for processing large-scale data in a
distributed and parallel manner.
Map Reduce Framework
⚫ MapReduce is the processing unit of
Hadoop.
⚫ Hadoop MapReduce is a software
framework that makes it simple to write
programs that process enormous volumes
of data (multi-terabyte data sets) in parallel
on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-
tolerant way.
⚫ Then processing them in parallel on
Hadoop commodity servers.
⚫ In the end, it collects all the 5
Map Reduce
⚫ MapReduce algorithm is made up of two steps: 1.Map
2.Reduce.
⚫ First step is Map and the second step is Reduce.

Map procedure(Transform): Performsa
filtering and sorting operation.

Reduce procedure(Aggregates):
Performs a summary operation
Two essential daemons of MapReduce-
Job Tracker and
Task Tracker
⚫ Each cluster node has one master
JobTracker and one slave TaskTracker in
the MapReduce framework. The master is in
charge of scheduling the slaves’ component
tasks, monitoring them, and re-running any
failed tasks. The slaves carry out the
master’s instructions.
⚫ Functions of Master Job Tracker
 Managing resources
 Scheduling tasks
 Monitoring tasks
⚫ Functions of Slave Task Tracker
 Executes the task
 Provide task status 7
Map
Reduce of large datasets in a
⚫ It performs the processing
distributed and parallel manner
⚫ The MapReduce task works on <Key, Value> pair.
⚫ Two main features of MapReduce are
parallel programming model and
large-scale distributed model.
⚫ MapReduce allows for the distributed
processing of the map and reducton
operations.

Map procedure(Transform): Performsa
filtering and sorting operation.

Reduce procedure(Aggregates):
Performs a summary operation
Two essential daemons of MapReduce-
Job Tracker and
Task Tracker
8
Daemons working in MapReduce

9
Example of Map Reduce for finding word count in
a Big Document

10
How MapReduce works?

11
How MapReduce works?

12
Input: This is the input data / file to be processed.
Split: Hadoop splits the incoming data into smaller pieces
called "splits".
Map: In this step, Map Reduce processes each split
according to the logic defined in map () function. Each
mapper works on each split at a time. Each mapper is
treated as a task and multiple tasks are executed across
different Task Trackers and coordinated by the Job Tracker.
Combine: This is an optional step and is used to
improve the performance by reducing the amount of data
transferred across the network. Combiner is the same as
the reduce step and is used for aggregating the output
of the map () function before it is passed to the subsequent
steps.
Shuffle & Sort: In this step, outputs from all the mappers
is shuffl ed, sorted to put them in order, and grouped before
sending them to the next step.
Reduce: This step is used to aggregate the outputs of
mappers using the reduce() function. Output of reducer is
sent to the next and final step. Each reducer is treated as a13
Mapper Task
⚫ It is the first phase of MapReduce
programming and contains the coding
logic of the mapper function.
⚫ Mapper function accepts key-value pairs as
input as (k,v), where the key represents the
offset address of each record and the value
represents the entire record content
Shuffle and Sort
• The output of various mappers, then goes into
Shuffle and
Sort phase.
• All the duplicate values are removed, and different
values are grouped together based on similar keys.
• The output of the Shuffle and Sort phase will be 14
Reducer Task
⚫ The output of the Shuffle and Sort phase
will be the input of the Reducer phase.
• Reducer consolidates outputs of various
mappers
and computes the final job output.
• The final output is then written into a single
file in an output directory of HDFS

15
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• • Input data is split into small subsets of
data.
• • Map tasks work on these data splits.
• • The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• • The Reduce task(s) works on this
intermediate data 16
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• • Input data is split into small subsets of
data.
• • Map tasks work on these data splits.
• • The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• • The Reduce task(s) works on this
intermediate data 17
Anatomy of a Map-Reduce Job Run
• A typical Hadoop MapReduce job is divided
into a set of Map and Reduce tasks that
execute on a Hadoop cluster.
• The execution flow occurs as follows:
• Input data is split into small subsets of
data.
• Map tasks work on these data splits.
• The intermediate input data from Map
tasks are then submitted to Reduce task
after an intermediate process called
‘shuffle’.
• The Reduce task(s) works on this
intermediate data 18
Anatomy of a Map-Reduce Job Run
Function
Component

Client - Submits the job

ResourceManager - Manages job execution (Hadoop 2.x)

NodeManager - Manages execution on worker nodes

Mapper - Processes input and produces


intermediate key-value pairs
Shuffle & Sort - Organizes intermediate data for
reducers

Reducer - Processes grouped data and produces


final output

HDFS - Stores input and output data


19
Developing A Map-Reduce
Application
Steps to Develop a MapReduce Application

Step 1: Set Up the Environment: Install Hadoop


and configure it.Set up HDFS (Hadoop
Distributed File System) for input/output storage.

Step 2: Define the Problem :Identify the input


data format (e.g., text, logs, structured
data).Decide what transformation needs to be
performed (e.g., word count, log analysis).

Step 3: Implement the Mapper Function:Read


input data line by line.Process each record and
emit intermediate key-value pairs.

20
Developing A Map-Reduce Application

Step 4: Implement the Reducer


Function:Aggregate and process the grouped
key-value pairs.Generate the final output.

Step 5: Configure the Driver (Main


Program):Set job configurations (Mapper,
Reducer, Combiner, Input & Output
format).Specify input and output paths.

Step 6: Compile and Package the Program:If


using Java, compile the code and create a JAR
file.If using Python, ensure the script is
executable for Hadoop Streaming.
21
Developing A Map-Reduce Application

Step 7: Upload Data to

HDFShdfs dfs -mkdir /inputhdfs dfs -put


localfile.txt /input

Step 8: Run the MapReduce Job:Execute the


job using Hadoop command-line tools.

Step 9: Retrieve and Analyze :OutputOutput is


stored in HDFS.

Step 10: Optimize & Debug (Optional)Use a


combiner to optimize performance.Debug using
Hadoop logs if issues arise. 22
Unit Test with MR unit

⚫A unit test is a type of software testing that focuses on


verifying the smallest testable parts of an
application, called units (typically individual
functions, methods, or classes), to ensure they work
as expected.

⚫MRUnit is a Java-based testing framework


specifically designed for unit testing Hadoop
MapReduce programs

23
MR unit
⚫ Testing library for MapReduce
⚫ Developed by Cloudera
⚫ Easy Integration with MapReduce and
standard testing tools
⚫ MRUnit is a Java-based testing(Junit test)
framework specifically designed for unit
testing Hadoop MapReduce programs.
⚫ It allows developers to test their Mapper,
Reducer, and Driver classes without
deploying them on a Hadoop cluster.
⚫ This makes it easy to develop as
well as to maintain Hadoop MapReduce
code bases
24
⚫ A MRUnit allows you to do TDD(Test
Driven Development) and write
lightweight unit tests which
accommodate Hadoop’s specific
architecture and constructs.

⚫ Test Driven (TDD) is


Development
development approach in software
which test
cases are developed to specify and validate
what the code will do.

25
How MR unit test works?

 Add MRUnit Dependency


Maven (automation abd project management tool)
MRUnit JAR(MRUnit Library )

 Write Your Mapper and Reducer


Create a Mapper that processes input key-value pairs and
emits intermediate key-value pairs.
Create a Reducer that aggregates values for each key.
 Setup MRUnit Test Cases
Use MapDriver to test the Mapper.Use ReduceDriver to test the
Reducer.
 Run Unit Tests
26
How MR unit test works?

 Run Unit Tests


Write JUnit test cases for Mapper, Reducer, and the full
MapReduce job.

Use mvn test or run the test cases in your IDE.

27
Benefits of MRunit
Fast, lightweight
Catch basic errors more
quickly Easy to get up and
running quickly

28
Test Data and local tests
Before running jobs on a real Hadoop
cluster, it is essential to test the logic locally
to ensure correctness and efficiency.

Example for Test Data-


⚫ If you want to test mobile software
applications.
⚫ Mobile has many different applications so to
test them you need different input data
such as photos of different formats,
music files supported and
unsupported formats, Videos file,
Contacts files, Different emails etc. 29
Test Data and local tests
Local test--
⚫ A local test runs directly on your own workstation, rather
than an Android device .
⚫ Local Testing is the ability to test private or internal
servers and local folders on the Browserstack Cloud
⚫ As such, it uses your local Java Virtual Machine (JVM),
rather than an Android device to run tests. Local tests enable
you to evaluate your app's logic more quickly.

30
Test Data and local tests
What is Test Data in Hadoop?
Test data refers to sample input data used to verify the
correctness of Hadoop programs before processing large
datasets.

Example Test Data (Word Count Program)


Suppose we have a simple text file for a word count
job: hello world hello
Hadoop hadoop
mapreduce
Expected output:
hello 2 world 1
hadoop 2 mapreduce 1
31
Test Data and local tests
Local Testing in Hadoop
Advantages of Test Locally
• Faster Debugging – No need to run jobs on a full Hadoop
cluster.
• Saves Resources – Avoids unnecessary use of distributed
systems.
• Ensures Accuracy – Helps catch errors before deploying
at scale
Methods for Local Testing
A. Local (Standalone) Mode
B. MRUnit for Unit Testing
C. MiniCluster (Pseudo-Distributed Mode)

32
Test Data and local tests
Methods for Local Testing
A. Local (Standalone) Mode
• Runs Hadoop programs on a single machine without
HDFS.
• Used for debugging and small-scale testing.
• Command to run a job in local mode:
• hadoop jar your-hadoop-job.jar input.txt output/

B. MRUnit for Unit Testing

• MRUnit is a Java testing framework for Hadoop


MapReduce.
• It tests Mappers and Reducers separately before full
execution. 33
Test Data and local tests
C. MiniCluster (Pseudo-Distributed Mode)

• Runs a small Hadoop cluster locally for near-real testing.


• Simulates a real Hadoop environment without full cluster
deployment.

3. Running a Local Test Step-by-Step


1. Prepare Input Data:Create a sample text file: input.txt
2. Write and Compile the Hadoop Program
Create Mapper, Reducer, and Driver classes.
3. Run the Hadoop Job Locally:
hadoop jar your-hadoop-job.jar input.txt output/
4. Verify Output:
cat output/part-r-00000 34
Failures in Hadoop MapReduce
Types
1. Task Failure
A Mapper or Reducer task might fail due to
software bugs, data corruption, or node
crashes.Hadoop retries the task on another node.
2. ApplicationMaster Failure
The ApplicationMaster can fail due to node failure
or resource issues.
YARN restarts the ApplicationMaster on another
node.

35
Failures in Hadoop MapReduce
3. Node Failure
A node running tasks may crash or become
unreachable.The ResourceManager reassigns tasks
to another node.
4. JobTracker / ResourceManager Failure (In
Hadoop 1.x and 2.x respectively)
In Hadoop 1.x, the JobTracker failure would cause all
running jobs to fail.In Hadoop 2.x (YARN), the
ResourceManager failure triggers recovery if High
Availability (HA) is enabled.
5. Disk or Network Failure
HDFS uses replication to prevent data loss from disk
failures. 36
Failures in Hadoop MapReduce
How Hadoop Handles Failures
Speculative Execution: If a task is running slower
than expected, Hadoop runs another copy and takes
the fastest result.
Retries: If a task fails, Hadoop retries it up to 4 times
(configurable).
Task Reassignment: Failed tasks are assigned to
other nodes.
Checkpointing: Intermediate results are
periodically saved to prevent reprocessing from
scratch.

37
Job Scheduling
Job scheduling is the process where different
tasks get executed at pre-determined time or
when the right event happen
A job scheduler is a system that can be
integrated with other software systems for the
purpose of executing or notifying other
software components when a pre-
determined, scheduled time arrives.

Schedulers are the ones in YARN which are


responsible for allocating the resources to
the various running applications

38
How does Job Scheduling work?

• Different clients send their jobs to


perform
• The jobs are managed by JobTracker or
resource manager
• There are three scheduling schemas
 FIFO queue-based scheduler
 The Fair Scheduler
 The Capacity Scheduler.

39
Job Scheduling

• The jobTracker comes with these three


scheduling techniques, and the default
is the FIFO.
• The Resource Manager takes Capacity
Scheduler and Fair Scheduler, where the
Capacity Scheduler is default.

40
FIF
O
⚫ As the name suggests FIFO i.e. First In First
Out, so the tasks or application that comes
first will be served first.
⚫ This is the default Scheduler we use in
Hadoop.
⚫ The tasks are placed in a queue and the
tasks are performed in their submission
order.
⚫ In this method, once the job is
scheduled, no intervention is
allowed.
⚫ So sometimes the high-priority process has to
wait for a long time since the priority of the task
does not matter in this method. 41
How does Job Scheduling work?

42
Capacity Scheduler

⚫ Resources are divided into queues with defined capacities to


ensure multi-tenancy.
⚫ (Tenancy refers to how resources are shared among multiple
users or groups in a system)
⚫ Each queue gets a guaranteed share of resources, and
multiple jobs can run concurrently
⚫ Capacity schedulers are used when it is required to allocate
the required number of resources to all the jobs.
⚫ Capacity schedulers are the ones that allow the
different persons for sharing a large Hadoop
cluster in a secure way.
⚫ it gives capacity guarantees and
safeguards to the
organization with the help of a cluster. 43
Capacity Scheduler
⚫ The Capacity Scheduler in Hadoop is designed to
be flexible.

⚫ While each queue has a minimum guaranteed


capacity, it can borrow unused resources from
other queues if they are idle.

⚫ However, when another queue submits jobs,


the extra borrowed resources are preempted
(taken back) to ensure fairness.

⚫ it gives capacity guarantees and


safeguards to the
organization with the help of a cluster.

44
Capacity Scheduler
Example
Imagine a large bank that has a Hadoop cluster
used by multiple departments:
1.Fraud Detection Team (high priority, requires
quick processing)
2.Customer Analytics Team (moderate priority,
requires significant resources)
3.Log Processing Team (low priority, runs in the
background)

45
Capacity Scheduler

46
Capacity Scheduler
How it Works in Real-Time?
1.If all teams are running jobs, each gets its minimum
guaranteed share.

2. If the Fraud Detection Team needs more resources, it can


borrow unused capacity from other queues (up to 80%).

3. If other teams submit jobs later, the scheduler preempts


low-priority tasks to free up resources for higher-priority
jobs.

4. The Log Processing Team's jobs run only when other


teams don’t fully utilize the cluster.
47
Fair Scheduling
⚫ Fair Scheduler is a resource management tool designed to
allocate cluster resources equitably among multiple jobs
and users.

⚫ Its primary goal is to ensure that all jobs receive, on average,


an equal share of resources over time

⚫ The priority of the job is kept in consideration.

⚫ A single job is running, it gets all of the clusters.


As more jobs are submitted, free task slots are
given to the jobs in such a way as to give
each user a fair share of the cluster.

48
Fair Scheduling

⚫ A short job belonging to one user will


complete in a reasonable time even while
another user’s long job is running, and the
long job will still make progress. Jobs are
placed in pools, and by default, each user gets
their own pool.
⚫ The Fair Scheduler supports preemption, so if
a pool has not received its fair share for a certain
period of time, then the scheduler will kill
tasks in pools running over capacity in order
to give the slots to the pool running under
capacity

49
Fair Scheduling

Example

Consider a university's shared research cluster


where multiple research teams run data processing
jobs:
•Different teams run different workloads, such as:

1. Team A (Data Science Group) – Running large machine


learning training jobs.
2. Team B (Bioinformatics Group) – Processing genomic
sequences.
3. Team C (IoT Group) – Running real-time sensor data
analytics.

50
Fair Scheduling

Without Fair Scheduling (FIFO Scheduling Issue)

If Team A submits a long-running job first, it monopolizes


the cluster.
Team B and Team C have to wait, causing resource
starvation for smaller jobs.

51
Fair Scheduling
With the Fair Scheduler:

•When Team A submits its job, it initially gets the entire


cluster because no other jobs are running.

•When Team B submits a job, the scheduler divides


resources equally, so both jobs get 50% each of
available resources.

•When Team C submits a job, the scheduler further


distributes the resources, ensuring all three jobs get an
equal share (33%).

52
Fair Scheduling
With the Fair Scheduler:

•When Team A submits its job, it initially gets the entire


cluster because no other jobs are running.

•When Team B submits a job, the scheduler divides


resources equally, so both jobs get 50% each of
available resources.

•When Team C submits a job, the scheduler further


distributes the resources, ensuring all three jobs get an
equal share (33%).

53
Task Execution :
1. Job Submission:
• The user submits a job to the JobTracker (in Hadoop
1.x) or ResourceManager (in Hadoop 2.x/YARN).
• The job is split into multiple tasks: Map tasks and
Reduce tasks.
• The input data is split into chunks called input splits,
which are processed by map tasks.

2. Job Initialization:
• The JobTracker/ResourceManager communicates with
the NameNode to get the data's location.
• It assigns the job to an available NodeManager (in
YARN) or TaskTracker (in Hadoop 1.x).

54
Task Execution :
3. Map Phase (Mapper Execution):

• Each Mapper task processes one input split at a time.


• The Mapper reads data from HDFS, processes it, and
outputs intermediate key-value pairs.

4. Shuffle and Sort Phase:


• The output from Mappers is shuffled and sorted by key.
• Data with the same key from multiple Mappers is
combined and sent to the appropriate Reducer.

55
Task Execution :

5. Reduce Phase (Reducer Execution):


• Each Reducer processes the sorted data.
• The Reducer aggregates the values associated with each key.

6. Output Phase:
• The final output is written to HDFS.
• The output is typically stored in a distributed manner.

56
MapReduce Types

-Input and Output Formats


• Hadoop MapReduce supports various input
and output formats to process different
types of data.
• These formats determine how data is read
by Mappers and written by Reducers.

57
MapReduce Types
1.Input Formats

i)TextInputFormat (Default)
ii) KeyValueTextInputFormat
iii) SequenceFileInputFormat
iv) SequenceFileAsTextInputFormat
(v) NLineInputFormat
(vi) MultipleInputs

58
MapReduce Types
2.Output Formats

i) TextOutputFormat (Default)
ii) KeyValueTextOutputFormat
iii) SequenceFileOutputFormat
iv) MapFileOutputFormat
v) MultipleOutputs
vi) LazyOutputFormat
(vii) DBOutputFormat

59
MapReduce Types
1.Input Formats

• InMapReduce job execution, InputFormat is the


first
• step.
• InputFormat describes how to split and read input
files.
● InputFormat is responsible for splitting the input data file into

records which is used for map-reduce operation.


60
i)TextInputFormat (Default):
• Reads line-by-line from a text file.
• Each line is treated as a key-value pair:
Key → Byte offset of the line.
Value → Line content.
Use Case: Log files, CSV, structured text data
Example:
0 Hadoop is fast
20 MapReduce is powerful

61
ii) KeyValueTextInputFormat:
Reads key-value pairs from a text file (key and value separated by
a delimiter, usually a tab).
Key → Text before the first tab.
Value → Text after the first tab.
Use Case: Processing key-value formatted logs, configurations.
Example:
apple 5
banana 7
(iii) SequenceFileInputFormat:
Binary format that stores key-value pairs in a compressed
format.
More efficient than text-based formats.
Use Case: Large datasets where compression improves performance.
Example:
Input is stored as Writable objects:(101, "Data Processing")
(102, "Big Data Analytics")
62
(iv) SequenceFileAsTextInputFormat:
• Reads a SequenceFile, but outputs the values as text.
• Useful for debugging compressed sequence files

(v) NLineInputFormatSplits:
• Splits input so that each Mapper gets a fixed number (N) of
lines.
• Use Case: When data processing requires control over how
many lines a single Mapper processes.

(vi) MultipleInputs:
Allows multiple input files with different input formats.

Example:
Processing CSV logs with TextInputFormat and Binary data with
SequenceFileInputFormat together
63
Output Formats
● The outputFormat decides the way the output key-
value pairs are written in the output files by
RecordWriter.
● The output format defines how data is written by the
Reducer.

64
MapReduce Types
2.Output Formats

i) TextOutputFormat (Default)
ii) KeyValueTextOutputFormat
iii) SequenceFileOutputFormat
iv) MapFileOutputFormat
v) MultipleOutputs
vi) LazyOutputFormat
(vii) DBOutputFormat

65
(i) TextOutputFormat (Default):
Writes plain text files, with key-value pairs separated by a
tab.
Use Case: Simple outputs like word count results.
Example Output:
apple 10
banana 15

(ii) KeyValueTextOutputFormat:
Similar to TextOutputFormat, but allows custom separators
instead of a tab.
Use Case: Custom key-value storage formats.

Example Output (Using ‘=’ as separator):


apple=10
banana=15
66
3. SequenceFileAsBinaryOutputFormat: It is another
variant 3.
of SequenceFileInputFormat. It also writes keys
and values to
4.MapFileOutputFormat: It is another form of
sequence file in binary
FileOutputFormat.
format.
It also writes output as map files.
The framework adds a key in a MapFile in order.
So we need to ensure that reducer emits keys in sorted order.

5. MultipleOutputs: This format allows writing data to files


whose names are derived from the output keys and
values.
6.LazyOutputFormat: It prevents empty files from being
created when a reducer or mapper does not produce
any output.

67
3. SequenceFileAsBinaryOutputFormat: It is another
variant 3.
of SequenceFileInputFormat. It also writes keys
and values to
sequence file in binary It is another
4.MapFileOutputFormat: form of
format.
FileOutputFormat.
• It also writes output as map files.
• The framework adds a key in a MapFile in order.
• So we need to ensure that reducer emits keys in sorted order.
5. MultipleOutputs: This format allows writing data to files
whose names are derived from the output keys and
values.
6.LazyOutputFormat: It prevents empty files from being
created when a reducer or mapper does not produce
any output.

68
By default, Hadoop creates an output file (part-r-00000, part-r-00001,
etc.) for each reducer, even if the reducer doesn't write any data.

• This can lead to unnecessary empty files.


• To avoid this, LazyOutputFormat ensures that an output file is only
created if the reducer writes at least one record.

7.DBOutputFormat:

It is the OutputFormat for writing to relational databases and HBase.

This format also sends the reduce output to a SQL

69
Counters in Hadoop
● Counters in Hadoop MapReduce are used for tracking
statistics and metrics during job execution.
● They help monitor the progress, detect issues, and collect
performance-related data.

Advantages of using

•Debugging: Detect issues like missing records.


•Performance Monitoring: Track data processing efficiency.
•Data Validation: Ensure correct processing of records

70
Counters in Hadoop

 Hadoop Counters validate that:


 It reads and written correct number of bytes.

 It has launched and successfully run correct number


of tasks or not.
 Counters also validate that the amount of CPU
and memory consumed is appropriate for our job and
cluster nodes or not.
 Counter also measures the progress or the
number of operations that occur within MapReduce
job.

71
Counters in Hadoop

Types:

1.Built-in Counters
•File System Counters (e.g., number of bytes read/written)
•Job Counters (e.g., number of launched map tasks)
•Task Counters (e.g., CPU time taken by a task)
•User-Defined Counters
•Custom counters created by developers to track specific events

72
Counters in Hadoop
2. User-Defined Counters (Custom Counters)
Hadoop allows developers to define their own counters for
application-specific metrics.

3. . Distributed Counters
• Used for global aggregation across nodes.
• Updated during map or reduce tasks and sent to the JobTracker
for consolidation.

73
MapReduce
Features

⚫ Scalability
⚫ Flexibility
⚫ Availability
⚫ Fast
⚫ Security and Authentication
⚫ Cost-effective solution
⚫ Parallel Programming
⚫ Simplified Programming Model
⚫ High Throughput
⚫ Supports Multiple Programming
Languages
74
Real world MapReduce
⚫ Analysis of logs
⚫ data analysis
⚫ recommendation mechanisms
⚫ fraud detection
⚫ user behavior analysis
⚫ genetic algorithms
⚫ scheduling problems, resource
planning

75
Real world MapReduce
1. Search Engines (Google, Bing, Yahoo)
Web indexing and ranking pages.
• How it Works:
• Map: Parses and processes web pages to extract keywords.
• Reduce: Aggregates keyword counts and ranks web pages based
on relevance.
• Example: Google’s original PageRank algorithm was built using
MapReduce.

2. Social Media (Facebook, Twitter, LinkedIn)


Analyzing user interactions, trending topics, and recommendations.
• How it Works:
• Map: Processes user activity (likes, shares, tweets).
• Reduce: Aggregates and generates insights (e.g., most shared post,
trending hashtag).
• Example: Facebook uses MapReduce for log analysis and friend
recommendation algorithms.

76
Real world MapReduce

3. Banking & Finance (JPMorgan, PayPal, Visa)


Fraud detection and risk assessment.
• How it Works:
• Map: Processes millions of financial
transactions.
• Reduce: Detects anomalies and patterns of
fraudulent activities.
• Example: PayPal uses MapReduce to analyze
transaction logs for fraud detection.

77
78

You might also like