0% found this document useful (0 votes)
99 views

Introduction To Hadoop Slides

The document provides an introduction to Hadoop, which is a framework for distributed computing on large datasets across computer clusters. It discusses key concepts like what constitutes big data in terms of volume, velocity, and variety of data. It describes how Hadoop uses distributed computing to address the challenges of processing big data at scale through parallelization across nodes. The document outlines the schedule and topics to be covered, which include the Hadoop distributed architecture, MapReduce, HDFS, YARN, and hands-on exercises.

Uploaded by

hegde247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Introduction To Hadoop Slides

The document provides an introduction to Hadoop, which is a framework for distributed computing on large datasets across computer clusters. It discusses key concepts like what constitutes big data in terms of volume, velocity, and variety of data. It describes how Hadoop uses distributed computing to address the challenges of processing big data at scale through parallelization across nodes. The document outlines the schedule and topics to be covered, which include the Hadoop distributed architecture, MapReduce, HDFS, YARN, and hands-on exercises.

Uploaded by

hegde247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

INTRODUCTION TO

HADOOP

Giovanna Roda
PRACE Autumn School ’21, 27–28 September 2021
Outline

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
Schedule

Timetable

Monday, Sept. 27th


13:00–13:30 What is Big Data?
13:30–14:30 The Hadoop distributed computing architecture
14:30–15:00 Coffee break
15:00–17:00 Hands-on exercises: HDFS and MapReduce
Tuesday, Sept. 28th
9:00–9:30 The YARN resource manager
9:30–10:30 MRjob library
10:30–11:00 testHDFSio

Introduction to Hadoop 2/101


Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
What is Big Data?
What is Big Data? What is Big Data

What is Big Data?

"Big Data" is the catch-all term for massive amounts of data as well as
for frameworks and R&D initiatives aimed at working with them efficiently.

Image source: erpinnews.com

Introduction to Hadoop 4/101


What is Big Data? What is Big Data

A short definition of Big Data

A nice definition from this year’s PRACE Summer of HPC presentation


"Convergence of HPC and Big Data".

Introduction to Hadoop 5/101


What is Big Data? The three V’s of Big Data

The three V’s of Big Data

It is customary to define Big Data in terms of three V’s:

Volume (the sheer volume of data)


Velocity (rate of flow of the data and processing speed needs)
Variety (different sources and formats)

Introduction to Hadoop 6/101


What is Big Data? The three V’s of Big Data

The three V’s of Big Data

Data arise from disparate sources and come in many sizes and formats.
Velocity refers to the speed of data generation as well as to processing
speed requirements.

Volume Velocity Variety


MB batch table
GB periodic database
TB near-real time multimedia
PB real time unstructured
... ... ...

Introduction to Hadoop 7/101


What is Big Data? The three V’s of Big Data

Reference: metric prefixes

1000000000000000000000000 1024 yotta Y septillion


1000000000000000000000 1021 zetta Z sextillion
1000000000000000000 1018 exa E quintillion
1000000000000000 1015 peta P quadrillion
1000000000000 1012 tera T trillion
1000000000 109 giga G billion
1000000 106 mega M million
1000 103 kilo k thousand

Note: 1 Gigabyte (GB) is 109 bytes. Sometimes GB is also used to denote


10243 or 23 0 bytes, which is actually one gibibyte (GiB).

Introduction to Hadoop 8/101


What is Big Data? The three V’s of Big Data

Structured vs. unstructured data

By structured data one refers to highly organized data that are usually
stored in relational databases or data warehouses. Structured data are easy
to search but unflexible in terms of the three "V"s.

Unstructured data come in mixed formats, usually require pre-processing,


and are difficult to search. Structured data are usually stored in noSQL
databases or in data lakes (these are scalable storage spaces for raw data of
mixed formats).

Introduction to Hadoop 9/101


What is Big Data? The three V’s of Big Data

Examples of structured/unstructured data

Industry Structured data Unstructured data

products & prices reviews


e-commerce customer data phone transcripts
transactions social media mentions

customer communication
financial transactions regulations & compliance
banking
customer data financial news

patient data clinical reports


healthcare
medical billing data radiology imagery

Introduction to Hadoop 10/101


What is Big Data? The three V’s of Big Data

Big Data in 2025

This table1 shows the projected annual storage and computing needs in
four domains (astronomy, social media, genomics)

1
Stephens ZD et al. “Big Data: Astronomical or Genomical?” In: PLoS Biol (2015).
Introduction to Hadoop 11/101
What is Big Data? The three V’s of Big Data

The three V’s of Big Data: additional dimensions

Three more "V"s to be pondered:

Veracity (quality or trustworthiness of data)


Value (economic value of the data)
Variability (general variability in any of the aforementioned
characteristics)

Introduction to Hadoop 12/101


What is Big Data? Addressing the challenges of Big Data

The challenges of Big Data

Anyone working with large amounts of data will sooner or later be


confronted with one or more of these challenges:

disk and memory space


processing speed
hardware faults
network capacity and speed
need to optimize resources use

Introduction to Hadoop 13/101


What is Big Data? Addressing the challenges of Big Data

Distributed computing for Big Data

Traditional technologies are inadequate for processing large amounts of


data efficiently.
Distributed computation makes it possible to work with Big Data using
reasonable amounts of time and resources.
Image: VSC-4 ©Matthias Heisler
Introduction to Hadoop 14/101
What is Big Data? Distributed computing

What is distributed computing?

A distributed computer system


consists of several interconnected
nodes. Nodes can be physical as well
as virtual machines or containers.

When a group of nodes provides


services and applications to the client
as if it were a single machine, then it
is also called a cluster.

Introduction to Hadoop 15/101


What is Big Data? Distributed computing

Main benefits of distributed computing

I Performance: supports intensive workloads by spreading tasks across


nodes
I Scalability: new nodes can be added to increase capacity
I Fault tolerance: resilience in case of hardware failures

Introduction to Hadoop 16/101


Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
The Hadoop distributed computing
architecture
The Hadoop distributed computing architecture Hadoop

Hadoop for distributed data processing

Hadoop is a framework for running jobs on clusters of computers that


provides a good abstraction of the underlying hardware and software.
“Stripped to its core, the tools that Hadoop provides for building distributed
systems—for data storage, data analysis, and coordination—are simple. If
there’s a common theme, it is about raising the level of abstraction—to
create building blocks for programmers who just happen to have lots of
data to store, or lots of data to analyze, or lots of machines to coordinate,
and who don’t have the time, the skill, or the inclination to become
distributed systems experts to build the infrastructure to handle it.2 ”

2
White T. Hadoop: The Definitive Guide. O’Reilly, 1988.
Introduction to Hadoop 18/101
The Hadoop distributed computing architecture Hadoop

Hadoop: some facts

Hadoop3 is an open-source project of the Apache Software Foundation.


The project was created to facilitate computations involving massive
amounts of data.

I its core components are implemented in Java


I initially released in 2006. Last stable version is 3.3.1 from June 2021
I originally inspired by Google‘s MapReduce4 and the proprietary GFS
(Google File System)

3
Apache Software Foundation. Hadoop. url: https://hadoop.apache.org.
4
J. Dean and S. Ghemawat. “MapReduce: Simplified data processing on large
clusters.” In: Proceedings of Operating Systems Design and Implementation (OSDI).
2004. url: https://www.usenix.org/legacy/publications/library/proceedings/
osdi04/tech/full_papers/dean/dean.pdf.
Introduction to Hadoop 19/101
The Hadoop distributed computing architecture Hadoop

Hadoop’s features

Hadoop’s features addressing the challenges of Big Data:

I scalability
I fault tolerance
I high availability
I distributed cache/data locality
I cost-effectiveness as it does not need high-end hardware
I provides a good abstraction of the underlying hardware
I easy to learn
I data can be queried trough SQL-like endpoints (Hive, Cassandra)

Introduction to Hadoop 20/101


The Hadoop distributed computing architecture Hadoop

Mini-glossary of Hadoop’s distinguishing features

fault tolerance: the ability to withstand hardware or network failures


(also: resilience)
high availability : this refers to the system minimizing downtimes by
eliminating single points of failure
data locality : task are run on the node where data are located, in
order to reduce the cost of moving data around

Introduction to Hadoop 21/101


The Hadoop distributed computing architecture The Hadoop core

The Hadoop core

The core of Hadoop consists of:

Hadoop common, the core libraries


HDFS, the Hadoop Distributed File System
MapReduce
the YARN (Yet Another Resource Negotiator) resource manager

Introduction to Hadoop 22/101


The Hadoop distributed computing architecture The Hadoop core

The Hadoop ecosystem

There’s a whole constellation of open source components for collecting,


storing, and processing big data that integrate with Hadoop.
Image source: Cloudera
Introduction to Hadoop 23/101
The Hadoop distributed computing architecture The Hadoop core

The Hadoop Distributed File System (HDFS)

HDFS stands for Hadoop Distributed File System and it takes care of
partitioning data across a cluster.
In order to prevent data loss and/or task termination due to hardware
failures HDFS uses either

replication (creating multiple copies —usually 3— of the data)


erasure coding

Data redundancy (obtained through replication or erasure coding) is the


basis of Hadoop’s fault tolerance.

Introduction to Hadoop 24/101


The Hadoop distributed computing architecture The Hadoop core

Replication vs. Erasure Coding

In order to provide protection against failures one introduces:

data redundancy
a method to recover the lost data using the redundant data

Replication is the simplest method for coding data by making n copies of


the data. n-fold replication guarantees the availability of data for at most
n − 1 failures and it has a storage overhead of 200% (this is equivalent to a
storage efficiency of 33%).
Erasure coding provides a better storage efficiency (up to to 71%) but it
can be more costly than replication in terms of performance.

Introduction to Hadoop 25/101


The Hadoop distributed computing architecture HDFS architecture

HDFS architecture

A typical Hadoop cluster installation


consists of:
a NameNode
a secondary NameNode
multiple DataNodes

Introduction to Hadoop 26/101


The Hadoop distributed computing architecture HDFS architecture

HDFS architecture: NameNode

NameNode
The NameNode is the main point of
access of a Hadoop cluster. It is
responsible for the bookkeeping of
the data partitioned across the
DataNodes, manages the whole
filesystem metadata, and performs
load balancing

Introduction to Hadoop 27/101


The Hadoop distributed computing architecture HDFS architecture

HDFS architecture: Secondary NameNode

Secondary NameNode
Keeps track of changes in the
NameNode performing regular
snapshots, thus allowing quick
startup.
An additional standby node is needed
to guarantee high availability (since
the NameNode is a single point of
failure).

Introduction to Hadoop 28/101


The Hadoop distributed computing architecture HDFS architecture

HDFS architecture: DataNode

DataNode
Here is where the data is saved and
the computations take place (data
nodes should actually be called "data
and worker nodes")

Introduction to Hadoop 29/101


The Hadoop distributed computing architecture HDFS architecture

HDFS architecture: internal data representation

HDFS supports working with very large files.


Internally, data are split into blocks. One of the reason for splitting data
into blocks is that in this way block objects all have the same size.
The block size in HDFS can be configured at installation time and it is by
default 128MiB (approximately 134MB).

Note: Hadoop sees data as a bunch of records and it processes multiple


files the same way it does with a single file. So, if the input is a directory
instead of a single file, it will process all files in that directory.

Introduction to Hadoop 30/101


The Hadoop distributed computing architecture HDFS architecture

HDFS architecture

Introduction to Hadoop 31/101


The Hadoop distributed computing architecture HDFS architecture

DataNode failures

Each DataNode sends a Heartbeat


message to the NameNode
periodically. Whenever a DataNode
becomes unavailable (due to network
or hardware failure), the NameNode
stops sending requests to that node
and creates new replicas of the blocks
stored on that node.

Introduction to Hadoop 32/101


The Hadoop distributed computing architecture WORM: Write Once Read Many

The WORM principle of HDFS

The Hadoop Distributed File System relies on a simple design principle for
data known as Write Once Read Many (WORM).
“A file once created, written, and closed need not be changed except for
appends and truncates. Appending the content to the end of the files is
supported but cannot be updated at arbitrary point. This assumption
simplifies data coherency issues and enables high throughput data access.5 ”
The data immutability paradigm is also discussed in Chapter 2 of "Big
Data".6

5
Apache Software Foundation. Hadoop. url:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/HdfsDesign.html.
6
Warren J. and Marz N. Big Data. Manning publications, 1988.
Introduction to Hadoop 33/101
Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
MapReduce
MapReduce MapReduce

The origins of MapReduce

The 2004 paper “MapReduce: Simplified Data Processing on Large


Clusters” by two members of Google’s R&D team, Jeffrey Dean and Sanjay
Ghemawat, is the seminal article on MapReduce.
The article describes the methods used to split, process, and aggregate the
large amount of data for the Google search engine.

The open-source version of MapReduce was later released within the


Apache Hadoop project.

Introduction to Hadoop 35/101


MapReduce MapReduce

MapReduce explained

Image source: Stack Overflow


Introduction to Hadoop 36/101
MapReduce MapReduce

MapReduce explained

The MapReduce paradigm is inspired by the computing model commonly


used in functional programming.

Applying the same function independently to items in a dataset either to


transform (map) or collate (reduce) them into new values, works well in a
distributed environment.

Introduction to Hadoop 37/101


MapReduce MapReduce

The phases of MapReduce

The phases of a MapReduce job:

split: data is partitioned across several computer nodes


map: apply a map function to each chunk of data
sort & shuffle: the output of the mappers is sorted and distributed to
the reducers
reduce: finally, a reduce function is applied to the data and an output
is produced

Introduction to Hadoop 38/101


MapReduce MapReduce

The phases of MapReduce

Introduction to Hadoop 39/101


MapReduce MapReduce

The phases of MapReduce

We have seen that a MapReduce job consists of four phases:

split
map
sort & shuffle
reduce

While splitting, sorting and shuffling are done by the framework, the map
and reduce functions are defined by the user.

It is also possible for the user to interact with the splitting, sorting and
shuffling phases and change their default behavior, for instance by
managing the amount of splitting or defining the sorting comparator. This
will be illustrated in the hands-on exercises.

Introduction to Hadoop 40/101


MapReduce MapReduce

MapReduce: some notes

Notes

the same map (and reduce) function is applied to all the chunks in the
data
the map and reduce computations can be carried out in parallel
because they’re completely independent from one another.
the split is not the same as the internal partitioning into blocks

Introduction to Hadoop 41/101


MapReduce MapReduce

MapReduce: shuffling and sorting

The shuffling and sorting phase is often the the most costly in a
MapReduce job.
The mapper takes as input unsorted data and emits key-value pairs. The
purpose of sorting is to provide data that is already grouped by key to the
reducer. This way reducers can start working as soon as a group (identified
by a key) is filled.

Introduction to Hadoop 42/101


MapReduce shuffling and sorting

MapReduce: shuffling and sorting

Introduction to Hadoop 43/101


Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
HDFS hands-on exercises
HDFS hands-on exercises HDFS basic commands

Where to find commands listing

For this part of the training you will need to activate the Hadoop module
using the command:

module load Hadoop/2.6.0-cdh5.8.0-native

All commands in this section can be found in the file:

HDFS_commands.txt

Introduction to Hadoop 45/101


HDFS hands-on exercises HDFS basic commands

Basic HDFS filesystem commands

One can regard HDFS as a regular file system, in fact many HDFS shell
commands are inherited from the corresponding bash commands.
To run a command on an Hadoop filesystem use the prefix hdfs dfs, for
instance use:
hdfs dfs - mkdir myDir

to create a new directory myDir on HDFS.

Note: One can use interchangeably hadoop or hdfs dfs when working on
a HDFS file system. The command hadoop is more generic because it can
be used not only on HDFS but also on other file systems that Hadoop
supports (such as Local FS, WebHDFS, S3 FS, and others).

Introduction to Hadoop 46/101


HDFS hands-on exercises HDFS basic commands

Basic HDFS filesystem commands

Basic HDFS filesystem commands that also exist in bash

hdfs dfs -mkdir create a directory


hdfs dfs -ls list files
hdfs dfs -cp copy files
hdfs dfs -cat print files
hdfs dfs -tail output last part of a file
hdfs dfs -rm remove files

Introduction to Hadoop 47/101


HDFS hands-on exercises HDFS basic commands

Basic HDFS filesystem commands

Here’s three basic commands that are specific to HDFS.

Copy single src, or multiple srcs


hdfs dfs –put from local file system to the
destination file system
Copy files to the local file sys-
hdfs dfs –get
tem
hdfs dfs -usage get help on hadoop fs

Introduction to Hadoop 48/101


HDFS hands-on exercises HDFS basic commands

Basic HDFS filesystem commands

To get more help on a specific hdfs command use: hdfs -help


<command>
$ hdfs dfs - help tail
# - tail [ - f ] < file > :
# Show the last 1 KB of the file .

# -f Shows appended data as the file grows .

Introduction to Hadoop 49/101


HDFS hands-on exercises HDFS basic commands

Some things to try

# create a new directory called " input " on HDFS


hdfs dfs - mkdir input
# copy local file wiki_1k_lines to input on HDFS
hdfs dfs - put wiki_1k_lines input /
# list contents of directory (" - h " = human )
hdfs dfs - ls -h input
# disk usage
hdfs dfs - du -h input
# get help on " du " command
hdfs dfs - help du
# remove directory
hdfs dfs - rm -r input

Introduction to Hadoop 50/101


HDFS hands-on exercises HDFS disk usage

Some things to try

What is the size of the file wiki_1k_lines? What is its disk usage?
# show the size of wiki_1k_lines on the regular filesystem
ls - lh wiki_1k_lines
# show the size of wiki_1k_lines on HDFS
hdfs dfs - put wiki_1k_lines
hdfs dfs - ls -h wiki_1k_lines

# disk usage of wiki_1k_lines on the regular filesystem


du -h wiki_1k_lines
# disk usage of wiki_1k_lines on HDFS
hdfs dfs - du -h wiki_1k_lines

Introduction to Hadoop 51/101


HDFS hands-on exercises HDFS disk usage

Disk usage on HDFS

The command hdfs dfs -help du will tell you that the output is of the
form:
size disk space consumed filename.

You’ll notice that the space on disk is larger than the file size (38.6MB
versus 19.3MB):
hdfs dfs - du -h wiki_1k_lines
# 19.3 M 38.6 M wiki_1k_lines

This is due to replication. You can check the replication factor using:
hdfs dfs - stat ’ Block size : % o Blocks : % b Replication : %r ’
input / wiki_1k_lines
# Block size : 134217728 Blocks : 20250760 Replication : 2

Introduction to Hadoop 52/101


HDFS hands-on exercises HDFS disk usage

Disk usage on HDFS

From the previous output:


Block size : 134217728 Blocks : 20250760 Replication : 2

we can see that the HDFS filesystem currently supports a replication factor
of 2.

Note that the Hadoop block size is defined in terms of mebibytes, in fact
134217728 bytes corresponds to 128MiB and 134MB. One MiB is larger
than a MB since one MiB is 10242 = 220 bytes, while one MB is 106 bytes.

Introduction to Hadoop 53/101


Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
MapReduce hands-on
MapReduce hands-on

Where to find commands listing

For this part of the training you will need to activate the Hadoop module
using the command:

module load Hadoop/2.6.0-cdh5.8.0-native

All commands in this section can be found in the file:

MapReduce_commands.txt

Introduction to Hadoop 55/101


MapReduce hands-on The Mapreduce streaming library

MapReduce streaming

The mapreduce streaming library allows to use any executable as mappers


and reducers.

read the input from stdin (line by line)


emit the output to stdout

The documentation for streaming can be found in:


https://hadoop.apache.org/docs/stable/hadoop-
streaming/HadoopStreaming.html

Introduction to Hadoop 56/101


MapReduce hands-on The Mapreduce streaming library

Locate the streaming library

First of all, we need to locate the streaming library on the system.


# find out where Hadoop is installed ( variable $HADOOP_HOME )
echo $HADOOP_HOME
# / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native / share /
hadoop / mapreduce

# find the streaming library


find / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native - name "
hadoop - streaming * jar "
# . . .
# / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native / share /
hadoop / tools / lib / hadoop - streaming -2.6.0 - cdh5 .8.0. jar

# save library in the variable $STREAMING


export STREAMING =/ opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 -
native / share / hadoop / tools / lib / hadoop - streaming -2.6.0 -
cdh5 .8.0. jar

Introduction to Hadoop 57/101


MapReduce hands-on How to run a Mapreduce job

Check input and output

We’re going to use the file wiki_1k_lines (later you can experiment with
a larger, for instance wiki_1k_lines.

# check that the output directory does not exist


hdfs dfs - rm -r output

# copy the file to HDFS


hdfs dfs - put wiki_1k_lines

Note: If you use a directory or file name that doesn’t start with a slash
(‘/‘) then the directory or file is meant to be in your home directory (both
in bash and on HDFS). A path that starts with a slash is called an absolute
path name.

Introduction to Hadoop 58/101


MapReduce hands-on How to run a Mapreduce job

Run a simple MapReduce job

Using the streaming library, we can run the simplest MapReduce job.
# launch MapReduce job
hadoop jar $STREAMING \
- input wiki_1k_lines \
- output output \
- mapper / bin / cat \
- reducer ’/ bin / wc -l ’

This job uses as a mapper the cat command, that does nothing else than
echoimg the input. The reducer wc -l counts the lines in the given input.
Note how we didn’t need to write any code for the mapper and reducer

because the executables (cat and wc) are already there as par of any
standard Linux distribution.

Introduction to Hadoop 59/101


MapReduce hands-on How to run a Mapreduce job

Run a simple MapReduce job

# launch MapReduce job


hadoop jar $STREAMING \
- input wiki_1k_lines \
- output output \
- mapper / bin / cat \
- reducer ’/ bin / wc -l ’

If the job was successful, the output directory on HDFS (we called it
output) should contain an empty file called _SUCCESS.
The file part-* contains the output of our job.
# check if job was successful ( output should contain a file
named _SUCCESS )
hdfs dfs - ls output
# check result
hdfs dfs - cat output / part -00000

Introduction to Hadoop 60/101


MapReduce hands-on How to run a Mapreduce job

Run a simple MapReduce job

Launch a MapReduce job with 4 mappers


hdfs dfs - rm -r output

# launch MapReduce job


hadoop jar $STREAMING \
-D mapreduce . job . maps =4 \
- input wiki_1k_lines \
- output output \
- mapper / bin / cat \
- reducer ’/ bin / wc -l ’

# check if job was successful ( output should contain a file


named _SUCCESS )
hdfs dfs - ls output
# check result
hdfs dfs - cat output / part -00000

Introduction to Hadoop 61/101


MapReduce hands-on How to run a Mapreduce job

Run a simple MapReduce job

Note how it is necessary to delete the output directory on HDFS (hdfs


dfs -rm -r output) because according to the WORM principle, Hadoop
will not delete or overwrite existing data!

The option -D mapreduce.job.maps=4 right after the jar directive (in


this example -D mapreduce.job.maps=4) allows to change MapReduce
properties at runtime.

The list of all MapReduce options can be found in: mapred-default.xml


Note: this is the link to the last stable version, there might be some slight
changes with respect to the version that is currently installed on the cluster.

Introduction to Hadoop 62/101


MapReduce hands-on Wordcount with Mapreduce

Wordcount

We are now going to run a wordcount job using Python executables as


mapper and reducer.
The mapper will be called mapper.py and the reducer reducer.py. Since
these executables are not known to Hadoop, it is necessary to add them
with the options

-files mapper.py -files reducer.py

Note: it is possible to have several mappers and reducers in one Mapreduce


job, the output of each function is sent as input to the next one.

Introduction to Hadoop 63/101


MapReduce hands-on Wordcount with Mapreduce

Define the mapper

# !/ bin / python3
import sys
for line in sys . stdin :
words = line . strip () . split ()
for word in words :
print ( " {}\ t {} " . format ( word ,1) )
Listing 1: mapper.py

Introduction to Hadoop 64/101


MapReduce hands-on Wordcount with Mapreduce

Define the reducer


# !/ bin / python3
import sys
current_word , current_count = None , 0
for line in sys . stdin :
word , count = line . strip () . split ( ’\t ’ , 1)
try :
count = int ( count )
except ValueError :
continue
if current_word == word :
current_count += count
else :
if current_word :
print ( " {}\ t {} " . format ( current_word ,
current_count ) )
current_count = count
current_word = word
if current_word == word :
print ( " {}\ t {} " . format ( current_word , current_count ) )
Listing 2: reducer.py
Introduction to Hadoop 65/101
MapReduce hands-on Wordcount with Mapreduce

Run the job


# upload file to HDFS
hdfs dfs - put data / wiki_1k_lines
# remove output directory
hdfs dfs - rm -r output

hadoop jar $STREAMING \


- files mapper . py \
- files reducer . py \
- mapper mapper . py \
- reducer reducer . py \
- input wiki_1k_lines \
- output output

Check results.
# check if job was successful ( output should contain a file
named _SUCCESS )
hdfs dfs - ls output
# check result
hdfs dfs - cat output / part -00000| head

Introduction to Hadoop 66/101


MapReduce hands-on Wordcount with Mapreduce

Sorting the output after the job

The reducer just writes the list of words and their frequency in the order
given by the mapper.

The output of the reducer is sorted by key (the word) because that’s the
ordering that the reducer becomes from the mapper. If we’re interested in
sorting the data by frequency, we can use the Unix sort command with the
options k2, n, r meaning respectively "by field 2", "numeric", "reverse".

hdfs dfs -cat output/part-00000|sort -k2nr|head

The output should be something like:


the 193778
of 117170
and 89966
in 69186
. . .

Introduction to Hadoop 67/101


MapReduce hands-on Wordcount with Mapreduce

Sorting with MapReduce

To sort by frequency using the mapreduce framework, we can employ a


simple trick: create a mapper that interchanges words with their frequency
values. Since by construction mappers sort their output by key, we get the
desired sorting as a side-effect.

Create a script swap_keyval.py


# !/ bin / python3
import sys
for line in sys . stdin :
word , count = line . strip () . split ( ’\t ’)
if int ( count ) >100:
print ( " {}\ t {} " . format ( count , word ) )
Listing 3: swap_keyval.py

Introduction to Hadoop 68/101


MapReduce hands-on Wordcount with Mapreduce

Sorting with MapReduce

Run the new MapReduce job using output as input and writing results to
a new directory output2.
# write the output to the directory output2
hdfs dfs - rm -r output2

hadoop jar $STREAMING \


- files swap_keyval . py \
- input output \
- output output2 \
- mapper swap_keyval . py

Looking at the output, one can see that it is sorted by frequency but
alphabetically.
hdfs dfs - cat output2 / part -00000| head
# 10021 his
# 1005 per
# 101 merely
# . . .
Introduction to Hadoop 69/101
MapReduce hands-on Wordcount with Mapreduce

Using comparator classes for sorting

In general, we can determine how mappers are going to sort their output by
configuring the comparator directive to use the special class
KeyFieldBasedComparator:
-D mapreduce . job . output . key . comparator . class =\
org . apache . hadoop . mapred . lib . K e yF ie l dB as ed C om pa ra t or

This class has some options similar to the Unix sort (-n to sort numerically,
-r for reverse sorting, -k pos1[,pos2] for specifying fields to sort by).
See documentation: KeyFieldBasedComparator.html

Introduction to Hadoop 70/101


MapReduce hands-on Wordcount with Mapreduce

Using comparator classes for sorting

hdfs dfs - rm -r output2

comparator_class = org . apache . hadoop . mapred . lib .


K ey Fi e l d B a s e d C o m p ar a to r

hadoop jar $STREAMING \


-D mapreduce . job . output . key . comparator . class =
$comparator_c lass \
-D mapreduce . partition . keycomparator . options = - nr \
- files swap_keyval . py \
- input output \
- output output2 \
- mapper swap_keyval . py

Introduction to Hadoop 71/101


MapReduce hands-on Wordcount with Mapreduce

Using comparator classes for sorting

Now MapReduce has performed the desired sorting on the data.


hdfs dfs - cat output2 / part -00000| head
193778 the
117170 of
89966 and
69186 in
. . .

Introduction to Hadoop 72/101


MapReduce hands-on Some things to try

Modify the Wordcount example

Try to modify the wordcount example:

using executables in other programming languages


adding a mapper that filters certain words
using larger files

Introduction to Hadoop 73/101


MapReduce hands-on Some things to try

Run the MapReduce examples

The MapReduce distribution comes with some standard examples including


source code.

To get a list of all available examples use:


hadoop jar \
$HADOOP_HOME / hadoop - mapreduce - examples -2.6.0 - cdh5 .8.0. jar

Run the Wordcount example:


hadoop jar \
$HADOOP_HOME / hadoop - mapreduce - examples -2.6.0 - cdh5 .8.0. jar
wordcount wiki_1k_lines output3

Introduction to Hadoop 74/101


Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
The YARN resource manager
The YARN resource manager YARN

YARN: Yet Another Resource Negotiator

Hadoop jobs are usually managed by YARN (acronym for Yet Another
Resource Negotiator), that is responsible for allocating resources and
managing job scheduling. Basic resource types are:

memory (memory-mb)
virtual cores (vcores)

YARN supports an extensible resource model that allows to define any


countable resource. A countable resource is a resource that is consumed
while a container is running, but is released afterwards. Such a resource
can be for instance:

GPU (gpu)

Introduction to Hadoop 76/101


The YARN resource manager YARN

YARN architecture

Image source: Apache Software Foundation

Introduction to Hadoop 77/101


The YARN resource manager YARN

YARN architecture

Each job submitted to the Yarn is assigned:

a container : this is an abstract entity which incorporates resources


such as memory, cpu, disk, network etc. Container resources are
allocated by YARN’s Scheduler.
an ApplicationMaster service assigned by the Application Manager
for monitoring the progress of the job, restarting tasks if needed

Introduction to Hadoop 78/101


The YARN resource manager YARN

YARN architecture

The main idea of Yarn is to have two distinct daemons for job monitoring
and scheduling, one global and one local for each application:

the Resource Manager is the global job manager, consisting of:


Scheduler: allocates resources across all applications
Applications Manager: accepts job submissions, restart
Application Masters on failure
an Application Master is the local application manager, responsible
for negotiating resources, monitoring status of the job, restarting failed
tasks

Introduction to Hadoop 79/101


The YARN resource manager YARN

Dynamic resource pools

Sharing computing resources fairly can be a big issue in multi-user


environments.

YARN supports dynamic resource pools for scheduling applications.


A resource pool is a given configuration of resources to which a group of
users is granted access. Whenever a group is not active, the resources are
preempted and granted to other groups.
Groups are assigned a priority and resources are shared among groups
according to these priority values.

Additionally, resource configurations can be scheduled for specific intervals


of time.

Introduction to Hadoop 80/101


The YARN resource manager YARN

YARN on SLURM

When running YARN on top of SLURM, it is not clear how to take


advantage of the flexibility of YARN’s dynamic resource pools to optimize
resource utilization.

How to leverage the bses characteristics of job schedulers from both Big
Data and HPC architectures in order to decrease latency is a subject of
active study.

Introduction to Hadoop 81/101


Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
MRjob
MRjob The MRjob Python library

MRjob

What is MRjob? It’s a wrapper for MapReduce that allows to write


MapReduce jobs in pure Python.

The library can be used for testing MapReduce as well as Spark jobs
without the need of a Hadoop cluster.

Here’s a quick-start tutorial:


https://mrjob.readthedocs.io/en/latest/index.html

Introduction to Hadoop 83/101


MRjob The MRjob Python library

A MRjob wordcount
from mrjob . job import MRJob

class MRWo r d F r e q u e nc yCount ( MRJob ) :


"""
A class to represent a Word Frequency Count mapreduce
job
"""
def mapper ( self , _ , line ) :
yield " chars " , len ( line )
yield " words " , len ( line . split () )
yield " lines " , 1

def reducer ( self , key , values ) :


yield key , sum ( values )

if __name__ == ’ __main__ ’:
MRWord F r e q ue n c y C ount . run ()
Listing 4: word_count.py

Introduction to Hadoop 84/101


MRjob The MRjob Python library

A MRjob wordcount

Install mrjob in a virtual environment:


# install mrjob
python3 -m venv mypython
mypython / bin / pip install ipython mrjob

Run a basic mrjob wordcount


mypython / bin / python3 word_count . py data / wiki_1k_lines

Introduction to Hadoop 85/101


Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
Benchmarking I/O with testDFSio
Benchmarking I/O with testDFSio Running TestDFSio

TestDFSio

TestDFSio is a tool for measuring the performance of read and write


operations on HDFS and can be used to measure performance, benchmark,
or stress-test a Hadoop cluster.

TestDFSio uses MapReduce to write files to the HDFS filesystem spanning


one mapper for file; the reducer is used to collect and summarize test data.

Introduction to Hadoop 87/101


Benchmarking I/O with testDFSio Running TestDFSio

Find library location

Find the location of the testDFSio library, save it in the variable


testDFSio:
find / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native \
- name " hadoop - mapreduce - client - jobclient * tests . jar "
# / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native / share /
hadoop / mapreduce / hadoop - mapreduce - client - jobclient
-2.6.0 - cdh5 .8.0 - tests . jar

export testDFSiojar =/ opt / apps / software / Hadoop /2.6.0 - cdh5


.8.0 - native / share / hadoop / mapreduce / hadoop - mapreduce -
client - jobclient -2.6.0 - cdh5 .8.0 - tests . jar

Introduction to Hadoop 88/101


Benchmarking I/O with testDFSio Running TestDFSio

Options

Main options:

-write to run write tests


-read to run read tests
-nrFiles the number of files (set to be equal to the number of
mappers)
-fileSize size of files (followed by B|KB|MB|GB|TB)

TestDFSIO generates exactly 1 map task per file, so it is a 1:1 mapping


from files to map tasks.

Introduction to Hadoop 89/101


Benchmarking I/O with testDFSio Running TestDFSio

Specify custom i/o directory

To avoid permission problems (you need to have read/write access to


/benchmarks/TestDFSIO on HDFS).

By default TestHDFSio uses the HDFS directory /benchmarks to read and


write, therefore it is recommended to run the tests as hdfs.

In case you want to run the tests as a user who has no write permissions on
HDFS root folder /, you can specify an alternative directory with the
option -D assigning a new value to test.build.data.

Introduction to Hadoop 90/101


Benchmarking I/O with testDFSio Running TestDFSio

Running a write test

We are going to run a test with nrFiles files, each of size fileSize,
using a custom output directory.

export myDir =/ user / $ { USER }/ benchmarks


export nrFiles =2
export fileSize =10 MB
cd ~
hadoop jar $testDFSiojar TestDFSIO -D test . build . data = $myDir
- write - nrFiles $nrFiles - fileSize $fileSize

Introduction to Hadoop 91/101


Benchmarking I/O with testDFSio Running TestDFSio

Running a read test

We are going to run a test with nrFiles files, each of size fileSize,
using a custom output directory.

export myDir =/ user / $ { USER }/ benchmarks


export nrFiles =2
export fileSize =10 MB
cd ~
hadoop jar $testDFSiojar TestDFSIO -D test . build . data = $myDir
- read - nrFiles $nrFiles - fileSize $fileSize

Introduction to Hadoop 92/101


Benchmarking I/O with testDFSio Running TestDFSio

A sample test output

19/07/25 15:44:26 INFO fs . TestDFSIO : ----- TestDFSIO ----- :


write
19/07/25 15:44:26 INFO fs . TestDFSIO : Date & time :
Thu Jul 25 15:44:26 CEST 2019
19/07/25 15:44:26 INFO fs . TestDFSIO : Number of files :
20
19/07/25 15:44:26 INFO fs . TestDFSIO : Total MBytes processed :
204800.0
19/07/25 15:44:26 INFO fs . TestDFSIO : Throughput mb / sec :
39.38 46532544 7428
19/07/25 15:44:26 INFO fs . TestDFSIO : Average IO rate mb / sec :
39.59 94606018 0664
19/07/25 15:44:26 INFO fs . TestDFSIO : IO rate std deviation :
3.0182 194 67 98 127 17
19/07/25 15:44:26 INFO fs . TestDFSIO : Test exec time sec :
292.66

Introduction to Hadoop 93/101


Benchmarking I/O with testDFSio Running TestDFSio

How to interpret the results

The main measurements produced by the HDFSio test are:

throughput in MB/sec
average IO rate in MB/sec
standard deviation of the IO rate
test execution time

All test results are logged by default to the file TestDFSIO_results.log.


The log file can be changed with the option -resFile resultFileName.

Introduction to Hadoop 94/101


Benchmarking I/O with testDFSio Running TestDFSio

Advanced test configuration

In addition to the default sequential file access, the mapper class for
reading data can be configured to perform various types of random reads:

random read always chooses a random position to read from


(skipSize = 0)
backward read reads file in reverse order (skipSize < 0)
skip-read skips skipSize bytes after every read (skipSize > 0)

The -compression option allows to specify a codec for the input and
output of data.

Introduction to Hadoop 95/101


Benchmarking I/O with testDFSio Running TestDFSio

What is a codec

Codec is a portmanteau of coder and decoder and it designates any


hardware or software device that is used to encode—most commonly also
reducing the original size—and decode information. Hadoop provides
classes that encapsulate compression and decompression algorithms.

These are all currently available Hadoop compression codecs:


Compression format Hadoop CompressionCodec
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
LZ4 org.apache.hadoop.io.compress.Lz4Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec

Introduction to Hadoop 96/101


Benchmarking I/O with testDFSio Running TestDFSio

Concurrent versus overall throughput

Throughput or data transfer rate measures the amount of data read or


written (expressed in Megabytes per second—MB/s) to the filesystem.

Throughput is one of the main performance measures used by disk


manufacturers as knowing how fast data can be moved around in a disk is
an important important factor to look at.

The listed throughput shows the average throughput among all the map
tasks. To get an approximate overall throughput on the cluster you can
divide the total MBytes by the test execution time in seconds.

Introduction to Hadoop 97/101


Benchmarking I/O with testDFSio Running TestDFSio

Clean up

When done with the tests, clean up the temporary files generated by
testDFSio.

export myDir =/ user / $ { USER }/ benchmarks

hadoop jar $testDFSiojar TestDFSIO \


-D test . build . data = $myDir - clean

Introduction to Hadoop 98/101


Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
Concluding remarks
Concluding remarks

Big Data on VSC course

As part of the Vienna Scientific cluster training program, we offer a course


"Big Data on VSC".
The first two editions ran in January and March of this year and the next
edition will take place next spring.

Our Hadoop expertise comes from managing a Big Data cluster LBD
(Little Big Data*) at the Vienna University of technology. The cluster is
running since 2017 and is used for teaching and research.

(*) https://lbd.zserv.tuwien.ac.at/

Introduction to Hadoop 100/101


Concluding remarks

Thanks

Thanks to:

I Janez Povh and Leon Kos for inviting me once again to hold this
training on Hadoop.
I Dieter Kvasnicka my colleague and co-trainer for Big Data on VSC for
his constant support

Introduction to Hadoop 101/101

You might also like