Introduction To Hadoop Slides
Introduction To Hadoop Slides
HADOOP
Giovanna Roda
PRACE Autumn School ’21, 27–28 September 2021
Outline
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
Schedule
Timetable
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
What is Big Data?
What is Big Data? What is Big Data
"Big Data" is the catch-all term for massive amounts of data as well as
for frameworks and R&D initiatives aimed at working with them efficiently.
Data arise from disparate sources and come in many sizes and formats.
Velocity refers to the speed of data generation as well as to processing
speed requirements.
By structured data one refers to highly organized data that are usually
stored in relational databases or data warehouses. Structured data are easy
to search but unflexible in terms of the three "V"s.
customer communication
financial transactions regulations & compliance
banking
customer data financial news
This table1 shows the projected annual storage and computing needs in
four domains (astronomy, social media, genomics)
1
Stephens ZD et al. “Big Data: Astronomical or Genomical?” In: PLoS Biol (2015).
Introduction to Hadoop 11/101
What is Big Data? The three V’s of Big Data
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
The Hadoop distributed computing
architecture
The Hadoop distributed computing architecture Hadoop
2
White T. Hadoop: The Definitive Guide. O’Reilly, 1988.
Introduction to Hadoop 18/101
The Hadoop distributed computing architecture Hadoop
3
Apache Software Foundation. Hadoop. url: https://hadoop.apache.org.
4
J. Dean and S. Ghemawat. “MapReduce: Simplified data processing on large
clusters.” In: Proceedings of Operating Systems Design and Implementation (OSDI).
2004. url: https://www.usenix.org/legacy/publications/library/proceedings/
osdi04/tech/full_papers/dean/dean.pdf.
Introduction to Hadoop 19/101
The Hadoop distributed computing architecture Hadoop
Hadoop’s features
I scalability
I fault tolerance
I high availability
I distributed cache/data locality
I cost-effectiveness as it does not need high-end hardware
I provides a good abstraction of the underlying hardware
I easy to learn
I data can be queried trough SQL-like endpoints (Hive, Cassandra)
HDFS stands for Hadoop Distributed File System and it takes care of
partitioning data across a cluster.
In order to prevent data loss and/or task termination due to hardware
failures HDFS uses either
data redundancy
a method to recover the lost data using the redundant data
HDFS architecture
NameNode
The NameNode is the main point of
access of a Hadoop cluster. It is
responsible for the bookkeeping of
the data partitioned across the
DataNodes, manages the whole
filesystem metadata, and performs
load balancing
Secondary NameNode
Keeps track of changes in the
NameNode performing regular
snapshots, thus allowing quick
startup.
An additional standby node is needed
to guarantee high availability (since
the NameNode is a single point of
failure).
DataNode
Here is where the data is saved and
the computations take place (data
nodes should actually be called "data
and worker nodes")
HDFS architecture
DataNode failures
The Hadoop Distributed File System relies on a simple design principle for
data known as Write Once Read Many (WORM).
“A file once created, written, and closed need not be changed except for
appends and truncates. Appending the content to the end of the files is
supported but cannot be updated at arbitrary point. This assumption
simplifies data coherency issues and enables high throughput data access.5 ”
The data immutability paradigm is also discussed in Chapter 2 of "Big
Data".6
5
Apache Software Foundation. Hadoop. url:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/HdfsDesign.html.
6
Warren J. and Marz N. Big Data. Manning publications, 1988.
Introduction to Hadoop 33/101
Outline/next
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
MapReduce
MapReduce MapReduce
MapReduce explained
MapReduce explained
split
map
sort & shuffle
reduce
While splitting, sorting and shuffling are done by the framework, the map
and reduce functions are defined by the user.
It is also possible for the user to interact with the splitting, sorting and
shuffling phases and change their default behavior, for instance by
managing the amount of splitting or defining the sorting comparator. This
will be illustrated in the hands-on exercises.
Notes
the same map (and reduce) function is applied to all the chunks in the
data
the map and reduce computations can be carried out in parallel
because they’re completely independent from one another.
the split is not the same as the internal partitioning into blocks
The shuffling and sorting phase is often the the most costly in a
MapReduce job.
The mapper takes as input unsorted data and emits key-value pairs. The
purpose of sorting is to provide data that is already grouped by key to the
reducer. This way reducers can start working as soon as a group (identified
by a key) is filled.
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
HDFS hands-on exercises
HDFS hands-on exercises HDFS basic commands
For this part of the training you will need to activate the Hadoop module
using the command:
HDFS_commands.txt
One can regard HDFS as a regular file system, in fact many HDFS shell
commands are inherited from the corresponding bash commands.
To run a command on an Hadoop filesystem use the prefix hdfs dfs, for
instance use:
hdfs dfs - mkdir myDir
Note: One can use interchangeably hadoop or hdfs dfs when working on
a HDFS file system. The command hadoop is more generic because it can
be used not only on HDFS but also on other file systems that Hadoop
supports (such as Local FS, WebHDFS, S3 FS, and others).
What is the size of the file wiki_1k_lines? What is its disk usage?
# show the size of wiki_1k_lines on the regular filesystem
ls - lh wiki_1k_lines
# show the size of wiki_1k_lines on HDFS
hdfs dfs - put wiki_1k_lines
hdfs dfs - ls -h wiki_1k_lines
The command hdfs dfs -help du will tell you that the output is of the
form:
size disk space consumed filename.
You’ll notice that the space on disk is larger than the file size (38.6MB
versus 19.3MB):
hdfs dfs - du -h wiki_1k_lines
# 19.3 M 38.6 M wiki_1k_lines
This is due to replication. You can check the replication factor using:
hdfs dfs - stat ’ Block size : % o Blocks : % b Replication : %r ’
input / wiki_1k_lines
# Block size : 134217728 Blocks : 20250760 Replication : 2
we can see that the HDFS filesystem currently supports a replication factor
of 2.
Note that the Hadoop block size is defined in terms of mebibytes, in fact
134217728 bytes corresponds to 128MiB and 134MB. One MiB is larger
than a MB since one MiB is 10242 = 220 bytes, while one MB is 106 bytes.
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
MapReduce hands-on
MapReduce hands-on
For this part of the training you will need to activate the Hadoop module
using the command:
MapReduce_commands.txt
MapReduce streaming
We’re going to use the file wiki_1k_lines (later you can experiment with
a larger, for instance wiki_1k_lines.
Note: If you use a directory or file name that doesn’t start with a slash
(‘/‘) then the directory or file is meant to be in your home directory (both
in bash and on HDFS). A path that starts with a slash is called an absolute
path name.
Using the streaming library, we can run the simplest MapReduce job.
# launch MapReduce job
hadoop jar $STREAMING \
- input wiki_1k_lines \
- output output \
- mapper / bin / cat \
- reducer ’/ bin / wc -l ’
This job uses as a mapper the cat command, that does nothing else than
echoimg the input. The reducer wc -l counts the lines in the given input.
Note how we didn’t need to write any code for the mapper and reducer
because the executables (cat and wc) are already there as par of any
standard Linux distribution.
If the job was successful, the output directory on HDFS (we called it
output) should contain an empty file called _SUCCESS.
The file part-* contains the output of our job.
# check if job was successful ( output should contain a file
named _SUCCESS )
hdfs dfs - ls output
# check result
hdfs dfs - cat output / part -00000
Wordcount
# !/ bin / python3
import sys
for line in sys . stdin :
words = line . strip () . split ()
for word in words :
print ( " {}\ t {} " . format ( word ,1) )
Listing 1: mapper.py
Check results.
# check if job was successful ( output should contain a file
named _SUCCESS )
hdfs dfs - ls output
# check result
hdfs dfs - cat output / part -00000| head
The reducer just writes the list of words and their frequency in the order
given by the mapper.
The output of the reducer is sorted by key (the word) because that’s the
ordering that the reducer becomes from the mapper. If we’re interested in
sorting the data by frequency, we can use the Unix sort command with the
options k2, n, r meaning respectively "by field 2", "numeric", "reverse".
Run the new MapReduce job using output as input and writing results to
a new directory output2.
# write the output to the directory output2
hdfs dfs - rm -r output2
Looking at the output, one can see that it is sorted by frequency but
alphabetically.
hdfs dfs - cat output2 / part -00000| head
# 10021 his
# 1005 per
# 101 merely
# . . .
Introduction to Hadoop 69/101
MapReduce hands-on Wordcount with Mapreduce
In general, we can determine how mappers are going to sort their output by
configuring the comparator directive to use the special class
KeyFieldBasedComparator:
-D mapreduce . job . output . key . comparator . class =\
org . apache . hadoop . mapred . lib . K e yF ie l dB as ed C om pa ra t or
This class has some options similar to the Unix sort (-n to sort numerically,
-r for reverse sorting, -k pos1[,pos2] for specifying fields to sort by).
See documentation: KeyFieldBasedComparator.html
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
The YARN resource manager
The YARN resource manager YARN
Hadoop jobs are usually managed by YARN (acronym for Yet Another
Resource Negotiator), that is responsible for allocating resources and
managing job scheduling. Basic resource types are:
memory (memory-mb)
virtual cores (vcores)
GPU (gpu)
YARN architecture
YARN architecture
YARN architecture
The main idea of Yarn is to have two distinct daemons for job monitoring
and scheduling, one global and one local for each application:
YARN on SLURM
How to leverage the bses characteristics of job schedulers from both Big
Data and HPC architectures in order to decrease latency is a subject of
active study.
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
MRjob
MRjob The MRjob Python library
MRjob
The library can be used for testing MapReduce as well as Spark jobs
without the need of a Hadoop cluster.
A MRjob wordcount
from mrjob . job import MRJob
if __name__ == ’ __main__ ’:
MRWord F r e q ue n c y C ount . run ()
Listing 4: word_count.py
A MRjob wordcount
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
Benchmarking I/O with testDFSio
Benchmarking I/O with testDFSio Running TestDFSio
TestDFSio
Options
Main options:
In case you want to run the tests as a user who has no write permissions on
HDFS root folder /, you can specify an alternative directory with the
option -D assigning a new value to test.build.data.
We are going to run a test with nrFiles files, each of size fileSize,
using a custom output directory.
We are going to run a test with nrFiles files, each of size fileSize,
using a custom output directory.
throughput in MB/sec
average IO rate in MB/sec
standard deviation of the IO rate
test execution time
In addition to the default sequential file access, the mapper class for
reading data can be configured to perform various types of random reads:
The -compression option allows to specify a codec for the input and
output of data.
What is a codec
The listed throughput shows the average throughput among all the map
tasks. To get an approximate overall throughput on the cluster you can
divide the total MBytes by the test execution time in seconds.
Clean up
When done with the tests, clean up the temporary files generated by
testDFSio.
Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
Concluding remarks
Concluding remarks
Our Hadoop expertise comes from managing a Big Data cluster LBD
(Little Big Data*) at the Vienna University of technology. The cluster is
running since 2017 and is used for teaching and research.
(*) https://lbd.zserv.tuwien.ac.at/
Thanks
Thanks to:
I Janez Povh and Leon Kos for inviting me once again to hold this
training on Hadoop.
I Dieter Kvasnicka my colleague and co-trainer for Big Data on VSC for
his constant support