0% found this document useful (0 votes)

304 views

Hadoop Ecosystem PDF

The document discusses the Hadoop ecosystem. It describes the core Hadoop components of HDFS for storage and MapReduce for processing. It also discusses other common Hadoop tools like HBase, Hive, Pig, Impala, Sqoop, Flume, Hue, and Zookeeper. HDFS stores large files across clusters in a fault-tolerant way. MapReduce allows distributed processing of large datasets in parallel. HBase is a distributed database that provides big table capabilities. Hive, Pig, and Impala allow SQL-like querying of Hadoop data. Sqoop moves data between Hadoop and relational databases. Flume collects streaming data, and Zookeeper provides coordination services.

Uploaded by

Rishabh Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

304 views

Hadoop Ecosystem PDF

Uploaded by

Rishabh Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

The Hadoop Ecosystem

Overview
Big Data Challenges
Distributed system and challenges
Hadoop Introduction
History
Who uses Hadoop
The Hadoop Ecosystem
 Hadoop core components
 HDFS
 Map Reduce
 Other Hadoop ecosystem components
 Hbase
 Hive
 Pig
 Impala
 Sqoop
 Flume
 Hue
 Zookeeper
Demo
Big Data Challenges
Solution: Distributed system
Distributed System Challenges

Programming Complexity
Finite bandwidth
Partial failure
The data bottleneck
New Approach to distributed computing

Hadoop:
A scalable fault-tolerant distributed system for data storage and
processing
Distribute data when the data is stored
Process data where the data is
Data is replicated
Hadoop Introduction

Apache Hadoop is an open-source software framework for storage and large-scale

processing of data-sets on clusters of commodity hardware.
Some of the characteristics:
 Open source
 Distributed processing
 Distributed storage
 Scalable
 Reliable
 Fault-tolerant
 Economical
 Flexible
History

Originally built as a Infrastructure for the “Nutch” project.

Based on Google’s map reduce and google File System.
Created by Doug Cutting in 2005 at Yahoo
Named after his son’s toy yellow elephant.
Who uses Hadoop

http://wiki.apache.org/hadoop/PoweredBy
http://wiki.apache.org/hadoop/
Distributions%20and%20Commercial%20Support
The Hadoop Ecosystem

http://hadoopecosystemtable.github.io/
Hadoop Core Components

HDFS – Hadoop Distributed File System (Storage)

Map Reduce (Processing)
Hadoop Core Components
A multi-node Hadoop cluster
Nodes
NameNode:
Master of the system
Maintains and manages the blocks which are present on the DataNodes

DataNodes:
Slaves which are deployed on each machine and provide the actual storage
Responsible for serving read and write requests for the clients

Jobtracker:
takes care of all the job scheduling and assign tasks to Task Trackers.

TaskTracker:
a node in the cluster that accepts tasks - Map, Reduce and Shuﬄe operations - from a
jobtracker
HDFS

Hadoop Distributed File System (HDFS) is designed to reliably store very large
files across machines in a large cluster. It is inspired by the GoogleFileSystem.
Distribute large data file into blocks
Blocks are managed by different nodes in the cluster
Each block is replicated on multiple nodes
Name node stored metadata information about files and blocks
Map Reduce
The Mapper:
 Each block is processed in isolation by a map task called mapper
 Map task runs on the node where the block is stored

The Reducer:
 Consolidate result from different mappers
 Produce ﬁnal output
What makes Hadoop unique

Moving computation to data, instead of moving data to computation.

Simpliﬁed programming model: allows user to quickly write and test
Automatic distribution of data and work across machines
Other Hadoop components in Ecosystem
HBase Hadoop database for random read/write access

Hive SQL-like queries and tables on large datasets

Pig Data ﬂow language and compiler

Oozie Workﬂow for interdependent Hadoop jobs

Sqoop Integration of databases and data warehouses with Hadoop

Flume Conﬁgurable streaming data collection

ZooKeeper Coordination service for distributed applications

Hbase

HBase is an open source, non-relational, distributed database modeled after

Google's BigTable.
It runs on top of Hadoop and HDFS, providing BigTable-like capabilities for
Hadoop.
Features of Hbase

Type of NoSql database

Strongly consistent read and write
Automatic sharding
Automatic RegionServer failover
Hadoop/HDFS Integration
HBase supports massively parallelized processing via MapReduce for using
HBase as both source and sink.
HBase supports an easy to use Java API for programmatic access.
HBase also supports Thrift and REST for non-Java front-ends.
Hbase in CAP theorem

Eric Brewer’s CAP theorem, HBase is a CP type system.

When to use Hbase

When there is real big data: millions or billions of rows, in other way data can not
store in a single node.
When random read/write access to big data
When require to do thousands of operations on big data
When there is no need of extra features of RDMS like typed columns, secondary
indexes, transactions, advanced query languages, etc.
When there is enough hardware.
Difference between Hbase and HDFS

HDFS Hbase
Good for storing large ﬁle Built on top of HDFS. Good for hosting
very large tables like billions of rows X
millions of column
Write once. Append to ﬁles in some of Read/write many
recent versions but not commonly used
No random read/write Random read/write
No individual record lookup rather read Fast records lookup(update)
all data
Hive

An sql like interface to Hadoop.

Data warehouse infrastructure built on top of Hadoop
Provide data summarization, query and analysis
Query execution via MapReduce
Hive interpreter convert the query to Map reduce format.
Open source project.
Developed by Facebook
Also used by Netﬂix, Cnet, Digg, eHarmony etc.
Hive

HiveQL example:

SELECT customerId, max(total_cost) from hive_purchases GROUP BY

customerId HAVING count(*) > 3;
Pig

A scripting platform for processing and analyzing large data sets

Apache Pig allows to write complex MapReduce programs using a simple
scripting language.
High level language: Pig Latin
Pig Latin is data ﬂow language.
Pig translate Pig Latin script into MapReduce to execute within Hadoop.
Open source project
Developed by Yahoo
Pig

Pig Latin example:

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:

ﬂoat);
X = FOREACH A GENERATE name,$2;
DUMP X;
Pig and Hive

Both requires compiler to generate Map reduce jobs

Hence high latency queries when used for real time responses to ad-hoc queries
Both are good for batch processing and ETL jobs
Fault tolerant
Impala

Cloudera Impala is a query engine that runs on Apache Hadoop.

Similar to HiveQL.
Does not use Map reduce
Optimized for low latency queries
Open source apache project
Developed by Cloudera
Much faster than Hive or pig
Comparing Pig, Hive and Impala
Description of Feature Pig Hive Impala

SQL based query No yes yes

language
Schema optional required required

Process data with yes yes no

external scripts

Extensible ﬁle format yes yes no

support

Query speed slow slow fast

Accessible via ODBC/ no yes yes

JDBC
Sqoop

Command-line interface for transforming data between relational database and

Hadoop
Support incremental imports
Imports use to populate tables in Hadoop
Exports use to put data from Hadoop into relational database such as SQL
server

Hadoop sqoop RDBMS

How Sqoop works

The dataset being transferred is broken into small blocks.

Map only job is launched.
Individual mapper is responsible for transferring a block of the dataset.
How Sqoop works
Flume

Apache Flume is a distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of
streaming data into the Hadoop Distributed File System (HDFS).
How flume works
How flume works

Data ﬂows like:

Agent tier -> Collector tier -> Storage tier
Agent nodes are typically installed on the machines that generate the logs and
are data’s initial point of contact with Flume. They forward data to the next tier
collector nodes , which aggregate the separate data ﬂows and forward them
of
storage tier .
to the ﬁnal
Hue

Graphical front end to the cluster.

Open source web interface.
Makes Hadoop platform (HDFS, Map reduce, oozie, Hive, etc.) easy to use
Hue
Zookeeper

Because coordinating distributed systems is a Zoo.

ZooKeeper is a centralized service for maintaining conﬁguration information,
naming, providing distributed synchronization, and providing group services.
DEMO
Hadoop Installation (CDH ) for windows
Download and install VM player
https://my.vmware.com/web/vmware/free#desktop_end_user_computing/
vmware_player/6_0
Hadoop Installation (CDH ) for windows

Make sure you have enabled

virtualization in bios
Hadoop Installation (CDH ) for windows
Download “Quick start VM with CDH” : Download for VMWare
http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/
cdh-4-7-x.html
Hadoop Installation (CDH ) for windows
Unzip “cloudera-quickstart-vm-4.7.0-0-vmware”
Open CDH using VMPlayer:
 Open VM Player
 Click open a virtual machine
 Select the ﬁle “cloudera-quickstart-vm-4.7.0-0-vmware” in the extracted directory of
“cloudera-quickstart-vm-4.7.0-0-vmware”. Virtual machine will be added to your VM
player.
 Select this virtual machine and click play virtual machine.
References

http://training.cloudera.com/essentials.pdf
http://en.wikipedia.org/wiki/Apache_Hadoop
http://practicalanalytics.wordpress.com/2011/11/06/explaining-hadoop-to-
management-whats-the-big-data-deal/
https://developer.yahoo.com/hadoop/tutorial/module1.html
http://hadoop.apache.org/
http://wiki.apache.org/hadoop/FrontPage
Questions?
Thanks

AWS Certified Cloud Practitioner CLF-C02 ExamTopics
100% (1)
AWS Certified Cloud Practitioner CLF-C02 ExamTopics
241 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Backend Development
From Everand
Backend Development
Kai Turing
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Apache Sqoop: Hanoi - Autumn 2019
No ratings yet
Apache Sqoop: Hanoi - Autumn 2019
18 pages
Exam AZ-104: Microsoft Azure Administrator - Skills Measured
No ratings yet
Exam AZ-104: Microsoft Azure Administrator - Skills Measured
8 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
Message Queues (ActiveMQs and Kafka)
No ratings yet
Message Queues (ActiveMQs and Kafka)
7 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
17 pages
Cloud Computing New One Project
No ratings yet
Cloud Computing New One Project
77 pages
Maven Tutorial: Understanding The Problem Without Maven
No ratings yet
Maven Tutorial: Understanding The Problem Without Maven
29 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
Amazon EMR Security: © 2018, Amazon Web Services, Inc. or Its Affiliates. All Rights Reserved
No ratings yet
Amazon EMR Security: © 2018, Amazon Web Services, Inc. or Its Affiliates. All Rights Reserved
16 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Amazon Web Services Training
No ratings yet
Amazon Web Services Training
5 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Installing and Using Impala
No ratings yet
Installing and Using Impala
248 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Microservices 1714410840
No ratings yet
Microservices 1714410840
18 pages
IBM Security Product Integration Reference
100% (1)
IBM Security Product Integration Reference
14 pages
Aws Services
No ratings yet
Aws Services
14 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
Introduction To Cloud Coumpting
No ratings yet
Introduction To Cloud Coumpting
21 pages
Java 8 Support in Hibernate 5 - Thoughts On Java Library PDF
No ratings yet
Java 8 Support in Hibernate 5 - Thoughts On Java Library PDF
17 pages
AWS Certified Developer Associate-Exam Guide en 1.4
No ratings yet
AWS Certified Developer Associate-Exam Guide en 1.4
3 pages
Module 8 - Database Services
No ratings yet
Module 8 - Database Services
33 pages
AWS EC2 Interview Questions - MindMajix
No ratings yet
AWS EC2 Interview Questions - MindMajix
27 pages
Practice Use Case
100% (1)
Practice Use Case
3 pages
COM-421-Lecture-Notes-7 - Open Stack
No ratings yet
COM-421-Lecture-Notes-7 - Open Stack
24 pages
Platform As A Service (PaaS) - PPT (Not Daily)
100% (1)
Platform As A Service (PaaS) - PPT (Not Daily)
4 pages
MongoDB Pagination
No ratings yet
MongoDB Pagination
6 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Explain in Detail About Hadoop Framework
No ratings yet
Explain in Detail About Hadoop Framework
4 pages
Data Modeling
No ratings yet
Data Modeling
3 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
CICD
No ratings yet
CICD
87 pages
Cloud Foundation - Presentation
No ratings yet
Cloud Foundation - Presentation
68 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
AWS Certified Developer Associate - Exam Guide
No ratings yet
AWS Certified Developer Associate - Exam Guide
16 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Cse - 2014 Se Module 2 V1
No ratings yet
Cse - 2014 Se Module 2 V1
154 pages
RSDB En-Us SG m06 Amazonrds
No ratings yet
RSDB En-Us SG m06 Amazonrds
27 pages
Aws Perspective
No ratings yet
Aws Perspective
70 pages
Hadoop Big Data Administration
No ratings yet
Hadoop Big Data Administration
6 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
SQL Server Theory
No ratings yet
SQL Server Theory
2 pages
microservices project
No ratings yet
microservices project
17 pages
The Lotus University Lecturer Schedule Management System Case Study-GroupAssignment
No ratings yet
The Lotus University Lecturer Schedule Management System Case Study-GroupAssignment
11 pages
Module 7: Data Management Backup, DR, Test/Dev Environments
No ratings yet
Module 7: Data Management Backup, DR, Test/Dev Environments
9 pages
Mapreduce Lab
No ratings yet
Mapreduce Lab
36 pages
DevOps Jumpstarting-Your-DevSecOps Jeff-Williams AppSecEU2018
No ratings yet
DevOps Jumpstarting-Your-DevSecOps Jeff-Williams AppSecEU2018
37 pages
Amazon: Exam Questions AWS-Certified-Developer-Associate
No ratings yet
Amazon: Exam Questions AWS-Certified-Developer-Associate
6 pages
O7planning Org en 11893 Integrating Spring Boot Jpa and h2 D
No ratings yet
O7planning Org en 11893 Integrating Spring Boot Jpa and h2 D
21 pages
MSFT Cloud Architecture Hybrid
No ratings yet
MSFT Cloud Architecture Hybrid
7 pages
Big Data Architecture Basics.pptx (1)
No ratings yet
Big Data Architecture Basics.pptx (1)
24 pages
Ansys 2022 R1 - Job Schedulers and Queuing Systems Support
No ratings yet
Ansys 2022 R1 - Job Schedulers and Queuing Systems Support
1 page
Company Profile PT Perkom Indah Murni
0% (1)
Company Profile PT Perkom Indah Murni
18 pages
0 - ACER Pending Complaints Upto 20-10-2020
No ratings yet
0 - ACER Pending Complaints Upto 20-10-2020
17 pages
ZTE IPTV
No ratings yet
ZTE IPTV
2 pages
Cloud Computing MAKAUT
No ratings yet
Cloud Computing MAKAUT
21 pages
Document 399482.1
No ratings yet
Document 399482.1
2 pages
6 Hadoop
No ratings yet
6 Hadoop
20 pages
Installing Multi Node Cluster - Handbook 2.0
No ratings yet
Installing Multi Node Cluster - Handbook 2.0
2 pages
NCTU PPT 2023-2024 - WEEK5-IOT-Communication-models-levels
No ratings yet
NCTU PPT 2023-2024 - WEEK5-IOT-Communication-models-levels
26 pages
FDSRPDF 2024 07 31
No ratings yet
FDSRPDF 2024 07 31
70 pages
Company Name URL Company Phone Number
No ratings yet
Company Name URL Company Phone Number
4 pages
SAP HANA On Power Competitive Level 2 Quiz Attempt Review PDF
No ratings yet
SAP HANA On Power Competitive Level 2 Quiz Attempt Review PDF
1 page
How OWA Works
No ratings yet
How OWA Works
4 pages
AZ-203 Course Outline
0% (1)
AZ-203 Course Outline
5 pages
Hyper-Converged Infrastructure Modernize Your Datacenter
No ratings yet
Hyper-Converged Infrastructure Modernize Your Datacenter
23 pages
BTP100 EN Col07-19
No ratings yet
BTP100 EN Col07-19
1 page
Cloud Computing MCQs PDF Tutorial Multiple Choice Questions
No ratings yet
Cloud Computing MCQs PDF Tutorial Multiple Choice Questions
18 pages
Ajay Kadiyala Resume 2023 PDF
No ratings yet
Ajay Kadiyala Resume 2023 PDF
6 pages
Cloud Computing
No ratings yet
Cloud Computing
63 pages
Second Question Bank NMNNV
No ratings yet
Second Question Bank NMNNV
3 pages
Catatan Tugasan Harian Ft19 Fa29 2017
No ratings yet
Catatan Tugasan Harian Ft19 Fa29 2017
1 page
36-07 Cloud Service Models
No ratings yet
36-07 Cloud Service Models
9 pages
Gruna Damp
No ratings yet
Gruna Damp
8 pages
Switching, Routing, and Wireless Essentials - Configure DHCPv6 Server
No ratings yet
Switching, Routing, and Wireless Essentials - Configure DHCPv6 Server
14 pages
Ultimate CCP Course Slides PDF
No ratings yet
Ultimate CCP Course Slides PDF
332 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
GBS - ERP Open Positions
No ratings yet
GBS - ERP Open Positions
258 pages

Hadoop Ecosystem PDF

Uploaded by

Hadoop Ecosystem PDF

Uploaded by

The Hadoop Ecosystem

Apache Hadoop is an open-source software framework for storage and large-scale

Originally built as a Infrastructure for the “Nutch” project.

HDFS – Hadoop Distributed File System (Storage)

Moving computation to data, instead of moving data to computation.

Hive SQL-like queries and tables on large datasets

Pig Data ﬂow language and compiler

Sqoop Integration of databases and data warehouses with Hadoop

Flume Conﬁgurable streaming data collection

ZooKeeper Coordination service for distributed applications

HBase is an open source, non-relational, distributed database modeled after

Type of NoSql database

Eric Brewer’s CAP theorem, HBase is a CP type system.

An sql like interface to Hadoop.

SELECT customerId, max(total_cost) from hive_purchases GROUP BY

A scripting platform for processing and analyzing large data sets

Pig Latin example:

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:

Both requires compiler to generate Map reduce jobs

Cloudera Impala is a query engine that runs on Apache Hadoop.

SQL based query No yes yes

Process data with yes yes no

Extensible ﬁle format yes yes no

Query speed slow slow fast

Accessible via ODBC/ no yes yes

Command-line interface for transforming data between relational database and

Hadoop sqoop RDBMS

The dataset being transferred is broken into small blocks.

Apache Flume is a distributed, reliable, and available service for

Data ﬂows like:

Graphical front end to the cluster.

Because coordinating distributed systems is a Zoo.

Make sure you have enabled

You might also like