Hadoop Ecosystem PDF
Hadoop Ecosystem PDF
Overview
Big Data Challenges
Distributed system and challenges
Hadoop Introduction
History
Who uses Hadoop
The Hadoop Ecosystem
Hadoop core components
HDFS
Map Reduce
Other Hadoop ecosystem components
Hbase
Hive
Pig
Impala
Sqoop
Flume
Hue
Zookeeper
Demo
Big Data Challenges
Solution: Distributed system
Distributed System Challenges
Programming Complexity
Finite bandwidth
Partial failure
The data bottleneck
New Approach to distributed computing
Hadoop:
A scalable fault-tolerant distributed system for data storage and
processing
Distribute data when the data is stored
Process data where the data is
Data is replicated
Hadoop Introduction
http://wiki.apache.org/hadoop/PoweredBy
http://wiki.apache.org/hadoop/
Distributions%20and%20Commercial%20Support
The Hadoop Ecosystem
http://hadoopecosystemtable.github.io/
Hadoop Core Components
DataNodes:
Slaves which are deployed on each machine and provide the actual storage
Responsible for serving read and write requests for the clients
Jobtracker:
takes care of all the job scheduling and assign tasks to Task Trackers.
TaskTracker:
a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a
jobtracker
HDFS
Hadoop Distributed File System (HDFS) is designed to reliably store very large
files across machines in a large cluster. It is inspired by the GoogleFileSystem.
Distribute large data file into blocks
Blocks are managed by different nodes in the cluster
Each block is replicated on multiple nodes
Name node stored metadata information about files and blocks
Map Reduce
The Mapper:
Each block is processed in isolation by a map task called mapper
Map task runs on the node where the block is stored
The Reducer:
Consolidate result from different mappers
Produce final output
What makes Hadoop unique
When there is real big data: millions or billions of rows, in other way data can not
store in a single node.
When random read/write access to big data
When require to do thousands of operations on big data
When there is no need of extra features of RDMS like typed columns, secondary
indexes, transactions, advanced query languages, etc.
When there is enough hardware.
Difference between Hbase and HDFS
HDFS Hbase
Good for storing large file Built on top of HDFS. Good for hosting
very large tables like billions of rows X
millions of column
Write once. Append to files in some of Read/write many
recent versions but not commonly used
No random read/write Random read/write
No individual record lookup rather read Fast records lookup(update)
all data
Hive
HiveQL example:
http://training.cloudera.com/essentials.pdf
http://en.wikipedia.org/wiki/Apache_Hadoop
http://practicalanalytics.wordpress.com/2011/11/06/explaining-hadoop-to-
management-whats-the-big-data-deal/
https://developer.yahoo.com/hadoop/tutorial/module1.html
http://hadoop.apache.org/
http://wiki.apache.org/hadoop/FrontPage
Questions?
Thanks