0% found this document useful (0 votes)

13 views

lec18

This lecture covers the design and architecture of HBase, an open-source NoSQL database that operates on top of HDFS and is modeled after Google's Bigtable. Key topics include HBase's components, data model, storage hierarchy, cross-data center replication, and features like auto sharding and bloom filters. HBase is characterized by its preference for consistency over availability, making it suitable for applications where strong consistency is required.

Uploaded by

bhargavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

lec18

Uploaded by

bhargavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Lecture - 18

Design of HBase
Design of HBase,

Refer Slide Time :( 0: 17)

Preface content of this lecture:

In this lecture we will discuss:

1. What is HBase?
2. HBase architecture
3. HBase components
4. Data model
5. HBase storage hierarchy
6. Cross-data center replication
7. Auto sharding and distribution
8. Bloom filter and fold, store and shift.

Refer Slide Time :( 0: 40)

So, HBase is an open source NoSQL database. It is a distributed column-oriented data store that can scale
out horizontally to thousands of commodity servers and Petabytes of indexed storage because this amount
of commodity machines or the cluster is required to store the large amount of data that is why?. This
particular aspect is covered into the distributed column-oriented data store and also it can scale out
horizontally to thousands of servers and Petabytes of indexed storage. HBase is designed to operate on
top of HDFS, for scalability, fault tolerance and high availability. HBase is actually an implementation of
Big Table, which is our storage architecture, given by Google and this is which a distributed storage
system, developed at Google. So, HBase works with a structured, unstructured and semi-structured data.

Refer Slide Time :( 1: 59)

So, HBase is basically designed as the Google's, Big Table, was first “block-based” storage system and
Yahoo! Open sourced this particular concept, a block based and which is now known, as HBase. So,
HBase is a major Apache project today and Facebook also uses HBase internally and it has various API
functions which provides for example: get and put by row that is the key-value pair and scan the row
range and filter all these operations which are supported to perform the range queries, it is also having a
multiput. So, that multiple key-value store can be stored and handled at the same time. Unlike Cassandra,
HBase prefers consistency over availability. Now, Cassandra prefers availability whereas HBase prefers
consistency over availability. So, now we are going to see that where consistency is a preferred option for
the application HBase is used and wherever availability is more important than Cassandra is used. That is
why both of them exist as the NoSQL solution.
Refer Slide Time :( 3:

51)

Let us see some of the important aspects of the HBase architecture. So, HBase has the region servers and
these region servers are basically handling the regions and there is one HBase master and this zookeeper
has to interact with the HBase master and all the other component and HBase, then as HBase also, deals
with the data nodes. So, HBase master has to communicate with the region servers and zookeeper. So, we
will see, in more detail about this. So, HBase architecture here, the table is split into the regions and
served by the region servers. So, regions are vertically divided by the column families into the stores that
we will discuss later on. And stores saved as files on HDFS. HBase utilizes zookeeper for the distributed
coordination service.

Refer Slide Time :( 5: 11)

The components in more detail. So, client will find the region servers that are serving particularly the row
range of interest. So, then HMaster monitors all the region servers instances in the cluster system.
Regions are our basic element of availability and distribution for the tables. Our region servers serving
and managing the regions and in a distributed cluster region runs on the data node.

Refer Slide Time :( 5: 42)

So, this is the typical layout of the data model which is also called as the, ‘Column Families’. So, data is
stored in HBase and is located by its "rowkey". So, "rowkey" is the primary key from the notion of
relational database management system. So, the records in HBase are stored in the sorted order according
to the rowkeys and the data in a row are grouped together as the column families and each column family
has one or more columns. These columns in a family are stored together in a low level storage file which
is called as, ‘HFile’.

Refer Slide Time :( 6: 28)

So, the tables are divided into the sequence of rows, by the key range called ‘Regions’. So, here we can
see that, this particular key range is basically nothing but, the row 1 and this will be, stored together into
the region. So, these regions are then assigned to the data node in the cluster called the ‘Region Servers’.
Now, that is shown over here. So, again for example, let us say that the range the sequence or the key
range, the "rowkey” range R2 will store these set off the rows, sequence of the rows and this will be
stored in another region server and the region servers, these regions are managed by the data nodes, they
are called they are called the ‘Region Servers’. Region servers as the data nodes.

Refer Slide Time :( 7: 42)

Now, column family a column is identified by the column qualifier that consists of the column family
name concatenated with the column name using the colon ex- personaldata:name. So, here we can see
that, this particular column family name column is identified by the columns of the column family name
concatenated with the column name using the colon. So, this is the column name and the column family
name is shown over here this is the column family name. So, this is shown over here. So, column family
name and the name will qualify, will identify the particular column. So, column families are mapped to
the storage files and are stored in separate files which can also be accessed separately.

Refer Slide Time :( 8: 53)

Now, cell in HBase. A table data is stored in HBase table cells. And cell is a combination of row, column
family, column qualifier and contains a value and a timestamp. So, the key consists of the row key,
column name and a timestamp that is shown here and the entire cell, with the added, structural
information is called the, ‘Key Value Pair’.

Refer Slide Time :( 9: 21)

So, HBase data model consists of a table. HBase organizes data into the table and table names are the
strings composed of the characters that are safe for use in the file system path. And the rows within the
table, the data is stored according to its row. Rows are identified by the arrow key and row keys do not
have the data type and are always treated as the byte [ ] (byte array). So, this aspect we have already
covered. So, column family the data within the row is grouped by the column family. Every row in the
table has the same column family, although the row need not store the data in all its families. Column
families are the strings and composed of characters that are safe for use in the file system path.

Refer Slide Time :( 10: 08)

Now, column qualifier: the data within the column family is addressed via the column qualifier and or
simply the column qualifier need not be specified in the advance and column qualifier need not be
consistent between the rows. Like "row keys" column qualifier do not have the data type and is always
treated as a byte [ ]. Cell is a combination of row key, column family and column qualifier uniquely
identifies the cell. The data is stored in the cell is referred to as the cell value.

Refer Slide Time :( 10: 41)

And timestamp the values within the cell are versioned. Versions are identified by their version number,
which by default is the timestamp. If the timestamp is not is specified for the read, the latest one is
returned. The number of cell values versions retained by the HBase is configured for each family. The
default number of cell versions is three.

Refer Slide Time :( 11: 03)

Let, us see in more detail once again the HBase architecture. So, HBase has this one a client. Client can
access to the HRegionServer and this HRegionServer are many HRegionServers. one such
HRegionServer is shown over here which has HLog and each HRegionServer is further divided into
different Hregions. One such Hregion is shown over here and Hregions will contain the store and also a
MemStore. So, within the store it will be having the StoreFile and StoreFile will contain basic storage that
is called, ‘HFile’ and ‘HFile’ is stored in HDFS. Now, as far as there is one HMaster and HMaster
communicates with the zookeeper and with the HRegionServers and HDFS, we will see, we have seen
about the HMaster and what is zookeeper? Is a small group of servers which runs consensus protocol like
Paxos and the zookeeper is the coordination service for HBase and assigns the different nodes and servers
to this particular service if zookeeper is not there then HBase will stop functioning.

Refer Slide Time :( 12: 38)

Now, let us see the HBase storage hierarchy. So, HBase has HBase table which is split into the multiple
regions which is replicated across the servers. Now, then it will be having a column family which is a
subset of the columns with similar query and one store per combination of column family plus a region is
there. And also, it has the Memstore for each store: in-memory updates to store and flush to the disc when
full, that like we have seen in the Cassandra. So, StoreFiles for each store for each region where the data
lives and within that contains the basic and basic HFile and as HFile will be stored in HDFS. So, each
HFile is a SStable from the Google's Big Table.

Refer Slide Time :( 13: 41)

So, this is about the HBase HFile. So, HFile comprises, of data and Meta file, Metafile information
indices and trailer where the data portion consists of the magic value and (key, value) pairs. So, the key
value pair is having the key length, value length, row length, row and column family length, column
family, column qualifier, timestamp, key type and value. And these rows will have SSN numbers and
column family will have the demographic information and column qualifier will have ethnicity
information and this becomes the HBase key.

Refer Slide Time :( 14: 30)

Now, HBase prefers the strong consistency or availability. So, HBase prefers the write-ahead log and
whenever a client comes with the key (K 1, K 2, K 3, K 4) gives to the HRegionServer then this let us
say, (K 1 and K 2) will be on a particular HRegion and (K 3, K4) will have another H Region and then
this particular aspect will be stored in the store and these store will have the MemStore, StoreFiles and
internally they are stored in as HFile. So, right to HLog before writing to MemStore is there to ensure the
fault tolerance aspect and recovery from the failures. So, this helps recover from the failure by replaying
HLog.

Refer Slide Time :( 15: 35)

So, Log Replay after the recovery from the failure or upon boot up, HRegionServer and HMaster will do
the replay. So, replay any stale logs using the timestamp to find out where the database is with respect to
the logs and replay will add it to the MemStore.

Refer Slide Time :( 15: 53)

Now, cross-datacenter replication now there will be a single “Master” cluster, others “Slave” cluster
replicate the same table, master cluster synchronously sends HLogs over the slaves clusters, coordination
among the cluster is done by a zookeeper. Zookeeper can be used like a file system to store the control
information and also this particular zookeeper will use different paths for invoking the state and peer
cluster number and each log all this information is there.

Refer Slide Time :( 16: 29)

Now, let us see how the auto sharding is done and Auto sharding means that tables is divided into the
row, range or a range keys and they are being stored in the region servers and we and these region servers
are now solved with the client.

Refer Slide Time :( 16: 50)

Now, similarly the table logical view, we can see that the rows are now split into the row keys, in the
range keys and these range keys are given charted into the region servers and we have shown here that
these rows from A to Z are stored in three different region servers.

Refer Slide Time :( 17: 11)

And this layout is there and it is automatically done by that is called, ‘Auto Sharding’ and ‘Distribution’.
So, unit of the scalability in HBase is the region that we have seen which is managed by the region
servers and they are sorted and contiguous range of rows spread randomly across the RegionServers,
moved around for the load balancing and failover that we have already seen in the previous slide. So, is
split automatically or manually to scale with the growing data and capacity is only a factor of cluster
nodes versus the regions per node.

Refer Slide Time :( 17: 46)

Now, there is a use of bloom filter here in HBase also. So, bloom filters are generated when HBase when
HFile is persisted, stored at the end of HFile, loaded into the memory. This bloom filter will allow to
check on the rows and from the column level and they can filter the entire store files from the reads,
useful when the data is grouped and also useful when many misses are expected during the reads.

Refer Slide Time :( 18: 15)

So, let us see the positioning of bloom filter into this HFile. So, at the bottom of HFile this particular the
bloom filter information is stored and this will help in identifying which of these different blocks are
basically containing the block index or information. So, using bloom filter you can access at least two at
most two blocks not the entire set of block sequentially has to be read. So, hence the access is reduced
from 6 seeks per block to the 2 seeks per block using the bloom filter and bloom filter stone we is restored
as a part of s five.

Refer Slide Time :( 19: 05)

Now, as far as fold store and shift is concerned you can see here that, for a particular row, the same data
is stored, multiple instances because the time series data is also stored at different time stamp the data is
stored for a particular row key. So, that is why? Several data on a particular row key is now appearing
they may be varying at their timestamp T 1, T 2 and T 3 and they are a store, into the, into the five.

Refer Slide Time :( 19: 44)

Logical layout does not match the physical one and here all the values are stored with the full coordinates
including the row key, column family, column qualifier and the timestamp. Folds column into the “row
per column” and nulls are cost free as nothing is stored, versions are multiple rows in a folded table.

Refer Slide Time :( 20: 06)

Conclusion: Traditional databases (RDBMSs) works with strong consistency and offers acid property and
in the modern workload doesn't need such strong guarantees, but do need fast response time that is the
availability. Unfortunately, CAP provides three out of two that is here given the partition tolerance, the
HBase prefers the consistency over the availability. So, key-value pair or a NoSQL system offers the
BASE property in these scenarios. So, Cassandra offers the eventual consistency and the other variety of
consistency models are striving towards the strong consistency. So, in this lecture we have covered about
the HBase architecture components, data model, storage hierarchy, cross datacenter replication auto-
starting, distribution and bloom filter and fold store and shift operation. Thank you.

Minimal Pairs
No ratings yet
Minimal Pairs
3 pages
Character and Plot Lesson Plan
No ratings yet
Character and Plot Lesson Plan
7 pages
lec18
No ratings yet
lec18
18 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
HBase
No ratings yet
HBase
31 pages
HBase
No ratings yet
HBase
39 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
BDA Unit 5
No ratings yet
BDA Unit 5
33 pages
BDA Unit-4 Part-2 HBase,Hive,Pig
No ratings yet
BDA Unit-4 Part-2 HBase,Hive,Pig
74 pages
HBASE (1)
No ratings yet
HBASE (1)
18 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
BDT UNIT - V
No ratings yet
BDT UNIT - V
15 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
HBase
No ratings yet
HBase
6 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
Unit - 5 Part - 1
No ratings yet
Unit - 5 Part - 1
8 pages
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
6 pages
Assignment 10
No ratings yet
Assignment 10
9 pages
Big Data 22MSM40206
No ratings yet
Big Data 22MSM40206
9 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
UNIT5
No ratings yet
UNIT5
42 pages
10_HBase
No ratings yet
10_HBase
13 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
18 pages
HBase
No ratings yet
HBase
38 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Cs525: Special Topics in DBS: Large-Scale Data Management
No ratings yet
Cs525: Special Topics in DBS: Large-Scale Data Management
35 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
HBASE
No ratings yet
HBASE
11 pages
Unit - IV_Notes
No ratings yet
Unit - IV_Notes
23 pages
HBase Presentation
No ratings yet
HBase Presentation
23 pages
Module 05 HBase - Distributed NoSQL Database
No ratings yet
Module 05 HBase - Distributed NoSQL Database
54 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
4 4HBase
No ratings yet
4 4HBase
17 pages
Unit 5 Lecture No-3(Hbase)
No ratings yet
Unit 5 Lecture No-3(Hbase)
35 pages
Week-5 - Lecture Notes
No ratings yet
Week-5 - Lecture Notes
138 pages
BDA.Unit-5
No ratings yet
BDA.Unit-5
31 pages
Unit v Hadoop Related Tools_b5f716067e8295de72a527efb7a3698b
No ratings yet
Unit v Hadoop Related Tools_b5f716067e8295de72a527efb7a3698b
54 pages
unit-5 notes
No ratings yet
unit-5 notes
61 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
HBASE
No ratings yet
HBASE
35 pages
Chapter 12 HBase[1]
No ratings yet
Chapter 12 HBase[1]
108 pages
Apache HBase PPT
No ratings yet
Apache HBase PPT
12 pages
Hbase
100% (1)
Hbase
30 pages
HBase Architecture PDF
No ratings yet
HBase Architecture PDF
32 pages
Unit 5 Lecture No-3(Hbase)
No ratings yet
Unit 5 Lecture No-3(Hbase)
35 pages
HBase
No ratings yet
HBase
30 pages
HBase
No ratings yet
HBase
27 pages
Lesson 6 NoSQL Databases HBase
100% (1)
Lesson 6 NoSQL Databases HBase
47 pages
Hbase in Practice
No ratings yet
Hbase in Practice
46 pages
C7 Hbase
No ratings yet
C7 Hbase
36 pages
Unit 5 Hbase
No ratings yet
Unit 5 Hbase
15 pages
Hadoop HBase Notes-Abhijit-Nagargoje
No ratings yet
Hadoop HBase Notes-Abhijit-Nagargoje
24 pages
Big data UNIT 5 own
No ratings yet
Big data UNIT 5 own
18 pages
Columnar Database
No ratings yet
Columnar Database
18 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Pig_2
No ratings yet
Pig_2
63 pages
nps
No ratings yet
nps
3 pages
Lec 26
No ratings yet
Lec 26
10 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Function Spark
No ratings yet
Function Spark
9 pages
Lec 8
No ratings yet
Lec 8
24 pages
Per Partition
No ratings yet
Per Partition
3 pages
Prog Python
No ratings yet
Prog Python
67 pages
Lec 2
No ratings yet
Lec 2
20 pages
Lec 3
No ratings yet
Lec 3
28 pages
Big Data Analytics Using Hadoop
No ratings yet
Big Data Analytics Using Hadoop
26 pages
Lec 6
No ratings yet
Lec 6
16 pages
Lec 7
No ratings yet
Lec 7
10 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
2 pages
Lec 5
No ratings yet
Lec 5
6 pages
Lec 4
No ratings yet
Lec 4
28 pages
Lec 1
No ratings yet
Lec 1
30 pages
Advanced English Communication Skills Lab
No ratings yet
Advanced English Communication Skills Lab
2 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
2 pages
Permohonan SK & Pelantikan Ranting
No ratings yet
Permohonan SK & Pelantikan Ranting
21 pages
Ebvb November
No ratings yet
Ebvb November
16 pages
Children's Multilingual Literacy Fostering Childhood Literacy in
No ratings yet
Children's Multilingual Literacy Fostering Childhood Literacy in
333 pages
Siège de Corinthe-Libretto Di Sala
100% (1)
Siège de Corinthe-Libretto Di Sala
68 pages
EE221 - Fall 2024 - Midterm Exam Solution
No ratings yet
EE221 - Fall 2024 - Midterm Exam Solution
7 pages
Growing Up Is Waking Up, Interpenetrating Quadrants, States and Structures
100% (1)
Growing Up Is Waking Up, Interpenetrating Quadrants, States and Structures
27 pages
25-05-2024 JR - Super60 (Incoming) NUCLEUS BT Jee-Main QMT-02 Q.paper
No ratings yet
25-05-2024 JR - Super60 (Incoming) NUCLEUS BT Jee-Main QMT-02 Q.paper
19 pages
Classical IPC Problems
No ratings yet
Classical IPC Problems
15 pages
Action Plan For Reading
No ratings yet
Action Plan For Reading
2 pages
User'S Guide: Ibm Infosphere Qualitystage
No ratings yet
User'S Guide: Ibm Infosphere Qualitystage
331 pages
Case Study
No ratings yet
Case Study
4 pages
Question Tags: English Grammar Rules
100% (1)
Question Tags: English Grammar Rules
5 pages
MSRI Lecture Notes
No ratings yet
MSRI Lecture Notes
98 pages
Tabernacle Prayer Plan Notes 2
No ratings yet
Tabernacle Prayer Plan Notes 2
1 page
ACIT 1420 Comprehensive Study Guideline
No ratings yet
ACIT 1420 Comprehensive Study Guideline
9 pages
Huffman Trees and Codes-v1
No ratings yet
Huffman Trees and Codes-v1
15 pages
Cs Form No. 212 Attachment Work Experience Sheet LEIZ
100% (1)
Cs Form No. 212 Attachment Work Experience Sheet LEIZ
5 pages
Module 5 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 5 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
26 pages
Remainder Theorem
No ratings yet
Remainder Theorem
2 pages
Bgf004 English No Longer World Language
100% (1)
Bgf004 English No Longer World Language
2 pages
MDR_210500923630 (1)
No ratings yet
MDR_210500923630 (1)
1 page
Munication Systems Gate Ece Notes Free1
No ratings yet
Munication Systems Gate Ece Notes Free1
341 pages
Grade12 HUMMS Creative Non-Fiction LP
No ratings yet
Grade12 HUMMS Creative Non-Fiction LP
3 pages
Download Complete Universal Principles of Typography: 100 Key Concepts for Choosing and Using Type Elliot Jay Stocks PDF for All Chapters
100% (3)
Download Complete Universal Principles of Typography: 100 Key Concepts for Choosing and Using Type Elliot Jay Stocks PDF for All Chapters
48 pages
Cherry Rose J. Esmero: College of Education
No ratings yet
Cherry Rose J. Esmero: College of Education
39 pages
Configure Local Gateway On Cisco IOS XE For Webex Calling
No ratings yet
Configure Local Gateway On Cisco IOS XE For Webex Calling
93 pages
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
No ratings yet
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
10 pages
Lecuter Notes On Computer Viruses
No ratings yet
Lecuter Notes On Computer Viruses
17 pages

lec18

Uploaded by

lec18

Uploaded by

Lecture - 18

Refer Slide Time :( 0: 17)

Preface content of this lecture:

Refer Slide Time :( 0: 40)

Refer Slide Time :( 1: 59)

Refer Slide Time :( 5: 11)

Refer Slide Time :( 5: 42)

Refer Slide Time :( 6: 28)

Refer Slide Time :( 7: 42)

Refer Slide Time :( 8: 53)

Refer Slide Time :( 9: 21)

Refer Slide Time :( 10: 08)

Refer Slide Time :( 10: 41)

Refer Slide Time :( 11: 03)

Refer Slide Time :( 12: 38)

Refer Slide Time :( 13: 41)

Refer Slide Time :( 14: 30)

Refer Slide Time :( 15: 35)

Refer Slide Time :( 15: 53)

Refer Slide Time :( 16: 29)

Refer Slide Time :( 16: 50)

Refer Slide Time :( 17: 11)

Refer Slide Time :( 17: 46)

Refer Slide Time :( 18: 15)

Refer Slide Time :( 19: 05)

Refer Slide Time :( 19: 44)

Refer Slide Time :( 20: 06)

You might also like