lec18
lec18
Design of HBase
Design of HBase,
1. What is HBase?
2. HBase architecture
3. HBase components
4. Data model
5. HBase storage hierarchy
6. Cross-data center replication
7. Auto sharding and distribution
8. Bloom filter and fold, store and shift.
51)
Let us see some of the important aspects of the HBase architecture. So, HBase has the region servers and
these region servers are basically handling the regions and there is one HBase master and this zookeeper
has to interact with the HBase master and all the other component and HBase, then as HBase also, deals
with the data nodes. So, HBase master has to communicate with the region servers and zookeeper. So, we
will see, in more detail about this. So, HBase architecture here, the table is split into the regions and
served by the region servers. So, regions are vertically divided by the column families into the stores that
we will discuss later on. And stores saved as files on HDFS. HBase utilizes zookeeper for the distributed
coordination service.
So, the tables are divided into the sequence of rows, by the key range called ‘Regions’. So, here we can
see that, this particular key range is basically nothing but, the row 1 and this will be, stored together into
the region. So, these regions are then assigned to the data node in the cluster called the ‘Region Servers’.
Now, that is shown over here. So, again for example, let us say that the range the sequence or the key
range, the "rowkey” range R2 will store these set off the rows, sequence of the rows and this will be
stored in another region server and the region servers, these regions are managed by the data nodes, they
are called they are called the ‘Region Servers’. Region servers as the data nodes.
Let, us see in more detail once again the HBase architecture. So, HBase has this one a client. Client can
access to the HRegionServer and this HRegionServer are many HRegionServers. one such
HRegionServer is shown over here which has HLog and each HRegionServer is further divided into
different Hregions. One such Hregion is shown over here and Hregions will contain the store and also a
MemStore. So, within the store it will be having the StoreFile and StoreFile will contain basic storage that
is called, ‘HFile’ and ‘HFile’ is stored in HDFS. Now, as far as there is one HMaster and HMaster
communicates with the zookeeper and with the HRegionServers and HDFS, we will see, we have seen
about the HMaster and what is zookeeper? Is a small group of servers which runs consensus protocol like
Paxos and the zookeeper is the coordination service for HBase and assigns the different nodes and servers
to this particular service if zookeeper is not there then HBase will stop functioning.
Now, HBase prefers the strong consistency or availability. So, HBase prefers the write-ahead log and
whenever a client comes with the key (K 1, K 2, K 3, K 4) gives to the HRegionServer then this let us
say, (K 1 and K 2) will be on a particular HRegion and (K 3, K4) will have another H Region and then
this particular aspect will be stored in the store and these store will have the MemStore, StoreFiles and
internally they are stored in as HFile. So, right to HLog before writing to MemStore is there to ensure the
fault tolerance aspect and recovery from the failures. So, this helps recover from the failure by replaying
HLog.
Now, let us see how the auto sharding is done and Auto sharding means that tables is divided into the
row, range or a range keys and they are being stored in the region servers and we and these region servers
are now solved with the client.
And this layout is there and it is automatically done by that is called, ‘Auto Sharding’ and ‘Distribution’.
So, unit of the scalability in HBase is the region that we have seen which is managed by the region
servers and they are sorted and contiguous range of rows spread randomly across the RegionServers,
moved around for the load balancing and failover that we have already seen in the previous slide. So, is
split automatically or manually to scale with the growing data and capacity is only a factor of cluster
nodes versus the regions per node.
Now, there is a use of bloom filter here in HBase also. So, bloom filters are generated when HBase when
HFile is persisted, stored at the end of HFile, loaded into the memory. This bloom filter will allow to
check on the rows and from the column level and they can filter the entire store files from the reads,
useful when the data is grouped and also useful when many misses are expected during the reads.
Logical layout does not match the physical one and here all the values are stored with the full coordinates
including the row key, column family, column qualifier and the timestamp. Folds column into the “row
per column” and nulls are cost free as nothing is stored, versions are multiple rows in a folded table.