0% found this document useful (0 votes)
122 views

R18 DBMS Unit-V

The document discusses different types of data storage and file organization methods in databases. It describes primary, secondary, and tertiary storage, with primary storage being volatile memory like RAM and cache, secondary storage being non-volatile storage like flash memory and magnetic disks, and tertiary storage being slow but large capacity offline storage like tapes and optical disks. It also summarizes common file organization methods like sequential file organization where records are stored in order, and heap file organization where records are inserted without sorting.

Uploaded by

prabhakar_1973
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views

R18 DBMS Unit-V

The document discusses different types of data storage and file organization methods in databases. It describes primary, secondary, and tertiary storage, with primary storage being volatile memory like RAM and cache, secondary storage being non-volatile storage like flash memory and magnetic disks, and tertiary storage being slow but large capacity offline storage like tapes and optical disks. It also summarizes common file organization methods like sequential file organization where records are stored in order, and heap file organization where records are inserted without sorting.

Uploaded by

prabhakar_1973
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT - V

Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and Secondary
Indexes, Index data Structures, Hash Based Indexing, Tree base Indexing, Comparison of File
Organizations, Indexes and Performance Tuning, Intuitions for tree Indexes, Indexed Sequential
Access Methods (ISAM), B+ Trees: A Dynamic Index Structure.
***************************************************************************************************

Storage System in DBMS:

 A database system provides an ultimate view of the stored data. However, data in the form of
bits, bytes get stored in different storage devices.

Types of Data Storage:

For storing the data, there are different types of storage options available. These storage types
differ from one another as per the speed and accessibility. There are the following types of storage
devices used for storing the data:

 Primary Storage
 Secondary Storage
 Tertiary Storage

Primary Storage:

It is the primary area that offers quick access to the stored data. We also know the primary storage
as volatile storage. It is because this type of memory does not permanently store the data. As soon
as the system leads to a power cut or a crash, the data also get lost. Main memory and cache are the
types of primary storage.
 Main Memory: It is the one that is responsible for operating the data that is available by the
storage medium. The main memory handles each instruction of a computer machine. This
type of memory can store gigabytes of data on a system but is small enough to carry the
entire database. At last, the main memory loses the whole content if the system shuts down
because of power failure or other reasons.

 Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A cache is
a tiny storage media which is maintained by the computer hardware usually. While
designing the algorithms and query processors for the data structures, the designers keep
concern on the cache effects.

Secondary Storage:

Secondary storage is also called as online storage. It is the storage area that allows the user to save
and store data permanently. This type of memory does not lose the data due to any power failure or
system crash. That's why we also call it non-volatile storage.

There are some commonly described secondary storage media which are available in almost every
type of computer system:

 Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are
further plugged into the USB slots of a computer system. These USB keys help transfer data
to a computer system, but it varies in size limits. Unlike the main memory, it is possible to
get back the stored data which may be lost due to a power cut or other reasons. This type of
memory storage is most commonly used in the server systems for caching the frequently
used data. This leads the systems towards high performance and is capable of storing large
amounts of databases than the main memory.

 Magnetic Disk Storage: This type of storage media is also known as online storage media. A
magnetic disk is used for storing the data for a long time. It is capable of storing an entire
database. It is the responsibility of the computer system to make availability of the data from
a disk to the main memory for further accessing. Also, if the system performs any operation
over the data, the modified data should be written back to the disk. The tremendous
capability of a magnetic disk is that it does not affect the data due to a system crash or
failure, but a disk failure can easily ruin as well as destroy the stored data.
Tertiary Storage:

It is the storage type that is external from the computer system. It has the slowest speed. But it is
capable of storing a large amount of data. It is also known as Offline storage. Tertiary storage is
generally used for data backup. There are following tertiary storage devices available:

 Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact
Disk (CD) can store 700 megabytes of data with a playtime of around 80 minutes. On the
other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each side
of the disk.

 Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for
archiving or backing up the data. It provides slow access to data as it accesses data
sequentially from the start. Thus, tape storage is also known as sequential-access storage.
Disk storage is known as direct-access storage as we can directly access the data from any
location on disk.

Storage Hierarchy:

 Besides the above, various other storage devices reside in the computer system. These
storage media are organized on the basis of data accessing speed, cost per unit of data to buy
the medium, and by medium's reliability. Thus, we can create a hierarchy of storage media
on the basis of its cost and speed.
 Thus, on arranging the above-described storage media in a hierarchy according to its speed
and cost, we conclude the below-described image:
In the image, the higher levels are expensive but fast. On moving down, the cost per bit is
decreasing, and the access time is increasing. Also, the storage media from the main memory to up
represents the volatile nature, and below the main memory, all are non-volatile devices.

File Organization:

o The File is a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization which was
used for a given set of records.
o File organization is a logical relationship among various records. This method defines how
file records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store only
one fixed length record in any given file. An alternative approach is to structure our files so
that we can contain multiple lengths for records.
o Files of fixed length records are easier to implement than the files of variable length records.

Objective of file organization

o It contains an optimal selection of records, i.e., records can be selected as fast as possible.
o To perform insert, delete or update transaction on the records should be quick and easy.
o The duplicate records cannot be induced as a result of insert, update or delete.
o For the minimal cost of storage, records should be stored efficiently.

Types of file organization:

File organization contains various methods. These particular methods have pros and cons on the
basis of access or selection. Types of file organization are as follows:
1. Sequential File Organization:

This is the most straightforward technique of file arrangement. Files are saved in this method in
sequential order.
Methods of Sequential File Organization:
There are two approaches to implementing this method:
1. Pile File Method:
It’s a straightforward procedure. In this method, we store the records in sequential order, one after
the other. The record will be entered in the same order as it is inserted into the tables.
When a record is updated or deleted, the memory blocks are searched for the record. When it is
discovered, it will be marked for deletion, and a new record will be added in its place.

 Insertion of the New Record:

Assume we have four records in order, R1, R3 …. R9, and R8. As a result, records are nothing more
than a table row. If we wish to add a new record R2 to the sequence, we’ll have to put it at the end
of the file. Records are nothing more than a row on a table.

2. Sorted File Method:

The new record is always added at the end of the file in this approach, and the sequence is then
sorted in descending or ascending order. Records are sorted using any primary key or other keys.

If a record is modified, it will first update the record, then sort the file, and then store the revised
record in the correct location.
Insertion of the New Record:

Assume there is a pre-existing sorted sequence of four records, R1, R3 … R9, and R8. If a new record
R2 needs to be added to the sequence, it will be added at the end of the file, and then the series will
be sorted.

Sequential File Organization Pros:

 It includes a way of dealing with large amounts of data that is both quick and efficient.

 Files can be stored easily in this fashion using less expensive storage mechanisms such as
magnetic cassettes.

 It has a simple design. The data and info can be stored with comparatively little effort.

 This approach is used when a large number of records must be accessed, such as when
calculating a student’s grade or generating a wage slip.

 This strategy is employed in the creation of reports and statistical calculations.

Sequential File Organization Cons:

 It will waste time because we cannot go to a specific record that is requested; instead, we
must proceed in a sequential manner, which consumes time.

 Sorting records in a sorted file approach consumes more time and space.
2. Heap File Organization:

 It is the most fundamental and basic form of file organization. It’s based on data chunks. The
records are inserted at the end of the file in the heap file organization. The ordering and sorting
of records are not required when the entries are added.

 The new record is put in a different block when the data block is full. This new data block does
not have to be the next data block in the memory; it can store new entries in any data block in
the memory. An unordered file is the same as a heap file. Every record in the file has a unique id,
and every page in the file is the same size. The DBMS is in charge of storing and managing the
new records.

Insertion of a New Record:


Let’s say we have five records in a heap, R1, R3, R6, R4, and R5, and we wish to add a fifth record,
R2. If data block 3 is full, the DBMS will insert it into whichever database it chooses, such as data
block 1.
In a heap file organization, if we wish to search, update, or remove data, we must traverse the data
from the beginning of the file until we find the desired record.

Because there is no ordering or sorting of records, searching, updating, or removing records will
take a long time if the database is huge. We must check all of the data in the heap file organization
until we find the necessary record.

Pros of Heap File Organization

 It’s a great way to organize your files for mass inclusion. This method is best suited when a
significant amount of data needs to be loaded into the database at once.

 Fetching records and retrieving them is faster in a small database than in consecutive
records.

Cons of Heap File Organization:

 Because it takes time to find or alter a record in a large database, this method is
comparatively inefficient.

 For huge or complicated databases, this type of organization could not be used.

3. Hash File Organization:

 Direct file organization is also known as hash file organization. A hash function is calculated in
this approach for storing the records – that provides us with the address of the block that stores
the record. Any mathematical function can be used in the form of a hash function. It can be
straightforward or intricate.

Hash File Organization uses the computation of the hash function on some fields of a record. The
output of the hash function defines the position of the disc block where the records will be stored.
 When a record is requested using the hash key columns, an address is generated, and the entire
record is fetched using that address. When a new record needs to be inserted, the hash key is
used to generate the address, and the record is then directly placed. In the case of removing and
updating, the same procedure is followed.

 There is no effort involved in searching and categorising the full file using this method. Each
record will be put in the RAM at random using this procedure.

4. B+ File Organization:

 The B+ tree file organization is a very advanced method of an indexed sequential access
mechanism. In File, records are stored in a tree-like structure. It employs a similar key-index
idea, in which the primary key is utilised to sort the records. The index value is generated for
each primary key and mapped to the record.

 Unlike a binary search tree (BST), the B+ tree can contain more than two children. All records
are stored solely at the leaf node in this method. The leaf nodes are pointed to by intermediate
nodes. There are no records in them.

The above B+ tree shows this:

 The tree has only one root node, which is number 25.

 There is a node-based intermediary layer. They don’t keep the original record. The only
things they have are pointers to the leaf node.

 The prior value of the root is stored in the nodes to the left of the root node, while the next
value is stored in the nodes to the right, i.e. 15 and 30.

 Only one leaf node contains only values, namely 10, 12, 17, 20, 24, 27, and 29.

 Because all of the leaf nodes are balanced, finding any record is much easier.

 This method allows you to search any record by following a single path and accessing it
quickly.
Pros of B+ Tree File Organization:

 Because all records are stored solely in the leaf nodes and ordered in a sequential linked list,
searching becomes very simple using this method.

 It’s easier and faster to navigate the tree structure.

 The size of the B+ tree is unrestricted; therefore, the number of records and the structure of
the B+ tree can both expand and shrink.

 It is a very balanced tree structure. Here, each insert, update, or deletion has no effect on the
tree’s performance.

Cons of B+ Tree File Organization:

 The B+ tree file organization method is very inefficient for the static method.
Indexing in DBMS:

 Indexing refers to a data structure technique that is used for quickly retrieving entries from
database files using some attributes that have been indexed. In database systems, indexing is
comparable to indexing in books. The indexing attributes are used to define the indexing.

What is Indexing in DBMS?

Indexing is a technique for improving database performance by reducing the number of disk
accesses necessary when a query is run. An index is a form of data structure. It’s used to swiftly
identify and access data and information present in a database table.

Structure of Index:

We can create indices using some columns of the database.

 The search key is the database’s first column, and it contains a duplicate or copy of the table’s
candidate key or primary key. The primary key values are saved in sorted order so that the
related data can be quickly accessible.

 The data reference is the database’s second column. It contains a group of pointers that point to
the disk block where the value of a specific key can be found.

Methods of Indexing:
Ordered Indices:

To make searching easier and faster, the indices are frequently arranged/sorted. Ordered indices
are indices that have been sorted.

Example

Let’s say we have a table of employees with thousands of records, each of which is ten bytes large. If
their IDs begin with 1, 2, 3,…, etc., and we are looking for the student with ID-543:

 We must search the disk block from the beginning till it reaches 543 in the case of a DB
without an index. After reading 543*10=5430 bytes, the DBMS will read the record.

 We will perform the search using indices in the case of an index, and the DBMS would read
the record after it reads 542*2 = 1084 bytes, which is significantly less than the prior
example.
Primary Index:
 If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1 relation
between the records.
 As primary keys are stored in sorted order, the performance of the searching operation is
quite efficient.
 The primary index can be classified into two types: Dense index and Sparse index.

 Dense index
 The dense index contains an index record for every search key value in the data file. It
makes searching faster.
 In this, the number of records in the index table is same as the number of records in the
main table.
 It needs more space to store index record itself. The index records have the search key and
a pointer to the actual record on the disk.

 Sparse Index:

 In the data file, index record appears only for a few items. Each item points to a block.
 In this, instead of pointing to each record in the main table, the index points to the records in
the main table in a gap

.
Clustering Index:

 A clustered index can be defined as an ordered data file. Sometimes the index is created on
non-primary key columns which may not be unique for each record.
 In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
 The records which have similar characteristics are grouped, and indexes are created for
these group.

Example: suppose a company contains several employees in each department. Suppose we use a
clustering index, where all employees which belong to the same Dept_ID are considered within a
single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique key.

The previous schema is little confusing because one disk block is shared by records which belong to
the different cluster. If we use separate disk block for separate clusters, then it is called better
technique
Secondary Index:

 In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster.
 Then the secondary memory searches the actual data based on the address got from
mapping. If the mapping size grows then fetching the address itself becomes slower.
 In this case, the sparse index will not be efficient. To overcome this problem, secondary
indexing is introduced.
 In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
In this method, the huge range for the columns is selected initially so that the mapping size of
the first level becomes small.
 Then each range is further divided into smaller ranges. The mapping of the first level is
stored in the primary memory, so that address fetch is faster. The mapping of the second
level and actual data are stored in the secondary memory (hard disk).
Example:

 If you want to find the record of roll 111 in the diagram, then it will search the highest entry
which is smaller than or equal to 111 in the first level index. It will get 100 at this level.

 Then in the second index level, again it does max (111) <= 111 and gets 110. Now using the
address 110, it goes to the data block and starts searching each record till it gets 111.

 This is how a search is performed in this method. Inserting, updating or deleting is also done
in the same manner.

Hash Based Indexing:

 In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on
the disk without using index structure.
 In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data
bucket or data blocks.
 In this, a hash function can choose any of the column value to generate the address. Most of
the time, the hash function uses the primary key to generate the address of the data block. A
hash function is a simple mathematical function to any complex mathematical function.
 We can even consider the primary key itself as the address of the data block. That means
each row whose address will be the same as a primary key stored in the data block.

 The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we have
mod (5) hash function to determine the address of the data block.
 In this case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2
respectively, and records are stored in those data block addresses.
Hash Organization:
Bucket – A bucket is a type of storage container. Data is stored in bucket format in a hash file.
Typically, a bucket stores one entire disc block, which can then store one or more records.

Hash Function – A hash function, abbreviated as h, refers to a mapping function that connects all of
the search-keys K to that address in which the actual records are stored. From the search keys to
the bucket addresses, it’s a function.

Types of Hashing:

Static Hashing:

 In static hashing, the resultant data bucket address will always be the same. That means if we
generate an address for EMP_ID =103 using the hash function mod (5) then it will always result
in same bucket address 3. Here, there will be no change in the bucket address.
 Hence in this static hashing, the number of data buckets in memory remains constant
throughout. In this example, we will have five data buckets in the memory used to store the
data.
Operations of Static Hashing:

 Searching a record
When a record needs to be searched, then the same hash function retrieves the address of the
bucket where the data is stored.
 Insert a Record
When a new record is inserted into the table, then we will generate an address for a new record
based on the hash key and record is stored in that location.
 Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we will
delete the records for that address in memory.
 Update a Record
To update a record, we will first search it using a hash function, and then the data record is
updated.

If we want to insert some new record into the file but the address of a data bucket generated by the
hash function is not empty, or data already exists in that address. This situation in the static hashing
is known as bucket overflow. This is a critical situation in this method.

To overcome this situation, there are various methods. Some commonly used methods are as
follows:

1. Close Hashing:
When a hash function generates an address at which data is already stored, then the next bucket
will be allocated to it. This mechanism is called as Linear Probing.

For example: suppose R3 is a new address which needs to be inserted, the hash function generates
address as 112 for R3. But the generated address is already full. So the system searches next
available data bucket, 113 and assigns R3 to it.
1. Linear probing:

 This Hashing technique finds the hash key value through hash function and maps the key on
particular position in hash table.
key = key % size;
 In case if key has same hash address (collision) then it will find next empty position in the
hash table.
key = (key+i) % size; here i is 0 to size-1
 We take the hash table as circular array. So if table size is N then after N-1 position it will
search from 0th position in the array.

Example: 01

Suppose we have a list of size 20 (m = 20). We want to put some elements in linear probing fashion.
The elements / keys are {96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}

Data Item Value % No. of Slots Hash Value probes


96 96 % 20 =16 16 1
48 48 % 20 =8 8 1
63 63 % 20 =3 3 1
29 29 % 20 =9 9 1
87 87 % 20 =7 7 1
77 77 % 20 =17 17 1
48 % 20 =8 3
Collision occurs
48 (48+1)%20=9
10
(48+2)%20=10
65 65 %48
20 =5 5 1
69 % 20 =9 3
(69+1) % 20=10 Collision occurs
69
11
(69+2) % 20=11

94 94 % 20 =14 14 1
61 61 % 20 =1 1 1
Hash Table

 The order of Elements /keys in Hash Table are: --,61,--,63,--,65,--87,48,29,48,69,--,--,94,--,96,77,--,--


 Total Number of Probes are 15
2.Open Hashing:

 When buckets are full, then a new data bucket is allocated for the same hash result and is linked
after the previous one. This mechanism is known as Overflow chaining.
 For example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this case,
a new bucket is inserted at the end of 110 buckets and is linked to it.

Example: 01

Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-
50, 700, 76, 85, 92, 73 and 101. Use(Open Hashing) separate chaining technique for collision
resolution.

Data Item Value % No. of Hash Value


Slots
50 50 % 7 = 1 1
700 700 %7 = 0 0
76 76 % 7 = 6 6
1
85 85 % 7 = 1
( collision occurs)

1
92 92 % 7 = 1
(collision occurs)

73 73 % 7 = 3 3
3
101 101 % 7 =3
(collision occurs)
Example: 02

Let the keys be 100, 200, 25, 125, 76, 86, 96 and let m = 10. Given, h(k) = k mod 10 . Use (Open
Hashing) separate chaining technique for collision resolution.

Data Item Value % No. of Hash Value


Slots
100 100 % 10 =0 0
200 % 10 =0 0
200
( collision occurs)

25 25 % 10 =5 5

125 % 10 =5 5
125
( collision occurs)
76 % 10 =6 6
76

86 % 10 =6 6
86
(collision occurs)
96% 10 =6 6
96
(collision occurs)

Dynamic Hashing:
o The dynamic hashing method is used to overcome the problems of static hashing like bucket
overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in
poor performance.

How to search a key

o First, calculate the hash address of the key.


o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the directory.
o Now using the index, go to the directory and find bucket address where the record might be.

How to insert a new record

o Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.
Example:01

 Example based on Extendible Hashing: Now, let us consider a prominent example of


hashing the following elements: 16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3 (Assume)
Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs.

Solution:
First, calculate the binary forms of each of the given numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 11010
 Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks like this:

Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1 LSB of
10000 which is 0. Hence, 16 is mapped to the directory with id=0.

Inserting 4 and 6:
Both 4(100) and 6(110) have 0 in their LSB. Hence, they are hashed as follows:
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0 is
already full. Hence, Over Flow occurs.

As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits and directory
expansion takes place. Also, rehashing of numbers present in the overflowing bucket takes place
after the split. And, since the global depth is incremented by 1, now, the global depth is 2. Hence,
16,4,6,22 are now rehashed w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ] .

*Notice that the bucket which was underflow has remained untouched. But, since the number of
directories has doubled, we now have 2 directories 01 and 11 pointing to the same bucket. This is
because the local-depth of the bucket has remained 1. And, any bucket having a local depth less
than the global depth is pointed-to by more than one directory.
Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id 00
and 10. Here, we encounter no overflow condition.

Inserting 31,7,9: All of these elements[31(11111), 7(111), 9(1001) ] have either 01 or 11 in


their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11. We do not encounter
any overflow condition here.

Inserting 20: Insertion of data element 20 (10100) will again cause the overflow problem.
20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since the local depth of
the bucket = global-depth, directory expansion (doubling) takes place along with bucket
splitting. Elements present in overflowing bucket are rehashed with the new global depth. Now,
the new Hash table looks like this:

Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore 26 best
fits in the bucket pointed out by directory 010 .

The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of bucket <
Global depth (2<3), directories are not doubled but, only the bucket is split and elements are
rehashed.
Finally, the output of hashing the given list of numbers is obtained.
 Hashing of 11 Numbers is thus completed.
Example:02

Consider the following grouping of keys into buckets, depending on the prefix of their hash address:

The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are 01, so
it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2. The last two
bits of 7 are 11, so it will go into B3.

Insert key 9 with hash address 10001 into the above structure:

 Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full, so
it will get split.

 The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go into
bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.

 Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because last
two bits of both the entry are 00.

 Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because last
two bits of both the entry are 10.

 Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two bits
of both the entry are 11.
Advantages of dynamic hashing:

 In this method, the performance does not decrease as the data grows in the system. It simply
increases the size of memory to accommodate the data.

 In this method, memory is well utilized as it grows and shrinks with the data. There will not
be any unused memory lying.

 This method is good for the dynamic database where data grows and shrinks frequently.

Disadvantages of dynamic hashing:

 In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the data
address will keep changing as buckets grow and shrink. If there is a huge increase in data,
maintaining the bucket address table becomes tedious.

 In this case, the bucket overflow situation will also occur. But it might take little time to
reach this situation than static hashing.

Comparison of File Organizations:


Difference between Sequential, heap/Direct, Hash,B+ Tree, file organization in database
managementsystem (DBMS) as shown below
Performance Tuning using Indexes:

What is Index tuning?


 Query performance as well as speed improvement of a database can be done using Indexes.
 The process of enhancing the selection of indexes is called Index Tuning.

Index tuning is part of database tuning for selecting and creating indexes. The index tuning goal is
to reduce the query processing time. Potential use of indexes in dynamic environments with several
ad-hoc queries in advance is a difficult task. Index tuning involves the queries based on indexes and
the indexes are created automatically on-the-fly. No explicit actions are needed by the database
users for index tuning.

 Effective indexes are one of the best ways to improve performance in a database application.
Without an index, the SQL Server engine is like a reader trying to find a word in a book by
examining each page.

 By using the index in the back of a book, a reader can complete the task in a much shorter time.
In database terms, a table scan happens when there is no index available to help a query. In a
table scan SQL Server examines every row in the table to satisfy the query results.

 Table scans are sometimes unavoidable, but on large tables, scans have a terrific impact on
performance.

 One of the most important jobs for the database is finding the best index to use when
generating an execution plan.

 First, let’s cover the scenarios where indexes help performance, and when indexes can hurt
performance.

Useful Index Queries

 Just like the reader searching for a word in a book, an index helps when you are looking for a
specific record or set of records with a WHERE clause.

 This includes queries looking for a range of values, queries designed to match a specific
value, and queries performing a join on two tables.

 For example, both of the queries against the Northwind database below will benefit from an
index on the UnitPrice column.
DELETE FROM Products WHERE UnitPrice = 1

SELECT * FROM PRODUCTS


WHERE UnitPrice BETWEEN 14 AND 16

 Since index entries are stored in sorted order, indexes also help when processing ORDER BY
clauses. Without an index the database has to load the records and sort them during execution.
 An index on UnitPrice will allow the database to process the following query by simply scanning
the index and fetching rows as they are referenced. To order the records in descending order,
the database can simply scan the index in reverse.

SELECT * FROM Products ORDER BY UnitPrice ASC


 Grouping records with a GROUP BY clause will often require sorting, so a UnitPrice index will
also help the following query to count the number of products at each price.

SELECT Count(*), UnitPrice FROM Products


GROUP BY UnitPrice
 By retrieving the records in sorted order through the UnitPrice index, the database sees
matching prices appear in consecutive index entries, and can easily keep a count of products at
each price.
 Indexes are also useful for maintaining unique values in a column, since the database can easily
search the index to see if an incoming value already exists. Primary keys are always indexed for
this reason.

Index Drawbacks:

 Indexes are a performance drag when the time comes to modify records. Any time a query
modifies the data in a table the indexes on the data must change also.

 Achieving the right number of indexes will require testing and monitoring of your database to
see where the best balance lies. Static systems, where databases are used heavily for reporting,
can afford more indexes to support the read only queries.

 A database with a heavy number of transactions to modify data will need fewer indexes to
allow for higher throughput. Indexes also use disk space.

 The exact size will depends on the number of records in the table as well as the number and
size of the columns in the index. Generally this is not a major concern as disk space is easy to
trade for better performance.
Indexed sequential access method (ISAM):

ISAM method is an advanced sequential file organization. In this method, records are stored in the
file using the primary key. An index value is generated for each primary key and mapped with the
record. This index contains the address of the record in the file.

If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
 In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
 This method supports range retrieval and partial retrieval of records. Since the index is
based on the primary key values, we can retrieve the data for the given range of value. In the
same way, the partial value can also be easily searched, i.e., the student name starting with
'JA' can be easily searched.
Cons of ISAM:
 This method requires extra space in the disk to store the index value.
 When the new records are inserted, then these files have to be reconstructed to maintain the
sequence.
 When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
B+ Tree:
 The B+ tree is a balanced binary search tree. It follows a multi-level index format.
 In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
 In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.

Structure of B+ Tree:
 In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
 It contains an internal node and leaf node.

Internal node

 An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
 At most, an internal node of the tree contains n pointers.

Leaf node

 The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
 At most, a leaf node contains n record pointer and n key values.
 Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.

Searching a record in B+ Tree:

 Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.
 So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end,
we will be redirected to the third leaf node. Here DBMS will perform a sequential search to find
55.
B+ Tree Insertion:

 Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after
55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60 there.
 In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.

 The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will
split the leaf node of the tree in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes.
 If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to
find the node where it fits and then place it in that leaf node.

B+ Tree Deletion:

Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from the
intermediate node as well as from the 4th leaf node too. If we remove it from the intermediate
node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it to have a
balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
Practice Problems on B+ Tree in Data Structure

1. What’s the max number of keys in a B+ Tree of height 3 and order 3?

a. 27

b. 3

c. 26

d. 80

Answer- (c) 26

2. Which of these data structures would be preferred in DB-system implementation?

a. B Tree

b. AVL Tree

c. Splay Tree

d. B+ Tree

Answer- (d) B+ Tree

3. In the case of a B+ Tree, both the leaves and the internal nodes consist of keys.

a. False

b. True

Answer- (a) False

4. What would be the min number of keys in the leaves in case any B+ Tree would contain a max of
7 pointers in the node?

a. 7

b. 3

c. 6

d. 4

Answer- (b) 3
During insertion following properties of B+ Tree must be followed:
 Each node except root can have a maximum of M children and at least ceil(M/2) children.
 Each node can contain a maximum of M – 1 keys and a minimum of ceil(M/2) – 1 keys.
 The root has at least two children and atleast one search key.
 While insertion overflow of the node occurs when it contains more than M – 1 search key
values.
Here M is the order of B+ tree.

Steps for insertion in B+ Tree:


1. Every element is inserted into a leaf node. So, go to the appropriate leaf node.
2. Insert the key into the leaf node in increasing order only if there is no overflow. If there is an
overflow go ahead with the following steps mentioned below to deal with overflow while
maintaining the B+ Tree properties.

Properties for insertion B+ Tree


Case 1: Overflow in leaf node
 Split the leaf node into two nodes.
 First node contains ceil((m-1)/2) values.
 Second node contains the remaining values.
 Copy the smallest search key value from second node to the parent node.(Right biased)
Below is the illustration of inserting 8 into B+ Tree of order of 5:

Case 2: Overflow in non-leaf node :

 Split the non leaf node into two nodes.


 First node contains ceil(m/2)-1 values.
 Move the smallest among remaining to the parent.
 Second node contains the remaining keys.
Below is the illustration of inserting 15 into B+ Tree of order of 5:

Example:01
Problem: Insert the following key values 6, 16, 26, 36, 46 on a B+ tree with order = 3.
Solution:
Step 1: The order is 3 so at maximum in a node so there can be only 2 search key values. As
insertion happens on a leaf node only in a B+ tree so insert search key value 6 and 16 in
increasing order in the node. Below is the illustration of the same:

Step 2: We cannot insert 26 in the same node as it causes an overflow in the leaf node, We have
to split the leaf node according to the rules.

First part contains ceil((3-1)/2) values i.e., only 6. The second node contains the remaining
values i.e., 16 and 26. Then also copy the smallest search key value from the second node to the
parent node i.e., 16 to the parent node. Below is the illustration of the same:
Step 3: Now the next value is 36 that is to be inserted after 26 but in that node, it causes an
overflow again in that leaf node.

 Again follow the above steps to split the node. First part contains ceil((3-1)/2) values i.e.,
only 16. The second node contains the remaining values i.e., 26 and 36.
 Then also copy the smallest search key value from the second node to the parent node
i.e., 26 to the parent node. Below is the illustration of the same:

The illustration is shown in the diagram below.

Step 4: Now we have to insert 46 which is to be inserted after 36 but it causes an overflow in the
leaf node. So we split the node according to the rules. The first part contains 26 and the second
part contains 36 and 46 but now we also have to copy 36 to the parent node but it causes
overflow as only two search key values can be accommodated in a node.

Now follow the steps to deal with overflow in the non-leaf node.

 First node contains ceil(3/2)-1 values i.e. ’16’.


 Move the smallest among remaining to the parent i.e ’26’ will be the new parent node.
 The second node contains the remaining keys i.e ’36’ and the rest of the leaf nodes remain
the same.
Below is the illustration of the same:
RAID (Redundant Array of Independent Disk )
RAID refers to redundancy array of the independent disk. It is a technology which is used to
connect multiple secondary storage devices for increased performance, data redundancy or both. It
gives you the ability to survive one or more drive failure depending upon the RAID level used.

It consists of an array of disks in which multiple disks are connected to achieve different goals.

RAID technology:

There are 7 levels of RAID schemes. These schemas are as RAID 0, RAID 1, ...., RAID 6. These levels
contain the following characteristics:

 It contains a set of physical disk drives.


 In this technology, the operating system views these separate disks as a single logical disk.
 In this technology, data is distributed across the physical drives of the array.
 Redundancy disk capacity is used to store parity information.
 In case of disk failure, the parity information can be helped to recover the data.

Standard RAID levels:

RAID 0
 RAID level 0 provides data stripping, i.e., a data can place across multiple disks. It is based on
stripping that means if one disk fails then all data in the array is lost.
 This level doesn't provide fault tolerance but increases the system performance.

Example:

Disk 0 Disk 1 Disk 2 Disk 3


20 21 22 23
24 25 26 27
28 29 30 31
32 33 34 35

In this figure, block 0, 1, 2, 3 form a stripe.

In this level, instead of placing just one block into a disk at a time, we can work with two or more
blocks placed it into a disk before moving on to the next one.
Disk 0 Disk 1 Disk 2 Disk 3
20 22 24 26
21 23 25 27
28 30 32 34
29 31 33 35
In this above figure, there is no duplication of data. Hence, a block once lost cannot be recovered.

Pros of RAID 0:
o In this level, throughput is increased because multiple data requests probably not on the
same disk.
o This level full utilizes the disk space and provides high performance.
o It requires minimum 2 drives.

Cons of RAID 0:
o It doesn't contain any error detection mechanism.
o The RAID 0 is not a true RAID because it is not fault-tolerance.
o In this level, failure of either disk results in complete data loss in respective array.

RAID 1

This level is called mirroring of data as it copies the data from drive 1 to drive 2. It provides 100%
redundancy in case of a failure.
Example:

Disk 0 Disk 1 Disk 2 Disk 3


A A B B
C C D D
E E F F
G G H H

Only half space of the drive is used to store the data. The other half of drive is just a mirror to the
already stored data.

Pros of RAID 1:
o The main advantage of RAID 1 is fault tolerance. In this level, if one disk fails, then the other
automatically takes over.
o In this level, the array will function even if any one of the drives fails.

Cons of RAID 1:
o In this level, one extra drive is required per drive for mirroring, so the expense is higher.

RAID 2

o RAID 2 consists of bit-level striping using hamming code parity. In this level, each data bit in
a word is recorded on a separate disk and ECC (Error Correction Code) code of data words
is stored on different set disks.
o Due to its high cost and complex structure, this level is not commercially used. This same
performance can be achieved by RAID 3 at a lower cost.
Pros of RAID 2:
o This level uses one designated drive to store parity.
o It uses the hamming code for error detection.

Cons of RAID 2:
o It requires an additional drive for error detection.

RAID 3:

o RAID 3 consists of byte-level striping with dedicated parity. In this level, the parity
information is stored for each disk section and written to a dedicated parity drive.
o In case of drive failure, the parity drive is accessed, and data is reconstructed from the
remaining devices. Once the failed drive is replaced, the missing data can be restored on the
new drive.
o In this level, data can be transferred in bulk. Thus high-speed data transmission is possible.

Disk 0 Disk 1 Disk 2 Disk 3


A B C P(A, B, C)
D E F P(D, E, F)
G H I P(G, H, I)
J K L P(J, K, L)

Pros of RAID 3:
o In this level, data is regenerated using parity drive.
o It contains high data transfer rates.
o In this level, data is accessed in parallel.

Cons of RAID 3:
o It required an additional drive for parity.
o It gives a slow performance for operating on small sized files.

RAID 4:

o RAID 4 consists of block-level stripping with a parity disk. Instead of duplicating data, the
RAID 4 adopts a parity-based approach.
o This level allows recovery of at most 1 disk failure due to the way parity works. In this level,
if more than one disk fails, then there is no way to recover the data.
o Level 3 and level 4 both are required at least three disks to implement RAID.
o Note that level 3 uses byte-level striping, whereas level 4 uses block-level striping.
Disk 0 Disk 1 Disk 2 Disk 3
A B C P0
D E F P1
G H I P2
J K L P3

In this figure, we can observe one disk dedicated to parity.

In this level, parity can be calculated using an XOR function. If the data bits are 0,0,0,1 then the
parity bits is XOR(0,1,0,0) = 1. If the parity bits are 0,0,1,1 then the parity bit is XOR(0,0,1,1)= 0.
That means, even number of one results in parity 0 and an odd number of one results in parity 1.

C1 C2 C3 C4 Parity
0 1 0 0 1
0 0 1 1 0

Suppose that in the above figure, C2 is lost due to some disk failure. Then using the values of all the
other columns and the parity bit, we can recompute the data bit stored in C2. This level allows us to
recover lost data.

RAID 5

o RAID 5 is a slight modification of the RAID 4 system. The only difference is that in RAID 5,
the parity rotates among the drives.
o It consists of block-level striping with DISTRIBUTED parity.
o Same as RAID 4, this level allows recovery of at most 1 disk failure. If more than one disk
fails, then there is no way for data recovery.

Disk 0 Disk 1 Disk 2 Disk 3 Disk 4


0 1 2 3 P0
5 6 7 P1 4
10 11 P2 8 9
15 P3 12 13 14
P4 16 17 18 19

This figure shows that how parity bit rotates.

This level was introduced to make the random write performance better.

Pros of RAID 5:
o This level is cost effective and provides high performance.
o In this level, parity is distributed across the disks in an array.
o It is used to make the random write performance better.
Cons of RAID 5:
o In this level, disk failure recovery takes longer time as parity has to be calculated from all
available drives.
o This level cannot survive in concurrent drive failure.

RAID 6

o This level is an extension of RAID 5. It contains block-level stripping with 2 parity bits.

o In RAID 6, you can survive 2 concurrent disk failures. Suppose you are using RAID 5, and
RAID 1. When your disks fail, you need to replace the failed disk because if simultaneously
another disk fails then you won't be able to recover any of the data, so in this case RAID 6
plays its part where you can survive two concurrent disk failures before you run out of
options.

Disk 1 Disk 2 Disk 3 Disk 4


A0 B0 Q0 P0
A1 Q1 P1 D1
Q2 P2 C2 D2
P3 B3 C3 Q3

Pros of RAID 6:
o This level performs RAID 0 to strip data and RAID 1 to mirror. In this level, stripping is
performed before mirroring.
o In this level, drives required should be multiple of 2.

Cons of RAID 6:
o It is not utilized 100% disk capability as half is used for mirroring.
o It contains very limited scalability.

You might also like