ch6 Distributed Database
ch6 Distributed Database
2
Distributed Database System
Advantages
◦ Management of distributed data with different
levels of transparency:
This refers to the physical placement of data (files,
relations, etc.) which is not known to the user
(distribution transparency).
3
Cont…
Advantages (cont…)
◦ Distribution and Network transparency:
Users do not have to worry about operational details of the
network
There is Location transparency, which refers to freedom of
issuing command from any location without affecting its
working
◦ Replication transparency:
It allows to store copies of a data at multiple sites for better
availability.
Makes the user unaware of the existence of copies
This is done to minimize access time to the required data.
◦ Fragmentation transparency:
Allows to fragment a relation horizontally (create a subset of
tuples of a relation) or vertically (create a subset of columns
of a relation)
Makes the user unaware of the existence of fragments
4
Distributed Database
System(cont…)
Advantages (cont...)
◦ Increased reliability and availability:
Reliability refers to system life time; that is, system is
running efficiently most of the time
Availability is the probability that the system is
continuously available (usable or accessible) during a
time interval
A distributed database system has multiple nodes
(computers) and if one fails then others are available to
do the job.
5
Distributed Database
System(cont…)
Other Advantages (cont…)
◦ Improved performance:
A distributed DBMS fragments the database to keep data
closer to where it is needed most
This reduces data management overhead (access and
modification time) significantly
◦ Easier expansion (scalability):
Refers to expansion of the system in terms of adding
more data, increasing database sizes or adding more
processors
6
Data Fragmentation, Replication and Allocation
Data Fragmentation
◦ Split a relation into logically related and correct parts. A
relation can be fragmented in two ways:
◦ Horizontal Fragmentation - Vertical
Fragmentation
Horizontal fragmentation
◦ It is a horizontal subset of a relation which contain those of
tuples which satisfy selection conditions.
◦ Consider the Employee relation with selection condition
(DNO = 5). All tuples that satisfy this condition will create a
subset which will be a horizontal fragment of Employee
relation.
◦ A selection condition may be composed of several conditions
connected by AND / OR
7
Data Fragmentation, Replication and
Allocation(cont…)
Vertical fragmentation
◦ It is a subset of a relation which is created by a subset of
columns. Thus a vertical fragment of a relation will
contain values of selected columns. There is no selection
condition used in vertical fragmentation.
◦ Consider the Employee relation. A vertical fragment of
can be created by keeping the values of Name, Bdate,
Sex, and Address.
◦ Because there is no condition for creating a vertical
fragment, each fragment must include the primary key
attribute of the parent relation Employee.
In this way all vertical fragments of a relation are
connected.
8
Data Fragmentation, Replication and
Allocation(cont…)
Representing horizontal fragmentation
◦ Each horizontal fragment on a relation can be specified by
a sCi (R) operation in the relational algebra
◦ Complete horizontal fragmentation
A set of horizontal fragments whose conditions C1, C2,
…, Cn include all the tuples in R- that is, every tuple in
R satisfies
(C1 OR C2 OR … OR Cn)
◦ Disjoint complete horizontal fragmentation: No tuple in R
satisfies (Ci AND Cj) where i ≠ j
◦ To reconstruct R from horizontal fragments a UNION is
applied
9
Data Fragmentation, Replication and
Allocation(cont…)
Vertical fragmentation
◦ A vertical fragment on a relation can be specified by a
Li(R) operation in the relational algebra.
◦ Complete vertical fragmentation
A set of vertical fragments whose projection lists L1, L2, …,
Ln include all the attributes in R but share only the primary
key of R. In this case the projection lists satisfy the following
two conditions:
L1 L2 ... Ln = ATTRS (R)
Li Lj = PK(R) for any I, j, where ATTRS (R) is the set of
attributes of R and PK(R) is the primary key of R.
10
Data Fragmentation, Replication and
Allocation(cont…)
Mixed (Hybrid) fragmentation
◦ A combination of Vertical fragmentation and
Horizontal fragmentation
◦ This is achieved by SELECT-PROJECT operations
which is represented by Li(sCi (R))
11
Data Fragmentation, Replication and
Allocation(cont…)
Data Replication
Replication refer the distribution of whole or part of the
data to a number of sites
◦ Useful in improving availability of data
◦ Improve performance of global queries since the result of
such query can be obtained from any one site
◦ In full replication, the entire database is replicated and in
partial replication some selected part is replicated to some
of the sites
◦ The disadvantage of full replication is that it can slow
down update operation since a single logical update must
be performed on every copy of the database to keep the
copies consistent
12
Types of Distributed Database Systems
Homogeneous
◦ All sites of the database
system have identical setup, Window
i.e., same database system Site 5 Unix
Oracle Site 1
software. Oracle
For example, all sites run Window
Oracle or DB2, or Sybase or Site 4 Communications
network
some other but the same
database system software.
Oracle
◦ The underlying operating
Site 3 Site 2
systems may be different (can Linux Oracle Linux Oracle
be a mixture of Linux,
Window, Unix, etc.)
13
Types of Distributed Database Systems
Heterogeneous
◦ Federated: Each site may run different database system but the data
access is managed through a single conceptual schema.
This implies that the degree of local autonomy is minimum. Each site
must adhere to a centralized access policy. There may be a global
schema.
◦ Multidatabase: There is no one conceptual global schema. For data
access a schema is constructed dynamically as needed by the
application software. Unix Relational
Object
Oriented Site 5 Unix
Site 1
Hierarchical
Window
Site 4 Communications
network
Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux 14
Types of Distributed Database Systems
15
Query Processing in Distributed Databases
Issues
◦ Cost of transferring data (files and results) over
the network.
This cost is usually high. So, some optimization is
necessary.
Example: Employee at site 1 and Department at Site 2
Employee at site 1. 10,000 rows. Row size = 100 bytes.
This means, table size = 106 bytes.
Department at Site 2. 100 rows. Row size = 35 bytes.
This means, table size = 3,500 bytes.
16
Query Processing in Distributed Databases (cont…)
Issues(cont…)
◦ Cost of transferring data (files and results) over the
network
◦ Example
Q: For each employee, retrieve employee name and
department name Where the employee works.
Q: Fname,Lname,Dname (EmployeeDno = Dnumber Department)
Employee
Fname Minit Lname SSN Bdate Address Sex Slary Superssn Dno
Department
Dname Dnumber Mgrssn Mgrstartdate
17
Query Processing in Distributed Databases (cont…)
Result
◦ If every employee is related to a department, the result
of this query will have 10,000 tuples
◦ Suppose that each result tuple is 40 bytes long. The
query is submitted at site 3 and the result is sent to this
site
◦ Suppose that Employee and Department relations are
not present at site 3
Employee
Site 1
Site 2 Site 3
Department
18
Query Processing in Distributed Databases
(cont…)
Strategies (Available options):
1. Transfer Employee and Department to site 3.
Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and
send the result to site 3.
Query result size = 40 * 10,000 = 400,000 bytes.
Total transfer size = 1,000,000 + 400,000 = 1,400,000
bytes.
3. Transfer Department relation to site 1, execute the join at
site 1, and send the result to site 3
Total bytes transferred = 3500 + 400,000 = 403,500 bytes.
Optimization criteria: minimizing data transfer.
◦ Preferred strategy: strategy 3.
19
Query Processing in Distributed Databases (cont…)
20
Query Processing in Distributed Databases (cont…)
Execution strategies:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
Query result size = 40 * 100 = 4000 bytes.
Total transfer size = 1,000,000 +4000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
Total transfer size = 3500 + 4000 = 7500 bytes.
Preferred strategy: Choose strategy 3.
21
Query Processing in Distributed
Databases (cont…)
Now suppose the result site is 2.
Possible strategies :
1. Transfer Employee relation to site 2, execute the query and
present the result to the user at site 2
Total transfer size = 1,000,000 bytes for both queries Q and Q’.
2. Transfer Department relation to site 1, execute join at site 1 and
send the result back to site 2
Total transfer size for Q:
3500 +400,000 = 403,500 bytes
Total transfer size for Q’:
3500 +4000 = 7500 bytes
22
Query Processing in Distributed Databases
Semijoin:
◦ Objective is to reduce the number of tuples in a relation before
transferring it to another site.
Example execution of Q or Q’:
1. Project the join attributes of Department at site 2, and transfer them
to site 1.
Assume size of Dnumber=4 bytes and size of Mgrssn=9 bytes
Assume size of fname and lname is 15 bytes each
◦ For Q, 4 * 100 = 400 bytes are transferred and for Q’, 9 * 100 =
900 bytes are transferred
2. Join the transferred file with the Employee relation at site 1, and
transfer the required attributes from the resulting file to site 2.
For Q, 34 * 10,000 = 340,000 bytes are transferred and
For Q’, 39 * 100 = 3900 bytes are transferred
3. Execute the query by joining the transferred file with Department and
present the result to the user at site 2.
23
Concurrency Control and Recovery
Distributed Databases encounter a
number of concurrency control and
recovery problems which are not present
in centralized databases.
Some of these problems are listed below:
◦ Dealing with multiple copies of data items
◦ Failure of individual sites
◦ Communication link failure
◦ Distributed commit
◦ Distributed deadlock
24
Concurrency Control and Recovery (cont…)
Details
◦ Dealing with multiple copies of data items:
The concurrency control must maintain global
consistency
Likewise, the recovery mechanism must recover all
copies and maintain consistency after recovery
◦ Failure of individual sites:
Database availability must not be affected due to the
failure of one or two sites and the recovery scheme
must recover them before they are available for use
25
Concurrency Control and Recovery (cont…)
(Details….)
Communication link failure:
◦ This failure may create network partition which would affect
database availability even though all database sites may be
running.
Distributed commit:
◦ A transaction may be fragmented and they may be executed
by a number of sites. This require a two commit approach
for transaction commit.
Distributed deadlock:
◦ Since transactions are processed at multiple sites, two or
more sites may get involved in deadlock. This must be
resolved in a distributed manner.
26
Concurrency Control and Recovery (cont…)
Site 3 Site 2
27
Concurrency Control and Recovery
Transaction management:
◦ Concurrency control and commit are managed
by this site
◦ All locks are kept at that site and all requests for
locking or unlocking are sent there
◦ In two phase locking, this site manages locking
and releasing of data items
◦ If all transactions follow two-phase policy at
all sites, then serializability is guaranteed
28
Concurrency Control and Recovery (cont…)
Advantages:
It is an extension to the centralized two phase locking and
hence simple to Implement and manage
Data items are locked only at one site but they can be
accessed at any site at which they reside
Disadvantages:
All transaction management activities go to primary site
which is likely to overload the site.
If the primary site fails, the entire system is inaccessible
To aid recovery, a backup site is designated which
behaves as a shadow of primary site.
◦ In case of primary site failure, backup site can act as
primary site.
29
Concurrency Control and Recovery (cont…)
Primary Copy Technique:
◦ In this approach, instead of a site, a data item partition is
designated as primary copy
Load of lock coordination is distributed among the
various sites
To lock a data item, just the primary copy of the data
item is locked
◦ Advantages:
Since primary copies are distributed at various sites, a
single site is not overloaded with locking and unlocking
requests
◦ Disadvantages:
Identification of a primary copy is complex...
30
Concurrency Control and Recovery
Recovery from a coordinator failure
◦ In both approaches, a coordinator site or copy may become
unavailable. This will require the selection of a new
coordinator.
Primary site approach with no backup site:
◦ Aborts and restarts all active transactions at all sites. Elects
a new coordinator and initiates transaction processing.
Primary site approach with backup site:
◦ Suspends all active transactions, designates the backup site
as the primary site and identifies a new back up site.
Primary site receives all transaction management
information to resume processing.
Primary and backup sites fail or no backup site:
◦ Use election process to select a new coordinator site.
31
Concurrency Control and Recovery
Concurrency control based on voting:
◦ In a voting method, a lock request is sent to all the
sites that have the copy of the data item
◦ Each copy maintains its own lock and can grant or
deny request for it
◦ If majority of sites grant the lock, the requesting
transaction gets the data item and inform all copies
that it has been granted the lock
◦ To avoid unacceptably long wait, a time-out period is
defined. If the requesting transaction does not get
any vote information, the transaction is aborted.
32
Client-Server Database Architecture
Itconsists of clients running client software, a set of
servers which provide all database functionalities and a
reliable communication infrastructure.
Server 1 Client 1
Client 2
Server 2 Client 3
Server n Client n
33
Client-Server Database Architecture
Server: is responsible for local data management at a
site, much like centralized DBMS software
Client: is responsible for most of the distribution
function; it accesses data distribution information from
the DBMS catalog and processes all requests that require
access to more than one site
The communication software manages communication
among clients and servers
34
Client-Server Database Architecture
The processing of a SQL queries goes as
follows:
◦ Client parses a user query and decomposes it
into a number of independent sub-queries.
◦ Each server processes its query and sends the
result to the client.
◦ The client combines the results of sub queries
and produces the final result.
35