0% found this document useful (0 votes)
30 views

Distributed

The document discusses distributed databases and their design. Key points include: - A distributed database is a collection of interconnected databases spread across locations that communicate over a network and appear as a single logical database. - Distributed database management systems (DBMS) ensure data modified at any site is universally updated. - Distributed database design involves fragmenting relations, allocating fragments to sites, and potentially replicating fragments for availability and reliability.

Uploaded by

mehari kiros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Distributed

The document discusses distributed databases and their design. Key points include: - A distributed database is a collection of interconnected databases spread across locations that communicate over a network and appear as a single logical database. - Distributed database management systems (DBMS) ensure data modified at any site is universally updated. - Distributed database design involves fragmenting relations, allocating fragments to sites, and potentially replicating fragments for availability and reliability.

Uploaded by

mehari kiros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 83

Chapter 4

Distributed Database-
Concepts and Design

1
Distributed Databases
 A distributed database is a collection of
multiple interconnected databases, which are
spread physically across various locations that
communicate via a computer network.
 It is single logical database physically divided
among networked computers
 Distributed DBMS: is a centralized software
system that supports and manipulates
distributed databases.
 It ensures that the data modified at any site is
universally updated.
2
Concepts
Distributed Database.
A logically interrelated collection of shared data
(and a description of this data), physically
distributed over a computer network.

Distributed DBMS.
Software system that permits the management of
the distributed database and makes the
distribution transparent to users.
Concept…
 Users access the distributed database
via applications, which are classified as
those that do not require data from
other sites (local applications) and
 those that do require data from other
sites (global applications). We require
a DDBMS to have at least one global
application.
Summary
 Collection of logically-related shared data.
 Data split into fragments.
 Fragments may be replicated.
 Fragments/replicas allocated to sites.
 Sites linked by a communications network.
 Data at each site is under control of a DBMS.
 DBMSs handle local applications autonomously.
 Each DBMS participates in at least one global
application.
Distributed Processing and Distributed
Database

 Distributed processing shares the database’s


logical processing among two or more
physically independent sites that are
connected through a network.
 Distributed database stores a logically
related database over two or more physically
independent sites connected via a computer
network.
Distributed Processing Environment
Distributed Database Environment
Distributed Processing and Distributed Database

 Distributed processing does not require a


distributed database, but a distributed
database requires distributed processing.
 Distributed processing may be based on a single
database located on a single computer. In order
to manage distributed data, copies or parts of
the database processing functions must be
distributed to all data storage sites.
 Both distributed processing and distributed
databases require a network to connect all
components.
Distributed Databases (DDB) Concept
SELECT Sales
 Multiple independent systems FROM Bahridar.Sales
UNION
 Each has DBMS engine, queries, locking, SELECT Sales
FROM Mekelle.Sales
transactions, etc.
UNION
 Usually on different machines & locations SELECT Sales
FROM Adama.Sales
 Can be different hardware, OS, software.
A.A

Bahrdar

Mekelle

Adama

10 of
24
Distributed Database System (DDBS) Concept

Database
Database

Bardar
Mekelle

Database

Adama
• There must be different databases as opposed to a central
database

• Network is a part of DDB system

11 of
24
Features of Distributed database
Data is stored at a no. of sites
Sites are interconnected by a network
DDB is logically a single database
DDBMS has full functionality of DBMS.
Advantages of DDBMSs
 Organizational Structure
 Many organizations are naturally distributed
over several locations.
 Shareability and Local Autonomy
 The users at one site can access data stored at
other sites. Data can be placed at the site close
to the users who normally use that data. In this
way, users have local control of the data and
they can consequently establish and enforce
local policies regarding the use of this data.
 A global database administrator (DBA) is
responsible for the entire system.
Advantages of DDBMSs
 Improved Availability
 In a centralized DBMS, a computer failure
terminates the operations of the DBMS.
However, a failure at one site of a DDBMS, or a
failure of a communication link making some
sites inaccessible, does not make the entire
system inoperable. Distributed DBMSs are
designed to continue to function despite such
failures. If a single node fails, the system may
be able to reroute the failed node’s requests to
another site.
Advantages of DDBMSs
 Improved Reliability
 As data may be replicated so that it exists at
more than one site, the failure of a node or a
communication link does not necessarily make
the data inaccessible.
 Improved Performance
 Since each site handles only a part of the
entire database, there may not be the same
contention for CPU and I/O services as
characterized by a centralized DBMS.
Advantages of DDBMSs
 Economics
 cost saving occurs where databases are
geographically remote and the applications
require access to distributed data. In such
cases, owing to the relative expense of data
being transmitted across the network as
opposed to the cost of local access, it may be
much more economical to partition the
application and perform the processing locally
at each site.
 It is also much more cost-effective to add
workstations to a network than to update a
mainframe system.
Advantages of DDBMSs
 Modular Growth
 In a distributed environment, it is much easier
to handle expansion. New sites can be added to
the network without affecting the operations
of other sites. This flexibility allows an
organization to expand relatively easily.
Increasing database size can usually be handled
by adding processing and storage power to the
network. In a centralized DBMS, growth may
entail changes to both hardware (the
procurement of a more powerful system) and
software (the procurement of a more powerful
or more configurable DBMS).
Disadvantages of DDBMSs
 Complexity
 Cost
 Security
 Integrity Control More Difficult
 Lack of Standards
 Lack of Experience
 Database Design More Complex
Types of DDBMS
 Homogeneous DDBMS
 Heterogeneous DDBMS
Homogeneous DDBMS
 All sites use same DBMS product.
 Much easier to design and manage.
 Share a common global schema
 Each site provides part of its autonomy in
terms of right to change schema or sw.
 Approach provides incremental growth and
allows increased performance.:-making the
addition of a new site to the DDBMS easy
and allows increased performance by
exploiting the parallel processing capability
of multiple sites.
Heterogeneous DDBMS
 Sites may run different DBMS products, with
possibly different underlying data models.
 Occurs when sites have implemented their own
databases and integration is considered later.
 Translations required to allow for:
 Different hardware and different DBMS
products.
Heterogeneous DDBMS
 For example, relations in the relational data model
are mapped to records and sets in the network
model.
 It is also necessary to translate the query
language used (for example, SQL SELECT
statements are mapped to the network FIND and
GET statements).
 Typical solution is to use gateways, which convert
the language and model of each different DBMS
into the language and model of the relational
system.
Distributed Database Design
 Three key issues:

 Fragmentation.
 Allocation
 Replication
Distributed Database Design
 Fragmentation
 Relationmay be divided into a number of sub-relations,
which are then distributed.

 Allocation
 Each fragment is stored at site with "optimal"
distribution.

 Replication
 Copy of fragment may be maintained at several sites.
Fragmentation
 Definition and allocation of fragments
carried out strategically to achieve:
 Localityof Reference
 Improved Reliability and Availability
 Improved Performance
 Balanced Storage Capacities and Costs
 Minimal Communication Costs.

 Involves analyzing most important


applications, based on
quantitative/qualitative information.
Fragmentation
 Quantitative information may include:
 frequency with which an application is run;
 site from which an application is run;
 performance criteria for transactions and
applications.
 Qualitative information may include
transactions that are executed by
application, type of access (read or write),
and predicates of read operations.
Why Fragment?
 Usage
 Applications work with views rather than entire
relations.
 Efficiency
 Data is stored close to where it is most
frequently used.
 Data that is not needed by local applications is
not stored.
Why Fragment?
 Parallelism
 With fragments as unit of distribution,
transaction can be divided into several
subqueries that operate on fragments.
 Security
 Data not required by local applications is not
stored and so not available to unauthorized
users.
 Disadvantages
 Performance
 Integrity.
Why Fragment?
 Disadvantages
 Performance-The performance of global
applications that require data from several
fragments located at different sites may
be slower.
 Integrity control may be more difficult if
data and functional dependencies are
fragmented and located at different sites.
Correctness of Fragmentation
 Fragmentation cannot be carried out
randomly. There are three rules that must
be followed during fragmentation:
 Completeness
 Reconstruction
 Disjointness.
Correctness of Fragmentation
 Completeness
 If relation R is decomposed into fragments R1, R2, ... Rn,
each data item that can be found in R must appear in at least one
fragment.
 Reconstruction
 Must be possible to define a relational operation that will
reconstruct R from the fragments. This rule ensures that
functional dependencies are preserved.
 Disjointness.
 If a data item di appears in fragment Ri, then it should not appear
in any other fragment. Vertical fragmentation is the exception to
this rule, where primary key
 attributes must be repeated to allow reconstruction. This
rule ensures minimal data
 redundancy.
Correctness of Fragmentation
 Disjointness.
 If a data item di appears in fragment Ri, then it
should not appear in any other fragment. Vertical
fragmentation is the exception to this rule, where
primary key attributes must be repeated to allow
reconstruction. This rule ensures minimal data
redundancy.
Types of Fragmentation
 Four types of fragmentation:
 Horizontal: Horizontal fragments are subsets
of tuples Selection operation of the relational
algebra
 Vertical: vertical fragments are subsets of
attributes: defined using the Projection
operation of the relational algebra
 Mixed: Consists of a horizontal fragment that
is subsequently vertically fragmented, or a
vertical fragment that is then horizontally
fragmented. defined using the Selection and
Projection operations of the relational algebra.
Types of Fragmentation
 Four types of fragmentation:
 Derived A horizontal fragment that is based on
the horizontal fragmentation of a parent
relation.
 We use the term child to refer to the relation
that contains the foreign key and parent to the
relation containing the targeted primary key.
Derived fragmentation is defined using the
Semijoin operation of the relational algebra
Horizontal and Vertical Fragmentation

41
Horizontal Fragmentation

 Consists of a subset of the tuples of a relation.


 Defined using Selection operation of relational
algebra:p(R)
 For example:

P1 =  type='House'(Rent)
P2 =  type='Flat' (rRent)

 This strategy is determined by looking at predicates


used by transactions.
 Reconstruction involves using a union eg R = r1 U r2
36
Vertical Fragmentation
 Consists of a subset of attributes of a relation.
 Defined using Projection operation of relational algebra:
a1, ... ,an(R)
 For example:
S1 = staffNo, position, sex, DOB, salary(Staff)
S2 = staffNo, fName, lName, branchNo(Staff)
 Determined by establishing affinity of one attribute to
another.

 For vertical fragements reconstruction involves the join


operation; Each fragment is disjointed except for the
primary key
37
Relational Algebra
Select Project Product

a x a x
y b x
b c x
c b y
c y
c y

Union Intersection Difference

Divide

a1
bb3 bb1
a1 bb2
bb2
bb3
b1
cc1
Join
a1
bb1
a1 b1 c1 a1 b1
a1
b1 c1 b1
b2 c2 bb2
a2 a2 b2
Mixed Fragmentation
Mixed Fragmentation

 Consists of a horizontal fragment that is


vertically fragmented, or a vertical fragment
that is horizontally fragmented.
 Defined using Selection and Projection
operations of relational algebra:

 p(a1, ... ,an(R)) or


a1, ... ,an(σp(R))

40
Example - Mixed Fragmentation

S1 = staffNo, position, sex, DOB, salary(Staff)


S2 = staffNo, fName, lName, branchNo(Staff)

S21 =  branchNo='B003'(S2)
S22 =  branchNo='B005'(S2)
S23 =  branchNo='B007'(S2)

Explain and Illustrate the result of the above


example

41
Transparencies in a DDBMS
 Distribution Transparency

 Fragmentation Transparency
 Location Transparency
 Replication Transparency
 Local Mapping Transparency
 Naming Transparency
Comparison of Strategies for Data
Distribution

33
Transparency
 Transparency hides implementation details
from the user.
 to make the use of the distributed
database equivalent to that of a
centralized database. We can identify four
main types of transparency in a DDBMS:
 distributiontransparency;
 transaction transparency;
 performance transparency;
 DBMS transparency.
Distribution Transparency
 Distribution transparency allows the user
to perceive the database as a single, logical
entity. If a DDBMS exhibits distribution
transparency, then the user does not need
to know the data is fragmented
(fragmentation transparency) or the
location of data items (location
transparency).
Fragmentation transparency
 Fragmentation is the highest level of
distribution transparency.
 If fragmentation transparency is provided
by the DDBMS, then the user does not
need to know that the data is fragmented.
As a result, database accesses are based
on the global schema, so the user does not
need to specify fragment names or data
locations.
Location transparency
 Location is the middle level of distribution
transparency. With location transparency,
the
 user must know how the data has been
fragmented but still does not have to know
the location
 of the data. The above query under
location transparency now becomes:
Naming transparency
 As a corollary to the above distribution
transparencies, we have naming transparency.
 Therefore, the DDBMS must ensure that no two
sites create a database object with the same
name. One solution to this problem is to create a
central name server, which has the responsibility
for ensuring uniqueness of all names in the system.
However, this approach results in:
 loss of some local autonomy;
 performance problems, if the central site becomes a
bottleneck;
 low availability; if the central site fails, the remaining
sites cannot create any new database objects.
Naming transparency
 An alternative solution is to prefix an object with
the identifier of the site that created it.
 For example, the relation Branch created
at site S1 might be named S1.Branch.
Similarly, we need to be able to identify
each fragment and each of its copies. Thus,
copy 2 of fragment 3 of the Branch
relation created at site S1 might be
referred to as S1.Branch.F3.C2.
Replication transparency
 Closely related to location transparency is
replication transparency, which means that
the user is unaware of the replication of
fragments. Replication transparency is
implied by location transparency. However,
it is possible for a system not to have
location transparency but to have
replication transparency.
Local mapping transparency
 Local mapping transparency This is the
lowest level of distribution transparency.
With local mapping transparency, the user
needs to specify both fragment names and
the location of data items, taking into
consideration any replication that may
exist.
Distribution Transparency
 Distribution transparency allows user to
perceive database as single, logical entity.
 If DDBMS exhibits distribution
transparency, user does not need to know:
 data is fragmented (fragmentation
transparency),
 location of data items (location transparency),
 otherwise call this local mapping transparency.

 With replication transparency, user is


unaware of replication of fragments .
Naming Transparency
 Each item in a DDB must have a unique
name.
 DDBMS must ensure that no two sites
create a database object with same name.
 One solution is to create central name
server. However, this results in:
 loss of some local autonomy;
 central site may become a bottleneck;
 low availability; if the central site fails,
remaining sites cannot create any new objects.
Transaction Transparency
 Ensures that all distributed transactions
maintain distributed database’s integrity and
consistency.
 Distributed transaction accesses data stored
at more than one location.
 Each transaction is divided into number of
sub-transactions, one for each site that has
to be accessed.
 DDBMS must ensure the indivisibility of both
the global transaction and each
subtransactions.
Concurrency Transparency
 All transactions must execute independently and
be logically consistent with results obtained if
transactions executed one at a time, in some
arbitrary serial order.
 Same fundamental principles as for centralized
DBMS.
 DDBMS must ensure both global and local
transactions do not interfere with each other.
 Similarly, DDBMS must ensure consistency of all
sub-transactions of global transaction.
Concurrency Transparency
 Replication makes concurrency more complex.
 If a copy of a replicated data item is updated,
update must be propagated to all copies.
 Could propagate changes as part of original
transaction, making it an atomic operation.
 However, if one site holding copy is not reachable,
then transaction is delayed until site is reachable.
Concurrency Transparency
 Could limit update propagation to only
those sites currently available. Remaining
sites updated when they become available
again.
 Could allow updates to copies to happen
asynchronously, sometime after the
original update. Delay in regaining
consistency may range from a few seconds
to several hours.
Failure Transparency
 DDBMS must ensure atomicity and durability of
global transaction.
 Means ensuring that sub-transactions of global
transaction either all commit or all abort.
 Thus, DDBMS must synchronize global transaction
to ensure that all sub-transactions have completed
successfully before recording a final COMMIT for
global transaction.
 Must do this in presence of site and network
failures.
Performance Transparency
 DDBMS must perform as if it were a
centralized DBMS.
 DDBMS should not suffer any performance
degradation due to distributed architecture.
 DDBMS should determine most cost-effective
strategy to execute a request.
Performance Transparency
 Distributed Query Processor (DQP) maps
data request into ordered sequence of
operations on local databases.
 Must consider fragmentation, replication,
and allocation schemas.
 DQP has to decide:
 which fragment to access;
 which copy of a fragment to use;
 which location to use.
Performance Transparency
 DQP produces execution strategy
optimized with respect to some cost
function.
 Typically, costs associated with a
distributed request include:

 I/O cost;
 CPU cost;
 communication cost.
Data Allocation
 Four alternative strategies regarding
placement of data:
 Centralized
 Partitioned(or Fragmented)
 Complete Replication
 Selective Replication
Data Allocation
 Centralized
 Consists of single database and DBMS stored
at one site with users distributed across the
network.

 Partitioned
 Database partitioned into disjoint fragments,
each fragment assigned to one site.
Data Allocation
 Complete Replication
 Consists of maintaining complete copy of
database at each site.

 Selective Replication
 Combination of partitioning, replication, and
centralization.
Data Replication
 Data replication refers to the storage of
data copies at multiple sites served by a
computer network.
 Fragment copies can be stored at several
sites to serve specific information
requirements.
 The existence of fragment copies can
enhance data availability and response time,
reducing communication and total query
costs.

Figure 10.20
Data Replication
 Mutual Consistency Rule

 Replicateddata are subject to the mutual


consistency rule, which requires that all
copies of data fragments be identical.
 DDBMS must ensure that a database update
is performed at all sites where replicas
exist.
 Datareplication imposes additional DDBMS
processing overhead.
Data Replication
 Replication Conditions
A fully replicated database stores multiple
copies of all database fragments at multiple
sites.
 A partially replicated database stores
multiple copies of some database fragments
at multiple sites.

 Factors for Data Replication Decision


 Database Size
 Usage Frequency
Data Allocation
 Data allocation describes the processing of
deciding where to locate data.
 Data Allocation Strategies
 Centralized
The entire database is stored at one site.
 Partitioned
The database is divided into several disjoint
parts (fragments) and stored at several sites.
 Replicated
Copies of one or more database fragments
are stored at several sites.
Data Allocation
 Data allocation algorithms take into consideration
a variety of factors:
 Performance and data availability goals
 Size, number of rows, the number of relations
that an entity maintains with other entities.
 Types of transactions to be applied to the
database, the attributes accessed by each of
those transactions.
Failure Transparency

 DDBMS must ensure atomicity and durability


of global transaction.
 Means ensuring that sub transactions of global
transaction either all commit or all abort.
 Thus, DDBMS must synchronize global
transaction to ensure that all sub transactions
have completed successfully before recording
a final COMMIT for global transaction.
 Must do this in the presence of site and
network failures.

70
Commit Protocols

 Commit protocols are used to ensure


atomicity across sites

 Atomicity states that database modifications


must follow an “all or nothing” rule.
 a transaction which executes at multiple sites
must either be committed at all the sites, or
aborted at all the sites.
The Two-Phase Commit (2 PC) Protocol
 What is this?

 Two-phase commit is a transaction protocol


designed for the complications that arise with
distributed resource managers.
 Two-phase commit technology is used for hotel
and airline reservations, stock market
transactions, banking applications, and credit card
systems.
 With a two-phase commit protocol, the
distributed transaction manager employs a
coordinator to manage the individual resource
managers. The commit process proceeds as
follows:
Phase1: Obtaining a Decision

 Site that initiates T is the coordinator


 When coordinator wants to commit
(complete T), it sends a
 “prepare T” msg to all participant sites
 Every other site receiving “prepare T”,
either sends “ready T” or “don’t commit T”
 A site can wait for a while until it reaches a
decision (Coordinator will wait reasonable
time to hear from the others)
Phase1: Obtaining a Decision

IF coordinator received all “ready T”


Remember no one committed yet Coordinator sends
“commit T” to all participant sites
Every site receiving “commit T” commits its
transaction
IF coordinator received any “don’t commit T”
Coordinator sends “abort T” to all participant sites
Every site receiving “abort T” commits its
transaction
These msgs are written to local logs
Phase1: Obtaining a Decision

 Step 1  Coordinator asks all participants


to prepare to commit transaction Ti.

 Ci adds the records <prepare T> to the log and


forces log to stable storage (a log is a file
which maintains a record of all changes to the
database)

 sends prepare T messages to all sites where T


executed
Phase1: Making a Decision

 Step 2  Upon receiving message,


transaction manager at site determines if it
can commit the transaction
 if not:
add a record <no T> to the log and send
abort T message to Ci
 if the transaction can be committed, then:
1). add the record <ready T> to the log
2). force all records for T to stable
storage
3). send ready T message to Ci
Phase1: Making a Decision
Phase 2: Recording the Decision

 Step 1  T can be committed of Ci received a ready


T message from all the participating sites: otherwise
T must be aborted.

 Step 2  Coordinator adds a decision record,


<commit T> or <abort T>, to the log and forces record
onto stable storage. Once the record is in stable
storage, it cannot be revoked (even if failures occur)

 Step 3  Coordinator sends a message to each


participant informing it of the decision (commit or
abort)

 Step 4  Participants take appropriate action locally.


Two-Phase Commit Diagram
Performance Transparency

 DDBMS must perform as if it were a


centralized DBMS.
 DDBMS should not suffer any performance
degradation due to distributed architecture.
 DDBMS should determine most cost-effective
strategy to execute a request. i.e. query
optimisation (the order of selects and
projects) applied to a distributed database

80
Performance Transparency

 Must consider fragmentation, replication,


and allocation schemas.
 DQP has to decide e.g. :
 which fragment to access;
 which copy of a fragment to use;
 which location to use.

81
Performance Transparency

 DQP produces execution strategy


optimized with respect to some cost
function.
 Typically, costs associated with a
distributed request include:
 I/Ocost;
 Communication cost.

82
RECOVERY IN DDBS
-More complicated than in a centralized system. Failures related to
distributed DB
-log managers

Problems:
Difficult to know
-failure of site which had occurred
-failure of link
-loss of messages
if server is down, elect new server what about network partitioning?
Original
Server’s
Newly
Server
link elected
Server

You might also like