Distributed Databases presentation

About This Presentation

Transcript and Presenter's Notes

Title: Distributed Databases

1
Distributed Databases

Frank Chen
Elise Edson
Jian Gao
Nathan Waldman
Phil Wang
Hui Zhang

2
Why Distribute Data?

Centralized DBMS vs. DDBMS
Several factors have led to the development of
DDBMS
Distributed nature of some database applications
Increased reliability and availability
Allowing data sharing while maintaining some
measure of local control
Improved performance

3
The Voltron Analogy
4
Additional Functionality

Distribution leads to increased complexity in the
system design and implementation
DDBMS must be able to provide additional
functions to those of a centralized DBMS. Some of
these are
To keep track of the data distribution and
replication in the DDBMS catalog.
To devise execution strategies for queries and
transactions that access data from more than one
site.
To decide on which copy of a replicated data item
to access.
To maintain the global conceptual schema of the
distributed database

5
Dates 12 Rules of a DDBMS

1. Local autonomy
2. No reliance on a central site
3. Continuous operation (Failure independence)
4. Location independence
5. Fragmentation independence
6. Replication independence

6
Dates 12 Rules of a DDBMS

7. Distributed query processing
8. Distributed transaction processing
9. Hardware independence
10. Operating system independence
11. Network independence
12. Database independence

7
Todays Topics

Architecture Design
Transparency
Functionality

8
Architecture Design
9
Introduction

Distributed database
A logically interrelated collection of shared
data (and a description of this data) physically
distributed over a computer network.
Distributed DBMS
The software system that permits the management
of the distributed database and makes the
distribution transparent to users.

10
Overview
11
Distributed DBMS VS. Distributed Processing
Site 1
DB
Site 1
Computer Network
Computer Network
Site 2
Site 2
Site 4
Site 4
Site 3
DB
Site 3
DB
DB
Distributed Processing
Distributed DBMS
12
Distributed DBMS VS. Parallel DBMS

Parallel DBMS
A DBMS running across multiple processors and
disks that is designed to execute operations in
parallel, whenever possible, in order to improve
performance.
Parallel DB Architectures
Shared Memory
Shared Disk
Shared Nothing

13
Architectural Dimensions

Three dimensions of distributed database systems
Autonomy of the individual sites comprising the
system
Distribution of data throughout the system
Heterogeneity of hardware, operating system, data
model and/or DBMS across the various sites

14
Autonomy

No Autonomy
The various sites/machines are tightly
integrated and cannot function independently
Semi-Autonomous
The various sites/machines are able to process
independently, but their cooperation involves
some modifications to their otherwise independent
processing
Fully Autonomous
The various sites/machines function in complete
isolation, knowing nothing of and being
unaffected by other sites processing

15
Distribution

No Distribution
All of the data is located on one site/machine
Semi-Distributed
The data is spread across some sites but not all
there is a differentiation between sites, with
some being servers and others clients
Fully Distributed
There is no distinction between sites/machines,
at least in the sense that data is spread across
all of them

16
Heterogeneity

Homogeneous
All sites used the same DBMS product
Are aware of each other and agree to cooperate in
processing user requests.
Each site surrenders part of its autonomy in
terms of right to change schemas or software
Appears to user as a single system

17
Heterogeneity

Heterogeneous
Different sites may use different DBMS product
Need not be based on the same underlying data
model
May be composed of relational, network,
hierarchical, and object-oriented DBMSs.
Sites may not be aware of each other and may
provide only limited facilities for cooperation
in transaction processing

18
Interesting Architectures

Client/server, with multiple servers
No autonomy Semi-distributed Homogeneous
Peer to peer distributed database
Semi-autonomous Fully distributed
Homogeneous
Federated (multi-) database
Full autonomy Fully distributed Heterogeneous

19
Architectural Dimensions
20
Reference Architecture for a DDBMS

DDBMS can be viewed as an extension of the
ANSI/SPARC 3-schema architecture
The schema consists
A set of global external schemas
A global conceptual schema
A fragmentation schema and allocation schema
A set of schemas for each local DBMS

21
Reference Architecture for a DDBMS
GES1
GES2
GES3
GES Global External Schema GCS Global
Conceptual Schema FS Fragmentation Schema AS
Allocation Schema LMS Local Mapping Schema LCS
Local Conceptual Schema LIS Local Internal Schema
GCS
FS
AS
LMS3
LMS2
LMS1
LCS1
LCS2
LCS2
LIS3
LIS2
LIS1
DB
DB
DB
22
Reference Architecture for a Federated MDBS

From an architectural viewpoint, multi-database
distributed systems come in two basic varieties
Use a global conceptual schema tightly coupled
Dont use a global conceptual schema loosely
coupled
The former case is a generalization, in a sense,
of the peer-to-peer system

23
Reference Architecture for a Federated MDBS
GES1
GES2
GES3
ANSI/SPARC
ANSI/SPARC
GCS
LES11
LESn1
LES12
LESn2
LCSn
LCS1

LES13
LESn3
LISn
LIS1
Sites
(Tightly Coupled)
24
Component Architecture for a DDBMS

Major Components
Local DBMS (LDBMS) component
Data communications (DC) component
Global system catalog (GSC)
Distributed DBMS (DDBMS) component

Comparison of strategies for data allocation

25
Distributed Relational Database Design

Fragmentation
A relation may be divided into a number of
sub-relations, called a fragments, which are then
distributed.
Allocation
Each fragment is stored at the site with
optimal distribution.
Replication
The DDBMS may maintain a copy of a fragment at
several different sites.

26
Distributed Relational DB Design (Data Allocation)

Strategic Objectives
Centralized
Fragmented (or Partitioned)
Complete Replication
Selective Replication

27
Distributed Relational DB Design (Fragmentation)

Why Fragment?
Usage
Efficiency
Parallelism
Security
Primary disadvantages
Performance
Integrity
Rules to follow during fragmentation
Completeness
Reconstruction
Disjointness

28
Distributed Relational DB Design (Fragmentation)

Types of fragmentation
Horizontal fragmentation
-- Consists of a subset of the tuples of a
relation
Vertical fragmentation
-- Consists of a subset of the attributes of a
relation
Mixed fragmentation
-- Consists of a horizontal fragment that is
subsequently vertically fragmented, or a vertical
fragment that is then horizontally fragmented.
Derived horizontal fragmentation
-- A horizontal fragment that is based on the
horizontal fragmentation of a parent relation.

29
Distributed Relational DB Design (Methodology
Summary)

Design global relations.
Examine the topology of the system.
Analyse the most important transactions and
identify where horizontal or vertical
fragmentation may be appropriate.
Decide the relations not to be fragmented (but to
be replicated).
Examine the relations that are on the one-side of
a relations and decide a suitable fragmentation
schema.
Check for situations where either vertical or
mixed fragmentation would be appropriate during
the previous step.

30
Transparency
31
Transparency

Q Why do we need transparency?

32
Transparency

Q Why do we need transparency?
A Without DDBMS transparency, applications must
be written using special knowledge about the
stored data

33
Transparency

Distribution Transparency
Transaction Transparency
Performance Transparency
Heterogeneity Transparency

34
Distribution Transparency

Idea Users perceive the DB as a single, unified
entity
Types of Distribution Transparency
Fragmentation
Location
Replication
Local-mapping
Naming

35
Distribution TransparencyFragmentation

What is a fragment?
Horizontal fragments
Fragmented by SELECTION
Composed by UNION
Vertical fragments
Fragmented by PROJECTION
Composed by JOIN

36
Distribution TransparencyFragmentation

Mixed/Hybrid fragments
Horizontal ? Vertical fragmentation
Vertical ? Horizontal fragmentation
Derived (Horizontal) fragments
Fragmenti Fragmentj
Fragmented by SEMIJOIN

37
Fragment Types
38
Distribution TransparencyFragmentation

Idea Tables can be fragmented to different sites
without user awareness
Example
SELECT name
FROM tblStaff
WHERE salary gt 30000

39
Distribution TransparencyFragmentation Example
EMP DEPT SALARY E1 DX
45K E2 DY 40K E3
DZ 50K E4 DY
63K E5 DZ 40K
Employee
User Perception
New York fragment
London fragment
EMP DEPT SALARY E1 DX
45K E3 DZ 50K E5
DZ 40K
EMP DEPT SALARY E2 DY
40K E4 DY 63K
40
Distribution TransparencyLocation

Idea Users dont know where the data is but must
specify fragment names
Example
SELECT name
FROM tblStaff21
WHERE staffNo IN
(SELECT staffNo FROM tblStaff1
WHERE salary gt 30000)

41
Distribution TransparencyReplication

Idea Objects can be copied, and copies are
maintained automatically and without user
knowledge
Location Transparency ? Replication Transparency

42
Distribution TransparencyReplication Example
EMP DEPT SALARY E1 DX
45K E2 DY 40K E3
DZ 50K E4 DY
63K E5 DZ 40K
Employee
User Perception
EMP DEPT SALARY E1 DX
45K E3 DZ 50K E5
DZ 40K
EMP DEPT SALARY E2 DY
40K E4 DY 63K
London fragment
New York fragment
EMP DEPT SALARY E1 DX
45K E3 DZ 50K E5
DZ 40K
Replica of New York fragment
Replica of London fragment
EMP DEPT SALARY E2 DY
40K E4 DY 63K
43
Distribution TransparencyLocal-mapping

Idea User needs to specify fragment names and
location of data items
Example
SELECT name
FROM tblStaff21 _at_ SITE 3
WHERE staffNo IN
(SELECT staffNo FROM tblStaff1 _at_ SITE 9
WHERE salary gt 30000)

44
Distribution TransparencyNaming

Idea Each attribute in the DDB must have an
unique name DDBMS ensures this by either
Creating a central name server
Prefixing attribute with site identifier
Using aliases Best Solution

45
Transaction Transparency

Idea Distributed transactions look like
single-site transactions
Example

46
Transaction Transparency

Concurrency Transparency
Failure Transparency

47
Transaction TransparencyConcurrency

Idea DDBMS ensures non-interference among local
and global transactions
Replication complicates concurrency
Solution 1 Establishing atomicity
Solution 2 Limit update
Solution 3 Asynchronous update

48
Transaction TransparencyWhat if we goof up?

Several problems can occur when concurrent
transactions execute in an uncontrolled manner
The lost update problem
The temporary update problem
The incorrect summary problem

49
Transaction TransparencyFailure

Idea Recovery procedure must handle
Loss of a message
Failure of a communication link
Failure of a site
Network partitioning
Must ensure all sub-transactions have completed
successfully before committing to the transaction.

50
Transaction TransparencyCommit Protocol

Idea Ensure atomicity of both local and global
transactions by employing
Transaction Manager (Coordinator)
Resource Managers (Participants)
Two-Phase Commit (2PC)
Three-Phrase Commit (3PC)

51
Transaction TransparencyTwo-Phase Commit (2PC)
52
Transaction Transparency2PC Topologies

Centralized 2PC
Linear 2PC
Distributed 2PC

53
Transaction Transparency2PC Topology -
Centralized
C coordinator Pi participant
RC READY_COMMIT GC GLOBAL
_COMMIT GA GLOBAL_ABORT
54
Transaction Transparency2PC Topology - Linear
C coordinator Pi participant
RC READY_COMMIT GC GLOBAL
_COMMIT GA GLOBAL_ABORT
55
Transaction Transparency2PC Topology -
Distributed
C coordinator Pi participant

56
Transaction TransparencyThree-Phase Commit (3PC)
57
Performance Transparency Query Optimization

Idea A DDBMS performs as if it were a
centralized DBMS
Decisions
Which fragment to access
Which copy of a fragment to use, if the fragment
is replicated
Which location to use
Costs
Access time (I/O), Communication time, CPU time

58
Heterogeneity Transparency

Idea Shouldnt matter what DBMS is running at
each site

Running Oracle
Running Sybase
1
N
Distributed Database
59
But wait, theres a caveat

Its generally not a good idea to implement full
transparency!

60
Functionality
61
Overview

Transaction Management
Concurrency Control
Deadlock Management
Database Failure Recovery
The X/Open Distributed Transaction Processing
Model and Replication
Query Optimization
Mobile Databases

62
Transaction Management

Local Database
Transaction Manager
Scheduler
Recovery Manager
Buffer Manager
Database Site
Transaction Coordinator
Data Communication Component

63
Concurrency Control

Objectives
Serializabilty
Locking Protocols
Centralized 2PL
Primary Copy 2PL
Distributed 2PL
Timestamp Protocols

64
Concurrency Control

Objectives
Preserve consistency of data items
Enable parallel processing for efficiency
Minimize overhead
Work efficiently over communication delayed
network
Reduce constraints on structure of atomic actions

65
Concurrency Control

Serializability
Simple answer
one transaction has full, locked control
Better method?
Want
Parallel processing
Maximization of concurrent operations
How to implement in Distributed DBMS?

66
Concurrency Control

Methodology
Locking
Timestamping
How do we implement locking mechanism?

67
Concurrency Control

Centralized 2PL (two phase locking)
Single site that maintains all locking
information
One lock manager
Advantages
Straightforward implementation
Disadvantages
Bottleneck
Lower reliability

68
Concurrency Control

Primary Copy 2PL
Many sites have lock managers
For replicated data, one copy is chosen as the
primary copy, and the rest are slave copies
On update, transaction coordinator determines
location of primary copy
Update is propagated to slave copies
Advantages
Fixes Centralized 2PL disadvantages
Disadvantages
Deadlock managing more complex
Primary copy local is bottleneck

69
Concurrency Control

Distributed 2PL
Every site has a lock manager
If data is replicated, ROWA (read-one-write-all)
control is executed from any one copy.
Advantages
Decentralized lock handling
Disadvantages
Again, deadlock managing more complex
High communication costs (all items locked before
update)

70
Timestamp Protocols

Objective
Order transactions globally
Give older transactions priority in conflict
i.e. breaks ties

71
Deadlock Management

Centralized Deadlock Detection
Hierarchical Deadlock Detection
Distributed Deadlock Detection

72
Deadlock Management

Centralized Deadlock Detection
Single site appointed Deadlock Detection
Coordinator (DDC)
Constructs aggregate wait-for-graphs (WFGs), and
rolls back deadlocked sites
Advantages
Need only send delta of WFGs
Disadvantages
Single point of failure

73
Wait for Graph
74
Deadlock Management

Hierarchical Deadlock Detection
Sites organized into hierarchy
Each node is responsible for checking for
deadlock in child nodes.
Advantages
Less network communication
Disadvantages
Much more complex to implement

75
Deadlock Management

Distributed Deadlock Detection
Each site contains partial deadlock detection
information
Advantages
Extremely robust
Disadvantages
High network traffic

76
Obermarck Detail

Transactions are ordered
Ex node is used to represent remote sites in
the local WFG
Upon receiving deadlock detection
Create local WFG
Resolve local deadlocks
For cycles that go from an Ex node through
local nodes then back out to an Ex node, send
local node graph out to other sites to fix

77
Database Failure Recovery

Failures in a Distributed Environment
How Failures Affect Recovery
Distributed Recovery Protocols
Network Partitioning
Identifying Updates
Maintaining Integrity
Pessimistic Protocols
Optimistic Protocols

78
Failures in a Distributed Environment

Distributed DBMS-specific failures
Packet loss
Communications link failure
Site failure
Network partitioning
Sites split into multiple groups that can
communicate to each other, but not outside the
group
Hard to tell if network failure or busy site

79
Failure and Recovery

Goal
Maintain globally atomic transactions
Leave operational sites unblocked
Upon failure
Abort affected transactions
Flag site as failed
Wait for site to recover
Initiate recovery procedure to remove partial
transactions
Update its copy of the database to maintain
consistency with the overall database.

80
Distributed Recovery Protocols

Assumptions
Every global transaction has transaction manager
TM site.
Two common protocols
Two-phase commit (2PC)
Three-phase commit (3PC)

81
Network Partitioning

Problem with replicated data
Classic unsynchronized bank account
Constraints on account (e.g. Balance cant go
below 0) may be violated
Generally not possible to design a non-blocking,
atomic commit protocol for arbitrary partitioned
networks

82
Network Partitioning

Pessimistic protocols
Choose consistency of database over availability.
Block transactions in a partitioned state if
consistency cannot be maintained.
Optimistic protocols
Availability at the expense of consistency.
Updates are allowed to proceed independently in
the various partitions.

83
The X/Open Distributed Transaction Processing
Model

Vendor-neutral, international consortium of
users, software vendors, and hardware vendors
whose mission is to cause the creation of a
viable, global information infrastructure.
Goals and Achievements (DTP Standard)
Develop APIs for atomic, consistent, isolated,
and durable transaction processing
Created idea of Transaction Manager (TM) and
Resource Manager (RM) discussed previously

84
The X/Open Distributed Transaction Processing
Model

DDBMS Tasks
Remote invocation mechanism
Provided by Remote Operations Service (ROSE) and
Remote Procedure Call (RPC)
Distributed transaction mechanism
Uses Open System Interconnection Transaction
Processing (OSI-TP) protocol
Supports flat, chained, and nested transactions

85
Replication

Replication is the process of generating and
reproducing multiple copies of data at one or
more sites.
Data Ownership
Master/Slave Ownership
Workflow Ownership
Update Anywhere Ownership

86
Master-Slave Ownership

Asynchronously replicated data is owned by one
site and can only be updated by that site.
Other sites subscribe to the data owned by the
publisher site. (publish and subscribe metaphor)

87
Workflow Ownership

Similar to Master-Slave Ownership, but allows the
right to update to move from site to site.
Only one site can have the right to update at any
given moment.

88
Update Anywhere Ownership

P2P style functionality
Everyone can update
Provides local autonomy even if the rest of the
network is dead.

89
Replication Mechanisms

Snapshots
Database Triggers
Conflict Detection and Resolution

90
Snapshot

Allow for asynchronous distribution of changes to
individual relations, collections of relations,
views, or fragments of relations according to a
predefined schedule.

91
Database Triggers

Shifts the responsibility to the users code
Executes whenever an appropriate event occurs
Disadvantages
Increases overhead
Have no knowledge of transactional nature of the
modification
Frequent updates stress the network and
application
Cannot be scheduled
Refreshing related relations is complex
Difficult to rollback

92
Conflict Detection and Resolution

Detection
Before and after images
Resolution
Earliest and latest timestamps
Site priority
Additive and average updates
Minimum and maximum values
User-defined
Hold for manual resolution
Resolve Duplicate Primary Key or Unique Keys
Append site name to duplicate value
Append sequence to duplicate value
Discard duplicate value

93
Current X/Open DTP and Replication Support
94
Query Optimization

Need for speed
Distributed Query Transformation
Technique Primary Horizontal Fragmentation
Technique Distributed Joins

95
Need for speed

Why worry?
For query optimization, you want to avoid disk
reads, because they are slow
WAN or even LAN speeds are much slower than disks

96
Distributed Query Transformation

Same idea as was presented by query optimization
group, but extended in the following ways
Once you apply the heuristic processing, take
data distribution into account
Replace global relations at the leaves of the
tree with their reconstruction algorithms

97
Technique Primary Horizontal Fragmentation

(R1 Union R2) JOIN R3
(R1 JOIN R3) Union
(R2 JOIN R3)
Allows for join of two relations to be
implemented as a union of partial joins (that can
be performed in parallel)

98
Technique Distributed Joins

Join is very expensive
To optimize, replace join by a combinations of
semi-joins. This reduces the size of the operand
relation.
R1 JOIN(x) R2
(R1 SEMIJOIN(x) R2) JOIN R

99
Technique Distributed Joins

Evaluation
R PROJ(x) (R2) as Site 2
Transfer R to Site 1
R R1 SEMIJOIN(x) R as Site 1
Transfer R to Site 2
Result R JOIN(x) R2 at Site2

100
Mobile Databases

Problem
Laptops and PDAs
Want database functionality, but have limited
bandwidth and storage
Solution
Mobile Databases
Portable and physically separate from a
centralized database server, but is capable of
communicating with that server from remote sites

101
Mobile Databases

Advantages
Mobile!
Poll centralized database for up-to-date data and
statistics
Can do a one-shot update of centralized
database
Disadvantages
Typically have pre-defined SQL queries
May change in future devices

102
Questions?
103
References Web

http//www.csee.umbc.edu/kalpakis/Courses/621-fa0
3/Lectures/DistributedDeadlocks.pdf
http//www.itee.adfa.edu.au/courses/ACSC7360/
http//www.postgresql.org/
http//www.oracle.com/
http//www.microsoft.com/
http//www.mysql.com/
http//www.cis.usouthal.edu/share/allen_324/Presen
tr/CH10.ppt
http//www.cs.rpi.edu/noel/distr_scaleup/distribu
ted.html
http//db.cs.berkeley.edu/jmh/cs262b/ddbms.html.
http//www.doc.mmu.ac.uk/STAFF/P.Quick/distribu.pp
t

104
References Text

Database Systems Design, Implementation,
Management, 5th Edition, Rob Coronel
Distributed and parallel database systems M.
Tamer Ozsu and Patrick Valduriez
Database System Concepts, Silberschatz, Korth and
Sudarshan, McGraw-Hill

105
References Review Article

Mariposa A Wide-Area Distributed Database
System, Michael Stonebraker, Paul M. Aoki, Avi
Pfeffer, Adam Sah,Jeff Sidell, Carl Staelin and
Andrew Yu

Write a Comment

User Comments (0)

About PowerShow.com

Distributed Databases PowerPoint PPT Presentation