Distributed Databases - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Distributed Databases

Description:

... network operates in an autonomous fashion, with selected hardware and operating system setup. ... Conceptual Design the Global Conceptual Schema ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 42
Provided by: MarkRo75
Category:

less

Transcript and Presenter's Notes

Title: Distributed Databases


1
Section 2
  • Distributed Databases

2
Section Content
  • 2.1 Concepts
  • 2.2 Advantages
  • 2.3 Classification of Distributed Systems
  • 2.4 Database Design
  • 2.5 Distributed Query Processing

3
2.1 Concepts
  • A Distributed Database (DDB) is a collection of
    nodes, connected via a communication network.
  • Each site is autonomous, but a partnership exists
    among a set of independent but co-operating
    centralised systems.
  • A Distributed Database Management System (DDBMS)
    is the software that permits the management of
    the DDBs and makes distribution transparent to
    users.
  • There are three basic architectures networked
    with a single centralised database shared
    memory, and shared nothing.

4
Centralised in a Networked Architecture
Client Interface
DBMS Interface
Network
Client Interface
Client Interface
Client Interface
5
Centralised in a Networked Architecture
  • Storage exists at a single site (with a shared
    disk architecture).
  • Architecture resembles a typical client server
    architecture although DDB transparencies exist.
  • This architecture is suited to a (conceptually)
    fully replicated environment. Each client site
    sees the same data as all other sites.
  • This architecture also suits a (conceptually)
    fully fragmented site where each client sees a
    different view of the overall schema.

6
Shared Memory Architecture
NT2000 O/S Workstation
SQL Server
NT2000 O/S Workstation
DBMS Interface
Network
SQL Server
DBMS Interface
NT2000 O/S Workstation
SQL Server
DBMS Interface
7
Shared Memory Architecture
  • Each node on the network operates in an
    autonomous fashion, with selected hardware and
    operating system setup.
  • However, each system runs (for example)
    distributed Oracle where each system shares a
    common memory space in which transactions are
    processed.
  • Each site may have copies of data which belong
    to other sites will require synchronisation of
    updates.

8
Shared Nothing
UNIX cluster
Oracle
NT2000 O/S Node
DBMS Interface
Network
Oracle
VMS Mainframe
DBMS Interface
Oracle
DBMS Interface
9
Shared Nothing Architecture
  • Each processor has its own autonomous processing
    and storage capabilities.
  • Each node is homogenous with respect to operating
    system, database management system protocols and
    storage.
  • Communication is (typically) through a high-speed
    interconnection network.

10
Sections Covered
  • 2.1 Concepts
  • 2.2 Advantages
  • 2.3 Classification of Distributed Systems
  • 2.4 Database Design
  • 2.5 Distributed Query Processing

11
2.2 Advantages
  • Management of distributed data with different
    levels of transparency. Transparencies
  • Distribution location transparency ensures that
    the user need not worry about the location or
    local name of data objects.
  • Replication The user is unaware of data copies.
    These copies provide better availability,
    performance and reliability.
  • Fragmentation horizontal and vertical
    fragmentation details are hidden from the user.
  • Increased reliability and availability.
  • Reliability is improved with a decrease in
    downtime. This is due to replication.
  • Availability is the probability that the DDB runs
    for a predetermined time interval.

12
Advantages (ii)
  • Improved Performance
  • A distributed DBMS fragments so that data is
    stored at the site where it is needed most.
  • Fragmentation also implies that the database is
    smaller instead of a single CPU processing one
    large database, multiple CPUs process many
    smaller databases.
  • Inter-query and intra-query parallelism can be
    achieved as multiple queries can be run in
    parallel at separate sites.
  • Easier Expansion
  • Expansion is easier as it may involve adding a
    new site.
  • Expansion can be planned to suit the current
    distribution scheme.

13
System Overheads (i)
  • Controlling Data. It is necessary to monitor data
    distribution, fragmentation and replication by
    expanding the system catalog.
  • Distributed Query Processing. It is necessary to
    access multiple sites during the execution of
    global queries.
  • Optimisation. It is necessary to devise execution
    strategies based on factors such as the movement
    of data between sites and the speed of network
    connections between those sites.
  • Replicated Data Management. It is necessary to
    propagate changes form one site to all copies.
    This requires an ability to decide which copy is
    master, and to maintain consistency among
    replicated sites.

14
System Overheads (ii)
  • Distributed Database Recovery. A requirement to
    handle new types of failure (based on
    communication), and to recover from individual
    site crashes.
  • Security. Global transactions require the
    negotiation of different security systems.
    Authorisation and access privileges must be
    maintained.
  • Distributed Catalog Management. The hold holds
    metadata for the entire DDBMS. A decision must be
    made at design time as to the fragmentation or
    replication (or both) of the system catalog.

15
Sections Covered
  • 2.1 Concepts
  • 2.2 Advantages
  • 2.3 Classification of Distributed Systems
  • 2.4 Database Design
  • 2.5 Distributed Query Processing

16
2.3 Classification of Distributed Systems
  • Distributed databases have design alternatives
    along three dimensions
  • Autonomy,
  • Distribution,
  • Heterogeneity.
  • Autonomy refers to the distribution of control,
    and indicates the degree to which individual
    DBMSs can operate independently.
  • The distribution dimension deals with data. There
    are only two possibilities data is distributed
    across multiple sites, or is stored at a single
    site.
  • Heterogeneity can occur in various forms
    hardware, networking protocols, variations in
    database managers. The important ones relate to
    data models, query languages, and transaction
    management protocols.

17
Distribution
Distributed Homogeneous federated DBMSs
Distributed Homogeneous DBMSs
Distributed Heterogeneous DBMSs
Logically integrated Homogeneous Multiple DBMSs
Single site homogeneous Federated DBMSs
Autonomy
Heterogeneous integrated DBMSs
Heterogeneity
18
Sections Covered
  • 2.1 Concepts
  • 2.2 Advantages
  • 2.3 Classification of Distributed Systems
  • 2.4 Database Design
  • 2.5 Distributed Query Processing

19
2.4 Database Design
  • Early research into DDBSs suggests the
    organisation of distributed systems along three
    orthogonal dimensions
  • level of sharing
  • behaviour of access patterns
  • level of knowledge on access pattern behaviour.
  • The first property looks at how data is shared
    between users the second looks at issues such as
    static and dynamic access patterns and the third
    looks at how much information is available
    regarding access patterns.

20
Top-down design
  • Top-down design is suited to a green-field type
    of application, whereas bottom-up design is
    generally employed where systems already exist.
  • Requirements Analysis ? Objectives
  • Conceptual Design ? the Global Conceptual Schema
  • View Design ? Access Information and External
    Schema Definitions
  • Distributed Design ? Local Conceptual Schemas
  • Physical Design ? Physical Schema
  • Observation Monitoring ? Feedback

21
(No Transcript)
22
Issues
  • Why fragment ?
  • How should fragmentation be performed ?
    (horizontally v vertical)
  • How much should be fragmented? An important issue
    as it effects the performance of query execution
    aim to find a nice balance between large and
    small units.
  • Can we test the correctness of decomposition ?
    (Observe rules)
  • How is allocation performed ? (choose sites,
    replication required ?)
  • What is the necessary information for
    fragmentation and allocation? (database
    information, application information,
    communication network information and computer
    system information).

23
Correctness Rules of Fragmentation
  • The following three rules should be enforced
    during fragmentation, which, together ensure that
    the database does not undergo semantic change
    during fragmentation.
  • Completeness. If a relation instance R is
    decomposed into fragments R1,R2,,Rn, each data
    item that can be found in R can also be found in
    one or more of each Ri. This property is
    identical to the lossless decomposition property
    of normalisation.
  • Reconstruction. If a relation R is decomposed
    into fragments R1,R2,,Rn, it should be possible
    to define a relational operator ? such that
  • R ?Ri ? Ri ? FR
  • The operator ? will be different for different
    fragmentations, but the operation must be
    identified.

24
Rules
  • Disjointness. If a relation instance R is
    decomposed into fragments R1,R2,,Rn, and data
    item di resides in Rj, it cannot reside in any
    other fragment Rk (k?j). This criterion ensures
    that the horizontal fragments are disjoint. Note
    that the primary key is often repeated in all
    fragments for vertical partitioning, thus,
    disjointness is defined only on the non-primary
    key attributes of a relation.

25
Sections Covered
  • 2.1 Concepts
  • 2.2 Advantages
  • 2.3 Classification of Distributed Systems
  • 2.4 Database Design
  • 2.5 Distributed Query Processing

26
2.5 Query Processing
  • The main function of a relational query processor
    is to transform a high-level query into an
    equivalent lower-level query.
  • The low-level query (contains the information
    required to) implements the execution strategy
    for the query.
  • The transformation must achieve correctness and
    efficiency. The well-defined mapping between
    relational calculus and algebra makes the
    correctness issue easy.
  • However, producing an execution strategy that is
    efficient is more complex. A relational calculus
    query may have many equivalent transformations in
    relational algebra. The issue is to select that
    execution strategy that minimises resource
    consumption.
  • In a distributed system, relational algebra is
    not enough to express execution strategies. It
    must be supplemented with operations for
    exchanging data between sites. For example, the
    distributed query processor must select the best
    sites to process data.

27
Sample DB
  • Site 1 (containing a table called EMPLOYEE)
  • Fname, Lname, RSI, DOB, Address, Sex, Salary,
    DeptNo
  • 10,000 tuples (each 100 bytes in length)
  • RSI is 9 bytes DeptNo is 4 bytes Fname is 15
    bytes Lname is 15 bytes
  • Site 2 (containing a table called DEPARTMENT)
  • Dname, Dnumber, MgrRSI, MGRStartdate
  • 100 tuples (35 bytes in length)
  • Dnumber is 4 bytes Dname is 10 bytes MgrRSI is
    9 bytes
  • Properties
  • Size of EMPLOYEE is 10,000 100 1,000,000
    bytes
  • Size of DEPARTMENT is 100 35 3,500 bytes
  • EMPLOYEE.DeptNo DEPARTMENT.Dnumber

28
Sample Query 1
  • For each employee, retrieve the employee name and
    the department in which that employee works.
  • The result of the query will include 10,000
    tuples (assuming that every employee has a valid
    department). We know that 40 bytes are required
    for each tuple in the result.
  • The query is executed at Site 3 (result site).
    Three strategies exist for execution of the
    distributed query.
  • If minimising the amount of data transfer is the
    optimisation criterion, which strategy is
    selected?

29
Strategy 1
  • Transfer both the EMPLOYEE and DEPARTMENT
    relations to the result site, and perform the
    join there.

Site 1
Site 2
Employee
Dept
D Dname, Dnumber, MgrRSI, MGRStartdate
E Fname, Lname, RSI, DOB, Address, Sex,
Salary, DeptNo
Site 3
Transfer amount 1,000,000 3,500 1,003,500
bytes
30
Strategy 2
  • Transfer the EMPLOYEE relation to site 2, execute
    the join at site 2, and send the result to site 3.

Site 1
Site 2
Employee
Dept
E Fname, Lname, RSI, DOB, Address, Sex,
Salary, DeptNo
R Fname, Lname, Dname
Site 3
Transfer 1,000,000 bytes to Site 2 Query result
size 40 10,000 400,000 bytes Transfer
amount 1,000,000 400,000 1,400,000 bytes.
31
Strategy 3
  • Transfer the DEPARTMENT relation to site 1,
    execute the join at site 2, and transfer the
    result to site 3.

Site 1
Site 2
Employee
Dept
D Dname, Dnumber, MgrRSI, MRGStartdate
Site 3
R Fname, Lname, Dname
Transfer 3,500 bytes to Site 1 Query result size
40 10,000 400,000 bytes Transfer amount
3,500 400,000 403,500 bytes.
32
Sample Query 2
  • For each department, retrieve the department
    name, and the name of the department manager.
  • Assume the query is again submitted at site 3,
    and that the result contains 100 tuples (of 40
    bytes).

33
Strategy 1
  • Transfer both EMPLOYEE and DEPARTMENT to site 3,
    and perform the join there.

Site 1
Site 2
Employee
Dept
D Dname, Dnumber, MgrRSI, MRGStartdate
E Fname, Lname, RSI, DOB, Address, Sex,
Salary, DeptNo
Site 3
Transfer amount 1,000,000 3,500 1,003,500
bytes
34
Strategy 2
  • Transfer the EMPLOYEE relation to site 2, execute
    the join at site 2, and send the result to site 3.

Site 1
Site 2
Employee
Dept
E Fname, Lname, RSI, DOB, Address, Sex,
Salary, DeptNo
R Fname, Lname, Dname
Site 3
Transfer 1,000,000 bytes to Site 2 Query result
size 40 100 4,000 bytes Transfer amount
1,000,000 4,000 1,004,000 bytes.
35
Strategy 3
  • Transfer the DEPARTMENT relation to site 1,
    execute the join at site 2, and transfer the
    result to site 3.

Site 1
Site 2
Employee
Dept
D Dname, Dnumber, MgrRSI, MRGStartdate
Site 3
R Fname, Lname, Dname
Transfer 3,500 bytes to Site 1 Query result size
40 100 4,000 bytes Transfer amount 3,500
4,000 7,500 bytes.
36
Exercises
  • Determine what the result would be if the
    projection of each table was executed before they
    left the site (eg. ? Dnumber(Department) and ?
    ltDeptNo, Fname, Lnamegt(Employee) for query 1).
  • Determine the best strategy if the query is
    executed at site 2.

37
Processing Layers
Used for processing
Output from Step 1
Output from Step 2
Output from Step 3
38
Query Decomposition
  • The first layer decomposes the distributed
    calculus query into an algebraic query.
  • Query decomposition can be viewed as four
    successive steps
  • rewrite the calculus query in a normalised form
    (suitable for subsequent manipulations)
  • analyse the normalised query to detect incorrect
    queries (reject them early)
  • simplify the correct query (eg. eliminate
    redundant predicates)
  • transform the calculus query into an algebraic
    query.

39
Data Localisation
  • The input into this layer is the algebraic
    transformation of the query.
  • The main role of this layer is to localise the
    querys data using data distribution information
    determine which fragments are involved in the
    query and transform the distributed query into
    fragment queries.
  • There are two steps
  • The distributed query is mapped into a fragment
    query by substituting each distributed relation
    by its materialisation program.
  • The fragment query is simplified and restructured
    to another correct query.

40
Global Query Optimisation
  • The input to this layer is a (algebraic) query
    fragment .
  • The goal of the query optimiser is to locate an
    execution strategy for the query that is close to
    optimal.
  • This consists of finding the best ordering of
    operations in the fragment query.
  • An important aspect of query optimisation is join
    ordering, since permutations of joins within the
    query may lead to improvements of orders of
    magnitude.

41
Local Query Optimisation
  • The final layer is performed by all sites having
    fragments involved in the query.
  • Each sub-query executing at local sites is
    optimised using the local schema of the site.
Write a Comment
User Comments (0)
About PowerShow.com