The%20Sum%20is%20Greater%20Than%20the%20Parts%20Global%20Query%20Optimization%20in%20Federated%20Systems

About This Presentation

Title:

The%20Sum%20is%20Greater%20Than%20the%20Parts%20Global%20Query%20Optimization%20in%20Federated%20Systems

Description:

... quickly to store inventory information, call tickets, etc. and are tech savvy. HR may adopt slowly as they use many paper forms and are not so tech savvy ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 101

Provided by: TonyY9

Category:

more less

Transcript and Presenter's Notes

Title: The%20Sum%20is%20Greater%20Than%20the%20Parts%20Global%20Query%20Optimization%20in%20Federated%20Systems

1
The Sum is GreaterThan the PartsGlobal Query
Optimization in Federated Systems

Tony Young
M.Math Candidate
CS 848 - Fall 2004

2
Outline

Introduction
Motivation
Issues
System Overview
Optimization
Semijoin Algorithm
Reduction Algorithm
The Garlic Approach
Conclusion
My Project

3
Outline

Introduction
Motivation
Issues
System Overview
Optimization
Semijoin Algorithm
Reduction Algorithm
The Garlic Approach
Conclusion
My Project

4
Introduction

References
L. M. Haas, E. T. Lin, and M. A. Roth. Data
integration through database federation. IBM
Systems Journal, 41(4)578596, 2002.
David K. Hsiao. Federated databases and systems
part i a tutorial on their data sharing. The
VLDB Journal, 1(1)127180, 1992.
David K. Hsiao. Federated databases and systems
part ii a tutorial on their resource
consolidation. The VLDB Journal, 1(2)285310,
1992.
Hongjun Lu, Beng-Chin Ooi, and Cheng-Hian Goh. On
global multidatabase query optimization. SIGMOD
Rec., 21(4)611, 1992.

5
Introduction

References
Neil Coburn and Per-Ake Larson. Multidatabase
services issues and architectural design. In
Proceedings of the 1992 conference of the Centre
for Advanced Studies on Collaborative research,
pages 5766, Toronto, Ontario, Canada, 1992. IBM
Press.
Qiang Zhu. Query optimization in multidatabase
systems. In Proceedings of the 1992 conference of
the Centre for Advanced Studies on Collaborative
research, pages 111127. IBM Press, 1992.

6
Introduction

References
H. Lu, B. C. Ooi, and C. H. Goh. Multidatabase
query optimization Issues and solutions. In
Proceedings of Third International Workshop on
Research Issues in Data Engineering
Interoperability in Multidatabase Systems, pages
137143, 1993.

7
Introduction
8
Introduction
9
Introduction
10
Introduction

What is a federated database system (FDBS)?
Also referred to as a multidatabase system (MDBS)
FDBSs combine multiple heterogeneous data
sources into one global view
Users think that the data all resides in one place

11
Introduction

The notion of an FDBS has been around for many
years
First semi-commercial product was Mermaid (later
became InterViso) in 1984
Flurry of papers began to surface in late 1980s
- early 1990s
What motivated the development of these systems?

12
Motivation

Replacement of Data Processing With Databases
Organizations were moving away from traditional
data processing techniques (such as storing
information in flat files
The rise in popularity of DBMS's can be
attributed to powerful data mining applications
as well as ease of data access

13
Motivation

Proliferation of Heterogeneous Databases Within
an Organization
It is not uncommon for different departments
within an organization to make use of their own
database servers
Departments often do not coordinate to ensure
that a corporation is using a homogeneous DBMS to
store data
There is no guarantee that the schema individual
departments use to store data will be homogeneous

14
Motivation

Data Sharing Within Organizations
Many organizations seek to share data between
different departments
Finance department may require information
regarding projects in progress in the marketing
department
Such information sharing is difficult without the
guarantee of schema, data model or access
language homogeneity
Finance may use SQL on a relational database and
marketing may use Xquery on an XML database

15
Motivation

Different Rates of Technology Adoption
Different departments will adopt technology at
different rates
IT will adopt quickly to store inventory
information, call tickets, etc. and are tech
savvy
HR may adopt slowly as they use many paper forms
and are not so tech savvy
Older systems are in place, and newer systems to
come online use different products

16
Motivation

Geographic Separation of Teams
Different teams may be broken up across
geographic locations
Different sites will hold teams working on
different projects
Each site may have their own IT staff and make
their own purchasing and installation decisions
Often there is no coordination between sites

17
Motivation

Mergers and Acquisitions
When companies join forces, their IT systems must
be joined as well
Old applications will depend on the old software,
and users might be reluctant to learn an entirely
new system
If we can merge the two system so that each user
can make use of their old applications and old
access language, transitions might be easier

18
Issues

Several issues affect query optimization in FDBSs

19
Issues

Site Autonomy
Data - Local database administrators have direct
and complete control over the schemas
This information cannot be modified in any way
Design - Local database administrators decide
when and how to replicate and fragment data
Communication - Each site decides locally whether
or not to communicate with the FDBS
Execution - Each site can determine how, when and
whether to execute global queries, as well as how
queries are prioritized

20
Issues

Local Parameters
Local cost parameters for individual sites are
not always available to the FDBS
The FDBS often doesn't know what indices are
available for relations at local sites
Cant predict what access methods will be used by
local sites
No idea what page size and disk latency are
etc.

21
Issues

Translation
Queries must be translated to and from the local
schema, query language, and data model on-the-fly
This requires additional query processing time

22
Issues

Heterogeneous Capabilities
Not all local sites have the same capabilities.
Some sites may not implement any ranking
operations
This means that intermediate results might have
to be shifted to sites that can provide these
capabilities, further increasing processing time

23
Issues

Additional Costs
Cost based optimization needs to take into
consideration some additional factors such as
Transmission speeds
Network loads
Local site configurations
As with local parameters, this information is not
always available to the FDBS

24
Issues

Overriding theme - the FDBS is just another
application as far as the local sites are
concerned
We can make no assumptions about the local sites
We can assume no global control over the local
sites
We have no hooks into the local sites to directly
access information

25
Outline

Introduction
Motivation
Issues
System Overview
Optimization
Semijoin Algorithm
Reduction Algorithm
The Garlic Approach
Conclusion
My Project

26
System Overview

References
Qiang Zhu. Query optimization in multidatabase
systems. In Proceedings of the 1992 conference of
the Centre for Advanced Studies on Collaborative
research, pages 111127. IBM Press, 1992.
M. T. Ozsu and P. Valduriez. Principles of
Distributed Database Systems. Prentice Hall,
Upper Saddle River, NJ, 2nd edition, 1999.
John Grant, Witold Litwin, Nick Roussopoulos, and
Timos Sellis. Query languages for relational
multidatabases. The VLDB Journal, 2(2)153172,
1993.

27
System Overview

Two main approaches to a federated system
Multidatabase Language Approach
Users must learn a special access language
Users must use one standardized data model
Users must know the sites they are contacting and
how data is organized at those sites
Users must enter their authentication information
each time they use a site
Users must

28
System Overview
29
System Overview

Main idea
User specifies the sites, relations and columns
used in the query by their name at the local site
Queries are still submitted to a middleware, but
user must know where things are stored

30
System Overview

Two main approaches to federated system
Global Schema Approach
Global DBA implements wrappers/agents to convert
access language and data model before sending to
source
Global DBA generates an integrated global schema
Global DBA stores authentication for individual
users at each local site and FDBS handles login
Global DBA

31
System Overview
32
System Overview

Main idea
User specifies the relations and columns used in
the query by their global name
Queries are submitted to a middleware that does
conversion and subquery generation

33
System Overview

The difference is how users perceive the system
MDBL As far as the user is concerned, the data
is stored at separate sites and their query must
explicitly use those sites
GS As far as the user is concerned, the data is
stored in the middleware and they can access it
directly

34
Outline

Introduction
Motivation
Issues
System Overview
Optimization
Semijoin Algorithm
Reduction Algorithm
The Garlic Approach
Conclusion
My Project

35
Optimization

There are many many optimization algorithms for
federated systems
2-Phase
Statistical Sampling
Adaptive
Probing Query-based
etc.
We will look at three
Semijoin and Reduction From Mermaid - state of
the art for many years and pioneer in the field
Garlic From IBM - incorporated into a shipping
product (DB2 Information Integrator)

36
Semijoin Optimization Algorithm

References
David Brill, Marjorie Templeton, and Clement T.
Yu. Distributed query processing strategies in
mermaid, a frontend to data management systems.
In Proceedings of the First International
Conference on Data Engineering, pages 211218.
IEEE Computer Society, 1984.

37
Semijion Optimization Algorithm

The semijoin algorithm was proposed in Mermaid
(1984)
Assumes that the cost of data transfer through a
network outweighs local site CPU overhead
Seeks to reduce the size of relations required
for a query at local sites before transferring
results back to the Controller
Four steps

38
Step 1 Site Selection

A set of sites that will be used to perform a
query must first be chosen
Requires finding a set of minimal size that
includes one copy of each local, partitioned and
replicated relation
i.e. each site holding a data fragment must be in
the set, but only one replica of a relation must
be in the set
Some sites may hold more than one relation
required by the query
Allows us to further reduce the size of the site
set

39
Step 1 Site Selection

The system statistics that can be used to
optimize this selection, such as link speeds and
system loads, remains an open problem

40
Step 1 Site Selection
41
Step 1 Site Selection
42
Step 2 Local Reduction

In parallel at each local site in the chosen site
set, reduce each relation by performing
selections and projections
Parameters used to perform these operations are
taken from select, where and join conditions in
the original query
It might be possible to optimize the order in
which site reduction queries are performed by
exploiting network traffic and speed, CPU load at
local sites, etc.
i.e. submit queries to slow sites first and hope
they don't increase the overall execution time
too much

43
Step 2 Local Reduction
44
Step 3 Global Reduction

Find and execute an efficient sequence of
semijoins that will reduce the set of records to
be transmitted
Mermaid uses a hill-climbing algorithm to
determine this set
Once the semijoins are performed, the smallest
amount of data required to answer the query is
ready for transport
Some other algorithms to determine the optimal
semijoin sequence should be investigated as this
one is slow!

45
Step 3 Global Reduction
46
Step 4 Assembly

Transfer the data to one central query site and
generate the result set. Return the result set to
the user
May perform joins at local sites or wait until we
get to the FDBS

47
Step 4 Assembly
48
Step 4 Assembly
49
Step 4 Assembly
50
Step 4 Assembly
51
Step 4 Assembly

It may be less costly to generate the result set
at one central site and then transfer the data
back to the user
It may also be less costly to assemble the result
set at the user's site

52
Semijoin Optimization Algorithm

This algorithm exploits the capabilities of the
DBMS's in the federation
Attempts to reduce the transmission overhead
required to send data to sites
Statistics are computed on-the-fly and are
discarded once the query is complete
No attempt is made to store or make statistics
more accurate

53
Reduction Algorithm

References
David Brill, Marjorie Templeton, and Clement T.
Yu. Distributed query processing strategies in
mermaid, a frontend to data management systems.
In Proceedings of the First International
Conference on Data Engineering, pages 211218.
IEEE Computer Society, 1984.

54
Reduction Algorithm

The replicate algorithm was proposed in Mermaid
(1984)
Assumes that CPU overhead at local sites
outweighs transfer costs between them
Seeks to transfer data to local sites in order to
exploit the differences in processing speeds of
each system
Four steps

55
Step 1 Site Selection

As with the semijoin algorithm, we choose a
minimal site set
Instead of choosing only one replica for each
replicated relation, we include all replicas of
the data
Allows us to run queries in parallel at each
replica

56
Step 1 Site Selection
57
Step 2 Data Transfer

Copy each relation to each site where it is to be
used to process a subquery, but does not already
exist
I.e. if site 1 holds relation A and requires
relation B, transfer B to site 1
This may require composing fragmented relations
into one large relation
After this step, each site should have a copy of
the relations that are to be used to form the
partial query result for which that site is
responsible
As per the subquery sent to it by the Controller

58
Step 2 Data Transfer
59
Step 3 Query Execution

Once each site has the data it needs to run its
partial query, the queries are executed
After this step, each site should have a partial
answer to the user's query

60
Step 3 Query Execution
61
Step 4 Assembly

Transfer the partial answers from local sites and
create the final result set at the user's home
site
Return the results to the user

62
Step 4 Assembly
63
Step 4 Assembly
64
Step 4 Assembly

It may be less costly to generate the result set
at one central site and then transfer the data
back to the user
It may also be less costly to assemble the result
set at the user's site

65
Reduction Algorithm

This algorithm exploits the configurations of the
DBMS's in the federation
Attempts to reduce the processing time for the
query by working at sites of varying capabilities
Statistics are computed on-the-fly and are
discarded once the query is complete
No attempt is made to store or make statistics
more accurate

66
Garlic

References
L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar,
J. E. Rice, and W. C. Swope. Discoverylink a
system for integrated access to life sciences
data sources. IBM Syst. J., 40(2)489511, 2001.
Laura M. Haas, Donald Kossmann, Edward L.
Wimmers, and Jun Yang. Optimizing queries across
diverse data sources. In Proceedings of the 23rd
International Conference on Very Large Data
Bases, pages 276285. Morgan Kaufmann Publishers
Inc., 1997.

67
Garlic

References
L. M. Haas, P. Kodali, J. E. Rice, P. M. Schwarz,
and W. C. Swope. Integrating life sciences
data-with a little garlic. In Proceedings of the
1st IEEE International Symposium on
Bioinformatics and Biomedical Engineering, page
5. IEEE Computer Society, 2000.
Mary Tork Roth, Fatma Ozcan, and Laura M. Haas.
Cost models DO matter Providing cost information
for diverse data sources in a federated system.
In The VLDB Journal, pages 599610, 1999.

68
Garlic

References
Guy M. Lohman. Grammar-like functional rules for
representing query optimization alternatives. In
Proceedings of the 1988 ACM SIGMOD international
conference on Management of data, pages 1827.
ACM Press, 1988.

69
Garlic

Proposed by IBM for use in several products
DB2 II and DiscoveryLink
The Garlic optimizer uses wrappers to gather
costing info for developing a query plan
Goal is to allow Garlic to find a good plan
without knowledge of the capabilities of the
local site
Optimizer uses a set of strategy alternative
rules (STARs) that are used to rewrite plans
Plans are a set of plan operators (POPs) that
compose the query plan tree (sort, filter, scan,
etc).
A generic pushdown POP that encapsulates work to
be done at a local site is also included

70
Garlic

STARs are fired over the query in order to
generate POPs
STARs can be seen as grammatical production
rules
STAR's generate cost and cardinality information
using input from the wrapper
Works in three phases

71
Step 1 Fire Access STARs

Access STARs are applied in order to enumerate
plans that read data from a source
Plan space is pruned in order to remove plans
that have the same or weaker cost properties
Just what is weaker is not detailed!

72
Step 1 Fire Access STARs
73
Step 1 Fire Access STARs
74
Step 2 Fire Join STARs

Join STARs are applied in order to enumerate all
plans involving joins
Plan space is filled with all possible join
combinations
Garlic considers bushy plans and left-deep plans
as collocated data may make a bushy plan more
efficient
I.e. joins might be able to be performed at a
local site
Plan space is pruned in order to remove plans
that have the same or weaker cost properties

75
Step 2 Fire Join STARs
76
Step 2 Fire Join STARs
77
Step 3 Fire FinishRoot STAR

The FinishRoot STAR is applied to provide
orderings, selects, projects, etc. that were not
already completed by some local site
I.e. the local site did not have the proper
capabilities to perform that operation
The plan with lowest cost is then chosen for
execution

78
Step 3 Fire FinishRoot STAR
79
Garlic

Garlic plans are built bottom-up using dynamic
programming
This method will take large amounts of time to
optimize plans containing even a small number of
joins
An 8-way clique join (i.e. there is a join
between every pair of relations) was shown to
take approximately 4.5 min to optimize!
Considered the upper limit of the algorithm.
Wrappers are assumed to be able to gather
statistics directly from the local site
Not always possible!

80
Outline

Introduction
Motivation
Issues
System Overview
Optimization
Semijoin Algorithm
Reduction Algorithm
The Garlic Approach
Conclusion
My Project

81
Conclusion

Many good algorithms exist for optimizing queries
in federated systems
Those queries make use of statistics that must be
gathered somehow
Probing queries
Utilities
Statistical sampling of data
etc.

82
Conclusion

Statistical accuracy is important
As we all know, accuracy directly affects
estimates, which may directly affect algorithm
choices
The optimizer is only as good as the statistics!

83
Questions?
84
Outline

Introduction
Motivation
Issues
System Overview
Optimization
Semijoin Algorithm
Reduction Algorithm
The Garlic Approach
Conclusion
My Project

85
My Project

There are many methods of statistics collection
that have been proposed
Utility based methods dispatch an agent to gather
data from a database and build statistics
Utility is invoked on a regular schedule to keep
statistics up to date
Imposes great overhead at the site
Requires a DBA to set the schedule and rerun if
the stats are out of date between scheduled runs

86
My Project

There are many methods of statistics collection
that have been proposed
Piggybacking requests more data than is required
by a user query and builds statistics using it
Ex ask for an additional column or remove filter
predicates
We can develop accurate statistics with the
returned data
Has the potential to significantly increase the
overhead of a query at a local site as well as
the running time of a query

87
My Project

There are many methods of statistics collection
that have been proposed
Fuzzy cost models attempt to model the changing
state of the database using probabilities about
the parameters needed for costing
Builds a parameter by taking portions of the
possible values (according to the probability
that they will occur)
Models must be built offline adding a significant
task to the DBA
What do we do to update the model once its
built?
How do we determine what parameters to add to the
model?

88
My Project

There are many methods of statistics collection
that have been proposed
Data sampling allows us to build a statistically
accurate value for a statistic
Different methods are used to draw a sample
Random draw a random sample of tuples
Stratified draw a sample of tuples from tuples
classified into groups
Adaptive draw n tuples where n is determined to
be a statistically relevant sample size or
provides an error less than the relative error
required by the application
Systemic draw a certain percentage of the number
of tuples in the table
etc.

89
My Project

There are many methods of statistics collection
that have been proposed
Among sample bias and estimation errors, sampling
methods require direct access to the data in the
tables
FDBS does not have direct access!
Not practical to run online
Might as well use a utility to get more accurate
stats

90
My Project

The overriding theme
We want something that imposes limited overhead
We want something that calculates reasonably
accurate statistics

91
My Project

WE CANNOT ASSUME ANY ACCESS TO DATA WILL BE
AVAILABLE!!!
To the local sites, our FDBS is just another
application, and can only do the things an
application can do!
I.e. we cant dictate the access plan to use, etc.

92
My Project

Meet FLO
Federated Learning cOllector (FLO) is able to
learn statistics from the data that is returned
to answer user queries
Calculates statistics regularly according to the
amount that they change
I.e. if a stat changes by 2, recalculate it in a
day if a stat changes by 20, recalculate it at
the next chance

93
My Project

Meet FLO
FLO does not request any additional data as
piggybacking
FLO does not run any additional queries as
utilities
FLO does not impose any additional overhead on
the local sites, and stats are calculated offline

94
My Project

Meet FLO
Tradeoffs exist
Stats wont be perfectly accurate
Stats wont necessarily exist for a range of
values
Questions exist
How often is often enough to recalculate stats?
How do we get initial statistics?
?

95
My Project

Statistics Gathering References
E. Whalen. Oracle Performance Tuning and
Optimization. SAMS Publishing, 1996.
Amira Rahal, Qiang Zhu, and Per-Ake Larson.
Evolutionary techniques for updating query cost
models in a dynamic multidatabase environment.
The VLDB Journal, 13(2)162176, 2004.
A. H. H. Ngu, B. Harangsri, and J. Shepherd.
Query size estimation for joins using systematic
sampling. Distrib. Parallel Databases,
15(3)237275, 2004.

96
My Project

Statistics Gathering References
Peter J. Haas and Arun N. Swami. Sequential
sampling procedures for query size estimation.
SIGMOD Rec., 21(2)341350, 1992.
Richard J. Lipton, Jeffrey F. Naughton, and
Donovan A. Schneider. Practical selectivity
estimation through adaptive sampling. In
Proceedings of the 1990 ACM SIGMOD international
conference on Management of data, pages 111. ACM
Press, 1990.

97
My Project

Statistics Gathering References
Qiang Zhu, Brian Dunkel, Wing Lau, Suyun Chen,
and Berni Schiefer. Piggyback statistics
collection for query optimization Towards a
self-maintaining database management system. The
Computer Journal, 47(2)221244, March 2004.
Qiang Zhu and P. A. Larson. Query optimization
using fuzzy set theory for multidatabase systems.
In Proceedings of the 1993 conference of the
Centre for Advanced Studies on Collaborative
research, pages 848859, Toronto, Ontario,
Canada, 1993. IBM Press.

98
My Project

Statistics Gathering References
Qiang Zhu and Per-Ake Larson. A query sampling
method of estimating local cost parameters in a
multidatabase system. In Proceedings of the Tenth
International Conference on Data Engineering,
February 14-18, 1994, Houston, Texas, USA, pages
144153. IEEE Computer Society, 1994.
Qiang Zhu. An integrated method for estimating
selectivities in a multidatabase system. In
Proceedings of the 1993 conference of the Centre
for Advanced Studies on Collaborative research,
pages 832847. IBM Press, 1993.

99
My Project

Statistics Gathering References
Per-Ake Larson Qiang Zhu. Establishing a fuzzy
cost model for query optimization in a
multidatabase system. In System Sciences, 1994.
Vol.II Software Technology, Proceedings of the
Twenty-Seventh Hawaii International Conference
on, volume 2, pages 263272, 1994.

100
Questions?

Write a Comment

User Comments (0)