The%20Sum%20is%20Greater%20Than%20the%20Parts%20Global%20Query%20Optimization%20in%20Federated%20Systems - PowerPoint PPT Presentation

About This Presentation
Title:

The%20Sum%20is%20Greater%20Than%20the%20Parts%20Global%20Query%20Optimization%20in%20Federated%20Systems

Description:

... quickly to store inventory information, call tickets, etc. and are tech savvy. HR may adopt slowly as they use many paper forms and are not so tech savvy ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 101
Provided by: TonyY9
Category:

less

Transcript and Presenter's Notes

Title: The%20Sum%20is%20Greater%20Than%20the%20Parts%20Global%20Query%20Optimization%20in%20Federated%20Systems


1
The Sum is GreaterThan the PartsGlobal Query
Optimization in Federated Systems
  • Tony Young
  • M.Math Candidate
  • CS 848 - Fall 2004

2
Outline
  • Introduction
  • Motivation
  • Issues
  • System Overview
  • Optimization
  • Semijoin Algorithm
  • Reduction Algorithm
  • The Garlic Approach
  • Conclusion
  • My Project

3
Outline
  • Introduction
  • Motivation
  • Issues
  • System Overview
  • Optimization
  • Semijoin Algorithm
  • Reduction Algorithm
  • The Garlic Approach
  • Conclusion
  • My Project

4
Introduction
  • References
  • L. M. Haas, E. T. Lin, and M. A. Roth. Data
    integration through database federation. IBM
    Systems Journal, 41(4)578596, 2002.
  • David K. Hsiao. Federated databases and systems
    part i a tutorial on their data sharing. The
    VLDB Journal, 1(1)127180, 1992.
  • David K. Hsiao. Federated databases and systems
    part ii a tutorial on their resource
    consolidation. The VLDB Journal, 1(2)285310,
    1992.
  • Hongjun Lu, Beng-Chin Ooi, and Cheng-Hian Goh. On
    global multidatabase query optimization. SIGMOD
    Rec., 21(4)611, 1992.

5
Introduction
  • References
  • Neil Coburn and Per-Ake Larson. Multidatabase
    services issues and architectural design. In
    Proceedings of the 1992 conference of the Centre
    for Advanced Studies on Collaborative research,
    pages 5766, Toronto, Ontario, Canada, 1992. IBM
    Press.
  • Qiang Zhu. Query optimization in multidatabase
    systems. In Proceedings of the 1992 conference of
    the Centre for Advanced Studies on Collaborative
    research, pages 111127. IBM Press, 1992.

6
Introduction
  • References
  • H. Lu, B. C. Ooi, and C. H. Goh. Multidatabase
    query optimization Issues and solutions. In
    Proceedings of Third International Workshop on
    Research Issues in Data Engineering
    Interoperability in Multidatabase Systems, pages
    137143, 1993.

7
Introduction
8
Introduction
9
Introduction
10
Introduction
  • What is a federated database system (FDBS)?
  • Also referred to as a multidatabase system (MDBS)
  • FDBSs combine multiple heterogeneous data
    sources into one global view
  • Users think that the data all resides in one place

11
Introduction
  • The notion of an FDBS has been around for many
    years
  • First semi-commercial product was Mermaid (later
    became InterViso) in 1984
  • Flurry of papers began to surface in late 1980s
    - early 1990s
  • What motivated the development of these systems?

12
Motivation
  • Replacement of Data Processing With Databases
  • Organizations were moving away from traditional
    data processing techniques (such as storing
    information in flat files
  • The rise in popularity of DBMS's can be
    attributed to powerful data mining applications
    as well as ease of data access

13
Motivation
  • Proliferation of Heterogeneous Databases Within
    an Organization
  • It is not uncommon for different departments
    within an organization to make use of their own
    database servers
  • Departments often do not coordinate to ensure
    that a corporation is using a homogeneous DBMS to
    store data
  • There is no guarantee that the schema individual
    departments use to store data will be homogeneous

14
Motivation
  • Data Sharing Within Organizations
  • Many organizations seek to share data between
    different departments
  • Finance department may require information
    regarding projects in progress in the marketing
    department
  • Such information sharing is difficult without the
    guarantee of schema, data model or access
    language homogeneity
  • Finance may use SQL on a relational database and
    marketing may use Xquery on an XML database

15
Motivation
  • Different Rates of Technology Adoption
  • Different departments will adopt technology at
    different rates
  • IT will adopt quickly to store inventory
    information, call tickets, etc. and are tech
    savvy
  • HR may adopt slowly as they use many paper forms
    and are not so tech savvy
  • Older systems are in place, and newer systems to
    come online use different products

16
Motivation
  • Geographic Separation of Teams
  • Different teams may be broken up across
    geographic locations
  • Different sites will hold teams working on
    different projects
  • Each site may have their own IT staff and make
    their own purchasing and installation decisions
  • Often there is no coordination between sites

17
Motivation
  • Mergers and Acquisitions
  • When companies join forces, their IT systems must
    be joined as well
  • Old applications will depend on the old software,
    and users might be reluctant to learn an entirely
    new system
  • If we can merge the two system so that each user
    can make use of their old applications and old
    access language, transitions might be easier

18
Issues
  • Several issues affect query optimization in FDBSs

19
Issues
  • Site Autonomy
  • Data - Local database administrators have direct
    and complete control over the schemas
  • This information cannot be modified in any way
  • Design - Local database administrators decide
    when and how to replicate and fragment data
  • Communication - Each site decides locally whether
    or not to communicate with the FDBS
  • Execution - Each site can determine how, when and
    whether to execute global queries, as well as how
    queries are prioritized

20
Issues
  • Local Parameters
  • Local cost parameters for individual sites are
    not always available to the FDBS
  • The FDBS often doesn't know what indices are
    available for relations at local sites
  • Cant predict what access methods will be used by
    local sites
  • No idea what page size and disk latency are
  • etc.

21
Issues
  • Translation
  • Queries must be translated to and from the local
    schema, query language, and data model on-the-fly
  • This requires additional query processing time

22
Issues
  • Heterogeneous Capabilities
  • Not all local sites have the same capabilities.
  • Some sites may not implement any ranking
    operations
  • This means that intermediate results might have
    to be shifted to sites that can provide these
    capabilities, further increasing processing time

23
Issues
  • Additional Costs
  • Cost based optimization needs to take into
    consideration some additional factors such as
  • Transmission speeds
  • Network loads
  • Local site configurations
  • As with local parameters, this information is not
    always available to the FDBS

24
Issues
  • Overriding theme - the FDBS is just another
    application as far as the local sites are
    concerned
  • We can make no assumptions about the local sites
  • We can assume no global control over the local
    sites
  • We have no hooks into the local sites to directly
    access information

25
Outline
  • Introduction
  • Motivation
  • Issues
  • System Overview
  • Optimization
  • Semijoin Algorithm
  • Reduction Algorithm
  • The Garlic Approach
  • Conclusion
  • My Project

26
System Overview
  • References
  • Qiang Zhu. Query optimization in multidatabase
    systems. In Proceedings of the 1992 conference of
    the Centre for Advanced Studies on Collaborative
    research, pages 111127. IBM Press, 1992.
  • M. T. Ozsu and P. Valduriez. Principles of
    Distributed Database Systems. Prentice Hall,
    Upper Saddle River, NJ, 2nd edition, 1999.
  • John Grant, Witold Litwin, Nick Roussopoulos, and
    Timos Sellis. Query languages for relational
    multidatabases. The VLDB Journal, 2(2)153172,
    1993.

27
System Overview
  • Two main approaches to a federated system
  • Multidatabase Language Approach
  • Users must learn a special access language
  • Users must use one standardized data model
  • Users must know the sites they are contacting and
    how data is organized at those sites
  • Users must enter their authentication information
    each time they use a site
  • Users must

28
System Overview
29
System Overview
  • Main idea
  • User specifies the sites, relations and columns
    used in the query by their name at the local site
  • Queries are still submitted to a middleware, but
    user must know where things are stored

30
System Overview
  • Two main approaches to federated system
  • Global Schema Approach
  • Global DBA implements wrappers/agents to convert
    access language and data model before sending to
    source
  • Global DBA generates an integrated global schema
  • Global DBA stores authentication for individual
    users at each local site and FDBS handles login
  • Global DBA

31
System Overview
32
System Overview
  • Main idea
  • User specifies the relations and columns used in
    the query by their global name
  • Queries are submitted to a middleware that does
    conversion and subquery generation

33
System Overview
  • The difference is how users perceive the system
  • MDBL As far as the user is concerned, the data
    is stored at separate sites and their query must
    explicitly use those sites
  • GS As far as the user is concerned, the data is
    stored in the middleware and they can access it
    directly

34
Outline
  • Introduction
  • Motivation
  • Issues
  • System Overview
  • Optimization
  • Semijoin Algorithm
  • Reduction Algorithm
  • The Garlic Approach
  • Conclusion
  • My Project

35
Optimization
  • There are many many optimization algorithms for
    federated systems
  • 2-Phase
  • Statistical Sampling
  • Adaptive
  • Probing Query-based
  • etc.
  • We will look at three
  • Semijoin and Reduction From Mermaid - state of
    the art for many years and pioneer in the field
  • Garlic From IBM - incorporated into a shipping
    product (DB2 Information Integrator)

36
Semijoin Optimization Algorithm
  • References
  • David Brill, Marjorie Templeton, and Clement T.
    Yu. Distributed query processing strategies in
    mermaid, a frontend to data management systems.
    In Proceedings of the First International
    Conference on Data Engineering, pages 211218.
    IEEE Computer Society, 1984.

37
Semijion Optimization Algorithm
  • The semijoin algorithm was proposed in Mermaid
    (1984)
  • Assumes that the cost of data transfer through a
    network outweighs local site CPU overhead
  • Seeks to reduce the size of relations required
    for a query at local sites before transferring
    results back to the Controller
  • Four steps

38
Step 1 Site Selection
  • A set of sites that will be used to perform a
    query must first be chosen
  • Requires finding a set of minimal size that
    includes one copy of each local, partitioned and
    replicated relation
  • i.e. each site holding a data fragment must be in
    the set, but only one replica of a relation must
    be in the set
  • Some sites may hold more than one relation
    required by the query
  • Allows us to further reduce the size of the site
    set

39
Step 1 Site Selection
  • The system statistics that can be used to
    optimize this selection, such as link speeds and
    system loads, remains an open problem

40
Step 1 Site Selection
41
Step 1 Site Selection
42
Step 2 Local Reduction
  • In parallel at each local site in the chosen site
    set, reduce each relation by performing
    selections and projections
  • Parameters used to perform these operations are
    taken from select, where and join conditions in
    the original query
  • It might be possible to optimize the order in
    which site reduction queries are performed by
    exploiting network traffic and speed, CPU load at
    local sites, etc.
  • i.e. submit queries to slow sites first and hope
    they don't increase the overall execution time
    too much

43
Step 2 Local Reduction
44
Step 3 Global Reduction
  • Find and execute an efficient sequence of
    semijoins that will reduce the set of records to
    be transmitted
  • Mermaid uses a hill-climbing algorithm to
    determine this set
  • Once the semijoins are performed, the smallest
    amount of data required to answer the query is
    ready for transport
  • Some other algorithms to determine the optimal
    semijoin sequence should be investigated as this
    one is slow!

45
Step 3 Global Reduction
46
Step 4 Assembly
  • Transfer the data to one central query site and
    generate the result set. Return the result set to
    the user
  • May perform joins at local sites or wait until we
    get to the FDBS

47
Step 4 Assembly
48
Step 4 Assembly
49
Step 4 Assembly
50
Step 4 Assembly
51
Step 4 Assembly
  • It may be less costly to generate the result set
    at one central site and then transfer the data
    back to the user
  • It may also be less costly to assemble the result
    set at the user's site

52
Semijoin Optimization Algorithm
  • This algorithm exploits the capabilities of the
    DBMS's in the federation
  • Attempts to reduce the transmission overhead
    required to send data to sites
  • Statistics are computed on-the-fly and are
    discarded once the query is complete
  • No attempt is made to store or make statistics
    more accurate

53
Reduction Algorithm
  • References
  • David Brill, Marjorie Templeton, and Clement T.
    Yu. Distributed query processing strategies in
    mermaid, a frontend to data management systems.
    In Proceedings of the First International
    Conference on Data Engineering, pages 211218.
    IEEE Computer Society, 1984.

54
Reduction Algorithm
  • The replicate algorithm was proposed in Mermaid
    (1984)
  • Assumes that CPU overhead at local sites
    outweighs transfer costs between them
  • Seeks to transfer data to local sites in order to
    exploit the differences in processing speeds of
    each system
  • Four steps

55
Step 1 Site Selection
  • As with the semijoin algorithm, we choose a
    minimal site set
  • Instead of choosing only one replica for each
    replicated relation, we include all replicas of
    the data
  • Allows us to run queries in parallel at each
    replica

56
Step 1 Site Selection
57
Step 2 Data Transfer
  • Copy each relation to each site where it is to be
    used to process a subquery, but does not already
    exist
  • I.e. if site 1 holds relation A and requires
    relation B, transfer B to site 1
  • This may require composing fragmented relations
    into one large relation
  • After this step, each site should have a copy of
    the relations that are to be used to form the
    partial query result for which that site is
    responsible
  • As per the subquery sent to it by the Controller

58
Step 2 Data Transfer
59
Step 3 Query Execution
  • Once each site has the data it needs to run its
    partial query, the queries are executed
  • After this step, each site should have a partial
    answer to the user's query

60
Step 3 Query Execution
61
Step 4 Assembly
  • Transfer the partial answers from local sites and
    create the final result set at the user's home
    site
  • Return the results to the user

62
Step 4 Assembly
63
Step 4 Assembly
64
Step 4 Assembly
  • It may be less costly to generate the result set
    at one central site and then transfer the data
    back to the user
  • It may also be less costly to assemble the result
    set at the user's site

65
Reduction Algorithm
  • This algorithm exploits the configurations of the
    DBMS's in the federation
  • Attempts to reduce the processing time for the
    query by working at sites of varying capabilities
  • Statistics are computed on-the-fly and are
    discarded once the query is complete
  • No attempt is made to store or make statistics
    more accurate

66
Garlic
  • References
  • L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar,
    J. E. Rice, and W. C. Swope. Discoverylink a
    system for integrated access to life sciences
    data sources. IBM Syst. J., 40(2)489511, 2001.
  • Laura M. Haas, Donald Kossmann, Edward L.
    Wimmers, and Jun Yang. Optimizing queries across
    diverse data sources. In Proceedings of the 23rd
    International Conference on Very Large Data
    Bases, pages 276285. Morgan Kaufmann Publishers
    Inc., 1997.

67
Garlic
  • References
  • L. M. Haas, P. Kodali, J. E. Rice, P. M. Schwarz,
    and W. C. Swope. Integrating life sciences
    data-with a little garlic. In Proceedings of the
    1st IEEE International Symposium on
    Bioinformatics and Biomedical Engineering, page
    5. IEEE Computer Society, 2000.
  • Mary Tork Roth, Fatma Ozcan, and Laura M. Haas.
    Cost models DO matter Providing cost information
    for diverse data sources in a federated system.
    In The VLDB Journal, pages 599610, 1999.

68
Garlic
  • References
  • Guy M. Lohman. Grammar-like functional rules for
    representing query optimization alternatives. In
    Proceedings of the 1988 ACM SIGMOD international
    conference on Management of data, pages 1827.
    ACM Press, 1988.

69
Garlic
  • Proposed by IBM for use in several products
  • DB2 II and DiscoveryLink
  • The Garlic optimizer uses wrappers to gather
    costing info for developing a query plan
  • Goal is to allow Garlic to find a good plan
    without knowledge of the capabilities of the
    local site
  • Optimizer uses a set of strategy alternative
    rules (STARs) that are used to rewrite plans
  • Plans are a set of plan operators (POPs) that
    compose the query plan tree (sort, filter, scan,
    etc).
  • A generic pushdown POP that encapsulates work to
    be done at a local site is also included

70
Garlic
  • STARs are fired over the query in order to
    generate POPs
  • STARs can be seen as grammatical production
    rules
  • STAR's generate cost and cardinality information
    using input from the wrapper
  • Works in three phases

71
Step 1 Fire Access STARs
  • Access STARs are applied in order to enumerate
    plans that read data from a source
  • Plan space is pruned in order to remove plans
    that have the same or weaker cost properties
  • Just what is weaker is not detailed!

72
Step 1 Fire Access STARs
73
Step 1 Fire Access STARs
74
Step 2 Fire Join STARs
  • Join STARs are applied in order to enumerate all
    plans involving joins
  • Plan space is filled with all possible join
    combinations
  • Garlic considers bushy plans and left-deep plans
    as collocated data may make a bushy plan more
    efficient
  • I.e. joins might be able to be performed at a
    local site
  • Plan space is pruned in order to remove plans
    that have the same or weaker cost properties

75
Step 2 Fire Join STARs
76
Step 2 Fire Join STARs
77
Step 3 Fire FinishRoot STAR
  • The FinishRoot STAR is applied to provide
    orderings, selects, projects, etc. that were not
    already completed by some local site
  • I.e. the local site did not have the proper
    capabilities to perform that operation
  • The plan with lowest cost is then chosen for
    execution

78
Step 3 Fire FinishRoot STAR
79
Garlic
  • Garlic plans are built bottom-up using dynamic
    programming
  • This method will take large amounts of time to
    optimize plans containing even a small number of
    joins
  • An 8-way clique join (i.e. there is a join
    between every pair of relations) was shown to
    take approximately 4.5 min to optimize!
  • Considered the upper limit of the algorithm.
  • Wrappers are assumed to be able to gather
    statistics directly from the local site
  • Not always possible!

80
Outline
  • Introduction
  • Motivation
  • Issues
  • System Overview
  • Optimization
  • Semijoin Algorithm
  • Reduction Algorithm
  • The Garlic Approach
  • Conclusion
  • My Project

81
Conclusion
  • Many good algorithms exist for optimizing queries
    in federated systems
  • Those queries make use of statistics that must be
    gathered somehow
  • Probing queries
  • Utilities
  • Statistical sampling of data
  • etc.

82
Conclusion
  • Statistical accuracy is important
  • As we all know, accuracy directly affects
    estimates, which may directly affect algorithm
    choices
  • The optimizer is only as good as the statistics!

83
Questions?
84
Outline
  • Introduction
  • Motivation
  • Issues
  • System Overview
  • Optimization
  • Semijoin Algorithm
  • Reduction Algorithm
  • The Garlic Approach
  • Conclusion
  • My Project

85
My Project
  • There are many methods of statistics collection
    that have been proposed
  • Utility based methods dispatch an agent to gather
    data from a database and build statistics
  • Utility is invoked on a regular schedule to keep
    statistics up to date
  • Imposes great overhead at the site
  • Requires a DBA to set the schedule and rerun if
    the stats are out of date between scheduled runs

86
My Project
  • There are many methods of statistics collection
    that have been proposed
  • Piggybacking requests more data than is required
    by a user query and builds statistics using it
  • Ex ask for an additional column or remove filter
    predicates
  • We can develop accurate statistics with the
    returned data
  • Has the potential to significantly increase the
    overhead of a query at a local site as well as
    the running time of a query

87
My Project
  • There are many methods of statistics collection
    that have been proposed
  • Fuzzy cost models attempt to model the changing
    state of the database using probabilities about
    the parameters needed for costing
  • Builds a parameter by taking portions of the
    possible values (according to the probability
    that they will occur)
  • Models must be built offline adding a significant
    task to the DBA
  • What do we do to update the model once its
    built?
  • How do we determine what parameters to add to the
    model?

88
My Project
  • There are many methods of statistics collection
    that have been proposed
  • Data sampling allows us to build a statistically
    accurate value for a statistic
  • Different methods are used to draw a sample
  • Random draw a random sample of tuples
  • Stratified draw a sample of tuples from tuples
    classified into groups
  • Adaptive draw n tuples where n is determined to
    be a statistically relevant sample size or
    provides an error less than the relative error
    required by the application
  • Systemic draw a certain percentage of the number
    of tuples in the table
  • etc.

89
My Project
  • There are many methods of statistics collection
    that have been proposed
  • Among sample bias and estimation errors, sampling
    methods require direct access to the data in the
    tables
  • FDBS does not have direct access!
  • Not practical to run online
  • Might as well use a utility to get more accurate
    stats

90
My Project
  • The overriding theme
  • We want something that imposes limited overhead
  • We want something that calculates reasonably
    accurate statistics

91
My Project
  • WE CANNOT ASSUME ANY ACCESS TO DATA WILL BE
    AVAILABLE!!!
  • To the local sites, our FDBS is just another
    application, and can only do the things an
    application can do!
  • I.e. we cant dictate the access plan to use, etc.

92
My Project
  • Meet FLO
  • Federated Learning cOllector (FLO) is able to
    learn statistics from the data that is returned
    to answer user queries
  • Calculates statistics regularly according to the
    amount that they change
  • I.e. if a stat changes by 2, recalculate it in a
    day if a stat changes by 20, recalculate it at
    the next chance

93
My Project
  • Meet FLO
  • FLO does not request any additional data as
    piggybacking
  • FLO does not run any additional queries as
    utilities
  • FLO does not impose any additional overhead on
    the local sites, and stats are calculated offline

94
My Project
  • Meet FLO
  • Tradeoffs exist
  • Stats wont be perfectly accurate
  • Stats wont necessarily exist for a range of
    values
  • Questions exist
  • How often is often enough to recalculate stats?
  • How do we get initial statistics?
  • ?

95
My Project
  • Statistics Gathering References
  • E. Whalen. Oracle Performance Tuning and
    Optimization. SAMS Publishing, 1996.
  • Amira Rahal, Qiang Zhu, and Per-Ake Larson.
    Evolutionary techniques for updating query cost
    models in a dynamic multidatabase environment.
    The VLDB Journal, 13(2)162176, 2004.
  • A. H. H. Ngu, B. Harangsri, and J. Shepherd.
    Query size estimation for joins using systematic
    sampling. Distrib. Parallel Databases,
    15(3)237275, 2004.

96
My Project
  • Statistics Gathering References
  • Peter J. Haas and Arun N. Swami. Sequential
    sampling procedures for query size estimation.
    SIGMOD Rec., 21(2)341350, 1992.
  • Richard J. Lipton, Jeffrey F. Naughton, and
    Donovan A. Schneider. Practical selectivity
    estimation through adaptive sampling. In
    Proceedings of the 1990 ACM SIGMOD international
    conference on Management of data, pages 111. ACM
    Press, 1990.

97
My Project
  • Statistics Gathering References
  • Qiang Zhu, Brian Dunkel, Wing Lau, Suyun Chen,
    and Berni Schiefer. Piggyback statistics
    collection for query optimization Towards a
    self-maintaining database management system. The
    Computer Journal, 47(2)221244, March 2004.
  • Qiang Zhu and P. A. Larson. Query optimization
    using fuzzy set theory for multidatabase systems.
    In Proceedings of the 1993 conference of the
    Centre for Advanced Studies on Collaborative
    research, pages 848859, Toronto, Ontario,
    Canada, 1993. IBM Press.

98
My Project
  • Statistics Gathering References
  • Qiang Zhu and Per-Ake Larson. A query sampling
    method of estimating local cost parameters in a
    multidatabase system. In Proceedings of the Tenth
    International Conference on Data Engineering,
    February 14-18, 1994, Houston, Texas, USA, pages
    144153. IEEE Computer Society, 1994.
  • Qiang Zhu. An integrated method for estimating
    selectivities in a multidatabase system. In
    Proceedings of the 1993 conference of the Centre
    for Advanced Studies on Collaborative research,
    pages 832847. IBM Press, 1993.

99
My Project
  • Statistics Gathering References
  • Per-Ake Larson Qiang Zhu. Establishing a fuzzy
    cost model for query optimization in a
    multidatabase system. In System Sciences, 1994.
    Vol.II Software Technology, Proceedings of the
    Twenty-Seventh Hawaii International Conference
    on, volume 2, pages 263272, 1994.

100
Questions?
Write a Comment
User Comments (0)
About PowerShow.com