A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences - PowerPoint PPT Presentation

About This Presentation
Title:

A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences

Description:

to Enable Cost-Based Optimization. of Statement Sequences ... SQL statement sequence algebraic operator trees. Everything is a bucket / histogram: ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 26
Provided by: kraf6
Learn more at: http://www.adbis.org
Category:

less

Transcript and Presenter's Notes

Title: A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences


1
A Statistics Propagation Approachto Enable
Cost-Based Optimizationof Statement Sequences
  • Tobias Kraft, Holger Schwarz, Bernhard Mitschang
  • Institute of Parallel and Distributed Systems
  • University of Stuttgart

2
Overview
  • Motivation
  • Cost Estimation Approach
  • Histogram Propagation
  • Related Work
  • Experiments
  • Conclusion Future Work

3
MotivationThe Case for Optimization
  • Many of todays applicationsembed query
    generators.
  • Some of these generators notonly produce a
    single query buta sequence of SQL
    statements(e.g. MicroStrategy DSS tools).
  • Rewriting these sequences maylead to significant
    performanceimprovements!
  • Development of an optimizer based on rewrite
    rules and a heuristic priority-based control
    strategy
  • Coarse-Grained Optimization (CGO) VLDB03

4
MotivationStatement Sequences
CREATE TABLE q1 (custkey INTEGER, turnover1990
FLOAT) CREATE TABLE q2 (custkey INTEGER,
turnover1991 FLOAT) CREATE TABLE q3 (custkey
INTEGER, name VARCHAR(25)) INSERT INTO
q1 SELECT o.custkey, SUM(o.totalprice) FROM orde
rs o WHERE o.orderyear 1990 GROUP BY
o.custkey INSERT INTO q2 SELECT o.custkey,
SUM(o.totalprice) FROM orders o WHERE o.orderyea
r 1991 GROUP BY o.custkey INSERT INTO
q3 SELECT c.custkey, c.name FROM q1, q2,
customer c WHERE q1.custkey c.custkey
AND q1.custkey q2.custkey
AND q2.turnover1991 gt q1.turnover1990 DROP
TABLE q1 DROP TABLE q2
  • We focus on statement sequences that
  • compute the final result of a requestin a set of
    subsequent steps,
  • allow to share intermediate results bymultiple
    subsequent steps,
  • temporarily store intermediate resultsin tables
    that are being created anddropped within the
    sequence,
  • store the final result in a table that isnot
    being dropped within the sequence.

q1
q2
q3
5
MotivationProblems of the Heuristic Control
Strategy
  • In some scenarios a rule application may lead to
    deteriorationof performance.
  • Different behavior on different platforms and
    database management systems.
  • Alternative sequences of rule applications lead
    to different results.
  • Need for a cost-based approach.

6
Cost Estimation ApproachProblems Solutions
  • Cost estimates depend on the physical layout of
    the database and the capabilities and strategies
    of the DBMSs query optimizer.
  • A cost model on top of the DBMS is no feasible
    solution.
  • Make use of the cost estimates provided by the
    DBMSs query optimizer.
  • DBMSs only provide cost estimates for statements
    on existing tables.
  • Execute CREATE TABLES statements before cost
    estimation.
  • Missing statistics for the created tables causes
    the DBMSs query optimizer to use default values
    for cardinality and selectivity.
  • Propagate statistics through the INSERT
    statements.
  • Make these propagated statistics available to the
    DBMSs optimizer.

7
Cost Estimation Approach Algorithm of Cost
Estimation
  • Input A statement sequence S.
  • Output A cost estimate for S.
  • totalcosts 0
  • foreach CREATE TABLE statement c in S
  • Execute c on the underlying database system.
  • foreach INSERT statement i in S (in the order
    given by S)
  • Retrieve a cost estimate for i from the
    optimizer of the underlying database system.
  • Add this cost estimate to totalcosts.
  • Translate i into an algebraic tree.
  • Retrieve histograms for the base tables from the
    underlying database system and propagate them
    through the algebraic tree to retrieve histograms
    for the target table of i.
  • Store the resulting histograms in the catalog of
    the underlying database system.
  • foreach CREATE TABLE statement c in S
  • Drop the table that has been created by c.
  • return totalcosts.

8
Cost Estimation Approach Architectural Overview
9
Histogram PropagationOverview
  • We have adopted techniques from approximate query
    answering and added some extensions
  • interval arithmetic for arithmetic terms,
  • heuristics for grouping and aggregation,
  • unification of comparison operators,
  • normalization of histograms,
  • common subexpressions
  • SQL statement sequence ? algebraic operator
    trees.
  • Everything is a bucket / histogram
  • NULL values are represented by a special bucket.
  • Constant values (used in arithmetic terms)
    arerepresented by a histogram with a single
    bucket.
  • Sets of constants (used in IN-predicates)
    arerepresented by a histogram.

10
Histogram PropagationHistograms
  • Buckets are 4-tuples(low, high, card, dv )
  • The NULL value(null, null, card, 1)
  • A constant value c (c, c, card, 1)

11
Histogram PropagationAlgebra and Propagation
  • Operators projection, selection, cartesian
    product, union, difference and grouping
    (including aggregation).
  • A join can be represented by a cartesian product
    followed by a selection that contains the join
    condition.
  • Arithmetic terms (projection / selection) and
    predicates (selection) are also represented by
    operator trees.
  • Propagation is done by recursively traversing the
    algebraic operator tree and the arithmetic trees
    in a post order manner.

12
Histogram PropagationAlgebra and Propagation
  • Arithmetic Operators
  • histograms as input
  • a result histogram as output
  • iterate over all bucket combinations
  • compute a result bucket for each bucket
    combination
  • Comparison Operators
  • histograms as input
  • selectivity as output
  • some operators also provide modified histograms
  • iterate over all bucket combinations
  • compute a selectivity for each bucket
    combination(and adapt the buckets)

13
Histogram PropagationInterval Arithmetic for
Arithmetic Terms
  • Example of using interval arithmetic for adding
    two buckets
  • BO.low BI1.low BI2.low
  • BO.high BI1.high BI2.high

14
Histogram PropagationHeuristics for Grouping and
Aggregation
  • Determine amount of groups and average group
    sizes.
  • Use these values together with the histogram of
    the attribute that should be aggregated to
    compute the aggregate buckets.

SUM(A) ?
100 groupswithgroup size 50
15
Histogram PropagationUnification of Comparison
Operators
  • Comparison operators compare histograms.
  • A single operator implementation can be used for
    different purposes.
  • No separate join implementation is necessary.
  • E.g., the comparison operator can be used in
  • a predicate comparing two attributes,
  • a predicate comparing an attribute and a
    constant,
  • a join condition of an equi-join? cartesian
    product followed by a selection thatcontains the
    join-condition as predicate,
  • an IN predicate comparing an attribute with a set
    of values? join with a table (represented by a
    histogram) thatcontains the values of the value
    set.

C
C1

Cn
16
Related Workon Histogram Propagation
  • Papers on Approximate Query Answering
  • Yannis E. Ioannidis, Viswanath PoosalaHistogram-
    Based Approximation of Set-Valued Query-Answers.
    VLDB 1999
  • Viswanath Poosala, Venkatesh Ganti, Yannis E.
    IoannidisApproximate Query Answering using
    Histograms.IEEE Data Eng. Bull. 1999
  • Transformation of SQL queries on tables into SQL
    queries on histograms.
  • Join is done by creating two tables under the
    uniform spread assumption such that each table
    represents a value distribution which fits to the
    respective input histogram.
  • Only equi-joins, no support for predicates that
    compare two attributes, no grouping and no
    support of arithmetic terms.

17
ExperimentsExperimental Setup
  • Database TPC-H benchmark database on IBM DB2 V9.
  • Sample sequences
  • Sequence S1 generated by the MicroStrategy DSS
    tool suite.
  • A set of semantically equivalent sequences that
    result from the application of the CGO rewrite
    rules.
  • Variants of those sequences resulting from the
    use of different values in the filter predicates
    of the sequence.

18
ExperimentsResults for Alternative Sequences
19
ExperimentsResults for Different Selectivities
20
Conclusion Future Work
  • Cost estimates for statement sequences are
    necessary to avoid rule applications that lead to
    performance deterioration.
  • Making use of the cost estimates of the
    underlying DBMS is a feasible solution.
  • Histogram propagation is necessary to get useable
    cost estimates for statements that access
    intermediate-result tables and to avoid bad
    plans.
  • Future Work
  • A cost-based control strategy.
  • Extensive measurements.

21
  • Thank you foryour attention!

22
(No Transcript)
23
Cost Estimation Approach Statistics API
  • is an interface that offers uniform
    DBMS-independent access to DBMS statistics, meta
    data and optimizer estimates
  • provides a flexible histogram format that
    abstracts from proprietary data structures used
    in different DBMSs
  • implementations exist for IBM DB2, Oracle and MS
    SQL Server

24
Histogram PropagationCommon Subexpressions
  • Identify arithmetic terms that appear multiple
    times in the different clauses of an SQL query.
  • Otherwise, the arithmetic term will be recomputed
    for each appearance and modifications of
    histograms may get lost.

25
Histogram PropagationNormalization of Histograms
  • Worst case
  • The number of buckets in the output histogram is
    the product of the number of buckets in the input
    histograms.
  • Serializing a histogram may double the number of
    buckets.
  • Normalization
  • prior to the enumeration phase and / or after an
    output histogram has been produced,
  • reduces the number of buckets by merging adjacent
    buckets,
  • trade-off between complexity / performance and
    quality.
Write a Comment
User Comments (0)
About PowerShow.com