Title: A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences
1A Statistics Propagation Approachto Enable
Cost-Based Optimizationof Statement Sequences
- Tobias Kraft, Holger Schwarz, Bernhard Mitschang
- Institute of Parallel and Distributed Systems
- University of Stuttgart
2Overview
- Motivation
- Cost Estimation Approach
- Histogram Propagation
- Related Work
- Experiments
- Conclusion Future Work
3MotivationThe Case for Optimization
- Many of todays applicationsembed query
generators. - Some of these generators notonly produce a
single query buta sequence of SQL
statements(e.g. MicroStrategy DSS tools). - Rewriting these sequences maylead to significant
performanceimprovements! - Development of an optimizer based on rewrite
rules and a heuristic priority-based control
strategy - Coarse-Grained Optimization (CGO) VLDB03
4MotivationStatement Sequences
CREATE TABLE q1 (custkey INTEGER, turnover1990
FLOAT) CREATE TABLE q2 (custkey INTEGER,
turnover1991 FLOAT) CREATE TABLE q3 (custkey
INTEGER, name VARCHAR(25)) INSERT INTO
q1 SELECT o.custkey, SUM(o.totalprice) FROM orde
rs o WHERE o.orderyear 1990 GROUP BY
o.custkey INSERT INTO q2 SELECT o.custkey,
SUM(o.totalprice) FROM orders o WHERE o.orderyea
r 1991 GROUP BY o.custkey INSERT INTO
q3 SELECT c.custkey, c.name FROM q1, q2,
customer c WHERE q1.custkey c.custkey
AND q1.custkey q2.custkey
AND q2.turnover1991 gt q1.turnover1990 DROP
TABLE q1 DROP TABLE q2
- We focus on statement sequences that
- compute the final result of a requestin a set of
subsequent steps, - allow to share intermediate results bymultiple
subsequent steps, - temporarily store intermediate resultsin tables
that are being created anddropped within the
sequence, - store the final result in a table that isnot
being dropped within the sequence.
q1
q2
q3
5MotivationProblems of the Heuristic Control
Strategy
- In some scenarios a rule application may lead to
deteriorationof performance. - Different behavior on different platforms and
database management systems. - Alternative sequences of rule applications lead
to different results. - Need for a cost-based approach.
6Cost Estimation ApproachProblems Solutions
- Cost estimates depend on the physical layout of
the database and the capabilities and strategies
of the DBMSs query optimizer. - A cost model on top of the DBMS is no feasible
solution. - Make use of the cost estimates provided by the
DBMSs query optimizer. - DBMSs only provide cost estimates for statements
on existing tables. - Execute CREATE TABLES statements before cost
estimation. - Missing statistics for the created tables causes
the DBMSs query optimizer to use default values
for cardinality and selectivity. - Propagate statistics through the INSERT
statements. - Make these propagated statistics available to the
DBMSs optimizer.
7Cost Estimation Approach Algorithm of Cost
Estimation
- Input A statement sequence S.
- Output A cost estimate for S.
- totalcosts 0
- foreach CREATE TABLE statement c in S
- Execute c on the underlying database system.
- foreach INSERT statement i in S (in the order
given by S) - Retrieve a cost estimate for i from the
optimizer of the underlying database system. - Add this cost estimate to totalcosts.
- Translate i into an algebraic tree.
- Retrieve histograms for the base tables from the
underlying database system and propagate them
through the algebraic tree to retrieve histograms
for the target table of i. - Store the resulting histograms in the catalog of
the underlying database system. - foreach CREATE TABLE statement c in S
- Drop the table that has been created by c.
- return totalcosts.
8Cost Estimation Approach Architectural Overview
9Histogram PropagationOverview
- We have adopted techniques from approximate query
answering and added some extensions - interval arithmetic for arithmetic terms,
- heuristics for grouping and aggregation,
- unification of comparison operators,
- normalization of histograms,
- common subexpressions
- SQL statement sequence ? algebraic operator
trees. - Everything is a bucket / histogram
- NULL values are represented by a special bucket.
- Constant values (used in arithmetic terms)
arerepresented by a histogram with a single
bucket. - Sets of constants (used in IN-predicates)
arerepresented by a histogram.
10Histogram PropagationHistograms
- Buckets are 4-tuples(low, high, card, dv )
- The NULL value(null, null, card, 1)
- A constant value c (c, c, card, 1)
11Histogram PropagationAlgebra and Propagation
- Operators projection, selection, cartesian
product, union, difference and grouping
(including aggregation). - A join can be represented by a cartesian product
followed by a selection that contains the join
condition. - Arithmetic terms (projection / selection) and
predicates (selection) are also represented by
operator trees. - Propagation is done by recursively traversing the
algebraic operator tree and the arithmetic trees
in a post order manner.
12Histogram PropagationAlgebra and Propagation
- Arithmetic Operators
- histograms as input
- a result histogram as output
- iterate over all bucket combinations
- compute a result bucket for each bucket
combination - Comparison Operators
- histograms as input
- selectivity as output
- some operators also provide modified histograms
- iterate over all bucket combinations
- compute a selectivity for each bucket
combination(and adapt the buckets)
13Histogram PropagationInterval Arithmetic for
Arithmetic Terms
- Example of using interval arithmetic for adding
two buckets - BO.low BI1.low BI2.low
- BO.high BI1.high BI2.high
14Histogram PropagationHeuristics for Grouping and
Aggregation
- Determine amount of groups and average group
sizes. - Use these values together with the histogram of
the attribute that should be aggregated to
compute the aggregate buckets.
SUM(A) ?
100 groupswithgroup size 50
15Histogram PropagationUnification of Comparison
Operators
- Comparison operators compare histograms.
- A single operator implementation can be used for
different purposes. - No separate join implementation is necessary.
- E.g., the comparison operator can be used in
- a predicate comparing two attributes,
- a predicate comparing an attribute and a
constant, - a join condition of an equi-join? cartesian
product followed by a selection thatcontains the
join-condition as predicate, - an IN predicate comparing an attribute with a set
of values? join with a table (represented by a
histogram) thatcontains the values of the value
set.
C
C1
Cn
16Related Workon Histogram Propagation
- Papers on Approximate Query Answering
- Yannis E. Ioannidis, Viswanath PoosalaHistogram-
Based Approximation of Set-Valued Query-Answers.
VLDB 1999 - Viswanath Poosala, Venkatesh Ganti, Yannis E.
IoannidisApproximate Query Answering using
Histograms.IEEE Data Eng. Bull. 1999 - Transformation of SQL queries on tables into SQL
queries on histograms. - Join is done by creating two tables under the
uniform spread assumption such that each table
represents a value distribution which fits to the
respective input histogram. - Only equi-joins, no support for predicates that
compare two attributes, no grouping and no
support of arithmetic terms.
17ExperimentsExperimental Setup
- Database TPC-H benchmark database on IBM DB2 V9.
- Sample sequences
- Sequence S1 generated by the MicroStrategy DSS
tool suite. - A set of semantically equivalent sequences that
result from the application of the CGO rewrite
rules. - Variants of those sequences resulting from the
use of different values in the filter predicates
of the sequence.
18ExperimentsResults for Alternative Sequences
19ExperimentsResults for Different Selectivities
20Conclusion Future Work
- Cost estimates for statement sequences are
necessary to avoid rule applications that lead to
performance deterioration. - Making use of the cost estimates of the
underlying DBMS is a feasible solution. - Histogram propagation is necessary to get useable
cost estimates for statements that access
intermediate-result tables and to avoid bad
plans. - Future Work
- A cost-based control strategy.
- Extensive measurements.
21- Thank you foryour attention!
22(No Transcript)
23Cost Estimation Approach Statistics API
- is an interface that offers uniform
DBMS-independent access to DBMS statistics, meta
data and optimizer estimates - provides a flexible histogram format that
abstracts from proprietary data structures used
in different DBMSs - implementations exist for IBM DB2, Oracle and MS
SQL Server
24Histogram PropagationCommon Subexpressions
- Identify arithmetic terms that appear multiple
times in the different clauses of an SQL query. - Otherwise, the arithmetic term will be recomputed
for each appearance and modifications of
histograms may get lost.
25Histogram PropagationNormalization of Histograms
- Worst case
- The number of buckets in the output histogram is
the product of the number of buckets in the input
histograms. - Serializing a histogram may double the number of
buckets. - Normalization
- prior to the enumeration phase and / or after an
output histogram has been produced, - reduces the number of buckets by merging adjacent
buckets, - trade-off between complexity / performance and
quality.