A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences

About This Presentation

Title:

A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences

Description:

to Enable Cost-Based Optimization. of Statement Sequences ... SQL statement sequence algebraic operator trees. Everything is a bucket / histogram: ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 26

Provided by: kraf6

Learn more at: http://www.adbis.org

Category:

more less

Transcript and Presenter's Notes

Title: A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences

1
A Statistics Propagation Approachto Enable
Cost-Based Optimizationof Statement Sequences

Tobias Kraft, Holger Schwarz, Bernhard Mitschang
Institute of Parallel and Distributed Systems
University of Stuttgart

2
Overview

Motivation
Cost Estimation Approach
Histogram Propagation
Related Work
Experiments
Conclusion Future Work

3
MotivationThe Case for Optimization

Many of todays applicationsembed query
generators.
Some of these generators notonly produce a
single query buta sequence of SQL
statements(e.g. MicroStrategy DSS tools).
Rewriting these sequences maylead to significant
performanceimprovements!
Development of an optimizer based on rewrite
rules and a heuristic priority-based control
strategy
Coarse-Grained Optimization (CGO) VLDB03

4
MotivationStatement Sequences
CREATE TABLE q1 (custkey INTEGER, turnover1990
FLOAT) CREATE TABLE q2 (custkey INTEGER,
turnover1991 FLOAT) CREATE TABLE q3 (custkey
INTEGER, name VARCHAR(25)) INSERT INTO
q1 SELECT o.custkey, SUM(o.totalprice) FROM orde
rs o WHERE o.orderyear 1990 GROUP BY
o.custkey INSERT INTO q2 SELECT o.custkey,
SUM(o.totalprice) FROM orders o WHERE o.orderyea
r 1991 GROUP BY o.custkey INSERT INTO
q3 SELECT c.custkey, c.name FROM q1, q2,
customer c WHERE q1.custkey c.custkey
AND q1.custkey q2.custkey
AND q2.turnover1991 gt q1.turnover1990 DROP
TABLE q1 DROP TABLE q2

We focus on statement sequences that
compute the final result of a requestin a set of
subsequent steps,
allow to share intermediate results bymultiple
subsequent steps,
temporarily store intermediate resultsin tables
that are being created anddropped within the
sequence,
store the final result in a table that isnot
being dropped within the sequence.

q1
q2
q3
5
MotivationProblems of the Heuristic Control
Strategy

In some scenarios a rule application may lead to
deteriorationof performance.
Different behavior on different platforms and
database management systems.
Alternative sequences of rule applications lead
to different results.
Need for a cost-based approach.

6
Cost Estimation ApproachProblems Solutions

Cost estimates depend on the physical layout of
the database and the capabilities and strategies
of the DBMSs query optimizer.
A cost model on top of the DBMS is no feasible
solution.
Make use of the cost estimates provided by the
DBMSs query optimizer.
DBMSs only provide cost estimates for statements
on existing tables.
Execute CREATE TABLES statements before cost
estimation.
Missing statistics for the created tables causes
the DBMSs query optimizer to use default values
for cardinality and selectivity.
Propagate statistics through the INSERT
statements.
Make these propagated statistics available to the
DBMSs optimizer.

7
Cost Estimation Approach Algorithm of Cost
Estimation

Input A statement sequence S.
Output A cost estimate for S.
totalcosts 0
foreach CREATE TABLE statement c in S
Execute c on the underlying database system.
foreach INSERT statement i in S (in the order
given by S)
Retrieve a cost estimate for i from the
optimizer of the underlying database system.
Add this cost estimate to totalcosts.
Translate i into an algebraic tree.
Retrieve histograms for the base tables from the
underlying database system and propagate them
through the algebraic tree to retrieve histograms
for the target table of i.
Store the resulting histograms in the catalog of
the underlying database system.
foreach CREATE TABLE statement c in S
Drop the table that has been created by c.
return totalcosts.

8
Cost Estimation Approach Architectural Overview
9
Histogram PropagationOverview

We have adopted techniques from approximate query
answering and added some extensions
interval arithmetic for arithmetic terms,
heuristics for grouping and aggregation,
unification of comparison operators,
normalization of histograms,
common subexpressions
SQL statement sequence ? algebraic operator
trees.
Everything is a bucket / histogram
NULL values are represented by a special bucket.
Constant values (used in arithmetic terms)
arerepresented by a histogram with a single
bucket.
Sets of constants (used in IN-predicates)
arerepresented by a histogram.

10
Histogram PropagationHistograms

Buckets are 4-tuples(low, high, card, dv )
The NULL value(null, null, card, 1)
A constant value c (c, c, card, 1)

11
Histogram PropagationAlgebra and Propagation

Operators projection, selection, cartesian
product, union, difference and grouping
(including aggregation).
A join can be represented by a cartesian product
followed by a selection that contains the join
condition.
Arithmetic terms (projection / selection) and
predicates (selection) are also represented by
operator trees.
Propagation is done by recursively traversing the
algebraic operator tree and the arithmetic trees
in a post order manner.

12
Histogram PropagationAlgebra and Propagation

Arithmetic Operators
histograms as input
a result histogram as output
iterate over all bucket combinations
compute a result bucket for each bucket
combination
Comparison Operators
histograms as input
selectivity as output
some operators also provide modified histograms
iterate over all bucket combinations
compute a selectivity for each bucket
combination(and adapt the buckets)

13
Histogram PropagationInterval Arithmetic for
Arithmetic Terms

Example of using interval arithmetic for adding
two buckets
BO.low BI1.low BI2.low
BO.high BI1.high BI2.high

14
Histogram PropagationHeuristics for Grouping and
Aggregation

Determine amount of groups and average group
sizes.
Use these values together with the histogram of
the attribute that should be aggregated to
compute the aggregate buckets.

SUM(A) ?
100 groupswithgroup size 50
15
Histogram PropagationUnification of Comparison
Operators

Comparison operators compare histograms.
A single operator implementation can be used for
different purposes.
No separate join implementation is necessary.
E.g., the comparison operator can be used in
a predicate comparing two attributes,
a predicate comparing an attribute and a
constant,
a join condition of an equi-join? cartesian
product followed by a selection thatcontains the
join-condition as predicate,
an IN predicate comparing an attribute with a set
of values? join with a table (represented by a
histogram) thatcontains the values of the value
set.

C
C1

Cn
16
Related Workon Histogram Propagation

Papers on Approximate Query Answering
Yannis E. Ioannidis, Viswanath PoosalaHistogram-
Based Approximation of Set-Valued Query-Answers.
VLDB 1999
Viswanath Poosala, Venkatesh Ganti, Yannis E.
IoannidisApproximate Query Answering using
Histograms.IEEE Data Eng. Bull. 1999
Transformation of SQL queries on tables into SQL
queries on histograms.
Join is done by creating two tables under the
uniform spread assumption such that each table
represents a value distribution which fits to the
respective input histogram.
Only equi-joins, no support for predicates that
compare two attributes, no grouping and no
support of arithmetic terms.

17
ExperimentsExperimental Setup

Database TPC-H benchmark database on IBM DB2 V9.
Sample sequences
Sequence S1 generated by the MicroStrategy DSS
tool suite.
A set of semantically equivalent sequences that
result from the application of the CGO rewrite
rules.
Variants of those sequences resulting from the
use of different values in the filter predicates
of the sequence.

18
ExperimentsResults for Alternative Sequences
19
ExperimentsResults for Different Selectivities
20
Conclusion Future Work

Cost estimates for statement sequences are
necessary to avoid rule applications that lead to
performance deterioration.
Making use of the cost estimates of the
underlying DBMS is a feasible solution.
Histogram propagation is necessary to get useable
cost estimates for statements that access
intermediate-result tables and to avoid bad
plans.
Future Work
A cost-based control strategy.
Extensive measurements.

Thank you foryour attention!

22
(No Transcript)
23
Cost Estimation Approach Statistics API

is an interface that offers uniform
DBMS-independent access to DBMS statistics, meta
data and optimizer estimates
provides a flexible histogram format that
abstracts from proprietary data structures used
in different DBMSs
implementations exist for IBM DB2, Oracle and MS
SQL Server

24
Histogram PropagationCommon Subexpressions

Identify arithmetic terms that appear multiple
times in the different clauses of an SQL query.
Otherwise, the arithmetic term will be recomputed
for each appearance and modifications of
histograms may get lost.

25
Histogram PropagationNormalization of Histograms

Worst case
The number of buckets in the output histogram is
the product of the number of buckets in the input
histograms.
Serializing a histogram may double the number of
buckets.
Normalization
prior to the enumeration phase and / or after an
output histogram has been produced,
reduces the number of buckets by merging adjacent
buckets,
trade-off between complexity / performance and
quality.

Write a Comment

User Comments (0)

About PowerShow.com

A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences - PowerPoint PPT Presentation

A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences

to Enable Cost-Based Optimization. of Statement Sequences ... SQL statement sequence algebraic operator trees. Everything is a bucket / histogram: ... – PowerPoint PPT presentation