Relational Query Optimization presentation

About This Presentation

Transcript and Presenter's Notes

Title: Relational Query Optimization

1
Relational Query Optimization

Chapter 15 Ramakrishnan Gehrke
(Sections 15.1-15.6)

2
What you will learn from this lecture

Cost-based query optimization (System R)
Plan space considered
How cost is estimated
Selectivity estimation using histograms.

3
Highlights of System R Optimizer

Impact
Most widely used currently works well for lt 10
joins.
Cost estimation Approximations used.
Statistics, maintained in system catalogs, used
to estimate cost of operations and result sizes.
System R considers combination of CPU and I/O
costs.
Plan Space Too large, must be pruned.
Only the space of left-deep plans is considered.
Left-deep plans allow output of each operator to
be pipelined into the next operator without
storing it in a temporary relation.
Cartesian products avoided.

4
Schema for Examples
Ratings(uid integer, sid integer, time date,
rating integer) Songs(sid integer, sname
string, genre string, year date)

Ratings
Each tuple is 40 bytes long, 100 tuples per
page, 1000 pages.
Songs
Each tuple is 50 bytes long, 80 tuples per page,
500 pages.

5
Query Blocks Units of Optimization
SELECT S.sname FROM Songs S WHERE S.sid IN
(SELECT R.sid FROM Ratings R
WHERE R.ratinggt7 AND R.time gt S.year)

A SQL query is parsed into a collection of query
blocks, and these are optimized one block at a
time.
Nested blocks are usually treated as calls to a
subroutine, made once per outer tuple. (This is
an over-simplification, but serves for now.)

Outer block
Nested block

For each block, the plans considered are
All available access methods, for each reln in
FROM clause.
All left-deep join trees (i.e., all ways to
join the relations one-at-a-time, with the inner
reln in the FROM clause, considering all reln
permutations and join methods.)
i.e., we pipeline what is being computed.

6
Cost Estimation (Recap 1)

For each plan considered, must estimate cost
Must estimate cost of each operation in plan
tree.
Depends on input cardinalities.
Weve already discussed how to estimate the cost
of operations (sequential scan, index scan,
joins, etc.)
Must estimate size of result for each operation
in tree!
Use information about the input relations.
For selections and joins, assume independence of
predicates.
Well discuss the System R cost estimation
approach.
Inexact, but works ok in practice.
More sophisticated techniques known now.

7
Statistics and Catalogs (Recap 2)

Need information about the relations and indexes
involved. Catalogs typically contain at least
tuples (NTuples) and pages (NPages) for each
relation.
distinct key values (NKeys) and NPages for each
index.
Index height, low/high key values (Low/High) for
each tree index.
Catalogs updated periodically.
Updating whenever data changes is too expensive
lots of approximation anyway, so slight
inconsistency ok.
More detailed information (e.g., histograms of
the values in some field) are sometimes stored.

8
Size Estimation and Reduction Factors (Recap 3)
SELECT attribute list FROM relation list WHERE
term1 AND ... AND termk

Consider a query block
Maximum tuples in result is the product of the
cardinalities of relations in the FROM clause.
Reduction factor (RF) associated with each term
reflects the impact of the term in reducing
result size. Result cardinality Max tuples
product of all RFs.
Implicit assumption that terms are independent!
Term colvalue has RF 1/NKeys(I), given index I
on col
Term col1col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
Term colgtvalue has RF (High(I)-value)/(High(I)-Low
(I) 1)

9
Histograms

Example Songs(sid, sname, genre, year). Suppose
recorded years ?1994,2008 and there are 45
tuples.
Suppose the year values occur in the following
pattern

year frequency
94 2
95 3
96 3
97 1
98 2
99 1
00 3
01 8
02 4
03 2
04 0
05 1
06 2
07 4
08 9

For year gt 07, what does our
uniformity assumption tell us?
RF 1/15, so 45 x 1/15 3
tuples qualify. But in reality, 9
tuples do!

frequency table

Can we improve estimation
by maintaining small amount of
additional info.?
Idea repeat what we did for the
interval 94,08 to subintervals!

10
Histogram Generation Use
Can slice up the frequency table using different
schemes. Each scheme leads to a type of
histogram.
Suppose we want to create a Histogram with 5
buckets.
age frequency
94 2
95 3
96 3
97 1
98 2
99 1
00 3
01 8
02 4
03 2
04 0
05 1
06 2
07 4
08 9
Bucket (Bin)
Split up the range into subintervals of equal
width.
Make uniformity assumption inside each bucket.
e.g., 95 is estimated to occur 8/3 2.67
times (approx).
11
Equiwidth Histograms

Equiwidth split 94,08 into 5 buckets (i.e.,
bins) of equal width sum up frequencies
bin-wide.

Frequency ? Ntuples. Note this is already much
closer to actual distribution. Whats our
current estimate for year gt 07? 1/3 x 15 5
tuples. Compare with actual data. Compare with
uniformity assumption. Uniformity assumption
histogram with 1 bucket!
bucket frequency
94-96 8
97-99 4
00-02 15
03-05 3
06-08 15
Equiwidth histogram With 5 buckets (bins).
12
Equidepth Histogram Generation
Can slice up the frequency table using different
schemes. Each scheme leads to a type of
histogram.
Want to make a 5 bucket Histogram again.
year frequency
94 2
95 3
96 3
97 1
98 2
99 1
00 3
01 8
02 4
03 2
04 0
05 1
06 2
07 4
08 9
Split up the range into subintervals such that
each contains about the same total
occurrences, i.e., 45/5 9 in our example.
13
Equidepth Histograms

Equidepth

Estimate for year gt 07 is 9.
Even better than equiwidth HG.
Happens to be exact in this case.
What is the maximum error in
estimate for EW HG and ED HG
in the present example?
What are the main differences
between equiwidth and
equidepth HGs?

bucket frequency
94-97 9
98-00 6
01-02 12
03-07 9
08 9
14
A slightly different slicing

Some of the bars have been moved. The slicing
cannot be perfect in general.

year frequency
94 2
95 3
96 3
97 1
98 2
99 1
00 3
01 8
02 4
03 2
04 0
05 1
06 2
07 4
08 9
15
Equidepth Histograms

Equidepth

How does this HG compare with
the previous ED HG?

bucket frequency
94-98 11
99-01 12
02-05 7
06-07 6
08 9
First ED HG.
bucket frequency
94-97 9
98-00 6
01-02 12
03-07 9
08 9
Which HG would you put your money on?
16
Histogram Use

How do you calculate RF using a histogram?
Straightforward for Attrval.
What about Attr lt val? (we use 1st EDHG below.)
E.g. year gt 98
tuplesuniform 10/15 x 45 30. RFuniform
30/45 10/15 2/3.
tuplesewhg 1 x 4/3 33 34.33. RFewhg
34.33/45.
tuplesedhg 2 x 6/3 30 34. RFedhg 34/45.
Actual (tuples with year gt 98) 34. Actual RF
34/45.
In general, can define optimal histograms (given
buckets) to minimize expected error.
Interesting algorithmic problem with lots of
research literature.
HGs useful for selectivity estimation for
multi-dimensional queries.

17
Relational Algebra Equivalences

Allow us to choose different join orders and to
push selections and projections ahead of joins.
Selections
(Cascade)

(Commute)

Projections

(Cascade)
when a1 ? a2 ? ? an.
(Associative)

Joins

R (S T) (R S) T
(Commutative)
(R S) (S R)
R (S T) (T R) S

Show that

18
More Equivalences

A projection commutes with a selection that only
uses attributes retained by the projection.
Selection between attributes of the two arguments
of a cross-product converts cross-product to a
join.
A selection on just attributes of R commutes with
R S. (i.e., (R S)
(R) S )
Similarly, if a projection follows a join R
S, we can push it by retaining only attributes
of R (and S) that are needed for the join or are
kept by the projection.

19
Some more equivalences

Join vs. Group-by
?_R.sid max(time)(Songs Ratings) ?
?_R.sid, max(time) ( ?_R.sid
max(time)(Ratings) Songs)
? ?_R.sid max(time)(Ratings).
provided the RIC Ratings.sid ? Songs.sid holds
(needed only for second equivalence).
Does the equivalence hold for other aggregate
functions e.g., what about avg(rating)? or
count(time)?

R.sidS.sid
R.sidS.sid
20
One more on Group-By and Join.

??_A agg(D)r(A,B) s(B,C,D) ??
?_A,agg(D)(r(A,B) ? ?_Bagg(D)(s(B,C,D))).
Does this equivalence always hold? What
constraints, if any, are needed for it to hold?
If the constraint is relaxed, what then?

21
Enumeration of Alternative Plans

There are two main cases
Single-relation plans
Multiple-relation plans
For queries over a single relation, queries
consist of a combination of selects, projects,
and aggregate ops
Each available access path (file i.e., table
scan/index scan/index probe) is considered, and
the one with the least estimated cost is chosen.
The different operations are essentially carried
out together (e.g., if an index is used for a
selection, projection is done for each retrieved
tuple, and the resulting tuples are pipelined
into the aggregate computation).

22
Cost Estimates for Single-Relation Plans

Index I on primary key matches equality
selection
Cost is Height(I)1 for a B tree, about 1.2 for
hash index.
Additional cost if alternative 2 or 3 is used.
Clustered index I matching one or more selects
(NPages(I)NPages(R)) product of RFs of
matching selects. this is mainly for secondary
keys.
Non-clustered index I matching one or more
selects
(NPages(I)NTuples(R)) product of RFs of
matching selects.
Sequential scan of file
NPages(R).
Note Typically, no duplicate elimination on
projections! (Exception Done on answers if user
says DISTINCT.)

secondary key
23
Example
SELECT S.sid FROM Songs S WHERE
S.genrefriendship

If we have an index on genre
(1/NKeys(I)) NTuples(R) (1/10) 40000 tuples
retrieved. assuming, distinct genres 10.
Suppose index size Npages(I) 50 pages.
Clustered index (1/NKeys(I))
(NPages(I)NPages(R)) (1/10) (50500) pages
are retrieved. (This is the cost.)
Unclustered index (1/NKeys(I))
(NPages(I)NTuples(R)) (1/10) (5040000)
pages are retrieved.
Doing a file scan
We retrieve all file pages (500).

24
An Example with Aggregation

SELECT year, COUNT()
FROM Songs
WHERE yeargt2000 AND genrefriendship
GROUP BY year
HAVING COUNT DISTINCT (sname) gt 2
PROJ_year, count(sname) (
HAVING_count disinct (sname)gt2(
GROUP BY_year count distinct(sname) (
PROJ_year,sname(
SEL_yeargt2000genrefriendship
(Songs))))).

25
Plan w/o indexes

Scan Songs file apply SEL/PROJ on the fly.
Write out result into temp.
Sort by year (for GROUP BY) aggregate during
merge do HAVING on the fly.
Whats the cost?
500 I/Os for scan
RF for yeargt2000 is 0.5 (approx.) RF for
genrefriendship is 0.1 (used default value).
RF 0.5 x 0.1 PROJ reduces size by 0.5 (if all
fields have the same size). Temp. size 500 x
0.5 x 0.1 x 0.5 13 pages (approx), so 13 I/Os
for writing it.
3 x 13 for sorting it. can be less if we have
sufficient buffer space.
Total 552 I/Os.

26
Plans w/ indexes.

Pick one (whatever QOzer estimates is best) index
among those available probe the index and verify
unmatched selection conditions as pages are read
in.
Use multiple indexes do a rid intersection then
fetch resulting pages.
If GROUP-BY attrs are a prefix of a Btree index,
use the index to retrieve tuples in sort order
(simplifies GROUP-BY). Rest of the ops done on
each tuple, then each group.
Suppose there is a dense index that includes all
attrs mentioned in query. Then only need to scan
the indexs data entries. If index matches some
of the SEL conditions, even better. If index is
Btree and GROUP-BY attrs form a prefix, can
avoid sorting!
Does it matter if this index is clustered or not?
(what is a dense index? and what is a sparse
index?)
Note we can keep attr-values as data entries
even if they are not indexed! E.g., index on
genre, but data entry -- (genre, year). Why
/When would this help?

27
Plans w/ indexes Example.

Revisit query data entries (key, rid) pairs.
Btree index on year, hash index on genre, Btree
index on (year, genre, sname).
1. Fetch data entries using hash index on genre
then records then apply remaining ops.
Does clustering make a difference?
2. Use both Btree index on year and hash index
on genre intersect rids and fetch data.
3. Use only Btree index on year (in sorted
order). Helps greatly for GROUP-BY.
4. Scan only the Btree index on (year, genre,
sname) fetch data entries with yeargt2000 and in
sorted order. Then apply remaining ops. (index
could be unclustered.) index must be dense for
this to be work!

28
Queries Over Multiple Relations

Fundamental decision in System R only left-deep
join trees are considered.
As the number of joins increases, the number of
alternative plans grows rapidly we need to
restrict the search space.
Left-deep trees allow us to generate all fully
pipelined plans.
Intermediate results not written to temporary
files.
Not all left-deep plans are fully pipelined
(e.g., Sort-Merge join).

29
Enumeration of Left-Deep Plans

Left-deep plans differ only in the order of
relations, the access method for each relation,
and the join method for each join.
Enumerated using N passes (if N relations
joined)
Pass 1 Find best 1-relation plan for each
relation.
Pass 2 Find best way to join result of each
1-relation plan (as outer) to another relation.
(All 2-relation plans.)
Pass N Find best way to join result of a
(N-1)-relation plan (as outer) to the N-th
relation. (All N-relation plans.)
For each subset of relations, retain only
Cheapest plan overall, plus
Cheapest plan for each interesting order of the
tuples.
Where does this help?

30
Enumeration of Plans (Contd.)

ORDER BY, GROUP BY, aggregates etc. handled as a
final step, using either an interestingly
ordered plan or an additional sorting operator.
Of course, hashing is an option too.
An N-1 way plan is not combined with an
additional relation unless there is a join
condition between them, unless all predicates in
WHERE have been used up.
i.e., avoid Cartesian products if possible.
In spite of pruning plan space, this approach is
still exponential in the of tables.

31
Example
Songs B tree on year. Hash on
sid Ratings B tree on uid

Pass1
Songs B tree matches yeargt2000, and is
probably cheapest. However, if this selection
is expected to retrieve a lot of tuples, and
index is unclustered, file scan may be cheaper.
Still, B tree plan kept (because tuples are in
year order).
Ratings B tree on uid matches uid100 perhaps
cheapest.

Pass 2
We consider each plan retained from Pass 1 as
the outer, and consider how to join it with the
(only) other relation.
e.g., Ratings as outer Hash index can be used
to get Songs tuples
that satisfy sid outer tuples sid value.
What other plans might be considered?

32
Example (contd.)

If we have gt 2 relations to join, for pass 2 ?
pass 3 (indeed, for pass k ? pass k1), the same
reasoning applies.
Use same criteria as before for deciding which
plans to keep for the join of each subset of k
relations. (k1)th rel inner for join plan.
For each k relations Join chosen (k1)th
rel, consider all possible strategies and choose
cheapest plans and/or plans w/ interesting tuple
orders.

33
Nested Queries
SELECT S.sname FROM Songs S WHERE S.sid IN
(SELECT R.sid FROM Ratings R
WHERE R.ratinggt7 AND R.time gt S.year)

Nested block is optimized independently, with the
outer tuple considered as providing a selection
condition.
Outer block is optimized with the cost of
calling a nested block computation taken into
account.
Implicit ordering of these blocks means that some
good strategies are not considered. The
non-nested version of the query is typically
optimized better.
Not all queries can be flattened this way!

Nested block to optimize SELECT R.sid FROM
Ratings R WHERE R.ratinggt7 AND R.timegt
outer value
Equivalent non-nested query SELECT S.sname FROM
Songs S, Ratings R WHERE S.sidR.sid AND
R.ratinggt7 AND R.time gt S.year
34
Summary

Query optimization is an important task in a
relational DBMS.
Must understand optimization in order to
understand the performance impact of a given
database design (relations, indexes) on a
workload (set of queries).
Two parts to optimizing a query
Consider a set of alternative plans.
Must prune search space typically, left-deep
plans only.
Must estimate cost of each plan that is
considered.
Must estimate size of result and cost for each
plan node.
Key issues Statistics, indexes, operator
implementations.

35
Summary (Contd.)

Single-relation queries
All access paths considered, cheapest is chosen.
Issues Selections that match index, whether
index key has all needed fields and/or provides
tuples in a desired order. Is the index
dense/sparse, clustered/unclustered. (Not all of
this matters all the time.)
Multiple-relation queries
All single-relation plans are first enumerated.
Selections/projections considered as early as
possible.
Next, for each 1-relation plan, all ways of
joining another relation (as inner) are
considered.
Next, for each 2-relation plan that is
retained, all ways of joining another relation
(as inner) are considered, etc.
At each level, for each subset of relations, only
best plan for each interesting order of tuples is
retained, in addition to the global cheapest
plan (which may or may not come with an
interesting order).

Write a Comment

User Comments (0)

About PowerShow.com

Relational Query Optimization PowerPoint PPT Presentation