Loading...

PPT – Relational Query Optimization PowerPoint presentation | free to download - id: 761d00-NDIzO

The Adobe Flash plugin is needed to view this content

Relational Query Optimization

- Chapter 15 Ramakrishnan Gehrke
- (Sections 15.1-15.6)

What you will learn from this lecture

- Cost-based query optimization (System R)
- Plan space considered
- How cost is estimated
- Selectivity estimation using histograms.

Highlights of System R Optimizer

- Impact
- Most widely used currently works well for lt 10

joins. - Cost estimation Approximations used.
- Statistics, maintained in system catalogs, used

to estimate cost of operations and result sizes. - System R considers combination of CPU and I/O

costs. - Plan Space Too large, must be pruned.
- Only the space of left-deep plans is considered.
- Left-deep plans allow output of each operator to

be pipelined into the next operator without

storing it in a temporary relation. - Cartesian products avoided.

Schema for Examples

Ratings(uid integer, sid integer, time date,

rating integer) Songs(sid integer, sname

string, genre string, year date)

- Ratings
- Each tuple is 40 bytes long, 100 tuples per

page, 1000 pages. - Songs
- Each tuple is 50 bytes long, 80 tuples per page,

500 pages.

Query Blocks Units of Optimization

SELECT S.sname FROM Songs S WHERE S.sid IN

(SELECT R.sid FROM Ratings R

WHERE R.ratinggt7 AND R.time gt S.year)

- A SQL query is parsed into a collection of query

blocks, and these are optimized one block at a

time. - Nested blocks are usually treated as calls to a

subroutine, made once per outer tuple. (This is

an over-simplification, but serves for now.)

Outer block

Nested block

- For each block, the plans considered are
- All available access methods, for each reln in

FROM clause. - All left-deep join trees (i.e., all ways to

join the relations one-at-a-time, with the inner

reln in the FROM clause, considering all reln

permutations and join methods.) - i.e., we pipeline what is being computed.

Cost Estimation (Recap 1)

- For each plan considered, must estimate cost
- Must estimate cost of each operation in plan

tree. - Depends on input cardinalities.
- Weve already discussed how to estimate the cost

of operations (sequential scan, index scan,

joins, etc.) - Must estimate size of result for each operation

in tree! - Use information about the input relations.
- For selections and joins, assume independence of

predicates. - Well discuss the System R cost estimation

approach. - Inexact, but works ok in practice.
- More sophisticated techniques known now.

Statistics and Catalogs (Recap 2)

- Need information about the relations and indexes

involved. Catalogs typically contain at least - tuples (NTuples) and pages (NPages) for each

relation. - distinct key values (NKeys) and NPages for each

index. - Index height, low/high key values (Low/High) for

each tree index. - Catalogs updated periodically.
- Updating whenever data changes is too expensive

lots of approximation anyway, so slight

inconsistency ok. - More detailed information (e.g., histograms of

the values in some field) are sometimes stored.

Size Estimation and Reduction Factors (Recap 3)

SELECT attribute list FROM relation list WHERE

term1 AND ... AND termk

- Consider a query block
- Maximum tuples in result is the product of the

cardinalities of relations in the FROM clause. - Reduction factor (RF) associated with each term

reflects the impact of the term in reducing

result size. Result cardinality Max tuples

product of all RFs. - Implicit assumption that terms are independent!
- Term colvalue has RF 1/NKeys(I), given index I

on col - Term col1col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
- Term colgtvalue has RF (High(I)-value)/(High(I)-Low

(I) 1)

Histograms

- Example Songs(sid, sname, genre, year). Suppose

recorded years ?1994,2008 and there are 45

tuples. - Suppose the year values occur in the following

pattern

year frequency

94 2

95 3

96 3

97 1

98 2

99 1

00 3

01 8

02 4

03 2

04 0

05 1

06 2

07 4

08 9

- For year gt 07, what does our
- uniformity assumption tell us?
- RF 1/15, so 45 x 1/15 3
- tuples qualify. But in reality, 9
- tuples do!

frequency table

- Can we improve estimation
- by maintaining small amount of
- additional info.?
- Idea repeat what we did for the
- interval 94,08 to subintervals!

Histogram Generation Use

Can slice up the frequency table using different

schemes. Each scheme leads to a type of

histogram.

Suppose we want to create a Histogram with 5

buckets.

age frequency

94 2

95 3

96 3

97 1

98 2

99 1

00 3

01 8

02 4

03 2

04 0

05 1

06 2

07 4

08 9

Bucket (Bin)

Split up the range into subintervals of equal

width.

Make uniformity assumption inside each bucket.

e.g., 95 is estimated to occur 8/3 2.67

times (approx).

Equiwidth Histograms

- Equiwidth split 94,08 into 5 buckets (i.e.,

bins) of equal width sum up frequencies

bin-wide.

Frequency ? Ntuples. Note this is already much

closer to actual distribution. Whats our

current estimate for year gt 07? 1/3 x 15 5

tuples. Compare with actual data. Compare with

uniformity assumption. Uniformity assumption

histogram with 1 bucket!

bucket frequency

94-96 8

97-99 4

00-02 15

03-05 3

06-08 15

Equiwidth histogram With 5 buckets (bins).

Equidepth Histogram Generation

Can slice up the frequency table using different

schemes. Each scheme leads to a type of

histogram.

Want to make a 5 bucket Histogram again.

year frequency

94 2

95 3

96 3

97 1

98 2

99 1

00 3

01 8

02 4

03 2

04 0

05 1

06 2

07 4

08 9

Split up the range into subintervals such that

each contains about the same total

occurrences, i.e., 45/5 9 in our example.

Equidepth Histograms

- Equidepth

- Estimate for year gt 07 is 9.
- Even better than equiwidth HG.
- Happens to be exact in this case.
- What is the maximum error in
- estimate for EW HG and ED HG
- in the present example?
- What are the main differences
- between equiwidth and
- equidepth HGs?

bucket frequency

94-97 9

98-00 6

01-02 12

03-07 9

08 9

A slightly different slicing

- Some of the bars have been moved. The slicing

cannot be perfect in general.

year frequency

94 2

95 3

96 3

97 1

98 2

99 1

00 3

01 8

02 4

03 2

04 0

05 1

06 2

07 4

08 9

Equidepth Histograms

- Equidepth

- How does this HG compare with
- the previous ED HG?

bucket frequency

94-98 11

99-01 12

02-05 7

06-07 6

08 9

First ED HG.

bucket frequency

94-97 9

98-00 6

01-02 12

03-07 9

08 9

Which HG would you put your money on?

Histogram Use

- How do you calculate RF using a histogram?
- Straightforward for Attrval.
- What about Attr lt val? (we use 1st EDHG below.)
- E.g. year gt 98
- tuplesuniform 10/15 x 45 30. RFuniform

30/45 10/15 2/3. - tuplesewhg 1 x 4/3 33 34.33. RFewhg

34.33/45. - tuplesedhg 2 x 6/3 30 34. RFedhg 34/45.
- Actual (tuples with year gt 98) 34. Actual RF

34/45. - In general, can define optimal histograms (given

buckets) to minimize expected error. - Interesting algorithmic problem with lots of

research literature. - HGs useful for selectivity estimation for

multi-dimensional queries.

Relational Algebra Equivalences

- Allow us to choose different join orders and to

push selections and projections ahead of joins. - Selections

(Cascade)

(Commute)

- Projections

(Cascade)

when a1 ? a2 ? ? an.

(Associative)

- Joins

R (S T) (R S) T

(Commutative)

(R S) (S R)

R (S T) (T R) S

- Show that

More Equivalences

- A projection commutes with a selection that only

uses attributes retained by the projection. - Selection between attributes of the two arguments

of a cross-product converts cross-product to a

join. - A selection on just attributes of R commutes with

R S. (i.e., (R S)

(R) S ) - Similarly, if a projection follows a join R

S, we can push it by retaining only attributes

of R (and S) that are needed for the join or are

kept by the projection.

Some more equivalences

- Join vs. Group-by
- ?_R.sid max(time)(Songs Ratings) ?
- ?_R.sid, max(time) ( ?_R.sid

max(time)(Ratings) Songs) - ? ?_R.sid max(time)(Ratings).
- provided the RIC Ratings.sid ? Songs.sid holds

(needed only for second equivalence). - Does the equivalence hold for other aggregate

functions e.g., what about avg(rating)? or

count(time)?

R.sidS.sid

R.sidS.sid

One more on Group-By and Join.

- ??_A agg(D)r(A,B) s(B,C,D) ??
- ?_A,agg(D)(r(A,B) ? ?_Bagg(D)(s(B,C,D))).

- Does this equivalence always hold? What

constraints, if any, are needed for it to hold?

If the constraint is relaxed, what then?

Enumeration of Alternative Plans

- There are two main cases
- Single-relation plans
- Multiple-relation plans
- For queries over a single relation, queries

consist of a combination of selects, projects,

and aggregate ops - Each available access path (file i.e., table

scan/index scan/index probe) is considered, and

the one with the least estimated cost is chosen. - The different operations are essentially carried

out together (e.g., if an index is used for a

selection, projection is done for each retrieved

tuple, and the resulting tuples are pipelined

into the aggregate computation).

Cost Estimates for Single-Relation Plans

- Index I on primary key matches equality

selection - Cost is Height(I)1 for a B tree, about 1.2 for

hash index. - Additional cost if alternative 2 or 3 is used.
- Clustered index I matching one or more selects
- (NPages(I)NPages(R)) product of RFs of

matching selects. this is mainly for secondary

keys. - Non-clustered index I matching one or more

selects - (NPages(I)NTuples(R)) product of RFs of

matching selects. - Sequential scan of file
- NPages(R).
- Note Typically, no duplicate elimination on

projections! (Exception Done on answers if user

says DISTINCT.)

secondary key

Example

SELECT S.sid FROM Songs S WHERE

S.genrefriendship

- If we have an index on genre
- (1/NKeys(I)) NTuples(R) (1/10) 40000 tuples

retrieved. assuming, distinct genres 10. - Suppose index size Npages(I) 50 pages.
- Clustered index (1/NKeys(I))

(NPages(I)NPages(R)) (1/10) (50500) pages

are retrieved. (This is the cost.) - Unclustered index (1/NKeys(I))

(NPages(I)NTuples(R)) (1/10) (5040000)

pages are retrieved. - Doing a file scan
- We retrieve all file pages (500).

An Example with Aggregation

- SELECT year, COUNT()
- FROM Songs
- WHERE yeargt2000 AND genrefriendship
- GROUP BY year
- HAVING COUNT DISTINCT (sname) gt 2
- PROJ_year, count(sname) (
- HAVING_count disinct (sname)gt2(
- GROUP BY_year count distinct(sname) (
- PROJ_year,sname(

SEL_yeargt2000genrefriendship - (Songs))))).

Plan w/o indexes

- Scan Songs file apply SEL/PROJ on the fly.
- Write out result into temp.
- Sort by year (for GROUP BY) aggregate during

merge do HAVING on the fly. - Whats the cost?
- 500 I/Os for scan
- RF for yeargt2000 is 0.5 (approx.) RF for

genrefriendship is 0.1 (used default value). - RF 0.5 x 0.1 PROJ reduces size by 0.5 (if all

fields have the same size). Temp. size 500 x

0.5 x 0.1 x 0.5 13 pages (approx), so 13 I/Os

for writing it. - 3 x 13 for sorting it. can be less if we have

sufficient buffer space. - Total 552 I/Os.

Plans w/ indexes.

- Pick one (whatever QOzer estimates is best) index

among those available probe the index and verify

unmatched selection conditions as pages are read

in. - Use multiple indexes do a rid intersection then

fetch resulting pages. - If GROUP-BY attrs are a prefix of a Btree index,

use the index to retrieve tuples in sort order

(simplifies GROUP-BY). Rest of the ops done on

each tuple, then each group. - Suppose there is a dense index that includes all

attrs mentioned in query. Then only need to scan

the indexs data entries. If index matches some

of the SEL conditions, even better. If index is

Btree and GROUP-BY attrs form a prefix, can

avoid sorting! - Does it matter if this index is clustered or not?

- (what is a dense index? and what is a sparse

index?) - Note we can keep attr-values as data entries

even if they are not indexed! E.g., index on

genre, but data entry -- (genre, year). Why

/When would this help?

Plans w/ indexes Example.

- Revisit query data entries (key, rid) pairs.

Btree index on year, hash index on genre, Btree

index on (year, genre, sname). - 1. Fetch data entries using hash index on genre

then records then apply remaining ops. - Does clustering make a difference?
- 2. Use both Btree index on year and hash index

on genre intersect rids and fetch data. - 3. Use only Btree index on year (in sorted

order). Helps greatly for GROUP-BY. - 4. Scan only the Btree index on (year, genre,

sname) fetch data entries with yeargt2000 and in

sorted order. Then apply remaining ops. (index

could be unclustered.) index must be dense for

this to be work!

Queries Over Multiple Relations

- Fundamental decision in System R only left-deep

join trees are considered. - As the number of joins increases, the number of

alternative plans grows rapidly we need to

restrict the search space. - Left-deep trees allow us to generate all fully

pipelined plans. - Intermediate results not written to temporary

files. - Not all left-deep plans are fully pipelined

(e.g., Sort-Merge join).

Enumeration of Left-Deep Plans

- Left-deep plans differ only in the order of

relations, the access method for each relation,

and the join method for each join. - Enumerated using N passes (if N relations

joined) - Pass 1 Find best 1-relation plan for each

relation. - Pass 2 Find best way to join result of each

1-relation plan (as outer) to another relation.

(All 2-relation plans.) - Pass N Find best way to join result of a

(N-1)-relation plan (as outer) to the N-th

relation. (All N-relation plans.) - For each subset of relations, retain only
- Cheapest plan overall, plus
- Cheapest plan for each interesting order of the

tuples. - Where does this help?

Enumeration of Plans (Contd.)

- ORDER BY, GROUP BY, aggregates etc. handled as a

final step, using either an interestingly

ordered plan or an additional sorting operator.

Of course, hashing is an option too. - An N-1 way plan is not combined with an

additional relation unless there is a join

condition between them, unless all predicates in

WHERE have been used up. - i.e., avoid Cartesian products if possible.
- In spite of pruning plan space, this approach is

still exponential in the of tables.

Example

Songs B tree on year. Hash on

sid Ratings B tree on uid

- Pass1
- Songs B tree matches yeargt2000, and is
- probably cheapest. However, if this selection
- is expected to retrieve a lot of tuples, and
- index is unclustered, file scan may be cheaper.
- Still, B tree plan kept (because tuples are in
- year order).
- Ratings B tree on uid matches uid100 perhaps

cheapest.

- Pass 2
- We consider each plan retained from Pass 1 as

the outer, and consider how to join it with the

(only) other relation. - e.g., Ratings as outer Hash index can be used

to get Songs tuples - that satisfy sid outer tuples sid value.

What other plans might be considered?

Example (contd.)

- If we have gt 2 relations to join, for pass 2 ?

pass 3 (indeed, for pass k ? pass k1), the same

reasoning applies. - Use same criteria as before for deciding which

plans to keep for the join of each subset of k

relations. (k1)th rel inner for join plan. - For each k relations Join chosen (k1)th

rel, consider all possible strategies and choose

cheapest plans and/or plans w/ interesting tuple

orders.

Nested Queries

SELECT S.sname FROM Songs S WHERE S.sid IN

(SELECT R.sid FROM Ratings R

WHERE R.ratinggt7 AND R.time gt S.year)

- Nested block is optimized independently, with the

outer tuple considered as providing a selection

condition. - Outer block is optimized with the cost of

calling a nested block computation taken into

account. - Implicit ordering of these blocks means that some

good strategies are not considered. The

non-nested version of the query is typically

optimized better. - Not all queries can be flattened this way!

Nested block to optimize SELECT R.sid FROM

Ratings R WHERE R.ratinggt7 AND R.timegt

outer value

Equivalent non-nested query SELECT S.sname FROM

Songs S, Ratings R WHERE S.sidR.sid AND

R.ratinggt7 AND R.time gt S.year

Summary

- Query optimization is an important task in a

relational DBMS. - Must understand optimization in order to

understand the performance impact of a given

database design (relations, indexes) on a

workload (set of queries). - Two parts to optimizing a query
- Consider a set of alternative plans.
- Must prune search space typically, left-deep

plans only. - Must estimate cost of each plan that is

considered. - Must estimate size of result and cost for each

plan node. - Key issues Statistics, indexes, operator

implementations.

Summary (Contd.)

- Single-relation queries
- All access paths considered, cheapest is chosen.
- Issues Selections that match index, whether

index key has all needed fields and/or provides

tuples in a desired order. Is the index

dense/sparse, clustered/unclustered. (Not all of

this matters all the time.) - Multiple-relation queries
- All single-relation plans are first enumerated.
- Selections/projections considered as early as

possible. - Next, for each 1-relation plan, all ways of

joining another relation (as inner) are

considered. - Next, for each 2-relation plan that is

retained, all ways of joining another relation

(as inner) are considered, etc. - At each level, for each subset of relations, only

best plan for each interesting order of tuples is

retained, in addition to the global cheapest

plan (which may or may not come with an

interesting order).