Advanced Database System - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Advanced Database System

Description:

When we turn the preferred logical query plan into a physical plan, we need to ... Disjointed: T(R S)=0. Y might be the key of S and a foreign key of R: T(R S)=T(R) ... – PowerPoint PPT presentation

Number of Views:215

Avg rating:3.0/5.0

Slides: 53

Provided by: Zhiy2

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Database System

1
Advanced Database System

CS 641
Lecture 13

2
Cost-based enumeration

When we turn the preferred logical query plan
into a physical plan, we need to consider
different physical plans and evaluate or estimate
the cost of each
We pick the physical query plan with the least
cost and pass this plan to the query execution
engine.

3
Generating physical plans

To estimate each physical plan, we select
An order and grouping for associative-and-commutat
ive operations (joins, unions and intersections)
An algorithm for each operator in the logical
plan
Additional operators that are needed
The way in which arguments are passed from one
operator to the next.

4
Estimate sizes of the intermediate relations

We need
Given accurate estimates
Easy to compute
Are logically consistent

5
Estimate the size of a projection

For projection, the size of the result is
computable.
It produces a result tuple for every argument
tuple.
The only change is the length of tuples
Consider as bag projection
For set projection, d operator need to be used.

6
Example

R(a,b,c)
a, b 4 bytes integer c 100 bytes string
Tuple header 12 bytes
120 bytes per tuple
Block 1024 bytes, block header 24 bytes.
T(R)10,000 ? B(R)1250
Spab, c(R)
116 bytes per tuple, T(S)10,000, B(S)1250
Upa, b(R)
20 bytes per tuple, T(U)10,000, B(U)200

7
Estimate the size of a selection

Selection operation reduces the number of tuples,
but the size of each tuple remains the same
SsAc(R)
Estimation T(S)T(R)/V(R,a)
Ssaltc(R)
Estimation T(S)T(R)/3
SsA?c(R)
Estimation T(S)T(R) or T(S)T(R)(V(R,a)-1)/V(R,a
)
For condition C is the AND of several equalities
and inequalities, the effect will be the
multiplication of the selectivity factor for each
condition.

8
Example 16.21

R(a,b,c), Ssa10 AND blt20(R)
T(R)10,000, V(R,a)50
Estimation T(R)/(503)
Special cases
Ssa10 AND agt20(R)
Estimation T(R)/(503)
actual result 0
Query optimizer have to deal with special cases.

9
OR of conditions

SsC1 OR C2(R)
If assume no tuple will satisfy both conditions
The size will be the sum of the tuples satisfy
each
It will even larger than size of R!
We can take the smaller one as the estimation
Another estimation (C1 and C2 are independent)
n(1-(1-m1/n)(1-m2/n))
n number of tuples
m1 number of tuples satisfy C1
m2 number of tuples satisfy C2

10
Example 16.22

R(a,b), T(R)10,000, V(R,a)50
Ssa10 OR blt20(R)
n10,000 m1200, m23333
Estimation
10000(1-(1-200/10000)(1-3333/10000))

11
NOT in a selection

For NOT C, the estimation number of tuples
T(R) the estimated number that satisfy C

12
Estimation the size of a join

R(X,Y) ? S(Y,Z), Y is a single attribute
According to Y values
Disjointed T(R ? S)0
Y might be the key of S and a foreign key of R
T(R ? S)T(R)
Almost all the tuples of R and S could have the
same Y-values T(R ? S)T(R)T(S)

13
Simplified assumption

Containment of value sets
If Y appearing in many relations, then each
relation get its values from a common set
y1,y2,y3,. Thus, if V(R,Y)V(S,Y), every Y-value
of R will be a Y-value in S.
Preservation of value sets
An attribute which is not the join attribute does
not lose values from its set of possible values.
For example, A V(R ? S, A)V(R,A) as well as V(S
? R, A)V(R,A)

14
Estimation under assumptions

R(X,Y) ? S(Y,Z), Y is a single attribute
Let V(R,Y)V(S,Y)
For each tuple t in R, it has 1/(V(S,Y) chance to
join with a given tuple in S, thus expected
number of tuples that t joins is T(S)/V(S,Y)
The estimated size of R ? S will be
T(R)T(S)/V(S,Y)
If V(R,Y)V(S,Y)
The estimated size of R ? S will be
T(R)T(S)/V(R,Y)
In general
Size T(R ? S) will be T(R)T(S)/max(V(R,Y),
V(S,Y))

15
Example 16.23
Compute R ? S ? U
16
Example (Cont.)

(R ? S) ? U
T(R ? S) T(R)T(S)/max(V(R,b), V(S,b))40,000
Next
T(R ? S)T(U)/max(V(R ? S,c), V(U,c))400,000
R ? (S ? U)
T(S ? U) T(S)T(U)/max(V(S,c), V(U,c))20,000
Next
T(R)T(S ? U)/max(V(R,b), V(S ? U,b))400,000
Coincidence?
NO!

17
Multiple join attributes

R(X,Y) ? S(Y,Z), Y contains several attributes
Suppose R(x,y1,y2) ? S(y1,y2,z)
For a give tuple r in R, the possibility to agree
on y1 with a tuple s in S is
1/max(V(R,y1), V(S,y1))
For y2 1/max(V(R,y2), V(S,y2))
Thus, the expected joined pairs
T(R)T(S)/(max(V(R,y1),V(S,y1)) max(V(R,y2),
V(S,y2)))

18
General rule

The size of natural join between two relations
The estimate of the size of R ? S is computed by
multiplying T(R) by T(S) and dividing by the
larger of V(R,y) and V(S,y) for each attribute y
that is common to R and S.

19
Example 16.24
Compute
The estimate size is T(R)T(S)/max(V(R,b),V(S,d))m
ax(V(R,c),V(S,e))400
20
Example 16.25
21
Example 16.25

R(a,b) ? U(c,d)
Size T(R)T(U)5,000,000
Then, join S(b,c)
T(R)T(U)T(S)/(max(V(R,b),V(S,b))max(V(U,c),V(S,c)
))400,000
Same result generated!

22
Equijoins

The number of tuples in the result of an equijoin
can be computed exactly as for a natural join,
after accounting for the change in variable names.

23
Other theta-joins

A selection following a product, and
The number of tuples in a product is the number
of tuples in the relations involved
Equality and inequality comparisons can be
estimated using the techniques developed for
selections

24
Joins of many relations

SR1?R2?R3??Rn, attribute A appears in k of the
Ris, and the number of its sets of values in
these k relations viV(Ri,A) are v1v2v3
vk
Suppose pick a tuple from each relation, what is
the probability that all tuples selected agree on
A?
Answer
1/v2v3...vk
start with the product of the number of tuples in
each relation. Then, for each attribute A
appearing at least twice, divided by all but the
least of these V(R,A)s.
The number of values that remains in A after join
will be the least of these V(R,A)s

25
Example 16.26

R(a,b,c)?S(b,c,d)?U(b,e)

26
Example 16.26 (Cont.)

The estimate size
T(R)T(S)T(U)/(( two largest of (V(R,b), V(S,b),
V(U,b)max(V(R,c),V(S,c)100020005000/((50200)
200)5000
The number of values for each of the attribute
the least value count
a 100 b20 c100 d400 e500

27
Conclusion

No matter how to order the natural joins of n
relations
The estimate rules, applied to each join
individually, yields the same estimate size, and
This size is the same as we apply the rules for
the join all n relations as a whole.

28
The size of Union

Bag union
The sum
Set union
The larger of two relations to the sum
Using something in the middle

29
The size of intersection

result
Maybe 0 to the smaller of the two arguments
Estimation
Half of the smaller
Consider as a special case of the natural join
Not good! (Sometimes for bags)
E.g. R(a,b) ?B S(a,b), R 2 copies of (0,1), S 3
copies
V(R,a)V(S,a)V(R,b)V(S,b)1
T(R)2, T(S)3
Estimation will be 2x3/(max(1,1)max(1,1))6, not
correct.

30
Difference

R-S
Result between T(R) to T(R)-T(S)
Estimation T(R)-(1/2)T(S)

31
Duplication elimination

Result
Between 1 to the size of R.
Estimation smaller of (1/2)T(R) and the product
of all the V(R,ai)s

32
Grouping and aggregation

Similar to duplication elimination
Estimation smaller of (1/2)T(R) or the product
of V(R,A)s

33
Exercises

16.4.1 a) b) g) i)
16.4.2

34
Estimate the cost of certain expressions

The cost (number of I/Os) is influenced by
The particular logical operator chosen to
implement the query
The size of intermediate results
The physical operators used to implement logical
operators
The ordering of similar operations
The method of passing arguments from one physical
operator to the next

35
Statistical information

DBMS have to keep important information such as
T(R), B(R), V(R,a) etc.
Histogram used to compute values for a given
attribute
Equal-wdith a width w is chosen, along with a
constant v0(the smallest value). Counts for
v0vltv0w, v0wvltv02w,
Equal-height common percentiles, pick some
fraction p, and list the lowest value, the
fraction 2p, and so on
Most-frequent-values list most common values and
their numbers of occurrences. Can be combined
with previous two mechanisms

36
Example 16.27

R(a,b)?S(b,c)
R.b 1200, 0150, 5100, others 550
S.b 0100, 180, 270, others 250
V(R,b)14, V(S,b)13.
R 50 occurrence of each other value
S 25 occurrence of each other value
For 0 15010015,000
For 1 200 7016,000
For others
2 70503500 5 100252500
Other nine values 5025911,250
Total 1500016000350025001125048250
compare to simple estimation 1000500/1435714

37
Example 16.28

Jan(day, temp)
July(day, temp)
Query
SELECT Jan.day, July.day
FROM Jan, July
WHERE Jan.tempJuly.temp
Using equal-width histogram.

38
Example 16.28 (Cont.)
39
Example 16.28 (Cont.)

We calculate for each bands
V10 for each band
40-49 105/105
50-59 520/1010
All other ranges have 0 joined tuples
Total 51015 tuples
If no histogram information
V100 for 0-100
245245/100600 tuples

40
Computation of statistics

Recomputation too often is not good
Statistics tend not to change radically in a
short time
Somehow, inaccurate statistics are still useful
Keeping up-to-date will cause hot spot
Recomputation should be triggered automatically
after some period of time or after some number of
updates
Database administrator can issue recomputation
command directly if poor-performing query plans
are observed.
Recomputation is complex and time-consuming.
Sometimes, we can sample a fraction of data when
doing recomputation.

41
Heuristics for reducing cost of logical query plan

Estimating cost both before and after a
transformation can reduce the cost or avoid the
transformation otherwise.

42
Example 16.29
43
Example 16.29 (Cont.)
44
Example 16.29 (Cont.)

To compare two plans, we add the estimated sizes
for all the nodes except the root and the leaves.
(a) 1005010001150
(b) 10010001100
Different answer if we have a smaller V(R,b) or
V(S,b)

45
Enumerating physical plans

Baseline approach
Exhaustive consider all the combinations of
choices, and choose the one with the smallest
cost
Other approaches are more efficient
Top-down work down the tree of the logical query
plan from the root.
Bottom-up for each subexpression, compute the
costs of all possible ways.
Both approaches can limit the choices effectively

46
Bottom-up approaches

Dynamic programming
Simplified bottom-up method
Consider only the best plan for each
subexpression
Not guarantee to yield the best plan, but it
often does
Selinger-style(system-R-style)
Exploit additional properties that some of the
plans for a subexpression may have
Produce optimal overall plan.

47
Heuristic selection

If the logical plan calls for a selection sAc
(R), and R has an index on A, then perform an
index-scan to obtain the tuples with Ac
More generally, for selections involving Ac and
other conditions, implement as the index-scan
selection followed by a further selection.
If an argument of a join has an index on the join
attribute(s), use an index-join with that
relation in the inner loop
If one argument of a join is sorted on the join
attribute(s), prefer a sort-join to a hash-join,
but not necessarily to an index-join
Computing the union or intersection of three or
more relations, group the smallest relations
first

48
Branch-and-bound enumeration

Using heuristics to find a physical plan first,
let its cost be C
Consider other plans for subqueries
If cost gt C, ignore
If cost lt C, replace the corresponding
subexpression
Advantage
We can choose to cut off the search and take the
best plan found so far. Reduce the calculation
time

49
Hill climbing

Start with a heuristically selected physical plan
Then, make small changes to the plan
Replacing one method for an operator by another
Reordering join, etc.
If find a plan with smaller cost, change
If no small modification can yield lower cost,
choose the current plan

50
Dynamic programming

Keep for each subexpression only the plan of
least cost
As we work up the tree, consider possible
implementations of each node, assuming the best
plan for each subexpression is also used

51
Selinger-style optimization

Improvement on dynamic programming
Keeping other plans that have higher cost for
each subexpression, and produce a result that is
sorted in an order that may useful in higher
layers
The attribute(s) specified in a sort operator at
the root
The grouping attribute(s) of a later group-by
operator
The join attribute(s) of a later join

52
Exercises