# Advanced Database System - PowerPoint PPT Presentation

PPT – Advanced Database System PowerPoint presentation | free to view - id: 12ffed-YWRmZ

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

Description:

### When we turn the preferred logical query plan into a physical plan, we need to ... Disjointed: T(R S)=0. Y might be the key of S and a foreign key of R: T(R S)=T(R) ... – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 53
Provided by: Zhiy2
Category:
Transcript and Presenter's Notes

1
• CS 641
• Lecture 13

2
Cost-based enumeration
• When we turn the preferred logical query plan
into a physical plan, we need to consider
different physical plans and evaluate or estimate
the cost of each
• We pick the physical query plan with the least
cost and pass this plan to the query execution
engine.

3
Generating physical plans
• To estimate each physical plan, we select
• An order and grouping for associative-and-commutat
ive operations (joins, unions and intersections)
• An algorithm for each operator in the logical
plan
• Additional operators that are needed
• The way in which arguments are passed from one
operator to the next.

4
Estimate sizes of the intermediate relations
• We need
• Given accurate estimates
• Easy to compute
• Are logically consistent

5
Estimate the size of a projection
• For projection, the size of the result is
computable.
• It produces a result tuple for every argument
tuple.
• The only change is the length of tuples
• Consider as bag projection
• For set projection, d operator need to be used.

6
Example
• R(a,b,c)
• a, b 4 bytes integer c 100 bytes string
• 120 bytes per tuple
• Block 1024 bytes, block header 24 bytes.
• T(R)10,000 ? B(R)1250
• Spab, c(R)
• 116 bytes per tuple, T(S)10,000, B(S)1250
• Upa, b(R)
• 20 bytes per tuple, T(U)10,000, B(U)200

7
Estimate the size of a selection
• Selection operation reduces the number of tuples,
but the size of each tuple remains the same
• SsAc(R)
• Estimation T(S)T(R)/V(R,a)
• Ssaltc(R)
• Estimation T(S)T(R)/3
• SsA?c(R)
• Estimation T(S)T(R) or T(S)T(R)(V(R,a)-1)/V(R,a
)
• For condition C is the AND of several equalities
and inequalities, the effect will be the
multiplication of the selectivity factor for each
condition.

8
Example 16.21
• R(a,b,c), Ssa10 AND blt20(R)
• T(R)10,000, V(R,a)50
• Estimation T(R)/(503)
• Special cases
• Ssa10 AND agt20(R)
• Estimation T(R)/(503)
• actual result 0
• Query optimizer have to deal with special cases.

9
OR of conditions
• SsC1 OR C2(R)
• If assume no tuple will satisfy both conditions
• The size will be the sum of the tuples satisfy
each
• It will even larger than size of R!
• We can take the smaller one as the estimation
• Another estimation (C1 and C2 are independent)
• n(1-(1-m1/n)(1-m2/n))
• n number of tuples
• m1 number of tuples satisfy C1
• m2 number of tuples satisfy C2

10
Example 16.22
• R(a,b), T(R)10,000, V(R,a)50
• Ssa10 OR blt20(R)
• n10,000 m1200, m23333
• Estimation
• 10000(1-(1-200/10000)(1-3333/10000))

11
NOT in a selection
• For NOT C, the estimation number of tuples
• T(R) the estimated number that satisfy C

12
Estimation the size of a join
• R(X,Y) ? S(Y,Z), Y is a single attribute
• According to Y values
• Disjointed T(R ? S)0
• Y might be the key of S and a foreign key of R
T(R ? S)T(R)
• Almost all the tuples of R and S could have the
same Y-values T(R ? S)T(R)T(S)

13
Simplified assumption
• Containment of value sets
• If Y appearing in many relations, then each
relation get its values from a common set
y1,y2,y3,. Thus, if V(R,Y)V(S,Y), every Y-value
of R will be a Y-value in S.
• Preservation of value sets
• An attribute which is not the join attribute does
not lose values from its set of possible values.
For example, A V(R ? S, A)V(R,A) as well as V(S
? R, A)V(R,A)

14
Estimation under assumptions
• R(X,Y) ? S(Y,Z), Y is a single attribute
• Let V(R,Y)V(S,Y)
• For each tuple t in R, it has 1/(V(S,Y) chance to
join with a given tuple in S, thus expected
number of tuples that t joins is T(S)/V(S,Y)
• The estimated size of R ? S will be
T(R)T(S)/V(S,Y)
• If V(R,Y)V(S,Y)
• The estimated size of R ? S will be
T(R)T(S)/V(R,Y)
• In general
• Size T(R ? S) will be T(R)T(S)/max(V(R,Y),
V(S,Y))

15
Example 16.23
Compute R ? S ? U
16
Example (Cont.)
• (R ? S) ? U
• T(R ? S) T(R)T(S)/max(V(R,b), V(S,b))40,000
• Next
• T(R ? S)T(U)/max(V(R ? S,c), V(U,c))400,000
• R ? (S ? U)
• T(S ? U) T(S)T(U)/max(V(S,c), V(U,c))20,000
• Next
• T(R)T(S ? U)/max(V(R,b), V(S ? U,b))400,000
• Coincidence?
• NO!

17
Multiple join attributes
• R(X,Y) ? S(Y,Z), Y contains several attributes
• Suppose R(x,y1,y2) ? S(y1,y2,z)
• For a give tuple r in R, the possibility to agree
on y1 with a tuple s in S is
• 1/max(V(R,y1), V(S,y1))
• For y2 1/max(V(R,y2), V(S,y2))
• Thus, the expected joined pairs
• T(R)T(S)/(max(V(R,y1),V(S,y1)) max(V(R,y2),
V(S,y2)))

18
General rule
• The size of natural join between two relations
• The estimate of the size of R ? S is computed by
multiplying T(R) by T(S) and dividing by the
larger of V(R,y) and V(S,y) for each attribute y
that is common to R and S.

19
Example 16.24
Compute
The estimate size is T(R)T(S)/max(V(R,b),V(S,d))m
ax(V(R,c),V(S,e))400
20
Example 16.25
21
Example 16.25
• R(a,b) ? U(c,d)
• Size T(R)T(U)5,000,000
• Then, join S(b,c)
• T(R)T(U)T(S)/(max(V(R,b),V(S,b))max(V(U,c),V(S,c)
))400,000
• Same result generated!

22
Equijoins
• The number of tuples in the result of an equijoin
can be computed exactly as for a natural join,
after accounting for the change in variable names.

23
Other theta-joins
• A selection following a product, and
• The number of tuples in a product is the number
of tuples in the relations involved
• Equality and inequality comparisons can be
estimated using the techniques developed for
selections

24
Joins of many relations
• SR1?R2?R3??Rn, attribute A appears in k of the
Ris, and the number of its sets of values in
these k relations viV(Ri,A) are v1v2v3
vk
• Suppose pick a tuple from each relation, what is
the probability that all tuples selected agree on
A?
• 1/v2v3...vk
each relation. Then, for each attribute A
appearing at least twice, divided by all but the
least of these V(R,A)s.
• The number of values that remains in A after join
will be the least of these V(R,A)s

25
Example 16.26
• R(a,b,c)?S(b,c,d)?U(b,e)

26
Example 16.26 (Cont.)
• The estimate size
• T(R)T(S)T(U)/(( two largest of (V(R,b), V(S,b),
V(U,b)max(V(R,c),V(S,c)100020005000/((50200)
200)5000
• The number of values for each of the attribute
the least value count
• a 100 b20 c100 d400 e500

27
Conclusion
• No matter how to order the natural joins of n
relations
• The estimate rules, applied to each join
individually, yields the same estimate size, and
• This size is the same as we apply the rules for
the join all n relations as a whole.

28
The size of Union
• Bag union
• The sum
• Set union
• The larger of two relations to the sum
• Using something in the middle

29
The size of intersection
• result
• Maybe 0 to the smaller of the two arguments
• Estimation
• Half of the smaller
• Consider as a special case of the natural join
• Not good! (Sometimes for bags)
• E.g. R(a,b) ?B S(a,b), R 2 copies of (0,1), S 3
copies
• V(R,a)V(S,a)V(R,b)V(S,b)1
• T(R)2, T(S)3
• Estimation will be 2x3/(max(1,1)max(1,1))6, not
correct.

30
Difference
• R-S
• Result between T(R) to T(R)-T(S)
• Estimation T(R)-(1/2)T(S)

31
Duplication elimination
• Result
• Between 1 to the size of R.
• Estimation smaller of (1/2)T(R) and the product
of all the V(R,ai)s

32
Grouping and aggregation
• Similar to duplication elimination
• Estimation smaller of (1/2)T(R) or the product
of V(R,A)s

33
Exercises
• 16.4.1 a) b) g) i)
• 16.4.2

34
Estimate the cost of certain expressions
• The cost (number of I/Os) is influenced by
• The particular logical operator chosen to
implement the query
• The size of intermediate results
• The physical operators used to implement logical
operators
• The ordering of similar operations
• The method of passing arguments from one physical
operator to the next

35
Statistical information
• DBMS have to keep important information such as
T(R), B(R), V(R,a) etc.
• Histogram used to compute values for a given
attribute
• Equal-wdith a width w is chosen, along with a
constant v0(the smallest value). Counts for
v0vltv0w, v0wvltv02w,
• Equal-height common percentiles, pick some
fraction p, and list the lowest value, the
fraction 2p, and so on
• Most-frequent-values list most common values and
their numbers of occurrences. Can be combined
with previous two mechanisms

36
Example 16.27
• R(a,b)?S(b,c)
• R.b 1200, 0150, 5100, others 550
• S.b 0100, 180, 270, others 250
• V(R,b)14, V(S,b)13.
• R 50 occurrence of each other value
• S 25 occurrence of each other value
• For 0 15010015,000
• For 1 200 7016,000
• For others
• 2 70503500 5 100252500
• Other nine values 5025911,250
• Total 1500016000350025001125048250
• compare to simple estimation 1000500/1435714

37
Example 16.28
• Jan(day, temp)
• July(day, temp)
• Query
• SELECT Jan.day, July.day
• FROM Jan, July
• WHERE Jan.tempJuly.temp
• Using equal-width histogram.

38
Example 16.28 (Cont.)
39
Example 16.28 (Cont.)
• We calculate for each bands
• V10 for each band
• 40-49 105/105
• 50-59 520/1010
• All other ranges have 0 joined tuples
• Total 51015 tuples
• If no histogram information
• V100 for 0-100
• 245245/100600 tuples

40
Computation of statistics
• Recomputation too often is not good
• Statistics tend not to change radically in a
short time
• Somehow, inaccurate statistics are still useful
• Keeping up-to-date will cause hot spot
• Recomputation should be triggered automatically
after some period of time or after some number of
• Database administrator can issue recomputation
command directly if poor-performing query plans
are observed.
• Recomputation is complex and time-consuming.
• Sometimes, we can sample a fraction of data when
doing recomputation.

41
Heuristics for reducing cost of logical query plan
• Estimating cost both before and after a
transformation can reduce the cost or avoid the
transformation otherwise.

42
Example 16.29
43
Example 16.29 (Cont.)
44
Example 16.29 (Cont.)
• To compare two plans, we add the estimated sizes
for all the nodes except the root and the leaves.
• (a) 1005010001150
• (b) 10010001100
• Different answer if we have a smaller V(R,b) or
V(S,b)

45
Enumerating physical plans
• Baseline approach
• Exhaustive consider all the combinations of
choices, and choose the one with the smallest
cost
• Other approaches are more efficient
• Top-down work down the tree of the logical query
plan from the root.
• Bottom-up for each subexpression, compute the
costs of all possible ways.
• Both approaches can limit the choices effectively

46
Bottom-up approaches
• Dynamic programming
• Simplified bottom-up method
• Consider only the best plan for each
subexpression
• Not guarantee to yield the best plan, but it
often does
• Selinger-style(system-R-style)
• Exploit additional properties that some of the
plans for a subexpression may have
• Produce optimal overall plan.

47
Heuristic selection
• If the logical plan calls for a selection sAc
(R), and R has an index on A, then perform an
index-scan to obtain the tuples with Ac
• More generally, for selections involving Ac and
other conditions, implement as the index-scan
selection followed by a further selection.
• If an argument of a join has an index on the join
attribute(s), use an index-join with that
relation in the inner loop
• If one argument of a join is sorted on the join
attribute(s), prefer a sort-join to a hash-join,
but not necessarily to an index-join
• Computing the union or intersection of three or
more relations, group the smallest relations
first

48
Branch-and-bound enumeration
• Using heuristics to find a physical plan first,
let its cost be C
• Consider other plans for subqueries
• If cost gt C, ignore
• If cost lt C, replace the corresponding
subexpression
• We can choose to cut off the search and take the
best plan found so far. Reduce the calculation
time

49
Hill climbing
• Then, make small changes to the plan
• Replacing one method for an operator by another
• Reordering join, etc.
• If find a plan with smaller cost, change
• If no small modification can yield lower cost,
choose the current plan

50
Dynamic programming
• Keep for each subexpression only the plan of
least cost
• As we work up the tree, consider possible
implementations of each node, assuming the best
plan for each subexpression is also used

51
Selinger-style optimization
• Improvement on dynamic programming
• Keeping other plans that have higher cost for
each subexpression, and produce a result that is
sorted in an order that may useful in higher
layers
• The attribute(s) specified in a sort operator at
the root
• The grouping attribute(s) of a later group-by
operator
• The join attribute(s) of a later join

52
Exercises
• 16.5.2
• 16.5.3
• 16.5.4 a) b)