Advanced Database System - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Advanced Database System

Description:

When we turn the preferred logical query plan into a physical plan, we need to ... Disjointed: T(R S)=0. Y might be the key of S and a foreign key of R: T(R S)=T(R) ... – PowerPoint PPT presentation

Number of Views:215
Avg rating:3.0/5.0
Slides: 53
Provided by: Zhiy2
Category:

less

Transcript and Presenter's Notes

Title: Advanced Database System


1
Advanced Database System
  • CS 641
  • Lecture 13

2
Cost-based enumeration
  • When we turn the preferred logical query plan
    into a physical plan, we need to consider
    different physical plans and evaluate or estimate
    the cost of each
  • We pick the physical query plan with the least
    cost and pass this plan to the query execution
    engine.

3
Generating physical plans
  • To estimate each physical plan, we select
  • An order and grouping for associative-and-commutat
    ive operations (joins, unions and intersections)
  • An algorithm for each operator in the logical
    plan
  • Additional operators that are needed
  • The way in which arguments are passed from one
    operator to the next.

4
Estimate sizes of the intermediate relations
  • We need
  • Given accurate estimates
  • Easy to compute
  • Are logically consistent

5
Estimate the size of a projection
  • For projection, the size of the result is
    computable.
  • It produces a result tuple for every argument
    tuple.
  • The only change is the length of tuples
  • Consider as bag projection
  • For set projection, d operator need to be used.

6
Example
  • R(a,b,c)
  • a, b 4 bytes integer c 100 bytes string
  • Tuple header 12 bytes
  • 120 bytes per tuple
  • Block 1024 bytes, block header 24 bytes.
  • T(R)10,000 ? B(R)1250
  • Spab, c(R)
  • 116 bytes per tuple, T(S)10,000, B(S)1250
  • Upa, b(R)
  • 20 bytes per tuple, T(U)10,000, B(U)200

7
Estimate the size of a selection
  • Selection operation reduces the number of tuples,
    but the size of each tuple remains the same
  • SsAc(R)
  • Estimation T(S)T(R)/V(R,a)
  • Ssaltc(R)
  • Estimation T(S)T(R)/3
  • SsA?c(R)
  • Estimation T(S)T(R) or T(S)T(R)(V(R,a)-1)/V(R,a
    )
  • For condition C is the AND of several equalities
    and inequalities, the effect will be the
    multiplication of the selectivity factor for each
    condition.

8
Example 16.21
  • R(a,b,c), Ssa10 AND blt20(R)
  • T(R)10,000, V(R,a)50
  • Estimation T(R)/(503)
  • Special cases
  • Ssa10 AND agt20(R)
  • Estimation T(R)/(503)
  • actual result 0
  • Query optimizer have to deal with special cases.

9
OR of conditions
  • SsC1 OR C2(R)
  • If assume no tuple will satisfy both conditions
  • The size will be the sum of the tuples satisfy
    each
  • It will even larger than size of R!
  • We can take the smaller one as the estimation
  • Another estimation (C1 and C2 are independent)
  • n(1-(1-m1/n)(1-m2/n))
  • n number of tuples
  • m1 number of tuples satisfy C1
  • m2 number of tuples satisfy C2

10
Example 16.22
  • R(a,b), T(R)10,000, V(R,a)50
  • Ssa10 OR blt20(R)
  • n10,000 m1200, m23333
  • Estimation
  • 10000(1-(1-200/10000)(1-3333/10000))

11
NOT in a selection
  • For NOT C, the estimation number of tuples
  • T(R) the estimated number that satisfy C

12
Estimation the size of a join
  • R(X,Y) ? S(Y,Z), Y is a single attribute
  • According to Y values
  • Disjointed T(R ? S)0
  • Y might be the key of S and a foreign key of R
    T(R ? S)T(R)
  • Almost all the tuples of R and S could have the
    same Y-values T(R ? S)T(R)T(S)

13
Simplified assumption
  • Containment of value sets
  • If Y appearing in many relations, then each
    relation get its values from a common set
    y1,y2,y3,. Thus, if V(R,Y)V(S,Y), every Y-value
    of R will be a Y-value in S.
  • Preservation of value sets
  • An attribute which is not the join attribute does
    not lose values from its set of possible values.
    For example, A V(R ? S, A)V(R,A) as well as V(S
    ? R, A)V(R,A)

14
Estimation under assumptions
  • R(X,Y) ? S(Y,Z), Y is a single attribute
  • Let V(R,Y)V(S,Y)
  • For each tuple t in R, it has 1/(V(S,Y) chance to
    join with a given tuple in S, thus expected
    number of tuples that t joins is T(S)/V(S,Y)
  • The estimated size of R ? S will be
    T(R)T(S)/V(S,Y)
  • If V(R,Y)V(S,Y)
  • The estimated size of R ? S will be
    T(R)T(S)/V(R,Y)
  • In general
  • Size T(R ? S) will be T(R)T(S)/max(V(R,Y),
    V(S,Y))

15
Example 16.23
Compute R ? S ? U
16
Example (Cont.)
  • (R ? S) ? U
  • T(R ? S) T(R)T(S)/max(V(R,b), V(S,b))40,000
  • Next
  • T(R ? S)T(U)/max(V(R ? S,c), V(U,c))400,000
  • R ? (S ? U)
  • T(S ? U) T(S)T(U)/max(V(S,c), V(U,c))20,000
  • Next
  • T(R)T(S ? U)/max(V(R,b), V(S ? U,b))400,000
  • Coincidence?
  • NO!

17
Multiple join attributes
  • R(X,Y) ? S(Y,Z), Y contains several attributes
  • Suppose R(x,y1,y2) ? S(y1,y2,z)
  • For a give tuple r in R, the possibility to agree
    on y1 with a tuple s in S is
  • 1/max(V(R,y1), V(S,y1))
  • For y2 1/max(V(R,y2), V(S,y2))
  • Thus, the expected joined pairs
  • T(R)T(S)/(max(V(R,y1),V(S,y1)) max(V(R,y2),
    V(S,y2)))

18
General rule
  • The size of natural join between two relations
  • The estimate of the size of R ? S is computed by
    multiplying T(R) by T(S) and dividing by the
    larger of V(R,y) and V(S,y) for each attribute y
    that is common to R and S.

19
Example 16.24
Compute
The estimate size is T(R)T(S)/max(V(R,b),V(S,d))m
ax(V(R,c),V(S,e))400
20
Example 16.25
21
Example 16.25
  • R(a,b) ? U(c,d)
  • Size T(R)T(U)5,000,000
  • Then, join S(b,c)
  • T(R)T(U)T(S)/(max(V(R,b),V(S,b))max(V(U,c),V(S,c)
    ))400,000
  • Same result generated!

22
Equijoins
  • The number of tuples in the result of an equijoin
    can be computed exactly as for a natural join,
    after accounting for the change in variable names.

23
Other theta-joins
  • A selection following a product, and
  • The number of tuples in a product is the number
    of tuples in the relations involved
  • Equality and inequality comparisons can be
    estimated using the techniques developed for
    selections

24
Joins of many relations
  • SR1?R2?R3??Rn, attribute A appears in k of the
    Ris, and the number of its sets of values in
    these k relations viV(Ri,A) are v1v2v3
    vk
  • Suppose pick a tuple from each relation, what is
    the probability that all tuples selected agree on
    A?
  • Answer
  • 1/v2v3...vk
  • start with the product of the number of tuples in
    each relation. Then, for each attribute A
    appearing at least twice, divided by all but the
    least of these V(R,A)s.
  • The number of values that remains in A after join
    will be the least of these V(R,A)s

25
Example 16.26
  • R(a,b,c)?S(b,c,d)?U(b,e)

26
Example 16.26 (Cont.)
  • The estimate size
  • T(R)T(S)T(U)/(( two largest of (V(R,b), V(S,b),
    V(U,b)max(V(R,c),V(S,c)100020005000/((50200)
    200)5000
  • The number of values for each of the attribute
    the least value count
  • a 100 b20 c100 d400 e500

27
Conclusion
  • No matter how to order the natural joins of n
    relations
  • The estimate rules, applied to each join
    individually, yields the same estimate size, and
  • This size is the same as we apply the rules for
    the join all n relations as a whole.

28
The size of Union
  • Bag union
  • The sum
  • Set union
  • The larger of two relations to the sum
  • Using something in the middle

29
The size of intersection
  • result
  • Maybe 0 to the smaller of the two arguments
  • Estimation
  • Half of the smaller
  • Consider as a special case of the natural join
  • Not good! (Sometimes for bags)
  • E.g. R(a,b) ?B S(a,b), R 2 copies of (0,1), S 3
    copies
  • V(R,a)V(S,a)V(R,b)V(S,b)1
  • T(R)2, T(S)3
  • Estimation will be 2x3/(max(1,1)max(1,1))6, not
    correct.

30
Difference
  • R-S
  • Result between T(R) to T(R)-T(S)
  • Estimation T(R)-(1/2)T(S)

31
Duplication elimination
  • Result
  • Between 1 to the size of R.
  • Estimation smaller of (1/2)T(R) and the product
    of all the V(R,ai)s

32
Grouping and aggregation
  • Similar to duplication elimination
  • Estimation smaller of (1/2)T(R) or the product
    of V(R,A)s

33
Exercises
  • 16.4.1 a) b) g) i)
  • 16.4.2

34
Estimate the cost of certain expressions
  • The cost (number of I/Os) is influenced by
  • The particular logical operator chosen to
    implement the query
  • The size of intermediate results
  • The physical operators used to implement logical
    operators
  • The ordering of similar operations
  • The method of passing arguments from one physical
    operator to the next

35
Statistical information
  • DBMS have to keep important information such as
    T(R), B(R), V(R,a) etc.
  • Histogram used to compute values for a given
    attribute
  • Equal-wdith a width w is chosen, along with a
    constant v0(the smallest value). Counts for
    v0vltv0w, v0wvltv02w,
  • Equal-height common percentiles, pick some
    fraction p, and list the lowest value, the
    fraction 2p, and so on
  • Most-frequent-values list most common values and
    their numbers of occurrences. Can be combined
    with previous two mechanisms

36
Example 16.27
  • R(a,b)?S(b,c)
  • R.b 1200, 0150, 5100, others 550
  • S.b 0100, 180, 270, others 250
  • V(R,b)14, V(S,b)13.
  • R 50 occurrence of each other value
  • S 25 occurrence of each other value
  • For 0 15010015,000
  • For 1 200 7016,000
  • For others
  • 2 70503500 5 100252500
  • Other nine values 5025911,250
  • Total 1500016000350025001125048250
  • compare to simple estimation 1000500/1435714

37
Example 16.28
  • Jan(day, temp)
  • July(day, temp)
  • Query
  • SELECT Jan.day, July.day
  • FROM Jan, July
  • WHERE Jan.tempJuly.temp
  • Using equal-width histogram.

38
Example 16.28 (Cont.)
39
Example 16.28 (Cont.)
  • We calculate for each bands
  • V10 for each band
  • 40-49 105/105
  • 50-59 520/1010
  • All other ranges have 0 joined tuples
  • Total 51015 tuples
  • If no histogram information
  • V100 for 0-100
  • 245245/100600 tuples

40
Computation of statistics
  • Recomputation too often is not good
  • Statistics tend not to change radically in a
    short time
  • Somehow, inaccurate statistics are still useful
  • Keeping up-to-date will cause hot spot
  • Recomputation should be triggered automatically
    after some period of time or after some number of
    updates
  • Database administrator can issue recomputation
    command directly if poor-performing query plans
    are observed.
  • Recomputation is complex and time-consuming.
  • Sometimes, we can sample a fraction of data when
    doing recomputation.

41
Heuristics for reducing cost of logical query plan
  • Estimating cost both before and after a
    transformation can reduce the cost or avoid the
    transformation otherwise.

42
Example 16.29
43
Example 16.29 (Cont.)
44
Example 16.29 (Cont.)
  • To compare two plans, we add the estimated sizes
    for all the nodes except the root and the leaves.
  • (a) 1005010001150
  • (b) 10010001100
  • Different answer if we have a smaller V(R,b) or
    V(S,b)

45
Enumerating physical plans
  • Baseline approach
  • Exhaustive consider all the combinations of
    choices, and choose the one with the smallest
    cost
  • Other approaches are more efficient
  • Top-down work down the tree of the logical query
    plan from the root.
  • Bottom-up for each subexpression, compute the
    costs of all possible ways.
  • Both approaches can limit the choices effectively

46
Bottom-up approaches
  • Dynamic programming
  • Simplified bottom-up method
  • Consider only the best plan for each
    subexpression
  • Not guarantee to yield the best plan, but it
    often does
  • Selinger-style(system-R-style)
  • Exploit additional properties that some of the
    plans for a subexpression may have
  • Produce optimal overall plan.

47
Heuristic selection
  • If the logical plan calls for a selection sAc
    (R), and R has an index on A, then perform an
    index-scan to obtain the tuples with Ac
  • More generally, for selections involving Ac and
    other conditions, implement as the index-scan
    selection followed by a further selection.
  • If an argument of a join has an index on the join
    attribute(s), use an index-join with that
    relation in the inner loop
  • If one argument of a join is sorted on the join
    attribute(s), prefer a sort-join to a hash-join,
    but not necessarily to an index-join
  • Computing the union or intersection of three or
    more relations, group the smallest relations
    first

48
Branch-and-bound enumeration
  • Using heuristics to find a physical plan first,
    let its cost be C
  • Consider other plans for subqueries
  • If cost gt C, ignore
  • If cost lt C, replace the corresponding
    subexpression
  • Advantage
  • We can choose to cut off the search and take the
    best plan found so far. Reduce the calculation
    time

49
Hill climbing
  • Start with a heuristically selected physical plan
  • Then, make small changes to the plan
  • Replacing one method for an operator by another
  • Reordering join, etc.
  • If find a plan with smaller cost, change
  • If no small modification can yield lower cost,
    choose the current plan

50
Dynamic programming
  • Keep for each subexpression only the plan of
    least cost
  • As we work up the tree, consider possible
    implementations of each node, assuming the best
    plan for each subexpression is also used

51
Selinger-style optimization
  • Improvement on dynamic programming
  • Keeping other plans that have higher cost for
    each subexpression, and produce a result that is
    sorted in an order that may useful in higher
    layers
  • The attribute(s) specified in a sort operator at
    the root
  • The grouping attribute(s) of a later group-by
    operator
  • The join attribute(s) of a later join

52
Exercises
  • 16.5.2
  • 16.5.3
  • 16.5.4 a) b)
Write a Comment
User Comments (0)
About PowerShow.com