Advanced Database System - PowerPoint PPT Presentation


PPT – Advanced Database System PowerPoint presentation | free to view - id: 12ffed-YWRmZ


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Advanced Database System


When we turn the preferred logical query plan into a physical plan, we need to ... Disjointed: T(R S)=0. Y might be the key of S and a foreign key of R: T(R S)=T(R) ... – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 53
Provided by: Zhiy2


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Advanced Database System

Advanced Database System
  • CS 641
  • Lecture 13

Cost-based enumeration
  • When we turn the preferred logical query plan
    into a physical plan, we need to consider
    different physical plans and evaluate or estimate
    the cost of each
  • We pick the physical query plan with the least
    cost and pass this plan to the query execution

Generating physical plans
  • To estimate each physical plan, we select
  • An order and grouping for associative-and-commutat
    ive operations (joins, unions and intersections)
  • An algorithm for each operator in the logical
  • Additional operators that are needed
  • The way in which arguments are passed from one
    operator to the next.

Estimate sizes of the intermediate relations
  • We need
  • Given accurate estimates
  • Easy to compute
  • Are logically consistent

Estimate the size of a projection
  • For projection, the size of the result is
  • It produces a result tuple for every argument
  • The only change is the length of tuples
  • Consider as bag projection
  • For set projection, d operator need to be used.

  • R(a,b,c)
  • a, b 4 bytes integer c 100 bytes string
  • Tuple header 12 bytes
  • 120 bytes per tuple
  • Block 1024 bytes, block header 24 bytes.
  • T(R)10,000 ? B(R)1250
  • Spab, c(R)
  • 116 bytes per tuple, T(S)10,000, B(S)1250
  • Upa, b(R)
  • 20 bytes per tuple, T(U)10,000, B(U)200

Estimate the size of a selection
  • Selection operation reduces the number of tuples,
    but the size of each tuple remains the same
  • SsAc(R)
  • Estimation T(S)T(R)/V(R,a)
  • Ssaltc(R)
  • Estimation T(S)T(R)/3
  • SsA?c(R)
  • Estimation T(S)T(R) or T(S)T(R)(V(R,a)-1)/V(R,a
  • For condition C is the AND of several equalities
    and inequalities, the effect will be the
    multiplication of the selectivity factor for each

Example 16.21
  • R(a,b,c), Ssa10 AND blt20(R)
  • T(R)10,000, V(R,a)50
  • Estimation T(R)/(503)
  • Special cases
  • Ssa10 AND agt20(R)
  • Estimation T(R)/(503)
  • actual result 0
  • Query optimizer have to deal with special cases.

OR of conditions
  • SsC1 OR C2(R)
  • If assume no tuple will satisfy both conditions
  • The size will be the sum of the tuples satisfy
  • It will even larger than size of R!
  • We can take the smaller one as the estimation
  • Another estimation (C1 and C2 are independent)
  • n(1-(1-m1/n)(1-m2/n))
  • n number of tuples
  • m1 number of tuples satisfy C1
  • m2 number of tuples satisfy C2

Example 16.22
  • R(a,b), T(R)10,000, V(R,a)50
  • Ssa10 OR blt20(R)
  • n10,000 m1200, m23333
  • Estimation
  • 10000(1-(1-200/10000)(1-3333/10000))

NOT in a selection
  • For NOT C, the estimation number of tuples
  • T(R) the estimated number that satisfy C

Estimation the size of a join
  • R(X,Y) ? S(Y,Z), Y is a single attribute
  • According to Y values
  • Disjointed T(R ? S)0
  • Y might be the key of S and a foreign key of R
    T(R ? S)T(R)
  • Almost all the tuples of R and S could have the
    same Y-values T(R ? S)T(R)T(S)

Simplified assumption
  • Containment of value sets
  • If Y appearing in many relations, then each
    relation get its values from a common set
    y1,y2,y3,. Thus, if V(R,Y)V(S,Y), every Y-value
    of R will be a Y-value in S.
  • Preservation of value sets
  • An attribute which is not the join attribute does
    not lose values from its set of possible values.
    For example, A V(R ? S, A)V(R,A) as well as V(S
    ? R, A)V(R,A)

Estimation under assumptions
  • R(X,Y) ? S(Y,Z), Y is a single attribute
  • Let V(R,Y)V(S,Y)
  • For each tuple t in R, it has 1/(V(S,Y) chance to
    join with a given tuple in S, thus expected
    number of tuples that t joins is T(S)/V(S,Y)
  • The estimated size of R ? S will be
  • If V(R,Y)V(S,Y)
  • The estimated size of R ? S will be
  • In general
  • Size T(R ? S) will be T(R)T(S)/max(V(R,Y),

Example 16.23
Compute R ? S ? U
Example (Cont.)
  • (R ? S) ? U
  • T(R ? S) T(R)T(S)/max(V(R,b), V(S,b))40,000
  • Next
  • T(R ? S)T(U)/max(V(R ? S,c), V(U,c))400,000
  • R ? (S ? U)
  • T(S ? U) T(S)T(U)/max(V(S,c), V(U,c))20,000
  • Next
  • T(R)T(S ? U)/max(V(R,b), V(S ? U,b))400,000
  • Coincidence?
  • NO!

Multiple join attributes
  • R(X,Y) ? S(Y,Z), Y contains several attributes
  • Suppose R(x,y1,y2) ? S(y1,y2,z)
  • For a give tuple r in R, the possibility to agree
    on y1 with a tuple s in S is
  • 1/max(V(R,y1), V(S,y1))
  • For y2 1/max(V(R,y2), V(S,y2))
  • Thus, the expected joined pairs
  • T(R)T(S)/(max(V(R,y1),V(S,y1)) max(V(R,y2),

General rule
  • The size of natural join between two relations
  • The estimate of the size of R ? S is computed by
    multiplying T(R) by T(S) and dividing by the
    larger of V(R,y) and V(S,y) for each attribute y
    that is common to R and S.

Example 16.24
The estimate size is T(R)T(S)/max(V(R,b),V(S,d))m
Example 16.25
Example 16.25
  • R(a,b) ? U(c,d)
  • Size T(R)T(U)5,000,000
  • Then, join S(b,c)
  • T(R)T(U)T(S)/(max(V(R,b),V(S,b))max(V(U,c),V(S,c)
  • Same result generated!

  • The number of tuples in the result of an equijoin
    can be computed exactly as for a natural join,
    after accounting for the change in variable names.

Other theta-joins
  • A selection following a product, and
  • The number of tuples in a product is the number
    of tuples in the relations involved
  • Equality and inequality comparisons can be
    estimated using the techniques developed for

Joins of many relations
  • SR1?R2?R3??Rn, attribute A appears in k of the
    Ris, and the number of its sets of values in
    these k relations viV(Ri,A) are v1v2v3
  • Suppose pick a tuple from each relation, what is
    the probability that all tuples selected agree on
  • Answer
  • 1/v2v3...vk
  • start with the product of the number of tuples in
    each relation. Then, for each attribute A
    appearing at least twice, divided by all but the
    least of these V(R,A)s.
  • The number of values that remains in A after join
    will be the least of these V(R,A)s

Example 16.26
  • R(a,b,c)?S(b,c,d)?U(b,e)

Example 16.26 (Cont.)
  • The estimate size
  • T(R)T(S)T(U)/(( two largest of (V(R,b), V(S,b),
  • The number of values for each of the attribute
    the least value count
  • a 100 b20 c100 d400 e500

  • No matter how to order the natural joins of n
  • The estimate rules, applied to each join
    individually, yields the same estimate size, and
  • This size is the same as we apply the rules for
    the join all n relations as a whole.

The size of Union
  • Bag union
  • The sum
  • Set union
  • The larger of two relations to the sum
  • Using something in the middle

The size of intersection
  • result
  • Maybe 0 to the smaller of the two arguments
  • Estimation
  • Half of the smaller
  • Consider as a special case of the natural join
  • Not good! (Sometimes for bags)
  • E.g. R(a,b) ?B S(a,b), R 2 copies of (0,1), S 3
  • V(R,a)V(S,a)V(R,b)V(S,b)1
  • T(R)2, T(S)3
  • Estimation will be 2x3/(max(1,1)max(1,1))6, not

  • R-S
  • Result between T(R) to T(R)-T(S)
  • Estimation T(R)-(1/2)T(S)

Duplication elimination
  • Result
  • Between 1 to the size of R.
  • Estimation smaller of (1/2)T(R) and the product
    of all the V(R,ai)s

Grouping and aggregation
  • Similar to duplication elimination
  • Estimation smaller of (1/2)T(R) or the product
    of V(R,A)s

  • 16.4.1 a) b) g) i)
  • 16.4.2

Estimate the cost of certain expressions
  • The cost (number of I/Os) is influenced by
  • The particular logical operator chosen to
    implement the query
  • The size of intermediate results
  • The physical operators used to implement logical
  • The ordering of similar operations
  • The method of passing arguments from one physical
    operator to the next

Statistical information
  • DBMS have to keep important information such as
    T(R), B(R), V(R,a) etc.
  • Histogram used to compute values for a given
  • Equal-wdith a width w is chosen, along with a
    constant v0(the smallest value). Counts for
    v0vltv0w, v0wvltv02w,
  • Equal-height common percentiles, pick some
    fraction p, and list the lowest value, the
    fraction 2p, and so on
  • Most-frequent-values list most common values and
    their numbers of occurrences. Can be combined
    with previous two mechanisms

Example 16.27
  • R(a,b)?S(b,c)
  • R.b 1200, 0150, 5100, others 550
  • S.b 0100, 180, 270, others 250
  • V(R,b)14, V(S,b)13.
  • R 50 occurrence of each other value
  • S 25 occurrence of each other value
  • For 0 15010015,000
  • For 1 200 7016,000
  • For others
  • 2 70503500 5 100252500
  • Other nine values 5025911,250
  • Total 1500016000350025001125048250
  • compare to simple estimation 1000500/1435714

Example 16.28
  • Jan(day, temp)
  • July(day, temp)
  • Query
  • FROM Jan, July
  • WHERE Jan.tempJuly.temp
  • Using equal-width histogram.

Example 16.28 (Cont.)
Example 16.28 (Cont.)
  • We calculate for each bands
  • V10 for each band
  • 40-49 105/105
  • 50-59 520/1010
  • All other ranges have 0 joined tuples
  • Total 51015 tuples
  • If no histogram information
  • V100 for 0-100
  • 245245/100600 tuples

Computation of statistics
  • Recomputation too often is not good
  • Statistics tend not to change radically in a
    short time
  • Somehow, inaccurate statistics are still useful
  • Keeping up-to-date will cause hot spot
  • Recomputation should be triggered automatically
    after some period of time or after some number of
  • Database administrator can issue recomputation
    command directly if poor-performing query plans
    are observed.
  • Recomputation is complex and time-consuming.
  • Sometimes, we can sample a fraction of data when
    doing recomputation.

Heuristics for reducing cost of logical query plan
  • Estimating cost both before and after a
    transformation can reduce the cost or avoid the
    transformation otherwise.

Example 16.29
Example 16.29 (Cont.)
Example 16.29 (Cont.)
  • To compare two plans, we add the estimated sizes
    for all the nodes except the root and the leaves.
  • (a) 1005010001150
  • (b) 10010001100
  • Different answer if we have a smaller V(R,b) or

Enumerating physical plans
  • Baseline approach
  • Exhaustive consider all the combinations of
    choices, and choose the one with the smallest
  • Other approaches are more efficient
  • Top-down work down the tree of the logical query
    plan from the root.
  • Bottom-up for each subexpression, compute the
    costs of all possible ways.
  • Both approaches can limit the choices effectively

Bottom-up approaches
  • Dynamic programming
  • Simplified bottom-up method
  • Consider only the best plan for each
  • Not guarantee to yield the best plan, but it
    often does
  • Selinger-style(system-R-style)
  • Exploit additional properties that some of the
    plans for a subexpression may have
  • Produce optimal overall plan.

Heuristic selection
  • If the logical plan calls for a selection sAc
    (R), and R has an index on A, then perform an
    index-scan to obtain the tuples with Ac
  • More generally, for selections involving Ac and
    other conditions, implement as the index-scan
    selection followed by a further selection.
  • If an argument of a join has an index on the join
    attribute(s), use an index-join with that
    relation in the inner loop
  • If one argument of a join is sorted on the join
    attribute(s), prefer a sort-join to a hash-join,
    but not necessarily to an index-join
  • Computing the union or intersection of three or
    more relations, group the smallest relations

Branch-and-bound enumeration
  • Using heuristics to find a physical plan first,
    let its cost be C
  • Consider other plans for subqueries
  • If cost gt C, ignore
  • If cost lt C, replace the corresponding
  • Advantage
  • We can choose to cut off the search and take the
    best plan found so far. Reduce the calculation

Hill climbing
  • Start with a heuristically selected physical plan
  • Then, make small changes to the plan
  • Replacing one method for an operator by another
  • Reordering join, etc.
  • If find a plan with smaller cost, change
  • If no small modification can yield lower cost,
    choose the current plan

Dynamic programming
  • Keep for each subexpression only the plan of
    least cost
  • As we work up the tree, consider possible
    implementations of each node, assuming the best
    plan for each subexpression is also used

Selinger-style optimization
  • Improvement on dynamic programming
  • Keeping other plans that have higher cost for
    each subexpression, and produce a result that is
    sorted in an order that may useful in higher
  • The attribute(s) specified in a sort operator at
    the root
  • The grouping attribute(s) of a later group-by
  • The join attribute(s) of a later join

  • 16.5.2
  • 16.5.3
  • 16.5.4 a) b)