Title: Advanced Database System
1Advanced Database System
2Cost-based enumeration
- When we turn the preferred logical query plan
into a physical plan, we need to consider
different physical plans and evaluate or estimate
the cost of each - We pick the physical query plan with the least
cost and pass this plan to the query execution
engine.
3Generating physical plans
- To estimate each physical plan, we select
- An order and grouping for associative-and-commutat
ive operations (joins, unions and intersections) - An algorithm for each operator in the logical
plan - Additional operators that are needed
- The way in which arguments are passed from one
operator to the next.
4Estimate sizes of the intermediate relations
- We need
- Given accurate estimates
- Easy to compute
- Are logically consistent
5Estimate the size of a projection
- For projection, the size of the result is
computable. - It produces a result tuple for every argument
tuple. - The only change is the length of tuples
- Consider as bag projection
- For set projection, d operator need to be used.
6Example
- R(a,b,c)
- a, b 4 bytes integer c 100 bytes string
- Tuple header 12 bytes
- 120 bytes per tuple
- Block 1024 bytes, block header 24 bytes.
- T(R)10,000 ? B(R)1250
- Spab, c(R)
- 116 bytes per tuple, T(S)10,000, B(S)1250
- Upa, b(R)
- 20 bytes per tuple, T(U)10,000, B(U)200
7Estimate the size of a selection
- Selection operation reduces the number of tuples,
but the size of each tuple remains the same - SsAc(R)
- Estimation T(S)T(R)/V(R,a)
- Ssaltc(R)
- Estimation T(S)T(R)/3
- SsA?c(R)
- Estimation T(S)T(R) or T(S)T(R)(V(R,a)-1)/V(R,a
) - For condition C is the AND of several equalities
and inequalities, the effect will be the
multiplication of the selectivity factor for each
condition.
8Example 16.21
- R(a,b,c), Ssa10 AND blt20(R)
- T(R)10,000, V(R,a)50
- Estimation T(R)/(503)
- Special cases
- Ssa10 AND agt20(R)
- Estimation T(R)/(503)
- actual result 0
- Query optimizer have to deal with special cases.
9OR of conditions
- SsC1 OR C2(R)
- If assume no tuple will satisfy both conditions
- The size will be the sum of the tuples satisfy
each - It will even larger than size of R!
- We can take the smaller one as the estimation
- Another estimation (C1 and C2 are independent)
- n(1-(1-m1/n)(1-m2/n))
- n number of tuples
- m1 number of tuples satisfy C1
- m2 number of tuples satisfy C2
10Example 16.22
- R(a,b), T(R)10,000, V(R,a)50
- Ssa10 OR blt20(R)
- n10,000 m1200, m23333
- Estimation
- 10000(1-(1-200/10000)(1-3333/10000))
11NOT in a selection
- For NOT C, the estimation number of tuples
- T(R) the estimated number that satisfy C
12Estimation the size of a join
- R(X,Y) ? S(Y,Z), Y is a single attribute
- According to Y values
- Disjointed T(R ? S)0
- Y might be the key of S and a foreign key of R
T(R ? S)T(R) - Almost all the tuples of R and S could have the
same Y-values T(R ? S)T(R)T(S)
13Simplified assumption
- Containment of value sets
- If Y appearing in many relations, then each
relation get its values from a common set
y1,y2,y3,. Thus, if V(R,Y)V(S,Y), every Y-value
of R will be a Y-value in S. - Preservation of value sets
- An attribute which is not the join attribute does
not lose values from its set of possible values.
For example, A V(R ? S, A)V(R,A) as well as V(S
? R, A)V(R,A)
14Estimation under assumptions
- R(X,Y) ? S(Y,Z), Y is a single attribute
- Let V(R,Y)V(S,Y)
- For each tuple t in R, it has 1/(V(S,Y) chance to
join with a given tuple in S, thus expected
number of tuples that t joins is T(S)/V(S,Y) - The estimated size of R ? S will be
T(R)T(S)/V(S,Y) - If V(R,Y)V(S,Y)
- The estimated size of R ? S will be
T(R)T(S)/V(R,Y) - In general
- Size T(R ? S) will be T(R)T(S)/max(V(R,Y),
V(S,Y))
15Example 16.23
Compute R ? S ? U
16Example (Cont.)
- (R ? S) ? U
- T(R ? S) T(R)T(S)/max(V(R,b), V(S,b))40,000
- Next
- T(R ? S)T(U)/max(V(R ? S,c), V(U,c))400,000
- R ? (S ? U)
- T(S ? U) T(S)T(U)/max(V(S,c), V(U,c))20,000
- Next
- T(R)T(S ? U)/max(V(R,b), V(S ? U,b))400,000
- Coincidence?
- NO!
17Multiple join attributes
- R(X,Y) ? S(Y,Z), Y contains several attributes
- Suppose R(x,y1,y2) ? S(y1,y2,z)
- For a give tuple r in R, the possibility to agree
on y1 with a tuple s in S is - 1/max(V(R,y1), V(S,y1))
- For y2 1/max(V(R,y2), V(S,y2))
- Thus, the expected joined pairs
- T(R)T(S)/(max(V(R,y1),V(S,y1)) max(V(R,y2),
V(S,y2)))
18General rule
- The size of natural join between two relations
- The estimate of the size of R ? S is computed by
multiplying T(R) by T(S) and dividing by the
larger of V(R,y) and V(S,y) for each attribute y
that is common to R and S.
19Example 16.24
Compute
The estimate size is T(R)T(S)/max(V(R,b),V(S,d))m
ax(V(R,c),V(S,e))400
20Example 16.25
21Example 16.25
- R(a,b) ? U(c,d)
- Size T(R)T(U)5,000,000
- Then, join S(b,c)
- T(R)T(U)T(S)/(max(V(R,b),V(S,b))max(V(U,c),V(S,c)
))400,000 - Same result generated!
22Equijoins
- The number of tuples in the result of an equijoin
can be computed exactly as for a natural join,
after accounting for the change in variable names.
23Other theta-joins
- A selection following a product, and
- The number of tuples in a product is the number
of tuples in the relations involved - Equality and inequality comparisons can be
estimated using the techniques developed for
selections
24Joins of many relations
- SR1?R2?R3??Rn, attribute A appears in k of the
Ris, and the number of its sets of values in
these k relations viV(Ri,A) are v1v2v3
vk - Suppose pick a tuple from each relation, what is
the probability that all tuples selected agree on
A? - Answer
- 1/v2v3...vk
- start with the product of the number of tuples in
each relation. Then, for each attribute A
appearing at least twice, divided by all but the
least of these V(R,A)s. - The number of values that remains in A after join
will be the least of these V(R,A)s
25Example 16.26
26Example 16.26 (Cont.)
- The estimate size
- T(R)T(S)T(U)/(( two largest of (V(R,b), V(S,b),
V(U,b)max(V(R,c),V(S,c)100020005000/((50200)
200)5000 - The number of values for each of the attribute
the least value count - a 100 b20 c100 d400 e500
27Conclusion
- No matter how to order the natural joins of n
relations - The estimate rules, applied to each join
individually, yields the same estimate size, and - This size is the same as we apply the rules for
the join all n relations as a whole.
28The size of Union
- Bag union
- The sum
- Set union
- The larger of two relations to the sum
- Using something in the middle
29The size of intersection
- result
- Maybe 0 to the smaller of the two arguments
- Estimation
- Half of the smaller
- Consider as a special case of the natural join
- Not good! (Sometimes for bags)
- E.g. R(a,b) ?B S(a,b), R 2 copies of (0,1), S 3
copies - V(R,a)V(S,a)V(R,b)V(S,b)1
- T(R)2, T(S)3
- Estimation will be 2x3/(max(1,1)max(1,1))6, not
correct.
30Difference
- R-S
- Result between T(R) to T(R)-T(S)
- Estimation T(R)-(1/2)T(S)
31Duplication elimination
- Result
- Between 1 to the size of R.
- Estimation smaller of (1/2)T(R) and the product
of all the V(R,ai)s
32Grouping and aggregation
- Similar to duplication elimination
- Estimation smaller of (1/2)T(R) or the product
of V(R,A)s
33Exercises
- 16.4.1 a) b) g) i)
- 16.4.2
34Estimate the cost of certain expressions
- The cost (number of I/Os) is influenced by
- The particular logical operator chosen to
implement the query - The size of intermediate results
- The physical operators used to implement logical
operators - The ordering of similar operations
- The method of passing arguments from one physical
operator to the next
35Statistical information
- DBMS have to keep important information such as
T(R), B(R), V(R,a) etc. - Histogram used to compute values for a given
attribute - Equal-wdith a width w is chosen, along with a
constant v0(the smallest value). Counts for
v0vltv0w, v0wvltv02w, - Equal-height common percentiles, pick some
fraction p, and list the lowest value, the
fraction 2p, and so on - Most-frequent-values list most common values and
their numbers of occurrences. Can be combined
with previous two mechanisms
36Example 16.27
- R(a,b)?S(b,c)
- R.b 1200, 0150, 5100, others 550
- S.b 0100, 180, 270, others 250
- V(R,b)14, V(S,b)13.
- R 50 occurrence of each other value
- S 25 occurrence of each other value
- For 0 15010015,000
- For 1 200 7016,000
- For others
- 2 70503500 5 100252500
- Other nine values 5025911,250
- Total 1500016000350025001125048250
- compare to simple estimation 1000500/1435714
37Example 16.28
- Jan(day, temp)
- July(day, temp)
- Query
- SELECT Jan.day, July.day
- FROM Jan, July
- WHERE Jan.tempJuly.temp
- Using equal-width histogram.
38Example 16.28 (Cont.)
39Example 16.28 (Cont.)
- We calculate for each bands
- V10 for each band
- 40-49 105/105
- 50-59 520/1010
- All other ranges have 0 joined tuples
- Total 51015 tuples
- If no histogram information
- V100 for 0-100
- 245245/100600 tuples
40Computation of statistics
- Recomputation too often is not good
- Statistics tend not to change radically in a
short time - Somehow, inaccurate statistics are still useful
- Keeping up-to-date will cause hot spot
- Recomputation should be triggered automatically
after some period of time or after some number of
updates - Database administrator can issue recomputation
command directly if poor-performing query plans
are observed. - Recomputation is complex and time-consuming.
- Sometimes, we can sample a fraction of data when
doing recomputation.
41Heuristics for reducing cost of logical query plan
- Estimating cost both before and after a
transformation can reduce the cost or avoid the
transformation otherwise.
42Example 16.29
43Example 16.29 (Cont.)
44Example 16.29 (Cont.)
- To compare two plans, we add the estimated sizes
for all the nodes except the root and the leaves. - (a) 1005010001150
- (b) 10010001100
- Different answer if we have a smaller V(R,b) or
V(S,b)
45Enumerating physical plans
- Baseline approach
- Exhaustive consider all the combinations of
choices, and choose the one with the smallest
cost - Other approaches are more efficient
- Top-down work down the tree of the logical query
plan from the root. - Bottom-up for each subexpression, compute the
costs of all possible ways. - Both approaches can limit the choices effectively
46Bottom-up approaches
- Dynamic programming
- Simplified bottom-up method
- Consider only the best plan for each
subexpression - Not guarantee to yield the best plan, but it
often does - Selinger-style(system-R-style)
- Exploit additional properties that some of the
plans for a subexpression may have - Produce optimal overall plan.
47Heuristic selection
- If the logical plan calls for a selection sAc
(R), and R has an index on A, then perform an
index-scan to obtain the tuples with Ac - More generally, for selections involving Ac and
other conditions, implement as the index-scan
selection followed by a further selection. - If an argument of a join has an index on the join
attribute(s), use an index-join with that
relation in the inner loop - If one argument of a join is sorted on the join
attribute(s), prefer a sort-join to a hash-join,
but not necessarily to an index-join - Computing the union or intersection of three or
more relations, group the smallest relations
first
48Branch-and-bound enumeration
- Using heuristics to find a physical plan first,
let its cost be C - Consider other plans for subqueries
- If cost gt C, ignore
- If cost lt C, replace the corresponding
subexpression - Advantage
- We can choose to cut off the search and take the
best plan found so far. Reduce the calculation
time
49Hill climbing
- Start with a heuristically selected physical plan
- Then, make small changes to the plan
- Replacing one method for an operator by another
- Reordering join, etc.
- If find a plan with smaller cost, change
- If no small modification can yield lower cost,
choose the current plan
50Dynamic programming
- Keep for each subexpression only the plan of
least cost - As we work up the tree, consider possible
implementations of each node, assuming the best
plan for each subexpression is also used
51Selinger-style optimization
- Improvement on dynamic programming
- Keeping other plans that have higher cost for
each subexpression, and produce a result that is
sorted in an order that may useful in higher
layers - The attribute(s) specified in a sort operator at
the root - The grouping attribute(s) of a later group-by
operator - The join attribute(s) of a later join
52Exercises
- 16.5.2
- 16.5.3
- 16.5.4 a) b)