Loading...

PPT – Advanced Database System PowerPoint presentation | free to view - id: 12ffed-YWRmZ

The Adobe Flash plugin is needed to view this content

Advanced Database System

- CS 641
- Lecture 13

Cost-based enumeration

- When we turn the preferred logical query plan

into a physical plan, we need to consider

different physical plans and evaluate or estimate

the cost of each - We pick the physical query plan with the least

cost and pass this plan to the query execution

engine.

Generating physical plans

- To estimate each physical plan, we select
- An order and grouping for associative-and-commutat

ive operations (joins, unions and intersections) - An algorithm for each operator in the logical

plan - Additional operators that are needed
- The way in which arguments are passed from one

operator to the next.

Estimate sizes of the intermediate relations

- We need
- Given accurate estimates
- Easy to compute
- Are logically consistent

Estimate the size of a projection

- For projection, the size of the result is

computable. - It produces a result tuple for every argument

tuple. - The only change is the length of tuples
- Consider as bag projection
- For set projection, d operator need to be used.

Example

- R(a,b,c)
- a, b 4 bytes integer c 100 bytes string
- Tuple header 12 bytes
- 120 bytes per tuple
- Block 1024 bytes, block header 24 bytes.
- T(R)10,000 ? B(R)1250
- Spab, c(R)
- 116 bytes per tuple, T(S)10,000, B(S)1250
- Upa, b(R)
- 20 bytes per tuple, T(U)10,000, B(U)200

Estimate the size of a selection

- Selection operation reduces the number of tuples,

but the size of each tuple remains the same - SsAc(R)
- Estimation T(S)T(R)/V(R,a)
- Ssaltc(R)
- Estimation T(S)T(R)/3
- SsA?c(R)
- Estimation T(S)T(R) or T(S)T(R)(V(R,a)-1)/V(R,a

) - For condition C is the AND of several equalities

and inequalities, the effect will be the

multiplication of the selectivity factor for each

condition.

Example 16.21

- R(a,b,c), Ssa10 AND blt20(R)
- T(R)10,000, V(R,a)50
- Estimation T(R)/(503)
- Special cases
- Ssa10 AND agt20(R)
- Estimation T(R)/(503)
- actual result 0
- Query optimizer have to deal with special cases.

OR of conditions

- SsC1 OR C2(R)
- If assume no tuple will satisfy both conditions
- The size will be the sum of the tuples satisfy

each - It will even larger than size of R!
- We can take the smaller one as the estimation
- Another estimation (C1 and C2 are independent)
- n(1-(1-m1/n)(1-m2/n))
- n number of tuples
- m1 number of tuples satisfy C1
- m2 number of tuples satisfy C2

Example 16.22

- R(a,b), T(R)10,000, V(R,a)50
- Ssa10 OR blt20(R)
- n10,000 m1200, m23333
- Estimation
- 10000(1-(1-200/10000)(1-3333/10000))

NOT in a selection

- For NOT C, the estimation number of tuples
- T(R) the estimated number that satisfy C

Estimation the size of a join

- R(X,Y) ? S(Y,Z), Y is a single attribute
- According to Y values
- Disjointed T(R ? S)0
- Y might be the key of S and a foreign key of R

T(R ? S)T(R) - Almost all the tuples of R and S could have the

same Y-values T(R ? S)T(R)T(S)

Simplified assumption

- Containment of value sets
- If Y appearing in many relations, then each

relation get its values from a common set

y1,y2,y3,. Thus, if V(R,Y)V(S,Y), every Y-value

of R will be a Y-value in S. - Preservation of value sets
- An attribute which is not the join attribute does

not lose values from its set of possible values.

For example, A V(R ? S, A)V(R,A) as well as V(S

? R, A)V(R,A)

Estimation under assumptions

- R(X,Y) ? S(Y,Z), Y is a single attribute
- Let V(R,Y)V(S,Y)
- For each tuple t in R, it has 1/(V(S,Y) chance to

join with a given tuple in S, thus expected

number of tuples that t joins is T(S)/V(S,Y) - The estimated size of R ? S will be

T(R)T(S)/V(S,Y) - If V(R,Y)V(S,Y)
- The estimated size of R ? S will be

T(R)T(S)/V(R,Y) - In general
- Size T(R ? S) will be T(R)T(S)/max(V(R,Y),

V(S,Y))

Example 16.23

Compute R ? S ? U

Example (Cont.)

- (R ? S) ? U
- T(R ? S) T(R)T(S)/max(V(R,b), V(S,b))40,000
- Next
- T(R ? S)T(U)/max(V(R ? S,c), V(U,c))400,000
- R ? (S ? U)
- T(S ? U) T(S)T(U)/max(V(S,c), V(U,c))20,000
- Next
- T(R)T(S ? U)/max(V(R,b), V(S ? U,b))400,000
- Coincidence?
- NO!

Multiple join attributes

- R(X,Y) ? S(Y,Z), Y contains several attributes
- Suppose R(x,y1,y2) ? S(y1,y2,z)
- For a give tuple r in R, the possibility to agree

on y1 with a tuple s in S is - 1/max(V(R,y1), V(S,y1))
- For y2 1/max(V(R,y2), V(S,y2))
- Thus, the expected joined pairs
- T(R)T(S)/(max(V(R,y1),V(S,y1)) max(V(R,y2),

V(S,y2)))

General rule

- The size of natural join between two relations
- The estimate of the size of R ? S is computed by

multiplying T(R) by T(S) and dividing by the

larger of V(R,y) and V(S,y) for each attribute y

that is common to R and S.

Example 16.24

Compute

The estimate size is T(R)T(S)/max(V(R,b),V(S,d))m

ax(V(R,c),V(S,e))400

Example 16.25

Example 16.25

- R(a,b) ? U(c,d)
- Size T(R)T(U)5,000,000
- Then, join S(b,c)
- T(R)T(U)T(S)/(max(V(R,b),V(S,b))max(V(U,c),V(S,c)

))400,000 - Same result generated!

Equijoins

- The number of tuples in the result of an equijoin

can be computed exactly as for a natural join,

after accounting for the change in variable names.

Other theta-joins

- A selection following a product, and
- The number of tuples in a product is the number

of tuples in the relations involved - Equality and inequality comparisons can be

estimated using the techniques developed for

selections

Joins of many relations

- SR1?R2?R3??Rn, attribute A appears in k of the

Ris, and the number of its sets of values in

these k relations viV(Ri,A) are v1v2v3

vk - Suppose pick a tuple from each relation, what is

the probability that all tuples selected agree on

A? - Answer
- 1/v2v3...vk
- start with the product of the number of tuples in

each relation. Then, for each attribute A

appearing at least twice, divided by all but the

least of these V(R,A)s. - The number of values that remains in A after join

will be the least of these V(R,A)s

Example 16.26

- R(a,b,c)?S(b,c,d)?U(b,e)

Example 16.26 (Cont.)

- The estimate size
- T(R)T(S)T(U)/(( two largest of (V(R,b), V(S,b),

V(U,b)max(V(R,c),V(S,c)100020005000/((50200)

200)5000 - The number of values for each of the attribute

the least value count - a 100 b20 c100 d400 e500

Conclusion

- No matter how to order the natural joins of n

relations - The estimate rules, applied to each join

individually, yields the same estimate size, and - This size is the same as we apply the rules for

the join all n relations as a whole.

The size of Union

- Bag union
- The sum
- Set union
- The larger of two relations to the sum
- Using something in the middle

The size of intersection

- result
- Maybe 0 to the smaller of the two arguments
- Estimation
- Half of the smaller
- Consider as a special case of the natural join
- Not good! (Sometimes for bags)
- E.g. R(a,b) ?B S(a,b), R 2 copies of (0,1), S 3

copies - V(R,a)V(S,a)V(R,b)V(S,b)1
- T(R)2, T(S)3
- Estimation will be 2x3/(max(1,1)max(1,1))6, not

correct.

Difference

- R-S
- Result between T(R) to T(R)-T(S)
- Estimation T(R)-(1/2)T(S)

Duplication elimination

- Result
- Between 1 to the size of R.
- Estimation smaller of (1/2)T(R) and the product

of all the V(R,ai)s

Grouping and aggregation

- Similar to duplication elimination
- Estimation smaller of (1/2)T(R) or the product

of V(R,A)s

Exercises

- 16.4.1 a) b) g) i)
- 16.4.2

Estimate the cost of certain expressions

- The cost (number of I/Os) is influenced by
- The particular logical operator chosen to

implement the query - The size of intermediate results
- The physical operators used to implement logical

operators - The ordering of similar operations
- The method of passing arguments from one physical

operator to the next

Statistical information

- DBMS have to keep important information such as

T(R), B(R), V(R,a) etc. - Histogram used to compute values for a given

attribute - Equal-wdith a width w is chosen, along with a

constant v0(the smallest value). Counts for

v0vltv0w, v0wvltv02w, - Equal-height common percentiles, pick some

fraction p, and list the lowest value, the

fraction 2p, and so on - Most-frequent-values list most common values and

their numbers of occurrences. Can be combined

with previous two mechanisms

Example 16.27

- R(a,b)?S(b,c)
- R.b 1200, 0150, 5100, others 550
- S.b 0100, 180, 270, others 250
- V(R,b)14, V(S,b)13.
- R 50 occurrence of each other value
- S 25 occurrence of each other value
- For 0 15010015,000
- For 1 200 7016,000
- For others
- 2 70503500 5 100252500
- Other nine values 5025911,250
- Total 1500016000350025001125048250
- compare to simple estimation 1000500/1435714

Example 16.28

- Jan(day, temp)
- July(day, temp)
- Query
- SELECT Jan.day, July.day
- FROM Jan, July
- WHERE Jan.tempJuly.temp
- Using equal-width histogram.

Example 16.28 (Cont.)

Example 16.28 (Cont.)

- We calculate for each bands
- V10 for each band
- 40-49 105/105
- 50-59 520/1010
- All other ranges have 0 joined tuples
- Total 51015 tuples
- If no histogram information
- V100 for 0-100
- 245245/100600 tuples

Computation of statistics

- Recomputation too often is not good
- Statistics tend not to change radically in a

short time - Somehow, inaccurate statistics are still useful
- Keeping up-to-date will cause hot spot
- Recomputation should be triggered automatically

after some period of time or after some number of

updates - Database administrator can issue recomputation

command directly if poor-performing query plans

are observed. - Recomputation is complex and time-consuming.
- Sometimes, we can sample a fraction of data when

doing recomputation.

Heuristics for reducing cost of logical query plan

- Estimating cost both before and after a

transformation can reduce the cost or avoid the

transformation otherwise.

Example 16.29

Example 16.29 (Cont.)

Example 16.29 (Cont.)

- To compare two plans, we add the estimated sizes

for all the nodes except the root and the leaves. - (a) 1005010001150
- (b) 10010001100
- Different answer if we have a smaller V(R,b) or

V(S,b)

Enumerating physical plans

- Baseline approach
- Exhaustive consider all the combinations of

choices, and choose the one with the smallest

cost - Other approaches are more efficient
- Top-down work down the tree of the logical query

plan from the root. - Bottom-up for each subexpression, compute the

costs of all possible ways. - Both approaches can limit the choices effectively

Bottom-up approaches

- Dynamic programming
- Simplified bottom-up method
- Consider only the best plan for each

subexpression - Not guarantee to yield the best plan, but it

often does - Selinger-style(system-R-style)
- Exploit additional properties that some of the

plans for a subexpression may have - Produce optimal overall plan.

Heuristic selection

- If the logical plan calls for a selection sAc

(R), and R has an index on A, then perform an

index-scan to obtain the tuples with Ac - More generally, for selections involving Ac and

other conditions, implement as the index-scan

selection followed by a further selection. - If an argument of a join has an index on the join

attribute(s), use an index-join with that

relation in the inner loop - If one argument of a join is sorted on the join

attribute(s), prefer a sort-join to a hash-join,

but not necessarily to an index-join - Computing the union or intersection of three or

more relations, group the smallest relations

first

Branch-and-bound enumeration

- Using heuristics to find a physical plan first,

let its cost be C - Consider other plans for subqueries
- If cost gt C, ignore
- If cost lt C, replace the corresponding

subexpression - Advantage
- We can choose to cut off the search and take the

best plan found so far. Reduce the calculation

time

Hill climbing

- Start with a heuristically selected physical plan
- Then, make small changes to the plan
- Replacing one method for an operator by another
- Reordering join, etc.
- If find a plan with smaller cost, change
- If no small modification can yield lower cost,

choose the current plan

Dynamic programming

- Keep for each subexpression only the plan of

least cost - As we work up the tree, consider possible

implementations of each node, assuming the best

plan for each subexpression is also used

Selinger-style optimization

- Improvement on dynamic programming
- Keeping other plans that have higher cost for

each subexpression, and produce a result that is

sorted in an order that may useful in higher

layers - The attribute(s) specified in a sort operator at

the root - The grouping attribute(s) of a later group-by

operator - The join attribute(s) of a later join

Exercises

- 16.5.2
- 16.5.3
- 16.5.4 a) b)