Database Techniek Query Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

Database Techniek Query Optimization

Description:

Dynamic Programming (System R/DB2) Heuristics (Ingres/Postgres) ... With dynamic programming time complexity of optimization with bushy trees is O(3n) ... – PowerPoint PPT presentation

Number of Views:293
Avg rating:3.0/5.0
Slides: 66
Provided by: homepa3
Category:

less

Transcript and Presenter's Notes

Title: Database Techniek Query Optimization


1
Database TechniekQuery Optimization(chapter
14)
2
Lecture 3
  • Query Rewriting
  • Equivalence Rules
  • Query Optimization
  • Dynamic Programming (System R/DB2)
  • Heuristics (Ingres/Postgres)
  • De-correlation of nested queries
  • Result Size Estimation
  • Practicum Assignment 2

3
Lecture 3
  • Query Rewriting
  • Equivalence Rules
  • Query Optimization
  • Dynamic Programming (System R/DB2)
  • Heuristics (Ingres/Postgres)
  • De-correlation of nested queries
  • Result Size Estimation
  • Practicum Assignment 2

4
Transformation of Relational Expressions
  • Two relational algebra expressions are said to be
    equivalent if on every legal database instance
    the two expressions generate the same set of
    tuples
  • Note order of tuples is irrelevant
  • In SQL, inputs and outputs are bags (multi-sets)
    of tuples
  • Two expressions in the bag version of the
    relational algebra are said to be equivalent if
    on every legal database instance the two
    expressions generate the same bag of tuples
  • An equivalence rule says that expressions of two
    forms are equivalent
  • Can replace expression of first form by second,
    or vice versa

5
Equivalence Rules
  • 1. Conjunctive selection operations can be
    deconstructed into a sequence of individual
    selections.
  • 2. Selection operations are commutative.
  • 3. Only the last in a sequence of projection
    operations is needed, the others can be
    omitted.
  • Selections can be combined with Cartesian
    products and theta joins.
  • ??(E1 X E2) E1 ? E2
  • ??1(E1 ?2 E2) E1 ?1??2 E2

6
Algebraic Rewritings for Selection
s
cond1
s
cond2
R
s
s
cond1 AND cond2
cond2
R
s
cond1
R
?
s
cond1 OR cond2
s
s
cond2
cond1
R
R
7
Equivalence Rules (Cont.)
  • 5. Theta-join operations (and natural joins) are
    commutative. E1 ? E2 E2 ? E1
  • 6. Natural join operations are associative
  • (E1 E2) E3 E1 (E2 E3)

8
Equivalence Rules for Joins
commutative
associative
9
Equivalence Rules (Cont.)
  • For pushing down selections into a (theta) join
    we have the following cases
  • (push 1) When all the attributes in ?0 involve
    only the attributes of one of the expressions
    (E1) being joined. ??0?E1 ?
    E2) (??0(E1)) ? E2
  • (split) When ? 1 involves only the attributes of
    E1 and ?2 involves only the attributes of E2.
  • ??1??? ?E1 ? E2)
    (??1(E1)) ? (??? (E2))
  • (impossible) When ? involves both attributes of
    E1 and E2 (it is a join condition)

10
Pushing Selection thru Cartesian Product and Join
s
?
cond
?
s
cond
R
R
S
The right direction requires that cond refers to
S attributes only
S
s
cond
s
cond
R
R
S
S
11
Projection Decomposition
pXYpXpY
totalX.taxY.price
p
p
total tax price
p
p
p
pXYpXpY
p
p
?
?
?
p
p
pXp?
pYp?
X
Y
X
Y
X.tax
Y.price
12
More Equivalence Rules
  • 8. The projections operation distributes over the
    theta join operation as follows
  • (a) if ? involves only attributes from L1 ?
    L2
  • (b) Consider a join E1 ? E2.
  • Let L1 and L2 be sets of attributes from E1 and
    E2, respectively.
  • Let L3 be attributes of E1 that are involved in
    join condition ?, but are not in L1 ? L2, and
  • Let L4 be attributes of E2 that are involved in
    join condition ?, but are not in L1 ? L2.

13
Join Ordering Example
  • For all relations r1, r2, and r3,
  • (r1 r2) r3 r1 (r2 r3 )
  • If r2 r3 is quite large and r1 r2 is
    small, we choose
  • (r1 r2) r3
  • so that we compute and store a smaller temporary
    relation.

14
Join Ordering Example (Cont.)
  • Consider the expression
  • ?customer-name ((?branch-city Brooklyn
    (branch))
    account depositor)
  • Could compute account depositor first, and
    join result with ?branch-city Brooklyn
    (branch)but account depositor is likely to be
    a large relation.
  • Since it is more likely that only a small
    fraction of the banks customers have accounts in
    branches located in Brooklyn, it is better to
    compute
  • ?branch-city Brooklyn (branch) account
  • first.

15
Lecture 3
  • Query Rewriting
  • Equivalence Rules
  • Query Optimization
  • Dynamic Programming (System R/DB2)
  • Heuristics (Ingres/Postgres)
  • De-correlation of nested queries
  • Result Size Estimation

16
Lecture 3
  • Query Rewriting
  • Equivalence Rules
  • Query Optimization
  • Dynamic Programming (System R/DB2)
  • Heuristics (Ingres/Postgres)
  • De-correlation of nested queries
  • Result Size Estimation

17
The role of Query Optimization
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra
query execution
18
The role of Query Optimization
Compare different relational algebra plan ? on
result size (Practicum 2A)
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra
query execution
19
The role of Query Optimization
SQL
parsing, normalization
logical algebra
physical query optimization
logical query optimization
physical algebra
  • Compare different execution algorithms
  • on true cost
  • (IO, CPU, cache)

query execution
20
Enumeration of Equivalent Expressions
  • Query optimizers use equivalence rules to
    systematically generate expressions equivalent to
    the given expression
  • repeated until no more expressions can be found
  • for each expression found so far, use all
    applicable equivalence rules, and add newly
    generated expressions to the set of expressions
    found so far
  • The above approach is very expensive in space and
    time
  • Time and space requirements are reduced by not
    generating all expressions

21
Finding A Good Join Order
  • Consider finding the best join-order for r1 r2
    . . . rn.
  • There are (2(n 1))!/(n 1)! different join
    orders for above expression. With n 7, the
    number is 665280, with n 10, the number is
    greater than 176 billion!
  • No need to generate all the join orders. Using
    dynamic programming, the least-cost join order
    for any subset of r1, r2, . . . rn is computed
    only once and stored for future use.

22
Dynamic Programming in Optimization
  • To find best join tree for a set of n relations
  • To find best plan for a set S of n relations,
    consider all possible plans of the form S1
    (S S1) where S1 is any non-empty subset of S.
  • Recursively compute costs for joining subsets of
    S to find the cost of each plan. Choose the
    cheapest of the 2n 1 alternatives.
  • When plan for any subset is computed, store it
    and reuse it when it is required again, instead
    of recomputing it
  • Dynamic programming

23
Join Order Optimization Algorithm
  • procedure findbestplan(S)if (bestplanS.cost ?
    ?) return bestplanS// else bestplanS has
    not been computed earlier, compute it nowfor
    each non-empty subset S1 of S such that S1 ?
    S P1 findbestplan(S1) P2 findbestplan(S -
    S1) A best algorithm for joining results of P1
    and P2 cost P1.cost P2.cost cost of A if
    cost lt bestplanS.cost bestplanS.cost
    cost bestplanS.plan execute P1.plan
    execute P2.plan join results of P1 and
    P2 using Areturn bestplanS

24
Left Deep Join Trees
  • In left-deep join trees, the right-hand-side
    input for each join is a relation, not the result
    of an intermediate join.

25
Cost of Optimization
  • With dynamic programming time complexity of
    optimization with bushy trees is O(3n).
  • With n 10, this number is 59000 instead of 176
    billion!
  • Space complexity is O(2n)
  • To find best left-deep join tree for a set of n
    relations
  • Consider n alternatives with one relation as
    right-hand side input and the other relations as
    left-hand side input.
  • Using (recursively computed and stored)
    least-cost join order for each alternative on
    left-hand-side, choose the cheapest of the n
    alternatives.
  • If only left-deep trees are considered, time
    complexity of finding best join order is O(n 2n)
  • Space complexity remains at O(2n)
  • Cost-based optimization is expensive, but
    worthwhile for queries on large datasets (typical
    queries have small n, generally lt 10)

26
Physical Query Optimization
  • Minimizes absolute cost
  • Minimize I/Os
  • Minimize CPU, cache miss cost (main memory DBMS)
  • Must consider the interaction of evaluation
    techniques when choosing evaluation plans
    choosing the cheapest algorithm for each
    operation independently may not yield best
    overall algorithm. E.g.
  • merge-join may be costlier than hash-join, but
    may provide a sorted output which reduces the
    cost for an outer level aggregation.
  • nested-loop join may provide opportunity for
    pipelining

27
Physical Optimization Interesting Orders
  • Consider the expression (r1 r2 r3) r4
    r5
  • An interesting sort order is a particular sort
    order of tuples that could be useful for a later
    operation.
  • Generating the result of r1 r2 r3 sorted on
    the attributes common with r4 or r5 may be
    useful, but generating it sorted on the
    attributes common only r1 and r2 is not useful.
  • Using merge-join to compute r1 r2 r3 may be
    costlier, but may provide an output sorted in an
    interesting order.
  • Not sufficient to find the best join order for
    each subset of the set of n given relations must
    find the best join order for each subset, for
    each interesting sort order
  • Simple extension of earlier dynamic programming
    algorithms
  • Usually, number of interesting orders is quite
    small and doesnt affect time/space complexity
    significantly

28
Heuristic Optimization
  • Cost-based optimization is expensive, even with
    dynamic programming.
  • Systems may use heuristics to reduce the number
    of choices that must be made in a cost-based
    fashion.
  • Heuristic optimization transforms the query-tree
    by using a set of rules that typically (but not
    in all cases) improve execution performance
  • Perform selection early (reduces the number of
    tuples)
  • Perform projection early (reduces the number of
    attributes)
  • Perform most restrictive selection and join
    operations before other similar operations.
  • Some systems use only heuristics, others combine
    heuristics with partial cost-based optimization.

29
Steps in Typical Heuristic Optimization
  • 1. Deconstruct conjunctive selections into a
    sequence of single selection operations (Equiv.
    rule 1.).
  • 2. Move selection operations down the query tree
    for the earliest possible execution (Equiv. rules
    2, 7a, 7b, 11).
  • 3. Execute first those selection and join
    operations that will produce the smallest
    relations (Equiv. rule 6).
  • 4. Replace Cartesian product operations that are
    followed by a selection condition by join
    operations (Equiv. rule 4a).
  • 5. Deconstruct and move as far down the tree as
    possible lists of projection attributes, creating
    new projections where needed (Equiv. rules 3, 8a,
    8b, 12).
  • 6. Identify those subtrees whose operations can
    be pipelined, and execute them using pipelining).

30
Heuristic Join Order the Wong-Youssefi algorithm
(INGRES)
Sample TPC-H Schema Nation(NationKey,
NName) Customer(CustKey, CName,
NationKey) Order(OrderKey, CustKey,
Status) Lineitem(OrderKey, PartKey,
Quantity) Product(SuppKey, PartKey,
PName) Supplier(SuppKey, SName)
Find the names of suppliers that sell a product
that appears in a line item of an order made by a
customer who is in Canada
SELECT SName FROM Nation, Customer, Order,
LineItem, Product, Supplier WHERE
Nation.NationKey Cuctomer.NationKey AND
Customer.CustKey Order.CustKey AND
Order.OrderKeyLineItem.OrderKey AND
LineItem.PartKey Product.Partkey AND
Product.Suppkey Supplier.SuppKey AND NName
Canada
31
Challenges with Large Natural Join Expressions
  • For simplicity, assume that in the query
  • All joins are natural
  • whenever two tables of the FROM clause have
    common
  • attributes we join on them
  • Consider Right-Index only

pSName
RI
RI
One possible order
RI
RI
RI
Index
sNNameCanada
Nation
Customer
Order
LineItem
Product
Supplier
32
Wong-Yussefi algorithm assumptions and objectives
  • Assumption 1 (weak) Indexes on all join
    attributes (keys and foreign keys)
  • Assumption 2 (strong) At least one selection
    creates a small relation
  • A join with a small relation results in a small
    relation
  • Objective Create sequence of index-based joins
    such that all intermediate results are small

33
Hypergraphs
Customer
CName CustKey
Nation
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product
  • relation hyperedges
  • two hyperedges for same relation are possible
  • each node is an attribute
  • can extend for non-natural equality joins by
    merging nodes

34
Small Relations/Hypergraph Reduction
Nation is small because it has the equality
selection NName Canada
Customer
CName CustKey
Nation
NationKey NName
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product
Index
Pick a small relation (and its conditions) to
start the plan
sNNameCanada
Nation
35
Customer
(1) Remove small relation (hypergraph reduction)
and color as small any relation that joins with
the removed small relation
CName CustKey
Nation
NationKey NName
NationKey NName
LineItem
Quantity PartKey
Order
Status OrderKey
Supplier
SName
SuppKey PName
Product
RI
(2) Pick a small relation (and its conditions if
any) and join it with the small relation that has
been reduced
Index
sNNameCanada
Customer
Nation
36
After a bunch of steps
pSName
RI
RI
RI
RI
RI
Index
sNNameCanada
Nation
Customer
Order
LineItem
Product
Supplier
37
Some Query Optimizers
  • The System R/Starburst dynamic programming on
    left-deep join orders. Also uses heuristics to
    push selections and projections down the query
    tree.
  • DB2, SQLserver are cost-based optimizers
  • SQLserver is transformation based, also uses
    dynamic programming.
  • MySQL optimizer is heuristics-based (rather weak)
  • Heuristic optimization used in some versions of
    Oracle
  • Repeatedly pick best relation to join next
  • Starting from each of n starting points. Pick
    best among these.

38
Lecture 3
  • Query Rewriting
  • Equivalence Rules
  • Query Optimization
  • Dynamic Programming (System R/DB2)
  • Heuristics (Ingres/Postgres)
  • De-correlation of nested queries
  • Result Size Estimation
  • Practicum Assignment 2

39
Lecture 3
  • Query Rewriting
  • Equivalence Rules
  • Query Optimization
  • Dynamic Programming (System R/DB2)
  • Heuristics (Ingres/Postgres)
  • De-correlation of nested queries
  • Result Size Estimation
  • Practicum Assignment 2

40
Optimizing Nested Subqueries
  • SQL conceptually treats nested subqueries in the
    where clause as functions that take parameters
    and return a single value or set of values
  • Parameters are variables from outer level query
    that are used in the nested subquery such
    variables are called correlation variables
  • E.g.select customer-namefrom borrowerwhere
    exists (select from
    depositor where
    depositor.customer-name

    borrower.customer-name)
  • Conceptually, nested subquery is executed once
    for each tuple in the cross-product generated by
    the outer level from clause
  • Such evaluation is called correlated evaluation
  • Note other conditions in where clause may be
    used to compute a join (instead of a
    cross-product) before executing the nested
    subquery

41
Optimizing Nested Subqueries (Cont.)
  • Correlated evaluation may be quite inefficient
    since
  • a large number of calls may be made to the nested
    query
  • there may be unnecessary random I/O as a result
  • SQL optimizers attempt to transform nested
    subqueries to joins where possible, enabling use
    of efficient join techniques
  • E.g. earlier nested query can be rewritten as
    select customer-namefrom borrower,
    depositorwhere depositor.customer-name
    borrower.customer-name
  • Note above query doesnt correctly deal with
    duplicates, can be modified to do so as we will
    see
  • In general, it is not possible/straightforward to
    move the entire nested subquery from clause into
    the outer level query from clause
  • A temporary relation is created instead, and used
    in body of outer level query

42
Optimizing Nested Subqueries (Cont.)
  • In general, SQL queries of the form below can be
    rewritten as shown
  • Rewrite select from L1
    where P1 and exists (select
    from L2 where P2)
  • To create table t1 as
    select distinct V from L2
    where P21 select
    from L1, t1 where P1
    and P22
  • P21 contains predicates in P2 that do not involve
    any correlation variables
  • P22 reintroduces predicates involving
    correlation variables, with relations renamed
    appropriately
  • V contains all attributes used in predicates with
    correlation variables

43
Optimizing Nested Subqueries (Cont.)
  • In our example, the original nested query would
    be transformed to create table t1 as
    select distinct customer-name from
    depositor select customer-name from
    borrower, t1 where t1.customer-name
    borrower.customer-name
  • The process of replacing a nested query by a
    query with a join (possibly with a temporary
    relation) is called decorrelation.
  • Decorrelation is more complicated when
  • the nested subquery uses aggregation, or
  • when the result of the nested subquery is used
    to test for equality, or
  • when the condition linking the nested subquery to
    the other query is not exists,
  • and so on.

44
Practicum Assignment 2A
  • Get the XML metadata description for TPC-H
  • xslt script for plotting histograms
  • Take our solution for your second query
    (assignment 1)
  • For each operator in the tree give
  • Selectivity
  • Intermediate Result size
  • Short description how you computed this
  • Explanation how to compute histograms on all
    result columns
  • Sum all intermediate result sizes into total
    query cost
  • DEADLINE march 31

45
The Big Picture
  • 1. Parsing and translation
  • 2. Optimization
  • 3. Evaluation

46
The Big Picture
  • 1. Parsing and translation
  • 2. Optimization
  • 3. Evaluation

47
Optimization
  • Query Optimization Amongst all equivalent
    evaluation plans choose the one with lowest cost.
  • Cost is estimated using statistical information
    from the database catalog
  • e.g. number of tuples in each relation, size of
    tuples, etc.
  • In this lecture we study logical cost estimation
  • introduction to histograms
  • estimating the amount of tuples in the result
    with perfect and equi-height histograms
  • propagation of histograms into result columns
  • How to compute result size from width and tuples

48
Cost Estimation
  • Physical cost estimation
  • predict I/O blocks, seeks, cache misses, RAM
    consumption,
  • Depends in the execution algorithm
  • In this lecture we study logical cost estimation
  • the plan with smallest intermediate result tends
    to be best
  • need estimations for intermediate result sizes
  • Histogram-based estimation (practicum, assignment
    2)
  • estimating the amount of tuples in the result
    with perfect and equi-height histograms
  • propagation of histograms into result columns
  • compute result size as tuple-width tuples

49
Selectivities
  • select
  • expr X(col,const) X in lt, lt, , gt, gt
  • expr expr expr expr
  • sop(R) R' / R.
  • join
  • only 1-n / n-1 foreign key joins
  • s join(R1,R2) R' / (R1R2).
  • aggr
  • s(A(Rg1a)) A(Rg1a) / R distinct(R.g1)
    / R.
  • s(A(Rg1,g2a)) (distinct(R.g1)
    distinct(R.g2)) / R.
  • project, order
  • s project(R) s order(R) 1
  • topn
  • s topN(R) min(N,R) / R min(N/R,1).

50
Result Size
  • tuples_max selectivity columns
  • We disregard differences in column-width
  • project
  • columns projectlist
  • tuples_max R
  • aggr
  • columns groupbys aggrs
  • tuples_max min(R, g1 .. gn)
  • join
  • columns child1 child2
  • tuples_max R1 R2
  • other
  • columns stays equal wrt child
  • tuples_max R

51
Selectivity estimation
  • We can estimate the selectivities using
  • domain constraints
  • min/max statistics
  • histograms

52
Histograms
  • Buckets
  • B ltmin, max, total, distinctgt
  • Leave min out (Bi.min Bi-1.max)

53
Different Kinds of Histograms
  • Perfect
  • Equi-width
  • Equi-height
  • In the practicum we use
  • Perfect histograms, when distinct(R.a) lt 25
  • Equi-height histograms of 10 buckets, otherwise
  • Not perfectly even-height disjunct value ranges
    between buckets
  • (i.e. frequent value is not split over
    even-height buckets. It may create a
    bigger-than-height bucket)

54
Perfect Histograms Equi-Selection
  • s(R.aC) Bk.total (1/R)
  • in case there is a k with Bk.max C
  • s(R.aC) 0
  • otherwise

s(R.ad)
total
a
c
f
d
55
Perfect Histograms Equi-Selection
  • s(R.aC) Bk.total (1/R)
  • in case there is a k with Bk.max C
  • s(R.aC) 0
  • otherwise

s(R.ad)
total
a
c
d
f
56
Perfect Histograms Range-Selection
  • s(R.altC) sum(Bi.total) (1/R),
  • for all 1 lt i lt k with B(k-1).max lt C lt Bk.max

s(R.altd)
total
a
c
d
f
57
Perfect Histograms Range-Selection
  • s(R.altC) sum(Bi.total) (1/R),
  • for all 1 lt i lt k with B(k-1).max lt C lt Bk.max

s(R.altd)
total
a
c
d
f
58
Equi-Height Histograms Equi-Selection
  • s(R.aC) avg_freq(Bk) (1/R)
  • in case there is a k with B(k-1).max lt C lt
    Bk.max
  • avg_freq(Bk) Bk.total / Bk.distinct
  • s(R.aC) 0
  • otherwise

s(R.ac)
total
a
d
f
e
59
Equi-Height Histograms Equi-Selection
  • s(R.aC) avg_freq(Bk) (1/R)
  • in case there is a k with B(k-1).max lt C lt
    Bk.max
  • avg_freq(Bk) Bk.total / Bk.distinct
  • s(R.aC) 0
  • otherwise

s(R.ac)
total
a
d
f
e
60
Equi-Height Histograms Range-Selection
  • s(R.altC) ( sum(Bi.total) freq_lt(Bk,C) )
    (1/R),
  • for all 1 lt i lt k with B(k-1).max lt C lt Bk.max.

s(R.altc)
total
a
d
f
e
61
Equi-Height Histograms Range-Selection
  • s(R.altC) ( sum(Bi.total) freq_lt(Bk,C) )
    (1/R),
  • for all 1 lt i lt k with B(k-1).max lt C lt Bk.max.

s(R.altc)
total
a
d
f
e
62
Select with And and Or
  • Assume no correlation between attributes
  • s(?a and ?c) s(?a) s(?c)
  • s(?a or ?c) s(?a) (1-s(?a)) s(?c)
  • Note must normalize ?a , ?c into non-overlapping
    conditions

63
Foreign-key Join Selectivity/Hitrate Estimation
  • Foreign-key constraint
  • R1 matches at most once with R2
  • each order matches on average with 7 lineitems
    ?hitrate 7
  • But what if R2 (e.g. order) is an intermediate
    result?
  • R2 may have multiple key occurrences due to a
    previous join
  • R2 may have less key occurrences (missing keys)
    due to a select (or join).
  • Simple Approach (practicum)
  • Hitrate R2/R2

64
Aggr(R,g1..gn,..)
  • Can only predict groupby columns and size
  • Expected result size
  • min(R,distinct(g1) . distinct(gn))

65
Histogram Propagation
  • order histogram stays identical
  • project histogram stays identical
  • Expression (e.g. l_taxl_price) not required for
    the practicum
  • possible to use cartesian product on histograms,
    followed by expression evaluation and
    re-bucketizing.
  • topn not required for the practicum
  • Use last bucket (and backwards) to take highest N
    distinct values and their frequencies
  • aggr not required for the practicum
  • Groupbys distinct is multiplication of
    distincts, freq1
  • Aggregates only possible for global aggregates
    (no groupbys)
  • fk-join multiply totals by join hitrate
  • distinct min(distinct,total) ? this is a
    simplicifation!
  • Select multiply totals by selectivity
  • distinct min(distinct,total)
  • Select (selection attribute)
  • Get totals/distincts from subset of buckets

66
Practicum Assignment 2
  • Get the XML metadata description for TPC-H
  • ps/pfd histograms also available
  • Take our solution for your second query
    (assignment 1)
  • For each operator in the tree give
  • Selectivity
  • Intermediate Result size
  • Short description how you computed this
  • Explanation how to compute histograms on all
    result columns
  • Sum all intermediate result sizes into total
    query cost
  • DEADLINE march 31
Write a Comment
User Comments (0)
About PowerShow.com