Query Optimization - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Query Optimization

Description:

Query Rewriter. Algebraic Representation. Plan Generator. Data Stats. Review ... We saw many for selections, joins. Estimate cost for each and choose best ... Rewriting: ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 37
Provided by: marily180
Category:

less

Transcript and Presenter's Notes

Title: Query Optimization


1
Query Optimization
2
General Overview
  • Relational model - SQL
  • Formal commercial query languages
  • Functional Dependencies
  • Normalization
  • Physical Design
  • Indexing
  • Query Processing and Optimization

3
Review QP O
SQL Query
Query Processor
Parser
Query Optimizer
Algebraic Expression
Execution plan

Evaluator
Data result of the query
4
Review QP O
Query Optimizer
Algebraic Representation
Query Rewriter
Algebraic Representation
Data Stats
Plan Generator
Query Execution Plan
5
Review-Plan Generation
  • Metadata DBMS maintains statistics about each
    relation, attribute and index.
  • Plan generation
  • Generate many alternative plans
  • We saw many for selections, joins
  • Estimate cost for each and choose best
  • Plans examined
  • Selection Linear, binary, PI, SI
  • Range PI, SI
  • Joins NLJ, BNLJ, INLJ, SMJ, HJ

6
Review-Plan Generation
  • Depends upon a cost model
  • For any query, must know
  • ?its estimated cardinality
  • ?its estimated cost (in of I/Os)
  • E.g. ?A K (R )
  • ? cardinality SC(A, R)
  • ? cost depends on the plan, attribute
  • Linear Scan bR /2
    bR
  • Binary Search ?log2(bR)? ?log2(bR)??sc(A,
    R)/fR? -1
  • PI Scan HTi 1 HTi ?sc(A, R) / fR?

7
Hybrid Merge Join
  • hybrid merge-join If one relation is sorted, and
    the other has a secondary B-tree index on the
    join attribute
  • Merge the sorted relation with the leaf entries
    of the B-tree .
  • Sort the result on the addresses of the unsorted
    relations tuples
  • Scan the unsorted relation in physical address
    order and merge with previous result, to replace
    addresses by the actual tuples
  • Sequential scan more efficient than random lookup

8
Hash- joins
  • Applicable only to natural joins, equijoins
  • Depends upon hash function h, used to partition
    both relations
  • ? must map values of join attributes to 0, ...,
    n-1 s.t. n partitions

9
Hash-Join Algorithm
Algorithm Hash Join
  • Partition the relation S using hashing function h
    so that each si fits in memory. Use 1 block of
    memory as the output buffer for each partition.
    (at least n blocks)
  • 2. Partition R using h.
  • For each partition i (0, n-1)
  • Use BNLJ to compute the join between Ri and Si
    Ri Si
  • (optimal since si fits in memory, inner relation)

S is called the build input and
R is called the probe input.
Note can reduce CPU costs by building in-memory
hash index for each si using a different hash
function than h.
10
Hash Join
  • Partitioning
  • must choose
  • of partitions, n
  • hashing function, h (each tuple ? 0, ..., n-1)
  • Goals (in order of importance)
  • 1. Each partition of build relation should fit
    in memory
  • (gt h is uniform, n is large)
  • 2. For partitioning step, can fit 1 output block
    of each partition in memory
  • (gt n is small (lt M-1))
  • Strategy
  • Ensure 1.
  • Deal with violations of 2 when needed.

11
Hash Join
  • Goal 1 Partitions of build relations should fit
    in memory

1
...
Memory (M blocks)
n
n should be? Ans (reserving 2
blocks for R partition, output of BNLJ) (In
practice, a little larger (fudge factor1.2) as
not all memory available for partition joins)
12
Hash Join
  • Goal 2 keep n lt M
  • what if not possible?
  • Recursive partitioning!
  • Idea
  • Iteration 1 Partition S into M-1 partitions
    using h1
  • Iteration 2 Partition each partition of S
    into M-1 partitions using a different hash
    function h2
  • ......
  • repeat until partition S into gt

13
Cost of Hash-Join
  • Cost
  • case 1 No recursive partitioning
  • 1. Partition S bS reads and bS
    n writes.
  • Why n?
  • 2. Rartition R bR reads and bR
    n writes.
  • 3. n partition joins bR bS 2n
    Reads
  • Total 3(bR bS ) 4 n

Typically n is small enough (roughly ) so
it can be ignored.
14
Cost of Hash-Join
  • case 2 Recursive Partitioning
  • Recall partition build relation M-1 ways at
    each time.
  • So, total number of iterations
  • ?logM1(n)? ?logM1(bS / M-2)? ?logM1(bS
    / M-1)?
  • ?logM1 bS - 1?
  • Cost
  • 1. partition S 2 bS (?logM1 bS - 1?)
  • 2. partition R 2 bR (?logM1 bS - 1?)
  • 3. n partition joins bR bS
  • Total cost estimate is
  • 2(bR bS )( ?logM1(bS)-1?) bR bS

15
Example of Cost of Hash-Join
customer depositor
  • Assume that memory size is M3 blocks
  • bdepositor 100 and bcustomer 400.
  • depositor is to be used as build input.

NO
Recursive partitioning 2(bcust bdep )
(?log2(bdep) -1?) bdep bcust 1000 (6)
500 6500 I/Os !
Why ever use SMJ? 1) both input relations
already sorted 2) skewless hash functions hard
sometimes.
16
Hybrid HashJoin
  • Useful when memory sizes are relatively large,
    and the build input is bigger than memory.
  • Main feature of hybrid hash join
  • Keep the smaller partition of the build
    relation in memory.

s0
Sn-2
Sn-1
input

R
S
s0
Rn-2
Rn-1
input
output

R0
17
Complex Joins
  • Join with a conjunctive condition
  • r ?1? ? 2?... ? ? n s
  • Either use nested loops/block nested loops, or
  • Compute the result of one of the simpler joins r
    ?i s
  • final result comprises those tuples in the
    intermediate result that satisfy the remaining
    conditions
  • ?1 ? . . . ? ?i 1 ? ?i 1 ? . . . ? ?n
  • Join with a disjunctive condition
  • r ?1 ? ?2 ?... ? ?n s
  • Either use nested loops/block nested loops, or
  • Compute as the union of the records in
    individual joins r ? i s
  • (r ?1 s) ? (r ?2 s) ? . . . ? (r
    ?n s)

18
Other Operations
  • Duplicate elimination can be implemented via
    hashing or sorting.
  • Optimization duplicates can be deleted during
    run generation as well as at intermediate merge
    steps in external sort-merge.
  • Hashing is similar duplicates will come into
    the same bucket.
  • Projection is implemented by performing
    projection on each tuple followed by duplicate
    elimination.

19
Other Operations Aggregation
  • Aggregation can be implemented in a manner
    similar to duplicate elimination.
  • Hashing or sorting
  • Optimization combine tuples in the same group
    during run generation and intermediate merges, by
    computing partial aggregate values
  • For count, min, max, sum keep aggregate values
    on tuples found so far in the group.
  • When combining partial aggregate for count, add
    up the aggregates
  • For avg, keep sum and count, and divide sum by
    count at the end
  • True for all distributive aggregates, i.e.
  • aggr(S) f(aggr(S1), aggr(S2)), S S1 ? S2

20
Evaluation of Expressions
  • So far we have seen algorithms for individual
    operations
  • Alternatives for evaluating an entire expression
    (operator) tree
  • Materialization generate results of an
    expression whose inputs are relations or are
    already computed, materialize (store) it on disk.
    Repeat.
  • Pipelining pass on tuples to parent operations
    even as an operation is being executed

21
Materialization
  • Materialized evaluation evaluate the expression
    bottom-up, storing intermediate results on disk
  • E.g., in figure below, compute and store
  • then compute the store its join with customer,
    and finally compute the projections on
    customer-name.

22
Materialization (Cont.)
  • Materialized evaluation is always applicable
  • Cost of writing results to disk and reading them
    back can be quite high
  • Overall cost Sum of costs of individual
    operations cost of
    writing intermediate results to disk
  • Double buffering use two output buffers for each
    operation, when one is full write it to disk
    while the other is getting filled
  • Allows overlap of disk writes with computation
    and reduces execution time

23
Pipelining
  • Pipelined evaluation evaluate several
    operations simultaneously, passing the results of
    one operation on to the next.
  • E.g., in previous expression tree, dont store
    result of
  • instead, pass tuples directly to the join..
    Similarly, dont store result of join, pass
    tuples directly to projection.
  • For pipelining to be effective, use evaluation
    algorithms that generate output tuples even as
    tuples are received for inputs to the operation.
  • Pipelines can be executed in two ways demand
    driven and producer driven

24
Pipelining (Cont.)
  • In demand driven or lazy evaluation
  • Each operation is implemented as an iterator
    implementing the following operations
  • init()
  • E.g. file scan initialize file scan, store
    pointer to beginning of file as state
  • next()
  • E.g. for file scan Output next tuple, and
    advance and store file pointer
  • close()

25
Pipelining (Cont.)
  • In produce-driven or eager pipelining
  • Operators produce tuples eagerly and pass them up
    to their parents
  • Buffer maintained between operators, child puts
    tuples in buffer, parent removes tuples from
    buffer
  • if buffer is full, child waits till there is
    space in the buffer, and then generates more
    tuples
  • System schedules operations that have space in
    output buffer and can process more input tuples

26
Query Optimization
  • Query Rewriting
  • Given a relational algebra expression produce and
    equivalent expression that can be evaluated more
    efficiently
  • Plan generator
  • Choose the best algorithm for each operator given
    statistics about the database, main memory
    constraints and available indices

27
Transformation of Relational Expressions
  • Two RA expressions are equivalent if they produce
    the same results on the same inputs
  • In SQL, inputs and outputs are multisets of
    tuples
  • Two expressions in the multiset version of the
    relational algebra are said to be equivalent if
    on every legal database instance the two
    expressions generate the same multiset of tuples
  • An equivalence rule says that expressions of two
    forms are equivalent
  • Can replace expression of first form by second,
    or vice versa

28
Equivalence Rules
  • 1. Conjunctive selection operations can be
    deconstructed into a sequence of individual
    selections.
  • 2. Selection operations are commutative.
  • 3. Only the last in a sequence of projection
    operations is needed, the others can be
    omitted.
  • Selections can be combined with Cartesian
    products and theta joins.
  • ??(R1 X R2) R1 ? R2
  • ??1(R1 ?2 R2) R1 ?1? ?2 R2

29
?
?
R1
R2
R2
R1
R3
R1
R1
R2
R2
R3
30
Equivalence Rules (Cont.)
  • 5. Theta-join operations (and natural joins) are
    commutative. R1 ? R2 R2 ? R1
  • 6. (a) Natural join operations are associative
  • (R1 R2) R3 R1 (R2 R3)(b)
    Theta joins are associative in the following
    manner (R1 ?1 R2) ?2? ? 3 R3 R1
    ?1? ?3 (R2 ?2 R3) where ?2
    involves attributes from only R2 and R3.

31
Equivalence Rules (Cont.)
  • The selection operation distributes over the
    theta join operation under the following two
    conditions
  • (a) When all the attributes in ?0 involve
    only the attributes of one of the
    expressions (R1) being joined.
    ??0?R1 ? R2) (??0(R1)) ? R2
  • (b) When ? 1 involves only the attributes of R1
    and ?2 involves only the attributes of
    R2.
  • ??1??? ?R1 ? R2)
    (??1(R1)) ? (??? (R2))

32
Equivalence Rules (Cont.)
  • 8. The projections operation distributes over the
    theta join operation as follows
  • (a) if ? involves only attributes from L1 ?
    L2
  • (b) Consider a join E1 ? E2.
  • Let L1 and L2 be sets of attributes from E1 and
    E2, respectively.
  • Let L3 be attributes of E1 that are involved in
    join condition ?, but are not in L1 ? L2, and
  • let L4 be attributes of E2 that are involved in
    join condition ?, but are not in L1 ? L2.

33
Transformation Example
  • Query Find the names of all customers who have
    an account at some branch located in Brooklyn.
  • ?customer-name(?branch-city Brooklyn (branch
    (account depositor)))
  • Transformation using rule 7a.
  • ?customer-name ((?branch-city Brooklyn
    (branch)) (account depositor))
  • Performing the selection as early as possible
    reduces the size of the relation to be joined.

34
Multiple Transformations (Cont.)
35
Enumeration of Equivalent Expressions
  • Query optimizers use equivalence rules to
    generate equivalent expressions
  • 1st Approach Generate all equivalent expressions
  • But... Very expensive
  • 2nd Approach Exploit common sub-expressions
  • when E1 is generated from E2 by an equivalence
    rule, usually only the top level of the two are
    different, subtrees below are the same and can be
    shared
  • E.g. when applying join associativity
  • Time requirements are reduced by not generating
    all expressions

36
Evaluation Plan Example
Write a Comment
User Comments (0)
About PowerShow.com