Query Optimization - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Query Optimization

Description:

Query Rewriter. Algebraic Representation. Plan Generator. Data Stats. Review ... We saw many for selections, joins. Estimate cost for each and choose best ... Rewriting: ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 37

Provided by: marily180

Category:

more less

Transcript and Presenter's Notes

Title: Query Optimization

1
Query Optimization
2
General Overview

Relational model - SQL
Formal commercial query languages
Functional Dependencies
Normalization
Physical Design
Indexing
Query Processing and Optimization

3
Review QP O
SQL Query
Query Processor
Parser
Query Optimizer
Algebraic Expression
Execution plan

Evaluator
Data result of the query
4
Review QP O
Query Optimizer
Algebraic Representation
Query Rewriter
Algebraic Representation
Data Stats
Plan Generator
Query Execution Plan
5
Review-Plan Generation

Metadata DBMS maintains statistics about each
relation, attribute and index.
Plan generation
Generate many alternative plans
We saw many for selections, joins
Estimate cost for each and choose best
Plans examined
Selection Linear, binary, PI, SI
Range PI, SI
Joins NLJ, BNLJ, INLJ, SMJ, HJ

6
Review-Plan Generation

Depends upon a cost model
For any query, must know
?its estimated cardinality
?its estimated cost (in of I/Os)
E.g. ?A K (R )
? cardinality SC(A, R)
? cost depends on the plan, attribute
Linear Scan bR /2
bR
Binary Search ?log2(bR)? ?log2(bR)??sc(A,
R)/fR? -1
PI Scan HTi 1 HTi ?sc(A, R) / fR?

7
Hybrid Merge Join

hybrid merge-join If one relation is sorted, and
the other has a secondary B-tree index on the
join attribute
Merge the sorted relation with the leaf entries
of the B-tree .
Sort the result on the addresses of the unsorted
relations tuples
Scan the unsorted relation in physical address
order and merge with previous result, to replace
addresses by the actual tuples
Sequential scan more efficient than random lookup

8
Hash- joins

Applicable only to natural joins, equijoins
Depends upon hash function h, used to partition
both relations
? must map values of join attributes to 0, ...,
n-1 s.t. n partitions

9
Hash-Join Algorithm
Algorithm Hash Join

Partition the relation S using hashing function h
so that each si fits in memory. Use 1 block of
memory as the output buffer for each partition.
(at least n blocks)
2. Partition R using h.
For each partition i (0, n-1)
Use BNLJ to compute the join between Ri and Si
Ri Si
(optimal since si fits in memory, inner relation)

S is called the build input and
R is called the probe input.
Note can reduce CPU costs by building in-memory
hash index for each si using a different hash
function than h.
10
Hash Join

Partitioning
must choose
of partitions, n
hashing function, h (each tuple ? 0, ..., n-1)
Goals (in order of importance)
1. Each partition of build relation should fit
in memory
(gt h is uniform, n is large)
2. For partitioning step, can fit 1 output block
of each partition in memory
(gt n is small (lt M-1))
Strategy
Ensure 1.
Deal with violations of 2 when needed.

11
Hash Join

Goal 1 Partitions of build relations should fit
in memory

1
...
Memory (M blocks)
n
n should be? Ans (reserving 2
blocks for R partition, output of BNLJ) (In
practice, a little larger (fudge factor1.2) as
not all memory available for partition joins)
12
Hash Join

Goal 2 keep n lt M
what if not possible?
Recursive partitioning!
Idea
Iteration 1 Partition S into M-1 partitions
using h1
Iteration 2 Partition each partition of S
into M-1 partitions using a different hash
function h2
......
repeat until partition S into gt

13
Cost of Hash-Join

Cost
case 1 No recursive partitioning
1. Partition S bS reads and bS
n writes.
Why n?
2. Rartition R bR reads and bR
n writes.
3. n partition joins bR bS 2n
Reads
Total 3(bR bS ) 4 n

Typically n is small enough (roughly ) so
it can be ignored.
14
Cost of Hash-Join

case 2 Recursive Partitioning
Recall partition build relation M-1 ways at
each time.
So, total number of iterations
?logM1(n)? ?logM1(bS / M-2)? ?logM1(bS
/ M-1)?
?logM1 bS - 1?
Cost
1. partition S 2 bS (?logM1 bS - 1?)
2. partition R 2 bR (?logM1 bS - 1?)
3. n partition joins bR bS
Total cost estimate is
2(bR bS )( ?logM1(bS)-1?) bR bS

15
Example of Cost of Hash-Join
customer depositor

Assume that memory size is M3 blocks
bdepositor 100 and bcustomer 400.
depositor is to be used as build input.

NO
Recursive partitioning 2(bcust bdep )
(?log2(bdep) -1?) bdep bcust 1000 (6)
500 6500 I/Os !
Why ever use SMJ? 1) both input relations
already sorted 2) skewless hash functions hard
sometimes.
16
Hybrid HashJoin

Useful when memory sizes are relatively large,
and the build input is bigger than memory.
Main feature of hybrid hash join
Keep the smaller partition of the build
relation in memory.

s0
Sn-2
Sn-1
input

R
S
s0
Rn-2
Rn-1
input
output

R0
17
Complex Joins

Join with a conjunctive condition
r ?1? ? 2?... ? ? n s
Either use nested loops/block nested loops, or
Compute the result of one of the simpler joins r
?i s
final result comprises those tuples in the
intermediate result that satisfy the remaining
conditions
?1 ? . . . ? ?i 1 ? ?i 1 ? . . . ? ?n
Join with a disjunctive condition
r ?1 ? ?2 ?... ? ?n s
Either use nested loops/block nested loops, or
Compute as the union of the records in
individual joins r ? i s
(r ?1 s) ? (r ?2 s) ? . . . ? (r
?n s)

18
Other Operations

Duplicate elimination can be implemented via
hashing or sorting.
Optimization duplicates can be deleted during
run generation as well as at intermediate merge
steps in external sort-merge.
Hashing is similar duplicates will come into
the same bucket.
Projection is implemented by performing
projection on each tuple followed by duplicate
elimination.

19
Other Operations Aggregation

Aggregation can be implemented in a manner
similar to duplicate elimination.
Hashing or sorting
Optimization combine tuples in the same group
during run generation and intermediate merges, by
computing partial aggregate values
For count, min, max, sum keep aggregate values
on tuples found so far in the group.
When combining partial aggregate for count, add
up the aggregates
For avg, keep sum and count, and divide sum by
count at the end
True for all distributive aggregates, i.e.
aggr(S) f(aggr(S1), aggr(S2)), S S1 ? S2

20
Evaluation of Expressions

So far we have seen algorithms for individual
operations
Alternatives for evaluating an entire expression
(operator) tree
Materialization generate results of an
expression whose inputs are relations or are
already computed, materialize (store) it on disk.
Repeat.
Pipelining pass on tuples to parent operations
even as an operation is being executed

21
Materialization

Materialized evaluation evaluate the expression
bottom-up, storing intermediate results on disk
E.g., in figure below, compute and store
then compute the store its join with customer,
and finally compute the projections on
customer-name.

22
Materialization (Cont.)

Materialized evaluation is always applicable
Cost of writing results to disk and reading them
back can be quite high
Overall cost Sum of costs of individual
operations cost of
writing intermediate results to disk
Double buffering use two output buffers for each
operation, when one is full write it to disk
while the other is getting filled
Allows overlap of disk writes with computation
and reduces execution time

23
Pipelining

Pipelined evaluation evaluate several
operations simultaneously, passing the results of
one operation on to the next.
E.g., in previous expression tree, dont store
result of
instead, pass tuples directly to the join..
Similarly, dont store result of join, pass
tuples directly to projection.
For pipelining to be effective, use evaluation
algorithms that generate output tuples even as
tuples are received for inputs to the operation.
Pipelines can be executed in two ways demand
driven and producer driven

24
Pipelining (Cont.)

In demand driven or lazy evaluation
Each operation is implemented as an iterator
implementing the following operations
init()
E.g. file scan initialize file scan, store
pointer to beginning of file as state
next()
E.g. for file scan Output next tuple, and
advance and store file pointer
close()

25
Pipelining (Cont.)

In produce-driven or eager pipelining
Operators produce tuples eagerly and pass them up
to their parents
Buffer maintained between operators, child puts
tuples in buffer, parent removes tuples from
buffer
if buffer is full, child waits till there is
space in the buffer, and then generates more
tuples
System schedules operations that have space in
output buffer and can process more input tuples

26
Query Optimization

Query Rewriting
Given a relational algebra expression produce and
equivalent expression that can be evaluated more
efficiently
Plan generator
Choose the best algorithm for each operator given
statistics about the database, main memory
constraints and available indices

27
Transformation of Relational Expressions

Two RA expressions are equivalent if they produce
the same results on the same inputs
In SQL, inputs and outputs are multisets of
tuples
Two expressions in the multiset version of the
relational algebra are said to be equivalent if
on every legal database instance the two
expressions generate the same multiset of tuples
An equivalence rule says that expressions of two
forms are equivalent
Can replace expression of first form by second,
or vice versa

28
Equivalence Rules

1. Conjunctive selection operations can be
deconstructed into a sequence of individual
selections.
2. Selection operations are commutative.
3. Only the last in a sequence of projection
operations is needed, the others can be
omitted.
Selections can be combined with Cartesian
products and theta joins.
??(R1 X R2) R1 ? R2
??1(R1 ?2 R2) R1 ?1? ?2 R2

29
?
?
R1
R2
R2
R1
R3
R1
R1
R2
R2
R3
30
Equivalence Rules (Cont.)

5. Theta-join operations (and natural joins) are
commutative. R1 ? R2 R2 ? R1
6. (a) Natural join operations are associative
(R1 R2) R3 R1 (R2 R3)(b)
Theta joins are associative in the following
manner (R1 ?1 R2) ?2? ? 3 R3 R1
?1? ?3 (R2 ?2 R3) where ?2
involves attributes from only R2 and R3.

31
Equivalence Rules (Cont.)

The selection operation distributes over the
theta join operation under the following two
conditions
(a) When all the attributes in ?0 involve
only the attributes of one of the
expressions (R1) being joined.
??0?R1 ? R2) (??0(R1)) ? R2
(b) When ? 1 involves only the attributes of R1
and ?2 involves only the attributes of
R2.
??1??? ?R1 ? R2)
(??1(R1)) ? (??? (R2))

32
Equivalence Rules (Cont.)

8. The projections operation distributes over the
theta join operation as follows
(a) if ? involves only attributes from L1 ?
L2
(b) Consider a join E1 ? E2.
Let L1 and L2 be sets of attributes from E1 and
E2, respectively.
Let L3 be attributes of E1 that are involved in
join condition ?, but are not in L1 ? L2, and
let L4 be attributes of E2 that are involved in
join condition ?, but are not in L1 ? L2.

33
Transformation Example

Query Find the names of all customers who have
an account at some branch located in Brooklyn.
?customer-name(?branch-city Brooklyn (branch
(account depositor)))
Transformation using rule 7a.
?customer-name ((?branch-city Brooklyn
(branch)) (account depositor))
Performing the selection as early as possible
reduces the size of the relation to be joined.

34
Multiple Transformations (Cont.)
35
Enumeration of Equivalent Expressions

Query optimizers use equivalence rules to
generate equivalent expressions
1st Approach Generate all equivalent expressions
But... Very expensive
2nd Approach Exploit common sub-expressions
when E1 is generated from E2 by an equivalence
rule, usually only the top level of the two are
different, subtrees below are the same and can be
shared
E.g. when applying join associativity
Time requirements are reduced by not generating
all expressions

36
Evaluation Plan Example

Write a Comment

User Comments (0)