SQL: Queries, Programming, Triggers - PowerPoint PPT Presentation

About This Presentation

SQL: Queries, Programming, Triggers


Database Systems II Query Execution – PowerPoint PPT presentation

Number of Views:356
Avg rating:3.0/5.0
Slides: 53
Provided by: RaghuRa59


Transcript and Presenter's Notes

Title: SQL: Queries, Programming, Triggers

Database Systems II Query Execution
  • We have optimized the logical query plan,
    applying relational algebra equivalences.
  • In order to refine this plan into a physical
    query plan, we in particular need to choose one
    of the available algorithms to implement the
    basic operations (selection, join, . . . ) of the
    query plan.
  • For each alternative physical query plan, we
    estimate its cost.
  • The cost estimates are based on the size
    estimates that we discussed in the previous

  • Disk I/O (read / write of a disk block) is orders
    of magnitude more expensive than CPU operations.
  • Therefore, we use the number of disk I/Os to
    measure the cost of a physical query plan.
  • We ignore CPU costs, timing effects, and double
    buffering requirements.
  • We assume that the arguments of an operator are
    found on disk, but the result of the operator is
    left in main memory.

  • We use the following parameters (statistics) to
    express the cost of an operator- B(R) of
    blocks containing R tuples,- f(R) max of
    tuples of R per block,- M memory blocks
    available in the buffer,- HT(i) levels in
    index i,- LB(i) of leaf blocks in index i.
  • M may comprise the entire main memory, but
    typically the main memory needs to be shared with
    other processes, and M is much (!) smaller.

  • The performance of relational operators depends
    on many parameters such as the following ones.
  • Are the tuples of a relation stored physically
    contiguous (clustered)? If yes, the number of
    blocks to be read is greatly reduced compared to
    non-clustered storage.
  • Is a relation sorted by the relevant (selection,
    join) attribute? Otherwise, it may need to be
    sorted on-the-fly.
  • Which indexes exist? Some algorithms require the
    existence of a corresponding index.

  • Each operator (selection, join, . . .) in a
    logical query plan can be implemented by one of a
    fairly large number of alternative algorithms .
  • We distinguish three types of algorithms-
    sorting-based algorithms,- hash-based
    algorithms,- index-based algorithms.
  • Sorting, building of hash table or building of
    index can either have happened in advance or may
    happen on the fly.

  • We can also categorize algorithms according to
    the number of passes over the data
  • - one-pass algorithms read data only once
    from disk,- two-pass algorithms read data
    once from disk, write intermediate relation
    back to disk and then read the intermediate
    relation once.- multiple-pass algorithms
    perform more than two passes over the data,
    not considered in class.

One-Pass Algorithms for Unary Operations
  • Consider the unary, tuple-at-a-time operations,
    selection and projection on relation R.
  • Read all the blocks of R into the input buffer,
    one at a time.
  • Perform the operation on each tuple and move the
    selected / projected tuple to the output buffer.

One-Pass Algorithms for Unary Operations
  • Output buffer may be input buffer of other
    operation and is not counted.
  • Thus, algorithm requires only M 1 buffer
  • I/O cost is B(R).
  • If some index is applicable for a selection, have
    to read only blocks that contain qualifying

One-Pass Algorithms for Binary Operations
  • Binary operations union, intersection,
    difference, Cartesian product, and join.
  • Use subscripts B and S to distinguish between the
    set- and bag version, e.g.
  • The bag union can be computed
    using a very simple one-pass algorithm copy each
    tuple of R to the output, and copy each tuple of
    S to the output.
  • I/O cost is B(R) B(S), M 1.

One-Pass Algorithms for Binary Operations
  • Other binary operations require the reading of
    the smaller of the two input relations into main
  • One buffer to read blocks of the larger relation,
    M-1 buffers for holding the entire smaller table.
  • I/O cost is B(R) B(S).
  • In main memory, a data structure is built that
    efficiently supports insertions and searches.
  • Data structure, e.g., hash table or binary
    balanced tree. Space overhead can be neglected.

One-Pass Algorithms for Binary Operations
  • For set union, read the smaller relation (S) into
    M-1 buffers, representing it in a data structure
    whose search key consists of all attributes.
  • All these tuples are also copied to the output.
  • Read all blocks of R into the M-th buffer, one at
    a time.
  • For each tuple t of R, check whether t is in S.
    If not, copy t to the output.
  • For set intersection, copy t to output if it also
    is in S.

Nested-Loop Joins
  • We now consider algorithms for the join operator.
  • The simplest one is the nested-loop join, a
    one-and-a-half pass algorithm.
  • One table is read once, the other one multiple
  • It is not necessary that one relation fits in
    main memory.
  • Perform the join through two nested loops over
    the two input relations.

Nested-Loop Joins
  • Tuple-based nested-loop join
  • natural join R S, join attribute
  • for each r ? R do
  • for each s ? S do
  • if r.C s.C then output (r,s)
  • Outer relation R, inner relation S.
  • One buffer for outer relation, one buffer for
    inner relation.
  • M 2.
  • I/O cost is T(R) x T(S).

Nested-Loop Joins
  • Example
  • Relations not clustered
  • T(R1) 10,000 T(R2) 5,000
  • R1 as the outer relation
  • Cost for each R1 tuple t1
  • read tuple t1 read relation R2
  • Total I/O cost is 10,000 (15,000)50,010,000

Nested-Loop Joins
  • Can do much better by organizing access to both
    relations by blocks.
  • Use as much buffer space as possible (M-1) to
    store tuples of the outer relation.
  • Block-based nested-loop join
  • for each chunk of M-1 blocks of R do
    read these blocks into the buffer
  • for each block b of S do
    read b into the buffer for each tuple t of b
    do find the tuples of R that join
    with t and output the join results

Nested-Loop Joins
  • Example
  • R1 as the outer relation
  • T(R1) 10,000, T(R2) 5,000
  • S(R1) S(R2) 1/10 block
  • M 101, 100 buffers for R1, 1 buffer for R2
  • 10 R1 chunks
  • cost for each R1 chunk
  • read chunk 1,000 IOs
  • read R2 5,000 IOs
  • total I/O cost is 10 x 6,000 60,000 IOs

Nested-Loop Joins
  • Can do even better by reversing the join order.
  • R2 R1
  • T(R1) 10,000, T(R2) 5,000
  • S(R1) S(R2) 1/10 block
  • M 101, 100 buffers for R2, 1 buffer for R1
  • 5 R2 chunks
  • cost for each R2 chunk
  • read chunk 1,000 IOs
  • read R1 10,000 IOs
  • total I/O cost is 5 x 11,000 55,000 IOs

Nested-Loop Joins
  • Finally, performance is dramatically improved
    when input relations are clustered.
  • With clustered relations, for each R2 chunk
  • read chunk 100 IOs
  • read R1 1,000 IOs
  • Total I/O is 5 x 1,100 5,500 IOs.
  • Note that the IO cost for a one-pass join (which
    has the minimum IO of any join algorithm) in this
    example is 1,000 500 1,500 IOs.
  • For a comparison, the one-pass join requires
    M501 buffer blocks, which is optimal.

Two-Pass Algorithms Based on Sorting
  • If the input relations are sorted, the efficiency
    of duplicate elimination, set-theoretic
    operations and join can be greatly improved.
  • Reserve one buffer for each of the input
    relations R and S and another buffer for the
  • Scan both relations simultaneously in sort order,
    merging matching tuples.
  • For example, for set intersection repeatedly
    consider the tuple t that is least in the sort
    order (w.r.t. primary key) among all tuples in
    the input buffer. If it appears in both R and S,
    output t.

Two-Pass Algorithms Based on Sorting
  • In the following, we present a simple sort-merge
    join algorithm.
  • It is called merge-join, if step (1) can be
    skipped, since the input relations R1 and R2 are
    already sorted.
  • Sort-merge join
  • (1) if R1 and R2 not sorted, sort them
  • (2) i ? 1 j ? 1
  • while (i ? T(R1)) ? (j ? T(R2)) do
  • if R1 i .C R2 j .C then outputTuples
  • else if R1 i .C gt R2 j .C then j ? j1
  • else if R1 i .C lt R2 j .C then i ? i1

Two-Pass Algorithms Based on Sorting
  • Procedure outputTuples produces all pairs of
    tuples from R1 and R2 with C R1 i .C R2 j
  • In the worst case, need to match each pairs of
    tuples from R1 and R2 (nested-loop join).
  • Procedure outputTuples
  • While (R1 i .C R2 j .C) ? (i ? T(R1)) do
  • jj ? j
  • while (R1 i .C R2 jj .C) ? (jj ?
    T(R2)) do
  • output pair R1 i , R2 jj
  • jj ? jj1
  • i ? i1

Two-Pass Algorithms Based on Sorting
Example i R1i.C R2j.C j 1 10
5 1 2 20 20 2 3 20
20 3 4 30 30 4 5 40
30 5 50 6 52 7
Two-Pass Algorithms Based on Sorting
  • Example
  • Both R1, R2 ordered by C relations clustered.

R1 R2
Total cost read R1 cost read R2 cost
1,000 500 1,500 IOs
Two-Pass Algorithms Based on Sorting
  • What if input relations are not yet in the
    required sort order?
  • Do Two-Phase, Multiway Merge-Sort (2PMMS).
  • Phase 1 Sort each block of relation R separately
    in main memory, write sorted sublists back to
  • Phase 2 Merge all the B(R) sorted sublists.

input buffer (sorted)
output buffer
. . .
pointer to firstunchosen tuple
Two-Pass Algorithms Based on Sorting
  • Each sorted sublist has a length of M blocks.
  • Number of sublists is B(R)/M.
  • Therefore,
  • This means we require
  • In phase 1, each tuple is read and written once.
    In phase 2, each tuple is read again. We ignore
    the cost of writing the results to disk.
  • Thus, the IO cost is 3B(R).

Two-Pass Algorithms Based on Sorting
  • IO cost is 4B(R), if sorting is used as a first
    step of sort-join and the results must be written
    to the disk.
  • If relation R is too big, apply the idea
  • Divide R into chunks of size M(M-1), use 2PMMS to
    sort each one, and take resulting sorted lists as
    input for a third (merge) phase.
  • This leads to Multi-Phase, Multiway Merge Sort.

Two-Pass Algorithms Based on Sorting
  • Example M101

(i) For each 100 blk chunk of R - read
chunk - sort in memory - write to disk
sorted chunks
Two-Pass Algorithms Based on Sorting
(ii) Read all chunks merge write out Sorted
file Memory Sorted Chunks
Two-Pass Algorithms Based on Sorting
Sort cost cach tuple is read, written,
read, written Join cost each tuple is read
Sort cost R1 4 x 1,000 4,000 Sort cost R2
4 x 500 2,000 Total cost sort cost join
cost 6,000 1,500 7,500 IOs
Two-Pass Algorithms Based on Sorting
  • Nested loop join (best version discussed above)
    needs only 5,500 IOs, i.e. outperforms sort-join.
  • However, the situation changes for the following
  • R1 10,000 blocks clustered
  • R2 5,000 blocks not ordered.
  • R1 is 1000 blocks, sorting needs M ? 31.62.R2
    is 500 blocks, sorting needs M ? 22.36. I.e.,
    need at least M32 buffers.

Two-Pass Algorithms Based on Sorting
  • Nested-loops join 5000 x (10010,000) 50 x
  • 100
  • 505,000 IOs
  • Sort-join 5(10,0005,000) 75,000 IOs
  • Sort-join clearly outperforms nested-loop join!

Two-Pass Algorithms Based on Sorting
  • Simple sort-join costs 5(B(R) B(S)) IOs.
  • It requires
  • It assumes that tuples with the same join
    attribute value fit in M blocks.
  • If we do not have to worry about large numbers of
    tuples with the same join attribute value, then
    we can combine the second phase of the sort with
    the actual join (merge).
  • We can save the writing to disk in the sort step
    and the reading in the merge step.

Two-Pass Algorithms Based on Sorting
  • This algorithm is an advanced sort-merge join.
  • Repeatedly find the least C-value c among the
    tuples in all input buffers.
  • Instead of writing a sorted output buffer to
    disk, and reading it again later, identify all
    the tuples of both relations that have Cc.
  • Cost is only 3(B(R) B(S)) IOs.
  • Since we have to simultaneously sort both input
    tables and keep them in memory, the memory
    requirements are getting larger

Two-Pass Algorithms Based on Hashing
  • If both input relations are too large to be
    stored in the buffer, hash all the tuples of both
    relations applying the same hash function to the
    join attribute(s).
  • Hash function h has domain of k hash values, i.e.
    k buckets.
  • Only tuples from R and S that fall into the same
    bucket i can join.
  • Hash first relation R, then relation S, writing
    the buckets to disk.

Two-Pass Algorithms Based on Hashing
  • To hash relation R, read it block by block.
  • Allocate one buffer block to each of the k
  • For each tuple t, move it to the buffer of h(t).
  • If a buffer is full, write it to disk and
    initialize it.
  • Finally, write to disk all partially full buffer
  • IO cost is B(R).
  • Memory requirement M k1.

Two-Pass Algorithms Based on Hashing
  • For each i, read the i-th bucket of R into
    completely memory, and read the i-th bucket of S
    into memory, one block at a time.
  • For each S tuple s in the buffer block, determine
    matching tuples r in R and output (r,s).
  • We assume that each bucket fits into main memory.

Two-Pass Algorithms Based on Hashing
Hash join Hash function h, range 0 . . .
k Buckets for R1 G0, G1, ... Gk Buckets for R2
H0, H1, ... Hk Algorithm (1) Hash R1 tuples into
G buckets (2) Hash R2 tuples into H buckets (3)
For i 0 to k do match tuples in buckets Gi,
Hi and output results
Two-Pass Algorithms Based on Hashing
Example hash function even/odd buckets
R1 R2 Buckets 2 5 Even 4 4 3
12 Odd 5 3 8 13 9
8 11 14
2 4 8
4 12 8 14
3 5 9
5 3 13 11
R1 R2
Two-Pass Algorithms Based on Hashing
  • R, S clustered (un-ordered).
  • Use 100 hash buckets of 10 blocks each.
  • To hash R read R, hash, and write buckets.
  • Hash S in the same way.
  • R

10 blocks
Two-Pass Algorithms Based on Hashing
  • Suppose R is the smaller of the input relations.
  • Read one R bucket, build memory hash table (with
    other hash function).
  • Read corresponding S bucket, one block at a time,
    and hash probe.
  • Repeat same procedure for all buckets.
  • R

Two-Pass Algorithms Based on Hashing
  • Cost
  • Bucketize Read R write
  • Read S write
  • Join Read R, S
  • Total cost 3 x 1,000500 4,500 IO
  • This is an approximation, since buckets will vary
    in size, and we have to round up to full blocks.

Two-Pass Algorithms Based on Hashing
  • Memory requirements
  • Size of R bucket (B(R)/M-1)
  • k M-1 number of hash buckets
  • This is assuming that all hash buckets of R have
    the same size.
  • Same calculation for S.
  • The buckets for the smaller input relation must
    fit into main memory.

Index-Based Algorithms
  • Index-based algorithms are especially useful for
    the selection operator, but also for the join
  • We distinguish clustering and non-clustering
  • A clustering index is an index where all tuples
    with a given search key value appear on (roughly)
    as few blocks as possible.
  • One relation can have only one clustering index,
    but multiple non-clustering indexes.

Index-Based Algorithms
Index join For each r ? R do X ?
index (S, C, r.C) for each s ? X do
output (r,s) index(rel, attr,
value) returns the set of rel tuples with attr
Index-Based Algorithms
  • ExampleAssume R.C index exists 2 levels.Assume
    S clustered, unordered.Assume R.C index fits in
  • Cost reads of S 500 IOs
  • for each S tuple
  • - probe index no IO
  • - if match, read R tuple 1 IO.

Index-Based Algorithms
What is expected number of matching tuples?
(a) say R.C is key, S.C is foreign key then
expect 1 match
(b) say V(R,C) 5000, T(R) 10,000 with
uniform distribution assumption expect
10,000/5,000 2 matching tuples.
Index-Based Algorithms
What is expected number of matching tuples?
(c) Say DOM(R, C)1,000,000 T(R)
10,000 with alternate assumption expect
10,000 1 matches 1,000,000
Index-Based Algorithms
Total cost of index join
(a) Total cost 5005000(1)1 5,500 IO (b)
Total cost 5005000(2)1 10,500 IO (c) Total
cost 5005000(1/100)1550 IO
Index-Based Algorithms
What if index does not fit in memory?
  • Example say R1.C index is 201 blocks.
  • Keep root 99 leaf nodes in memory.
  • Expected cost of each probe is
  • E (0)99 (1)101 ? 0.5.
  • 200 200

Index-Based Algorithms
  • Total cost (including probes)
  • For case (b)
  • 5005000 probe get records
  • 5005000 0.52 uniform assumption
  • 50012,500 13,000 IOs
  • For case (c)
  • 50050000.5 ? 1 (1/100) ? 1
  • 500250050 3,050 IOs

Summary Join Algorithms
  • Nested-loop join ok for small relations
    (relative to memory size).
  • Hash-join usually best for equi-join, where
    relations not sorted and no indexes exist.
  • Sort-merge join good for non-equi-join e.g.,
    R.C gt S.C.
  • If relations already sorted, use merge join.
  • If index exists, index-join can be efficient
  • (depends on expected result size).
Write a Comment
User Comments (0)
About PowerShow.com