Query Execution - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Query Execution

Description:

Query Execution Optimizing Performance – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 51
Provided by: thomog
Category:

less

Transcript and Presenter's Notes

Title: Query Execution


1
Query Execution
  • Optimizing Performance

2
Resolving an SQL query
  • Since our SQL queries are very high level, the
    query processor must do a lot of additional
    processing to supply all of the missing details.
  • In practice, an SQL query is translated
    internally into a relational algebra expression.
  • One advantage of using relational algebra is that
    it makes alternative forms of a query easier to
    explore.
  • The different algebraic expressions for a query
    are called logical query plans.
  • We will focus first on the methods for executing
    the operations of the relational algebra.
  • Then we will focus on how transform logical query
    plans.

3
Preview
  • Parsing read SQL, output relational algebra tree
  • Query rewrite Transform tree to a form, which is
    more efficient to evaluate
  • Physical plan generation
  • select implementation for each operator in
    tree,
  • and for passing results up the tree.
  • In this chapter we will focus on the
    implementation for each operator.

4
Relational Algebra recap
  • RA union, intersection and difference correspond
    to UNION, INTERSECT, and EXCEPT in SQL
  • Selection ? corresponds to the WHERE-clause in
    SQL
  • Projection ? corresponds to SELECT-clause
  • Product ? corresponds to FROM-clause
  • Joins corresponds to JOIN, NATURAL JOIN, and
    OUTER JOIN in the SQL2 standard
  • Duplicate elimination ? corresponds to DISTINCT
    in SELECT-clause
  • Grouping ? corresponds to GROUP BY
  • Sorting ? corresponds to ORDER BY

5
Expression trees
MovieStar(name,addr,gender,birthdate) StarsIn(titl
e, year, starName) SELECT title, birthdate FROM
MovieStar, StarsIn WHERE year 1996 AND Gender
F AND starName name
6
How to generate such alternative expression trees
will be Chapter 16.
  • Join method?
  • Can we pipeline the result of one or both
    selections, and avoid storing the result on disk
    temporarily?
  • Are there indexes on MovieStar.gender and/or
    StarsIn.year that will make the ?'s efficient?

7
Relational algebra for real SQL
  • Keep in mind the following fact
  • A relation in algebra is a set, while a relation
    in SQL is probably a bag
  • In short, a bag allows duplicates.
  • Not surprisingly, this can effect the cost of
    related operations.

8
Bag union, intersection, and difference
  • Card(t,R) means the number of occurrences of
    tuple t in relation R
  • Card(t, R?S) Card(t,R) Card(t,S)
  • Card(t,R?S) minCard(t,R), Card(t,S)
  • Card(t,RS) maxCard(t,R)Card(t,S), 0
  • Example R A,B,B, S C,A,B,C
  • R ? S A,A,B,B,B,C,C
  • R ? S A,B
  • R S B

9
Beware Bag Laws ! Set Laws
  • Not all algebraic laws that hold for sets also
    hold for bags.
  • For one example, the commutative law for union
  • R ? S S ? R does hold for bags.
  • Since addition is commutative, adding the number
    of times that tuple x appears in R and S doesnt
    depend on the order of R and S.
  • Set union is idempotent, meaning that S ? S S.
  • However, for bags, if x appears n times in S,
    then it appears 2n times in S ? S.
  • Thus S ? S ! S in general.

10
Physical operators
  • Physical query plans are built from physical
    operators.
  • In most cases, the physical operators are direct
    implementations of the relational algebra
    operators.
  • However, there are several other physical
    operators for various supporting tasks.
  • Table-scan (the most basic operation we want to
    perform in a physical query plan)
  • Index-scan (E.g. if we have an index on some
    relation R we can retrieve the blocks of R by
    using the index)
  • Sort-scan (takes a relation R and a specification
    of the attributes on which the sort is to be
    made, and produces R in sorted order)

11
Computational Model
  • When comparing algorithms for the same operations
    we do not consider the cost of writing the
    output.
  • Because the cost of writing the output on the
    disk depends on the size of the result, not on
    the way the result was computed. In other words,
    it is the same for any computational alternative.
  • Also, we can often pipeline the result to other
    operators when the result is constructed in main
    memory. So a final output phase may not even be
    required.

12
Cost parameters
  • In order to evaluate or estimate the cost of
    query resolution, we need to identify the
    relevant parameters.
  • Typical cost parameters include
  • R the relation on disk
  • M number of main memory buffers available
    (1buffer 1block)
  • B(R) number of blocks of R
  • T(R) number of tuples of R
  • V(R, a) number of distinct values in column a
    of R
  • V(R, L) number of different tuples in R (where
    L is a list of attributes or columns)
  • Simple cost estimate
  • Basic scan B(R)
  • 2PMMS 3B(R)
  • Recall that final output is not counted

13
Algorithms for implementing RA-operators
  • Classification of algorithms
  • Sort based methods
  • Hash based methods
  • Index based methods
  • Degree of difficultness of algorithms
  • One pass (when one relation can fit into main
    memory)
  • Two pass (when no relation can fit in main
    memory, but again the relations are not very
    extremely large)
  • Multi pass (when the relations are very extremely
    large)
  • Classification of operators
  • Tuple-at-a-time, unary operations(s, p)
  • Full-relation, unary operations (d, g)
  • Full-relation, binary operations (union, join,)

14
Selection --
  • The condition C might involve
  • Arithmetic (,-, ?) or string operators such as
    LIKE
  • Comparison between terms, e.g. a lt b or ab 10.
  • Boolean connectives AND, OR, and NOT
  • Example R

a b ---- 2 3 4 5 2 3
a b ---- 0 1 2 3 4 5 2 3
a b ---- 4 5
One pass, tuple-at-a-time
15
Projection --
  • Argument L of ? is a sequence of elements of the
    following form
  • A single attribute in R, or
  • An expression x ? y, where x and y are attribute
    names, or
  • An expression E ? z, where E is an expression
    involving attributes in R and z is a new
    attribute name not in R
  • Example R

a b c ------ 0 1 2 0 1 2 3 4 5
a x ---- 0 3 0 3 3 9
x y ---- 1 1 1 1 1 1
One pass, tuple-at-a-time
16
One pass, tuple-at-a-time
  • Selection and projection
  • Cost B(R) or T(R)
  • (if the relation is
  • not clustered)
  • Space requirement
  • M ? 1 block
  • Principle
  • Read one block (or one tuple if the relation is
    not clustered) at a time
  • Filter in or out the tuples of this block.

17
Duplicate Elimination
  • R1 ?(R2).
  • R1 consists of one copy of each tuple that
    appears in R2 one or more times.

R A B 1 2 3 4 1 2
One pass, unary full-relation operations
18
One pass, unary full-relation operations
  • Duplicate elimination
  • for each tuple decide
  • seen before ignore
  • new output
  • Principle
  • It is the first time we have
  • seen this tuple, in which case we copy it to the
    output.
  • We have seen the tuple before,in which case we
    must not output this tuple.
  • We need a Main Memory hash-table to be efficient.
  • Requirement ?

19
One pass, unary full-relation operations
20
One pass, unary full-relation operations
21
One pass, unary full-relation operations
  • Grouping Accumulate the information on groups in
    main memory.
  • Need to keep one entry for each value of the
    grouping attributes, through a main memory
    search structure (hash table).
  • Then, we need for each group to keep an
    aggregated value (or values if the query asks for
    more than one aggregation).
  • For MIN/MAX we keep the min or max value seen so
    far for the group.
  • For COUNT aggregation we keep a counter which is
    incremented each time we encounter a tuple
    belonging to the group.
  • For SUM, we add the value if the tuple belongs to
    the group.
  • For AVG?
  • MM requirement.
  • Typically, a (group) tuple will be smaller than a
    tuple of the input relation,
  • Typically, the number of groups will be smaller
    than the number of tuples in the input relation.
    The number of groups is
  • Requirement ? lt M

22
One pass, binary operators
  • Requirement min(B(R),B(S)) M
  • Exception bag union
  • Cost B(R) B(S)
  • Assume R is larger than S.
  • How to perform the operations below
  • Set union, set intersection, set difference
  • Bag intersection, bag difference
  • Cartesian product, natural join
  • All these operators require reading the smaller
    of the relations into main memory using there a
    search scheme (e.g. main memory hash table) for
    easy search and insertion.

23
Set Union
  • Let R and S be sets.
  • We read S into M-1 buffers of main memory.
  • All these tuples are also copied to the output.
  • We then read each block of R into the Mth buffer,
    one at a time.
  • For each tuple t of R we see if t is in S, and if
    not, we copy t to output.
  • At the end we output S from main mem.

24
Set Intersection
  • Let R and S be sets.
  • The result will be set.
  • We read S into M-1 buffers of main memory.
  • We then read each block of R into the M-th
    buffer, one at a time.
  • For each tuple t of R we see if t is in S, and if
    so, we copy t to output.

25
Set Difference
  • Let R and S be sets.
  • Since difference is not a commutative operator,
    we must distinguish between R-S and S-R.
  • Read S into M-1 buffers of main memory.
  • Then read each block of R into the Mth buffer,
    one at a time.
  • To compute R-S
  • for each tuple t of R we see if t is not in S,
    and if so, we copy t to output.
  • To compute S-R
  • for each tuple t of R we see if t is is in S, we
    delete t from S in such a case. At the end we
    output those tuples of S that remain.

26
Bag Intersection
  • Let R and S be bags.
  • Read S into M-1 buffers of main memory.
  • Also, associate with each tuple a count, which
    initially measures the number of times the tuple
    occurs in S.
  • Then read each block of R into the M-th buffer,
    one at a time.
  • For each tuple t of R we see if t is in S. If not
    we ignore it.
  • Otherwise, if the counter is greater than zero,
    we output t and decrement the counter.

27
Bag Difference
  • We read S into M-1 buffers of main memory.
  • Also, we associate with each tuple a count, which
    initially measures the number of times the tuple
    occur in S.
  • We then read each block of R into the M-th
    buffer, one at a time.
  • To compute S-R
  • for each tuple t of R we see if t is is in S, we
    decrement its counter.
  • At the end we output those tuples of S that
    remain with counter positive.
  • To compute R-S
  • we may think of the counter c for tuple t as
    having c reasons to not output t.
  • Now, when we process a tuple of R we check to see
    if that tuple appears in S. If not we output t.
  • Otherwise, we check to see the counter c of t. If
    it is 0 we output t.
  • If not, we dont output t, and we decrement c.

28
Product
  • We read S into M-1 buffers of main memory. No
    special structure is needed.
  • We then read each block of R into the M-th
    buffer, one at a time. And combine each tuple
    with all the tuples of S.

29
(No Transcript)
30
Natural Join
  • We read S into M-1 buffers of main memory and
    build a search structure where the search key is
    the shared attributes X of R and S.
  • We then read each block of R into the M-th
    buffer, one at a time. For each tuple t of R we
    look for tuples t in S, and if tX uX, we
    copy t.u to the output.

31
Nested-Loop joins
  • one-and-a-half pass method, since one relation
    will be read just once.
  • Tuple-Based Nested-loop Join Algorithm
  • FOR each tuple s in S DO
  • FOR each tuple r in R DO
  • IF r and s join to make a tuple t THEN
  • output t
  • Improvement to Take Advantage of Disk I/O Model
  • Instead of retrieving tuples of R, T(S) times,
    load memory with as many tuples of S as can fit,
    and match tuples of R against all S tuples in
    memory.

32
Block-based nested loops
  • Assume B(S) B(R), and B(S) gt M
  • Read M-1 blocks of S into main memory and compare
    to all of R, block by block
  • FOR each chunk of M-1 blocks of S DO
  • FOR each block b of R DO
  • FOR each tuple t of b DO
  • find the tuples of S in memory that join
    with t
  • output the join of t with each of these
    tuples

33
Example
  • B(R) 1000, B(S) 500, M 101
  • Outer loop iterates 5 times
  • At each iteration we read M-1 (i.e. 100) blocks
    of S and all of R (i.e. 1000) blocks.
  • Total time 5(100 1000) 5500 I/Os
  • Question What if we reversed the roles of R and
    S?
  • We would iterate 10 times, and in each we would
    read 100500 blocks, for a total of 10(100500)
    6000 I/Os.
  • Compare with one-pass join, if it could be done!
  • We would need 1500 disk I/Os if B(S) ? M-1

34
Analysis of blocks nested loops
  • Number of disk I/Os
  • B(S)/(M-1) (M-1 B(R))
  • or
  • B(S) (B(S)B(R))/(M-1)
  • or approximately (B(S)B(R))/M

35
Summary of one-pass algorithms
36
Two-pass algorithms based on sorting
  • This special case of multi-pass algorithms is
    sufficient for most of the relation sizes.
  • Main idea for unary operations on R
  • Suppose B(R) ? M (main memory size in blocks)
  • First pass
  • Read M blocks of R into Main Memory
  • Sort the content of Main Memory
  • Write the sorted result (sublist/run) into M
    blocks on disk.
  • Second pass create final result

37
Duplicate elimination ? using sorting
  • In the second phase (merging) we dont sort but
    copy each tuple just once.
  • We can do that because the identical tuples will
    show up at the same time, i.e. they will be all
    the first ones at the buffers (for the sorted
    sublists).
  • As usual, if one buffer gets empty we refill it.

38
Duplicate-Elimination using Sorting Example
  • Assume M3, each buffer holds 2 records and
    relation R consists of the following 17 tuples
  • 2, 5, 2, 1, 2, 2, 4, 5, 4, 3, 4, 2, 1, 5, 2, 1,
    3
  • After the first pass the following sorted
    sub-lists are created
  • 1, 2, 2, 2, 2, 5
  • 2, 3, 4, 4, 4, 5
  • 1, 1, 2, 3, 5
  • In the second pass we dedicate a memory buffer to
    each sub-list.

39
Example (Contd)
40
Example (Contd)
41
Example (Contd)
42
Analysis of ?(R)
  • 2B(R) to create sorted sublists, B(R) to read
    each sublist in phase 2. Total 3B(R)
  • How large can R be?
  • There can be no more than M sublists since we
    need one buffer for each one.
  • So, B(R)/M M, (B(R)/M is the number of
    sublists)
  • i.e. B(R) M2
  • To compute ?(R) we need at least sqrt(B(R))
    blocks of Main Memory.

43
Sort-based ?, ?, -
Example set union.
  • Create sorted sublists of R and S
  • Use input buffers for sorted sublists of R and S,
    one buffer per sublist.
  • Output each tuple once.
  • We can do that since all the identical tuples
    appear at the same time.
  • Analysis 3(B(R) B(S)) disk I/Os
  • Condition B(R) B(S) M2
  • Similar algorithms for sort based intersection
    and difference (bag or set versions).

44
Simple sort-based join
  • For R(X,Y) S(Y,Z) with M buffers of memory
  • Completely sort R on Y, sort S on Y
  • Merge phase
  • Use 2 input buffers 1 for R, 1 for S.
  • Pick tuple t with smallest Y value in the buffer
    for R
  • If t doesnt match with the first tuple in the
    buffer for S, then just remove t.
  • Otherwise, read all the tuples from R with the
    same Y value as t and put them in the M-2 part of
    the memory.
  • When the input buffer for R is exhausted fill it
    again and again.
  • Then, read the tuples of S that match. For each
    one we produce the join of it with all the tuples
    of R in the M-2 part of the memory.

45
Example of sort join
  • B(R) 1000, B(S) 500, M 101
  • To sort R, we need 4B(R) I/Os, same for S.
  • Number of I/Os 4(B(R) B(S))
  • Doing the join in the merge phase
  • Number of I/Os B(R) B(S)
  • Total disk I/Os 5(B(R) B(S)) 7500
  • Memory Requirement?
  • To be able to do the sort, we should have B(R)
    M2 and B(S) M2
  • Recall for nested-loop join, we needed 5500 disk
    I/Os, but the memory requirement was quadratic
    (it is linear, here), i.e., nested-loop join is
    not good for joining relations that are much
    larger than main memory.

46
Potential problem ...
S(Y, Z) --------- a z1 a z2 ...
a zm
R(X , Y) ----------- x1 a x2 a
xn a
What if Size of n tuples gt M-2 and Size of m
tuplesgt M-2?
  • If the tuples from R (or S) with the same value y
    of Y do not fit in M-2 buffers, then we use all
    M-2 buffers to do a nested-loop join on the
    tuples with Y-value y from both relations.
  • Observe that we can smoothly continue with the
    nested loop join when we see that the R tuples
    with Y-value y do not fit in M-2 buffers.

47
Can We Improve on Sort Join?
  • Do we really need the fully sorted files?
  • Suppose we are not worried about many common Y
    values

48
A more efficient sort-based join
  • Suppose we are not worried about many common Y
    values
  • Create Y-sorted sublists of R and S
  • Bring first block of each sublist into a buffer
    (assuming we have at most M sublists)
  • Find smallest Y-value from heads of buffers. Join
    with other tuples in heads of buffers, use other
    possible buffers, if there are many tuples with
    the same Y values.
  • Disk I/O 3(B(R) B(S))
  • Requirement B(R) B(S) M2

49
Example
  • B(R) 1000, B(S) 500, M 101
  • Total of 15 sorted sublists
  • If too many tuples join on a value Y, use the
    remaining 86 MM buffers for a one pass join on Y
  • Total cost 3(1000 500) 4500 disk I/Os
  • M2 10201 gt B(R) B(S), so the requirement is
    satisfied

50
Summary of sort-based algorithms
Write a Comment
User Comments (0)
About PowerShow.com