COMP 5138 Relational Database Management Systems - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

COMP 5138 Relational Database Management Systems

Description:

translate the query into its internal form. This is then translated into ... path: An index or file scan that we estimate will require the fewest page I/Os. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 43
Provided by: kellie7
Category:

less

Transcript and Presenter's Notes

Title: COMP 5138 Relational Database Management Systems


1
COMP 5138Relational Database Management Systems
Semester 2, 2007 Lecture 12 Query Processing and
Optimization
2
Todays Agenda
  • Parsing and Translation
  • Evaluation
  • Individual operations
  • Entire operations
  • Optimization
  • Cost-based
  • Heuristic

3
Overview Query Processing
  • Basic Steps
  • Parsing and Translation
  • Optimisation
  • Evaluation

Source Silberschatz/Korth/Sudarshan Database
System Concepts, 2002.
4
Basic Steps in Query Processing
  • Parsing and Translation
  • translate the query into its internal form. This
    is then translated into relational algebra.
  • Parser checks syntax, verifies relations
  • Query Optimization
  • Amongst all equivalent query-evaluation plans
    choose the one with lowest cost.
  • Query Evaluation
  • Query-execution engine takes a query-evaluation
    plan, executes that plan, and returns the answers
    to the query.

5
Parsing and Translation
  • SQL gets translated into relational algebra,
    which can be shown as expression tree.
  • Operators have one or more input sets and return
    one (or more) output set(s).
  • Leafs correspond to (base) relations.
  • Example SELECT name FROM students ,
    enrolled WHERE cid COMP5138

Projection
?name
Join
Selection
?cidCOMP5138
students
enrolled
6
Todays Agenda
  • Parsing and Translation
  • Evaluation
  • Individual operations
  • Entire operations
  • Optimization
  • Cost-based
  • Heuristic

7
Query Evaluation Plan
  • Each relational algebra operation can be
    evaluated using one of several different
    algorithms
  • Correspondingly, a relational-algebra expression
    can be evaluated in many ways.
  • Annotated expression specifying detailed
    evaluation strategy is called an evaluation-plan.
  • Example
  • We can use an index onbalance to find
    accountswith balancelt2500.
  • Or can perform completerelation scan and
    discardaccounts with balance ? 2500

? balance
?balance?2500
account
8
Expression Tree vs. Query Evaluation Plan
? balance
Project balance
Project balance
?balance?2500
Index Range Scan(account.balance, balance?2500)
Filterbalance?2500
account
Table Scan (account)
  • Relational algebra operators logical operators
    which represent the intermediate results.
  • Physical operators show how query is evaluated.

9
Measures of Query Costs
  • Query Optimization Amongst all equivalent
    evaluation plans choose the one with lowest cost.
  • Cost is generally measured as total elapsed time
    for answering query
  • Many factors contribute to time cost
  • disk accesses, CPU, or even network communication
  • Typically disk access is the predominant cost,
    and is also relatively easy to estimate.
  • For simplicity, we just use number of block
    transfers from disk as the cost measure
  • We ignore the difference in cost between
    sequential and random I/O for simplicity
  • We also ignore CPU costs for simplicity

10
Access Path
  • An access path is a method of retrieving tuples
  • File scan, or index that matches a selection (in
    the query)
  • A tree index matches (a conjunction of) terms
    that involve only attributes in a prefix of the
    search key.
  • E.g., Tree index on lta, b, cgt matches the
    selection a5 AND b3, and matches a5 AND bgt6,
    but not b3.
  • A hash index matches (a conjunction of) terms
    that has a term attribute value for every
    attribute in the search key of the index.
  • E.g., Hash index on lta, b, cgt matches a5 AND
    b3 AND c5 but it does not match b3, or a5
    AND b3, or agt5 AND b3 AND c5.

11
Selection
  • Find the most selective access path, retrieve
    tuples using it, and apply any remaining terms
    that dont match the index
  • Most selective access path An index or file scan
    that we estimate will require the fewest page
    I/Os.
  • Table scan search algorithms that locate and
    retrieve records that fulfill a selection
    condition.
  • Cost estimate (number of disk blocks scanned)
    bR
  • bR denotes number of blocks containing records
    from relation R
  • e.g. Oracle TABLE SCAN, TABLE ACCESS FULL
  • Index scan search algorithms that use an index
  • selection condition must be on search-key of
    index.
  • Cost HTi number of blocks containing
    retrieved records (PK 1 block)
  • HT number of index level
  • e.g. Oracle INDEX RANGE SCAN or INDEX UNIQUE SCAN

12
Projection
  • The expensive part is removing duplicates.
  • SQL systems dont remove duplicates unless the
    keyword DISTINCT is specified in a query.
  • Sorting Approach Sort on ltsid, bidgt and remove
    duplicates. (Can optimize this by dropping
    unwanted information while sorting.)
  • Hashing Approach Hash on ltsid, bidgt to create
    partitions. Load partitions into memory one at a
    time, build in-memory hash structure, and
    eliminate duplicates.

13
Sorting
  • Importance of sorting
  • SQL queries can specify that the output be sorted
  • SQL and relational algebra can be implemented
    efficiently if the input are sorted
  • We may build an index on the relation, and then
    use the index to read the relation in sorted
    order. May lead to one disk block access for
    each tuple.
  • For relations that fit in memory, techniques like
    quick sort can be used. For relations that dont
    fit in memory, external sort-merge is a good
    choice.

14
External Sort-Merge
?
Let M denote memory size (in pages).
  • Create sorted runs. Let i be 0 initially.
    Repeatedly do the following till the end of the
    relation (a) Read M blocks of relation
    into memory (b) Sort the in-memory blocks
    (c) Write sorted data to run Ri increment
    i.Let the final value of i be N
  • Merge the runs (N-way merge). We assume (for now)
    that N lt M.
  • Use N blocks of memory to buffer input runs, and
    1 block to buffer output. Read the first block of
    each run into its buffer page
  • repeat
  • Select the first record (in sort order) among all
    buffer pages
  • Write the record to the output buffer. If output
    is full, write it to disk.
  • If this is the last records of the input buffer
    page then read
    the next block (if any) of the run into the
    buffer.
  • until all input buffer pages are empty
  • If i ? M, several merge passes are required.
  • In each pass, contiguous groups of M - 1 runs are
    merged.

15
External Sort-Merge - Example
16
External Sort-Merge
?
  • Cost analysis
  • Total number of merge passes required
    ?logM1(br/M)?.
  • Disk accesses for initial run creation as well as
    in each pass is 2br
  • for final pass, we dont count write cost
  • we ignore final write cost for all operations
    since the output of an operation may be sent to
    the parent operation without being written to
    disk
  • Thus total number of disk accesses for external
    sorting
  • br ( 2 ?logM1(br / M)? 1)

17
Join Operations
  • Several different algorithms to implement joins
  • Nested-loop join
  • Block nested-loop join
  • Indexed nested-loop join
  • Merge-join
  • Hash-join
  • Choice based on cost estimate
  • Examples use the following information
  • Number of records of students 1,000 enrolled
    10,000
  • Number of blocks of students 100 enrolled 400

18
Nested-Loop Join
  • To compute the theta join R ? Sfor each tuple
    r in R do begin for each tuple s in S do begin
    if ?(r,s)true then add rs to the
    result endend
  • R is called the outer relation,S the inner
    relation of the join
  • Requires no indices and can be used with any kind
    of join condition.
  • Expensive since it examines every pair of tuples
    in the two relations.

19
Nested-Loop Join Cost Analysis
  • In the worst case, if there is memory only to
    hold one block of each relation, the estimated
    cost is R ? bS
    bR
  • If the smaller relation fits entirely in memory,
    use that as the inner relation. Reduces cost to
    bR bS disk accesses.
  • Example
  • In the worst case scenariostudents as outer
    relation 1000 ? 400 100 400,100 disk
    accessesenrolled as outer relation 10000 ?100
    400 1,000,400 disk accesses
  • If smaller relation (students) fits entirely in
    memory, the cost estimate will be 500 disk
    accesses.
  • Block nested-loops algorithm (next slide) is
    preferable.

20
Block Nested-Loop Join
  • Variant of nested-loop join in which every block
    of inner relation is paired with every block of
    outer relation.
  • for each block BR of R do begin for each block
    Bs of S do begin for each tuple r in BR do
    begin for each tuple s in Bs do begin
    if ?(r,s)true then add rs to the
    result end end endend

21
Block Nested-Loop Join Cost Analysis
  • Worst case estimate bR ? bS bR block
    accesses.
  • Each block in the inner relation S read once for
    each block in the outer relation (instead of once
    for each tuple in the outer relation)
  • Best case bR bS block accesses.
  • Improvements to nested loop block nested loop
    algorithms
  • In block nested-loop, use M 2 disk blocks as
    blocking unit for outer relations, where M
    memory size in blocks use remaining two blocks
    to buffer inner relation and output
  • Cost ?bR / (M-2)? ? bS bR
  • If equi-join attribute forms a key on inner
    relation, stop inner loop on first match
  • Scan inner loop forward and backward alternately,
    to make use of the blocks remaining in buffer
    (with LRU replacement)
  • Use index on inner relation if available (next
    slide)

22
Indexed Nested-Loop Join
  • Index lookups can replace file scans if
  • join is an equi-join or natural join and
  • an index is available on the inner relations
    join attribute
  • For each tuple r in the outer relation R, use
    the index to look up tuples in S that satisfy the
    join condition with tuple R.
  • Worst case buffer has space for only one page
    of R, and, for each tuple in R, we perform an
    index lookup on S.
  • Cost of the join bR R ? c
  • Where c is the cost of traversing index and
    fetching all matching S tuples for one tuple of R
  • c can be estimated as cost of a single selection
    on S using the join condition.
  • If an index exists on the appropriate column of
    each relation, you can choose which relation is
    the outer one
  • Compare the costs each way

23
Merge-Join
  • Sort both relations on their join attribute (if
    not already sorted on the join attributes).
  • Merge the sorted relations to join them
  • Join step is similar to the merge stage of the
    sort-merge algorithm.
  • Main difference is handling of duplicate values
    in join attribute
  • Can construct an index just to compute a join.
  • Detailed algorithm in textbook

24
Merge-Join Cost Analysis
  • Can be used only for equi-joins and natural joins
  • Each block needs to be read only once (assuming
    all tuples for any given value of the join
    attributes fit in memory
  • Thus number of block accesses for merge-join is
    bR bS the cost of sorting if
    relations are unsorted.
  • hybrid merge-join If one relation is sorted, and
    the other has a secondary B-tree index on the
    join attribute
  • Merge the sorted relation with the leaf entries
    of the B-tree .
  • Sort the result on the addresses of the unsorted
    relations tuples
  • Scan the unsorted relation in physical address
    order and merge with previous result, to replace
    addresses by the actual tuples
  • Sequential scan more efficient than random lookup

25
Hash-Join
  • Applicable for equi-joins and natural joins.
  • A hash function h is used to partition tuples of
    both relations
  • h maps JoinAttrs values to 0, 1, ..., n, where
    JoinAttrs denotes the common attributes of R and
    S used in the natural join.
  • R0, R1, . . ., Rn denote partitions of R tuples
  • Each tuple r ? R is put in partition Ri where i
    h(r JoinAttrs).
  • S0, S1. . ., Sn denotes partitions of S tuples
  • Each tuple s ? S is put in partition Si, where i
    h(sJoinAttrs).

26
Hash-Join (contd)
27
Other Operations
  • Duplicate elimination can be implemented via
    hashing or sorting.
  • On sorting duplicates will come adjacent to each
    other, and all but one set of duplicates can be
    deleted.
  • Hashing is similar duplicates will come into
    the same bucket.
  • Aggregation can be implemented in a manner
    similar to duplicate elimination.
  • Sorting or hashing can be used to bring tuples in
    the same group together, then the aggregate
    functions can be applied on each group.
  • Set Operations (UNION, INTERSECT, SET MINUS)

28
Evaluation of Expressions
  • So far we have seen algorithms for individual
    operations
  • Alternatives for evaluating an entire expression
    tree
  • Materialization (also set-at-a-time) simply
    evaluate one operation at a time. The result of
    each evaluation is materialized (stored) in a
    temporary relation for subsequent use.
  • Pipelining (also tuple-at-a-time) evaluate
    several operations simultaneously in a pipeline

29
Materialization
  • Materialized evaluation evaluate one operation
    at a time, starting at the lowest-level. Use
    intermediate results materialized into temporary
    relations to evaluate next-level operations.
  • E.g., in figure below, compute and storethen
    compute the store its join with customer, and
    finally compute the projections on customer-name.
  • Materialized evaluation is always
    applicable
  • Costs can be quite high

30
Pipelining
  • Pipelined evaluation evaluate several
    operations simultaneously, passing the results of
    one operation on to the next.
  • E.g., in previous expression tree, dont store
    result of
  • instead, pass tuples directly to the join.
    Similarly, dont store result of join, pass
    tuples directly to projection.
  • Much cheaper than materialization no need to
    store a temporary relation to disk.
  • Pipelining may not always be possible e.g.,
    sort, hash-join.

31
Todays Agenda
  • Parsing and Translation
  • Evaluation
  • Individual operations
  • Entire operations
  • Optimization
  • Cost-based
  • Heuristic

32
Intro to Query Optimization
  • A relational algebra expression may have many
    equivalent expressions
  • Example SELECT BALANCE
    FROM account WHERE
    balance lt 2500
  • Can be translated into
    ?balance?2500(?balance(account)) which is
    equivalent to
    ?balance(?balance?2500(account))

33
Query Optimization
  • Central Problems
  • Query is (by definition) declarative, e.g. it
    does not specify the execution order. But we
    need an executable plan.
  • The goal of query optimization is less to find
    the optimal plan, but more to avoid the worst.
    Time for query optimization adds to total
    query execution time.
  • Three evaluation plan
  • Interaction of evaluation techniques
  • Cost-based
  • Heuristic

34
Cost-based Query Optimization
  • Generation of query-evaluation plans for an
    expression involves several steps
  • Generating logically equivalent expressions
  • Use equivalence rules to transform an expression
    into an equivalent one.
  • Annotating resultant expressions to get
    alternative query plans
  • Choosing the cheapest plan based on estimated
    cost
  • The overall process is called cost-based
    optimization.
  • Some systems only have rule-based optimization,
    i.e. they dont take statistical information into
    account.

35
Statistic Information in the DB Catalog
  • Need information about the relations and indexes
    involved. Catalogs typically contain at least
  • tuples (NTuples) and pages (NPages) for each
    relation.
  • distinct key values (NKeys) and NPages for each
    index.
  • Index height, low/high key values (Low/High) for
    each tree index.
  • Catalogs are updated periodically.
  • Updating whenever data changes is too expensive
    lots of approximation anyway, so slight
    inconsistency ok.
  • More detailed information (e.g., histograms of
    the values in some field) are sometimes stored.

36
Heuristic Optimization
  • Cost-based optimization is expensive, even with
    dynamic programming.
  • Systems may use heuristics to reduce the number
    of choices that must be made in a cost-based
    fashion.
  • Heuristic optimization transforms the query-tree
    by using a set of rules that typically (but not
    in all cases) improve execution performance
  • Perform selection early (reduces the number of
    tuples)
  • Perform projection early (reduces the number of
    attributes)
  • Perform most restrictive selection and join
    operations before other similar operations.
  • Some systems use only heuristics, others combine
    heuristics with partial cost-based optimization.

37
Transformation Example
  • Query Find the names of all students who have
    enrolled in some course with 6 credit
    points. ?name(?credit_points6 (course
    (enrolled student)))
  • Transformation using distribution rule 3
    ?name( (?credit_points6 (course))
    (enrolled student)))
  • Performing the selection as early as possible
    reduces the size of the relation to be joined.

38
Example with Multiple Transformations
  • Query Find the names of all students which
    enrolled in a 6 credit_point course, whose grade
    is a distinction (D).?name(?gradeD ?
    credit_points6 (course (enrolled
    student)))
  • Transformation using join association
    rule?name(?gradeD ? credit_points6
    ((course enrolled) student))
  • Second form provides an opportunity to apply the
    perform selections early rule, resulting in the
    subexpression
  • ?credit_points6 (course) ?
    gradeD (enrolled)
  • Thus a sequence of transformations can be useful

39
Optimization Strategies
  • Ideally find the optimal plan.
  • This approach is very expensive in time and
    space.
  • Typically Find a good plan, avoiding the worst
    plan.
  • do not generate all expressions
  • Central Problem Join Ordering (Join
    Enumeration)
  • Typical Approach
  • use dynamic programming for generating join plans
  • restrict to left-deep join trees

40
Left Deep Join Trees
  • In left-deep join trees, the right-hand-side
    input for each join is a relation, not the result
    of an intermediate join.

41
Dynamic Programming in Optimization
  • To find best join tree for a set of n relations
  • To find best plan for a set S of n relations,
    consider all possible plans of the form S1
    (S S1) where S1 is any non-empty subset of S.
  • Recursively compute costs for joining subsets of
    S to find the cost of each plan. Choose the
    cheapest of the 2n 1 alternatives.
  • When plan for any subset is computed, store it
    and reuse it when it is required again, instead
    of recomputing it
  • Dynamic programming
  • E.g. r1, r2, r3, r4, r5
  • (r1, r2, r3), (r4, r5) 1212144
  • (r1, r2, r3) and (r4, r5) 1212 24

42
Wrap-Up
  • Parsing and Translation
  • Evaluation
  • Individual operations
  • Entire operations
  • Optimization
  • Cost-based
  • Heuristic
Write a Comment
User Comments (0)
About PowerShow.com