Access Path Selection in a RDBMS - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Access Path Selection in a RDBMS

Description:

Parse tree is replaced by executable machine code and its associated data structures. ... tuple is tagged with the identification of the relation to which it ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 44
Provided by: vishalt
Category:

less

Transcript and Presenter's Notes

Title: Access Path Selection in a RDBMS


1
Access Path Selection in a RDBMS
  • Shahram Ghandeharizadeh
  • Computer Science Department
  • University of Southern California

2
System R
  • Grand-daddy of RDBMS
  • Started in 1975 at IBM San Jose Research Lab.
  • Won the ACM Software System Award in 1988.
  • Introduced fundamental database concepts such as
    SQL, locking, logging, cost-based query
    optimization techniques, etc.

3
Four Phases of SQL Processing
  • Parsing
  • Checks for correct SQL syntax,
  • Computes the list of items to be retrieved, the
    table(s) referenced, and boolean combination of
    simple predicates.
  • Optimization
  • Looks up the tables in the database catalog for
    their existence and statistics, and available
    access paths.
  • Computes the execution plan with minimum cost.
  • Output Execution plan in the Access
    Specification Language (ASL).
  • Code generation
  • Code generator is a table-driven program which
    translates ASL tress into machine language code.
  • Parse tree is replaced by executable machine code
    and its associated data structures. This code
    can be stored away in the database for later
    execution.
  • Execution
  • Executes the machine code by invoking System R
    internal storage system (RSS) via the storage
    system interface (RSI) to scan each of the
    physically stored relations referenced by the
    query.

4
Research Storage System (RSS)
  • Maintains physical storage of relations, access
    paths on these relations.
  • Implements locking and logging.
  • RSS represents a relation as
  • A collection of tuples stored in 4KB pages,
  • Columns of a tuple are physically contiguous,
  • No tuple spans a page.
  • Pages are organized into logical units called
    segments.
  • Segments may contain one or more relations.
  • Each tuple is tagged with the identification of
    the relation to which it belongs.
  • At most one relation per segment.

5
RSS (Cont)
  • Access tuples using a scan OPEN, NEXT, and
    CLOSE. A scan returns a tuple at a time.
  • Supports two types of scans
  • Segment scan Find all tuples of a relation.
  • All non-empty pages of a segment are referenced
    only once.
  • Index scan B-trees

6
Optimizer
  • Formulates a cost prediction for each access
    plan, using the following cost formula
  • COST Page fetches W (RSI Calls)
  • W is an adjustable weighting factor between I/O
    and CPU.
  • RSI calls is an approximation for CPU
    utilization.
  • Assumptions
  • WHERE tree is considered to be in conjunctive
    normal form,
  • Every disjunct is called a boolean factor.

7
Optimizer (Motivation)
  • Given a query, there are many ways to execute it.
    The optimizer must identify the best execution
    plan.
  • Example
  • SELECT name, title, sal
  • FROM Emp, Job
  • WHERE Emp.Job Job.Job
  • and Title CLERK

8
Optimizer (Motivation)
  • Example
  • SELECT name, title, sal
  • FROM Emp, Job
  • WHERE Emp.Job Job.Job
  • and Title CLERK
  • Decide order to perform the different operators
  • process Title CLERK followed by the join
  • Process the join Emp.Job Job.Job followed by
    Title CLERK
  • Decide which index structure to use Segment
    scan, clustered index, non-clustered index.
  • Decide the join algorithm nested-loops versus
    merge-scan.
  • This paper tries to answer all the above
    questions!

9
How?
  • Enumerating the different execution plans,
  • Estimate the cost of performing each plan,
  • Pick the cheapest plan.
  • What is definition of cost?

10
How?
  • Enumerating the different execution plans,
  • Estimate the cost of performing each plan,
  • Pick the cheapest plan.
  • What is definition of cost?
  • COST Page fetches W (RSI Calls)

11
Conjunctive Normal Form
  • A formula is in conjunctive normal form if it is
    a conjunction of clauses
  • A AND B
  • A AND (B OR C)
  • (A OR B) AND (D OR E)
  • Is (B OR C) in CNF?

12
Conjunctive Normal Form
  • A formula is in conjunctive normal form if it is
    a conjunction of clauses
  • A AND B
  • A AND (B OR C)
  • (A OR B) AND (D OR E)
  • Is (B OR C) in CNF?
  • Fix it by carrying the negation inside
  • B AND C

13
Conjunctive Normal Form
  • A formula is in conjunctive normal form if it is
    a conjunction of clauses
  • A AND B
  • A AND (B OR C)
  • (A OR B) AND (D OR E)
  • How about (A AND B) OR C?

14
Conjunctive Normal Form
  • A formula is in conjunctive normal form if it is
    a conjunction of clauses
  • A AND B
  • A AND (B OR C)
  • (A OR B) AND (D OR E)
  • How about (A AND B) OR C?
  • Transform it to (A OR C) AND (B OR C)

15
CNF
  • Why?
  • Every tuple returned to the user must satisfy
    every boolean factor.
  • If a tuple fails a boolean factor, discard it
    from farther consideration.

16
Database Catalog
  • System R maintains statistics for each relation
    T
  • NCARD(T), number of records in T
  • TCARD(T), number of pages in the segment that
    holds tuples of T
  • P(T), fraction of data pages in the segment that
    hold tuples of relation T
  • P(T) TCARD(T) / ( of non-empty pages in the
    segment)
  • For each index I on relation T,
  • ICARD(I), number of distinct keys in index I.
  • NINDX(I), number of pages in index I.

17
Maintenance of Statistics
18
Selectivity Factor (F)
  • Corresponds to the expected fraction of tuples
    which will satisfy the predicate.
  • Column value
  • F 1 / ICARD(column index) with an index,
    assuming an even distribution of tuples among the
    index key values.
  • F 1 / 10 otherwise.

19
Clustered Index
  • Assume a student table Student(name, age, gpa,
    major)
  • t(Student) 16
  • P(Student) 4

Chris, 22, 3.9, CS
Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chad, 28, 2.3, LS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Leila, 20, 3.5, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Shideh, 16, 4, CS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
20
Number of Records per GPA
21
ESTIMATING NUMBER OF RESULTING RECORDS
  • For exact match selection predicates assume a
    uniform distribution of records across the number
    of unique values. E.g., the selection predicate
    is gpa 3.3
  • For range selection predicates assume a uniform
    distribution of records across the range of
    available values defined by min and max. In this
    case, one must think about the interval. E.g.,
    gpa gt 3.5

22
Selectivity Factor (F)
  • Column gt value
  • F (high key value value) / (high key value
    low key value) as long as the column is an
    arithmetic type and value is known at access path
    selection time.
  • F 1/3 otherwise (column is not arithmetic)

23
Selectivity Factor (F)
  • Column lt value
  • ?

24
Selectivity Factor (F)
  • Column lt value
  • F (value - low key value) / (high key value
    low key value) as long as the column is an
    arithmetic type and value is known at access path
    selection time.
  • F 1/3 otherwise (column is not arithmetic)

25
Selectivity Factor (F)
  • Value1 lt Column lt Value2
  • ?

26
Selectivity Factor (F)
  • Value1 lt Column lt Value2
  • F (Value2 Value1) / (high key value low key
    value) as long as the column is arithmetic
  • F ¼ otherwise

27
Selectivity Factor (F)
  • Column in (list of values)
  • Join predicate, Column 1 Column 2
  • Disjunctive predicate

28
Selectivity Factor (F)
  • Conjunctive predicate
  • Negation

29
Interesting order
  • A query blocks GROUP BY or ORDER BY clauses may
    correspond to the order of records in an access
    path. This tuple order is an interesting order.
  • Example query

30
Interesting order
  • A query blocks GROUP BY or ORDER BY clauses may
    correspond to the order of records in an access
    path. This tuple order is an interesting order.
  • Example query
  • Student(name, age, gpa, major) with a B-tree on
    the gpa attribute
  • SELECT name
  • FROM Student
  • WHERE gpa lt 3.0
  • ORDER BY gpa

SELECT gpa, count() FROM Student WHERE gpa lt
3.0 GROUP BY gpa
31
B-Tree
  • A B-tree on the gpa attribute

3.6
(3.7, (3, 1))
(3.9, (4,1))
(2.3, (1, 1))
(3, (2,1))
(3.8, (3,2))
(3.9, (4,2))
(2.5, (1,2))
(3.1, (2,2))
(3.8, (3,3))
(4, (4,3))
(2.8, (1,3))
(3.2, (2,3)
(3.8, (3,4))
(2.8, (1,4))
(4, (4,4))
(3.5, (2,4))
Bob, 21, 3.7, CS
Mary, 24, 3, ECE
Chris, 22, 3.9, CS
Chad, 28, 2.3, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
James, 24, 3.1, ME
Chang, 18, 2.5, CS
Kane, 19, 3.8, ME
Lam, 22, 2.8, ME
Louis, 32, 4, LS
Tom, 20, 3.2, EE
Martha, 29, 3.8, CS
Pat, 19, 2.8, EE
Leila, 20, 3.5, LS
Shideh, 16, 4, CS
32
Single Relation Access Paths
  • Single relation access paths are simple selects
    with ORDER BY and GROUP BY clauses
  • SELECT name
  • FROM Student
  • WHERE age lt 20
  • Without an index, must perform a segment scan,
    what is the cost?
  • TCARD / P W RSISCAN
  • TCARD(T), number of pages in the segment that
    holds tuples of T
  • P(T), fraction of data pages in the segment that
    hold tuples of relation T
  • P(T) TCARD(T) / ( of non-empty pages in the
    segment)
  • Why?

33
Single Relation Access Paths
  • Single relation access paths are simple selects
    with ORDER BY and GROUP BY clauses
  • SELECT name
  • FROM Student
  • WHERE age lt 20
  • Without an index, must perform a segment scan,
    what is the cost?
  • TCARD / P W RSISCAN
  • TCARD(T), number of pages in the segment that
    holds tuples of T
  • P(T), fraction of data pages in the segment that
    hold tuples of relation T
  • P(T) TCARD(T) / ( of non-empty pages in the
    segment)
  • Tuples of Student might be inter-mixed with
    professors. Example the student table with
    TCARD 100 pages and P(T) 0.75. Note that
    P(T) 1 when the student table is not intermixed
    with another table.

34
Single Relation Access Paths
  • Cost of scanning leaf pages and data pages

3.6
(3.7, (3, 1))
(3.9, (4,1))
(2.3, (1, 1))
(3, (2,1))
(3.8, (3,2))
(3.9, (4,2))
(2.5, (1,2))
(3.1, (2,2))
(3.8, (3,3))
(4, (4,3))
(2.8, (1,3))
(3.2, (2,3)
(3.8, (3,4))
(2.8, (1,4))
(4, (4,4))
(3.5, (2,4))
Bob, 21, 3.7, CS
Mary, 24, 3, ECE
Chris, 22, 3.9, CS
Chad, 28, 2.3, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
James, 24, 3.1, ME
Chang, 18, 2.5, CS
Kane, 19, 3.8, ME
Lam, 22, 2.8, ME
Louis, 32, 4, LS
Tom, 20, 3.2, EE
Martha, 29, 3.8, CS
Pat, 19, 2.8, EE
Leila, 20, 3.5, LS
Shideh, 16, 4, CS
35
Single Relation Access Paths
  • Cost of scanning leaf pages and data pages
    containing the qualifying records

36
Non-Clustered B-Tree
  • A random I/O for every qualifying record

3.6
(3.7, (1, 1))
(3.9, (4,1))
(2.3, (4, 2))
(3, (1,2))
(3.8, (3,2))
(3.9, (2,4))
(2.5, (2,3))
(3.1, (3,3))
(3.8, (2,1))
(4, (3,1))
(2.8, (2,2))
(3.2, (1,3)
(3.8, (1,4))
(2.8, (3,4))
(4, (4,4))
(3.5, (4,3))
Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chris, 22, 3.9, CS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
Shideh, 16, 4, CS
37
Non-Clustered B-Tree
  • A random I/O for every qualifying record

38
R EQUALITY JOIN S R.A S.A
  • Two algorithms for performing the join operator
    nested loops and merge-scan.
  • Tuple nested loops
  • for each tuple r in R do
  • for each tuple s in S do
  • if r.As.A then output r,s in
    the result relation
  • end-for
  • end-for
  • Estimated cost of tuple nested loops
  • TCARD(R)/P(R) NCARD(R) TCARD(S)/P(S)

NCARD(R)
TCARD(S)/P(S)
39
EQUALITY JOIN (Cont)
  • Merge-scan
  • Interesting order on R.A (sorted)
  • Interesting order on S.A (sorted)
  • Scan R and S in parallel, merging tuples with
    matching A values
  • Estimated cost of merge scan NINDX(IR)
    NINDX(IS)

40
N-Way Join
  • N-Way joins as a sequence of 2-way joins.
  • Utilize pipelining whenever appropriate
  • The ordering of the joins is important. Consider
    all ordering such that
  • Join predicates relate the two participating
    tables together do not consider cartesian
    products. For example if the join clause is (R.A
    S.A and R.B T.B) then it would be a mistake
    to use the following clause (S Cartesian product
    T) and R.A ST.A and R.B ST.B
  • Delay computation of cartesian products as much
    as possible.
  • Consider interesting orders in order to use
    merge-scan whenever possible.

41
Search Space
  • Rather large search space for expressions joining
    several tables
  • Heuristics prune the search space

42
Nested Queries
  • Correlation subquery A subquery with a
    reference to a value obtained from a candidate
    tuple of a higher level query block.

43
Non-Correlation sub-queries
  • Evaluate the inner query once and use its results
    to process the outer query.
Write a Comment
User Comments (0)
About PowerShow.com