Access Path Selection in a RDBMS - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Access Path Selection in a RDBMS

Description:

Estimate a selectivity factor F' for each Boolean factor in the predicate list ... Predicate Selectivity Estimation. attr=value. F=1/ICARD(attr index) if ... – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 21
Provided by: one121
Category:

less

Transcript and Presenter's Notes

Title: Access Path Selection in a RDBMS


1
Access Path Selectionin a RDBMS
  • P. Griffiths Selinger et al.
  • Presented by gwyou

2
System R
  • System R 1974-1978
  • DBMS based on relational model
  • IBM San Jose Labs, lots of PhD researchers
  • Gray coming from OS
  • lots of influence in RSS
  • ACM SIGMOD Innovation Award 1992
  • Turing Award 1998
  • User need not know
  • How the tuples are physically stored
  • What access paths are available
  • Optimizer choose optimal strategy to minimize
    total access cost
  • RSS(Storage System) storage/access methods
    (physical)
  • ? storage, access methods, buffer manager,
    lock, log/recovery

3
Query Processing in System R
SELECT list of items to be retrieved FROM the
table(s) referenced WHERE the boolean combination
of predicates ? Query Block
Optimal Access Path Selection
Code Generation
Parsed Query
Parser
High Level Query (SQL Statement)
Execution Query Results
4
Query Optimizer Access Path Selection
  • Access path selection based on expected costs
  • SELECT name
  • FROM employee
  • WHERE job Programmer and city Chicago
  • path 1 job index ? check city
  • path 2 city index ? check job

5
Outline
  • Research Storage System
  • Costs for single relation access paths
  • Access path selection for joins
  • Nested Queries

6
Research Storage System
  • Storage subsystem of System R responsible for
    maintaining
  • Physical storage of relations
  • Access paths of these relations
  • Locking (in a multi-user environment)
  • Logging
  • Recovery
  • Relations are stored in the RSS as a collection
    of tuples whose columns are physically contiguous
  • Tuples are stored on 4 Kbyte - No tuple spans a
    page
  • Pages are organized into logical units called
    segments - Segments may contain one or more
    relations, but no relation may span a segment

7
RSS Scans Access tuples
  • Two RSS Scans
  • Segment Scan
  • examines all pages of the segment
  • returns all tuples of the given relation
  • Index Scan
  • searches on the column of the relation using
    index (e.g., B-tree)
  • Clustered proximity of index corresponds to
    physical proximity
  • ? each data page will be touched only once
  • SARGS (Search arguments)
  • Both scans may optionally take a set of
    predicates called SARGS
  • Eliminate the overhead of making RSI calls
  • RSI User Interface of System R,
    Tuple-oriented interface
  • A SARGable predicate
  • One of the form (or which can be put into the
    form)
  • column comparison-operator value.
  • ex. Name Smith, Salary gt 10000

8
Outline
  • Research Storage System
  • Costs for single relation access paths
  • Access path selection for joins
  • Nested Queries

9
Cost Model
  • Cost page fetches W (RSI Calls)
  • W is a weighting factor
  • Pages fetched (I/O) vs. CPU cost
  • (RSI Calls) of tuples returned by RSS
  • Goal to minimize the cost

10
Statistics for Optimization
  • Each relation T
  • NCARD(T) Cardinality of relation T
  • TCARD(T) of pages in the segment containing
    tuples from T
  • P(T) TCARD(T)/( of non-empty pages in the
    segment)
  • ? Fraction of data pages in the segment that
    holds tuples of T
  • Each index I on relation T
  • ICARD(I) of distinct keys in index I
  • NINDEX(I) of pages in index I
  • ? These statistics are maintained in the System R
    catalogs

11
Query Optimization
  • Estimate a selectivity factor F for each
    Boolean factor in the predicate list using the
    statistics
  • Selectivity factor
  • ? Expected fraction of tuples satisfying the
    predicate

12
Predicate Selectivity Estimation
13
Cost Formulas
COST index pages fetched data pages fetched
w(RSI calls) 1 1 W F(preds)
(NINDX(I)TCARD) WRSICARD F(preds)
(NINDX(I)NCARD) WRSICARD (NINDX(I)
TCARD) WRSICARD TCARD/P WRSICARD
SITUATION Unique index matching an
equal predicate Clustered index I matching
one or more boolean factors Non clustered index
I matching one or more boolean factors Clustered
index I not matching any boolean
factors Segment scan
14
Outline
  • Research Storage System
  • Costs for single relation access paths
  • Access path selection for joins
  • 2-way Join n-way Join
  • Nested Queries

15
Two-way join method
  • Nested Loops

R
S
Tuple
R tuples
16
N-way joins
  • Visualized as a sequence of 2-way joins
  • Given n relations, there are n! ways of joining
  • Search space can be reduced
  • 1. Join of (k1) relation with previous k
    relations is independent of first k join order ?
    using dynamic programming O(2n) subsets of n
    tables
  • 2. Inconsistent Join condition
  • Example (2)
  • Given relations T1, T2, and T3
  • 1. Join predicates T1-T2 ? T1.a T2.b
  • 2. Join predicates T2-T3 ? T2.b T2.c
  • T1-T2-T3 T2-T1-T3 T3-T1-T2
  • T1-T3-T2 T2-T3-T1 T3-T2-T1

X
X
17
Cost formulae for Joins
  • C-outer(path1) cost of scanning the outer
    relation via path 1,
  • C-inner(path2) cost of scanning the inner
    relation
  • ? The costs of the scans computed using cost
    formulas for single relation
  • N (Product of cardinalities of all relations T
    of the join so far) (Product of selectivity
    factors of all applicable predicates)
  • ? cardinality of the outer relation tuples which
    satisfy the applicable predicates
  • C-nested loop join (path1,path2) C-outer(path1)
    N C-inner (path2)

18
Outline
  • Research Storage System
  • Costs for single relation access paths
  • Access path selection for joins
  • Nested Queries

19
Nested queries
SELECT NAME FROM EMPLOYEE WHERE SALARY (
SELECT AVG(SALARY) FROM EMPLOYEE )
? Subquery is evaluated before the main query and
is evaluated only once
SELECT NAME FROM EMPLOYEE X WHERE SALARY gt
(SELECT SALARY FROM EMPLOYEE
WHERE EMPLOYEE_NUMBER X.MANAGER)
  • Correlation Subquery must be re-evaluated for
    each candidate tuple
  • from the referenced query block

20
Q AAny Questions?
Write a Comment
User Comments (0)
About PowerShow.com