Access Path Selection in a RDBMS - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Access Path Selection in a RDBMS

Description:

Parse tree is replaced by executable machine code and its associated data structures. ... tuple is tagged with the identification of the relation to which it ... – PowerPoint PPT presentation

Number of Views:142

Avg rating:3.0/5.0

Slides: 44

Provided by: vishalt

Category:

more less

Transcript and Presenter's Notes

Title: Access Path Selection in a RDBMS

1
Access Path Selection in a RDBMS

Shahram Ghandeharizadeh
Computer Science Department
University of Southern California

2
System R

Grand-daddy of RDBMS
Started in 1975 at IBM San Jose Research Lab.
Won the ACM Software System Award in 1988.
Introduced fundamental database concepts such as
SQL, locking, logging, cost-based query
optimization techniques, etc.

3
Four Phases of SQL Processing

Parsing
Checks for correct SQL syntax,
Computes the list of items to be retrieved, the
table(s) referenced, and boolean combination of
simple predicates.
Optimization
Looks up the tables in the database catalog for
their existence and statistics, and available
access paths.
Computes the execution plan with minimum cost.
Output Execution plan in the Access
Specification Language (ASL).
Code generation
Code generator is a table-driven program which
translates ASL tress into machine language code.
Parse tree is replaced by executable machine code
and its associated data structures. This code
can be stored away in the database for later
execution.
Execution
Executes the machine code by invoking System R
internal storage system (RSS) via the storage
system interface (RSI) to scan each of the
physically stored relations referenced by the
query.

4
Research Storage System (RSS)

Maintains physical storage of relations, access
paths on these relations.
Implements locking and logging.
RSS represents a relation as
A collection of tuples stored in 4KB pages,
Columns of a tuple are physically contiguous,
No tuple spans a page.
Pages are organized into logical units called
segments.
Segments may contain one or more relations.
Each tuple is tagged with the identification of
the relation to which it belongs.
At most one relation per segment.

5
RSS (Cont)

Access tuples using a scan OPEN, NEXT, and
CLOSE. A scan returns a tuple at a time.
Supports two types of scans
Segment scan Find all tuples of a relation.
All non-empty pages of a segment are referenced
only once.
Index scan B-trees

6
Optimizer

Formulates a cost prediction for each access
plan, using the following cost formula
COST Page fetches W (RSI Calls)
W is an adjustable weighting factor between I/O
and CPU.
RSI calls is an approximation for CPU
utilization.
Assumptions
WHERE tree is considered to be in conjunctive
normal form,
Every disjunct is called a boolean factor.

7
Optimizer (Motivation)

Given a query, there are many ways to execute it.
The optimizer must identify the best execution
plan.
Example
SELECT name, title, sal
FROM Emp, Job
WHERE Emp.Job Job.Job
and Title CLERK

8
Optimizer (Motivation)

Example
SELECT name, title, sal
FROM Emp, Job
WHERE Emp.Job Job.Job
and Title CLERK
Decide order to perform the different operators
process Title CLERK followed by the join
Process the join Emp.Job Job.Job followed by
Title CLERK
Decide which index structure to use Segment
scan, clustered index, non-clustered index.
Decide the join algorithm nested-loops versus
merge-scan.
This paper tries to answer all the above
questions!

9
How?

Enumerating the different execution plans,
Estimate the cost of performing each plan,
Pick the cheapest plan.
What is definition of cost?

10
How?

Enumerating the different execution plans,
Estimate the cost of performing each plan,
Pick the cheapest plan.
What is definition of cost?
COST Page fetches W (RSI Calls)

11
Conjunctive Normal Form

A formula is in conjunctive normal form if it is
a conjunction of clauses
A AND B
A AND (B OR C)
(A OR B) AND (D OR E)
Is (B OR C) in CNF?

12
Conjunctive Normal Form

A formula is in conjunctive normal form if it is
a conjunction of clauses
A AND B
A AND (B OR C)
(A OR B) AND (D OR E)
Is (B OR C) in CNF?
Fix it by carrying the negation inside
B AND C

13
Conjunctive Normal Form

A formula is in conjunctive normal form if it is
a conjunction of clauses
A AND B
A AND (B OR C)
(A OR B) AND (D OR E)
How about (A AND B) OR C?

14
Conjunctive Normal Form

A formula is in conjunctive normal form if it is
a conjunction of clauses
A AND B
A AND (B OR C)
(A OR B) AND (D OR E)
How about (A AND B) OR C?
Transform it to (A OR C) AND (B OR C)

15
CNF

Why?
Every tuple returned to the user must satisfy
every boolean factor.
If a tuple fails a boolean factor, discard it
from farther consideration.

16
Database Catalog

System R maintains statistics for each relation
T
NCARD(T), number of records in T
TCARD(T), number of pages in the segment that
holds tuples of T
P(T), fraction of data pages in the segment that
hold tuples of relation T
P(T) TCARD(T) / ( of non-empty pages in the
segment)
For each index I on relation T,
ICARD(I), number of distinct keys in index I.
NINDX(I), number of pages in index I.

17
Maintenance of Statistics
18
Selectivity Factor (F)

Corresponds to the expected fraction of tuples
which will satisfy the predicate.
Column value
F 1 / ICARD(column index) with an index,
assuming an even distribution of tuples among the
index key values.
F 1 / 10 otherwise.

19
Clustered Index

Assume a student table Student(name, age, gpa,
major)
t(Student) 16
P(Student) 4

Chris, 22, 3.9, CS
Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chad, 28, 2.3, LS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Leila, 20, 3.5, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Shideh, 16, 4, CS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
20
Number of Records per GPA
21
ESTIMATING NUMBER OF RESULTING RECORDS

For exact match selection predicates assume a
uniform distribution of records across the number
of unique values. E.g., the selection predicate
is gpa 3.3
For range selection predicates assume a uniform
distribution of records across the range of
available values defined by min and max. In this
case, one must think about the interval. E.g.,
gpa gt 3.5

22
Selectivity Factor (F)

Column gt value
F (high key value value) / (high key value
low key value) as long as the column is an
arithmetic type and value is known at access path
selection time.
F 1/3 otherwise (column is not arithmetic)

23
Selectivity Factor (F)

Column lt value
?

24
Selectivity Factor (F)

Column lt value
F (value - low key value) / (high key value
low key value) as long as the column is an
arithmetic type and value is known at access path
selection time.
F 1/3 otherwise (column is not arithmetic)

25
Selectivity Factor (F)

Value1 lt Column lt Value2
?

26
Selectivity Factor (F)

Value1 lt Column lt Value2
F (Value2 Value1) / (high key value low key
value) as long as the column is arithmetic
F ¼ otherwise

27
Selectivity Factor (F)

Column in (list of values)
Join predicate, Column 1 Column 2
Disjunctive predicate

28
Selectivity Factor (F)

Conjunctive predicate
Negation

29
Interesting order

A query blocks GROUP BY or ORDER BY clauses may
correspond to the order of records in an access
path. This tuple order is an interesting order.
Example query

30
Interesting order

A query blocks GROUP BY or ORDER BY clauses may
correspond to the order of records in an access
path. This tuple order is an interesting order.
Example query
Student(name, age, gpa, major) with a B-tree on
the gpa attribute
SELECT name
FROM Student
WHERE gpa lt 3.0
ORDER BY gpa

SELECT gpa, count() FROM Student WHERE gpa lt
3.0 GROUP BY gpa
31
B-Tree

A B-tree on the gpa attribute

3.6
(3.7, (3, 1))
(3.9, (4,1))
(2.3, (1, 1))
(3, (2,1))
(3.8, (3,2))
(3.9, (4,2))
(2.5, (1,2))
(3.1, (2,2))
(3.8, (3,3))
(4, (4,3))
(2.8, (1,3))
(3.2, (2,3)
(3.8, (3,4))
(2.8, (1,4))
(4, (4,4))
(3.5, (2,4))
Bob, 21, 3.7, CS
Mary, 24, 3, ECE
Chris, 22, 3.9, CS
Chad, 28, 2.3, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
James, 24, 3.1, ME
Chang, 18, 2.5, CS
Kane, 19, 3.8, ME
Lam, 22, 2.8, ME
Louis, 32, 4, LS
Tom, 20, 3.2, EE
Martha, 29, 3.8, CS
Pat, 19, 2.8, EE
Leila, 20, 3.5, LS
Shideh, 16, 4, CS
32
Single Relation Access Paths

Single relation access paths are simple selects
with ORDER BY and GROUP BY clauses
SELECT name
FROM Student
WHERE age lt 20
Without an index, must perform a segment scan,
what is the cost?
TCARD / P W RSISCAN
TCARD(T), number of pages in the segment that
holds tuples of T
P(T), fraction of data pages in the segment that
hold tuples of relation T
P(T) TCARD(T) / ( of non-empty pages in the
segment)
Why?

33
Single Relation Access Paths

Single relation access paths are simple selects
with ORDER BY and GROUP BY clauses
SELECT name
FROM Student
WHERE age lt 20
Without an index, must perform a segment scan,
what is the cost?
TCARD / P W RSISCAN
TCARD(T), number of pages in the segment that
holds tuples of T
P(T), fraction of data pages in the segment that
hold tuples of relation T
P(T) TCARD(T) / ( of non-empty pages in the
segment)
Tuples of Student might be inter-mixed with
professors. Example the student table with
TCARD 100 pages and P(T) 0.75. Note that
P(T) 1 when the student table is not intermixed
with another table.

34
Single Relation Access Paths

Cost of scanning leaf pages and data pages

Cost of scanning leaf pages and data pages
containing the qualifying records

36
Non-Clustered B-Tree

A random I/O for every qualifying record

3.6
(3.7, (1, 1))
(3.9, (4,1))
(2.3, (4, 2))
(3, (1,2))
(3.8, (3,2))
(3.9, (2,4))
(2.5, (2,3))
(3.1, (3,3))
(3.8, (2,1))
(4, (3,1))
(2.8, (2,2))
(3.2, (1,3)
(3.8, (1,4))
(2.8, (3,4))
(4, (4,4))
(3.5, (4,3))
Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chris, 22, 3.9, CS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
Shideh, 16, 4, CS
37
Non-Clustered B-Tree

A random I/O for every qualifying record

38
R EQUALITY JOIN S R.A S.A

Two algorithms for performing the join operator
nested loops and merge-scan.
Tuple nested loops
for each tuple r in R do
for each tuple s in S do
if r.As.A then output r,s in
the result relation
end-for
end-for
Estimated cost of tuple nested loops
TCARD(R)/P(R) NCARD(R) TCARD(S)/P(S)

NCARD(R)
TCARD(S)/P(S)
39
EQUALITY JOIN (Cont)

Merge-scan
Interesting order on R.A (sorted)
Interesting order on S.A (sorted)
Scan R and S in parallel, merging tuples with
matching A values
Estimated cost of merge scan NINDX(IR)
NINDX(IS)

40
N-Way Join

N-Way joins as a sequence of 2-way joins.
Utilize pipelining whenever appropriate
The ordering of the joins is important. Consider
all ordering such that
Join predicates relate the two participating
tables together do not consider cartesian
products. For example if the join clause is (R.A
S.A and R.B T.B) then it would be a mistake
to use the following clause (S Cartesian product
T) and R.A ST.A and R.B ST.B
Delay computation of cartesian products as much
as possible.
Consider interesting orders in order to use
merge-scan whenever possible.

41
Search Space