Chapters 17 and 18: System Catalog and Query Optimization

About This Presentation

Title:

Chapters 17 and 18: System Catalog and Query Optimization

Description:

Platter. Track. Sector. Cylinder. Read/Write Heads. Chaps17&18-8. CSE. 255. Track. Sector ... Platters. R/W. Heads. Note: Parallel Read/Write Drives. Activate ... – PowerPoint PPT presentation

Number of Views:277

Avg rating:3.0/5.0

Slides: 117

Provided by: stevenad

Category:

more less

Transcript and Presenter's Notes

Title: Chapters 17 and 18: System Catalog and Query Optimization

1
Chapters 17 and 18 SystemCatalog and Query
Optimization
Prof. Steven A. Demurjian, Sr. Computer Science
Engineering Department The University of
Connecticut 191 Auditorium Road, Box
U-155 Storrs, CT 06269-3155
steve_at_engr.uconn.edu http//www.engr.uconn.edu/st
eve (860) 486 - 4818

A portion of these slides are being used with the
permission of Dr. Ling Lui, Associate Professor,
College of Computing, Georgia Tech.
Other slides have been adapted from the AWL web
site for the textbook.
Remaining slides represent new material.

2
Overview of Material

Key Background Topics
What are Typical Database Processing Actions?
Disk Drives and Disk Storage
Database Processing/Architectures
Motivating Query Optimization
Query Processing
Chapter 17 - System Catalog
What is it?
How is it Used?
Chapter 18 - Query Optimization in RDBMS
High-level Query Optimization (Algebraic)
Low-level Query Optimization (Cost-based)

3
Typical Database Processing
Parsed and Optimized User Trans.
Concurrency Control
Pre-Processing - Parser/Lexical - Optimizer/Views
Lock Request Response
Lock Request
User Transaction
Low-Level Processing - Enqueue Trans. - Request
Locks - Issue I/Os - Process Returned Data -
Integrity Checks - Security Checks - Logging for
Recovery - Release Locks - Dequeue Trans.
High-Level Processing - Enqueue Trans. - Request
Locks - Release Locks -Dequeue Trans.
Errors
Response to User
I/O Request
Post-Processing - Collection of Results -
Aggregation Operations - Security Checks
Errors Results
Results
Disk I/O
Recovery
4
What are the Processing Issues for DBs?

Database Applications of Today and Tomorrow
Require High Volumes of Information!
Increase of Information Still Requires High
Performance!
Throughput and Response Time
Where's the Bottleneck in DBS?
CPU ??
Main Memory Size/Speed ??
Virtual Memory Limitations ??
Communications Bus ??
I/O Channel ??

5
90-10 Rule for Database Processing

Load (Transaction per second) vs. Performance
(Response Time of Transactions)
Processing of Large Amounts of Raw Data
Addressed in Secondary Storage
Staged to Main Memory
Identifying Relevant Data
Large Amounts of Raw Data Discarded
Focus on Data Most Likely to Contain Answers
Possible Loss of CPU and Main Memory Cycles
This is Double Jeopardy!
Load of DBS Must be Reduced
Performance of DBS Degrades

6
90-10 Rule for Conventional DBS

Only 10 of Relevant Data has Answers
Note Naive Approach to Database Searching Often
Occurs (Little or No Indexing in Practice!)

Only 10 of Raw Data is Relevant

Application Programs
Operating System
Database Functions
On-Line I/O
Disk I/O
7
Randomly Accessed Storage Devices

Popular Media (Hard Drives, CDs, DVDs, etc.)
Access to Information in Any Order
Sequential Access Not Typically Supported or
Needed, Since Files Not Stored Sequentially
Recall, Disk Defragmentation on PC Platform
Block-Oriented Utilization of Device
Block Access to Optimize Transfer
Block Size is Device/Controller Dependent
Linear/Non-Linear Byte Orders with Blocks
Key Concepts
Platter
Track
Sector
Cylinder
Read/Write Heads

8
Rotating Storage
Track
R/W Heads
Platters
Sector
Cylinder
Top View of a Surface
Note Parallel Read/Write Drives Activate All
Heads Simultaneously
9
Disk Drive Components
10
Disk Characteristics and Access

Transfer Time Time to Copy Bits From Disk
Surface to Primary Memory
Disk Latency Time
Rotational Delay Waiting for Proper Sector to
Rotate Under R/W Head
Rotate to Next Sector to Process Next Request
Disk Seek Time
Delay While R/W Head Moves to the Destination
Track/Cylinder
Move Head In/Out to Seek Next Track/Cylinder
Access Seek (In/Out) Latency (Around)
Transfer (Bytes)
For DBMS - Key is Moving Data To/From Disk ASAP
w.r.t. Performance and Response Time
Improve on 90-10 via Processing/Optimization

11
Historical DB Architecture - Mainframe
12
Client/Server DBS Architecture
13
Mixed Architecture
14
Three and Four Tier Architectures
From http//java.sun.com/javaone/javaone98/sessio
ns/T400/index.html
15
What is MBDS?

MBDS is Multi-Process, Multi-Computer, Parallel
Database System
MBDS Composed of
Host for Issuing User Requests
Controller to Interact with Host (and User)
One or More Backend Database Processors
Goals of MBDS
Suppose Request Takes 4 Minutes with One Backend
Improve Response Time by Increasing Backends
Two Backends - Request 2 Minutes
Four Backends - Request 1 Minutes

16
What is MBDS Architecture?
Database Blocks are Distributed Across All
Backends
Backend (BE) DB Processors are Replicated
Host User
Database ControllerSends Same Query in Parallel
to all BEs
Database Controller
BEs work in Parallel on Each Query and
Communicate for Join
Results are Sent to and Collected by the DB
Controller - then to the User
17
What are MBDS Processes?
Request Preparation
Post Processing
Database Controller
Get Msg.
Put Msg.
Get Msg.
Put Msg.
Directory Management
Record Processing
Backend Database Processor
Concurrency Control
Disk I/O
18
What are MBDS Messages?
No. Type SRC
DST 1 New Request
Host ReqP 2 Results of Request
PoPr Host 3 Number of Reqs in Transaction
ReqP PoPr 4 Aggregate Operators (Sum,
etc.) ReqP PoPr 6 Parsed Request to
Backends ReqP DM 12 Backend
Aggregate Operator Results RecP PoPr 15 Ids
for Accessing Database Indexes DM DMs 16
Request and Disk Addresses DM
RecP 21 Ids for Accessing Database Records
DM CC 22 Locks Obtained Okay to Execute
CC RecP 23 Request ID of Finished Request
RecP CC
19
Sample Processing of Retrieve Request
B3
A1
K12
Request Preparation
Post Processing
C4
K12
D6
Get Msg.
Put Msg.
F15 From Other Backend
E15 To Backend(s)
Get Msg.
Put Msg.
K12
Record Processing
D6,F15
E15
I16
Directory Management
J23
H22
Concurrency Control
G21
Disk I/O
20
What are Synchronization Issues in MBDS?

Coordination of Synchronous Behavior
Within Controller and Backend to Allow Multiple
Active Requests within
Each Process
Requests at Different Stages in Different
Processes
Between Controller and Backends to Allow
A Request to be Processed by All Backends
A Request to be Processed by One Backend
Among Multiple Backends to Allow a Backend
to Synchronize its Work on one Request with Other
Backends
to Forward Results to Another Backend

21
Why is Query Optimization Needed?

Data Volume in any Type of Join or Cartesian
Product has the Potential to be Very Large!
Consider R(A, B) r1, r2 , ..., rn
Consider S(C, D) s1, s2 , ..., sm
R x S r1 s1, r1 s2, r1 s3, r1 s4, r2 s1,
r2 s2, r2 s3, r2 s4, which contains n x m
tuples!
What is the Issue?
If n is 10,000 and m is 20,000 then
Cartesian Product has 200,000,000 Tuples
Join must Perform 200,000,000 Comparisons

22
Why is Query Optimization Needed?

n/m - Number of Tuples of R/S Respectively
bR / bS - Number of Tuples/Block of Memory
Assume that K Blocks Fit into Primary Memory

n / bR (m / bS ) Number of Blocks for R/S
1 Block of R
(m / bS )/(K-1)
Number of Times that K-1 Memory Chunk Filled by S
1
2
(n / bR )(m / bS )/(K-1)
K-1 Blocks of S
Which if Filled for Each Block of R
3

(n / bR ) (n / bR )(m / bS )/(K-1)
Total Block Reads Must also Read Blocks of R
K-1
23
Why is Query Optimization Needed?
If n m 10,000 and bR bS 5, and K
100(10,000/5)(10,000/5)(10,000/5)/99 42,400
Blocks to Read At 20 Blocks/Second - 35 Minutes!
24
Observation

Cartesian Product Yields Unwanted Data
SELECT R.AFROM R, SWHERE R.B S.C and S.C 99
In Relational Algebra?A (? BC and D99 (R x
S)) ?A (? BC (R x ? D99 (S) )) ?A (R x BC
(? D99 (S)))
Has Performance Improved? How?

25
Evaluation

Cartesian Product for SELECT - 40,000 Blocks
SELECT R.AFROM R, SWHERE R.B S.C and S.C 99
Relational Algebra with Equijoin?A (R x BC (?
D99 (S)))The ? D99 (S) Limits the Size of S
DramaticallyAs a Result, the Equijoin of R and ?
D99 (S) Would Likely Reduce the Total Blocks
Required to 4,000
Thus, a Smart Query Execution Strategy Can
Dramatically Reduce the Amount of I/Os

26
Query Optimization Goal

Limit Costly Join Operation by Reducing Data to
be Scanned or that Participates in the Join
Query Optimization is Strategy to Achieve Goal
While Improving Selection and Projection can
Help, the Main Objective is Join
In Worst Case - Cartesian Product
Can Improve by Introducing Indices on the Join
Attributes (R.B and S.C) to Limit Product
Can Further Improve by Sorting on the Join
Attributes (R.B and S.C)
This Reduces Block Accesses by Limiting the
Number of Blocks that Must be Examined in a Join
If Bs Values Range from 0 to 100 and C from 50
to 150, only need to Compare from 50 to 100

27
Query Processing

Internal Data Structure
Memory Hierarchy
Main Memory Secondary Memory
Information Must be Staged from Secondary to
Primary Memory for Database Operation
Sequential Search
Brute force Approach
Direct Access (Indexed Search)
Hash, Inverted Index file, Binary Search Tree,
B-tree, B-tree
Improves Selection by Focusing on Subset of
Tuples that are Involved in the Answer and
Equijoin by Not Having to Compare All Blocks in
Two Relations

28
Algorithms for Database Query Operators

Largely Fall into Three Classes
Sorting-Based Methods
Hash-Based Methods
Index-Based Methods
Such Algorithms are Divided into Three Degrees of
Difficulty and Cost (Limiting Factor is Size of
Data)
One Pass Algorithms
Where Data is Only Read Once From Disk
Two-pass Algorithms
Data is Read from Disk, Processed in Some Way,
Written Back to Disk, Read Again for Processing,
etc.
Multi-pass Algorithms
Where 3 or More Passes Are Required, i.e.,
Recursive Generalization of the Two-pass
Algorithms

29
Database Join and Sort are External

Suppose that your DBS has 1,000 1K Blocks of
Memory Available for Performing Operations (e.g.,
Select, Project, Join, Union, Aggregation, etc.)
Suppose Sort R by R.B
R Contains 5000 Blocks
In order to Perform a Sort/Merge - You Must Use
External Algorithm since all 5000 Blocks Can Fit
Into Memory at the Same Time
Suppose Join R (500 Blocks) and S (800 Blocks)
Again - their Total Exceeds Memory - Hence you
Must Take an Approach that Compares One Block of
R with All Blocks of S, etc. (Slides 22,23)

30
Database Join and Sort are External

Whats True about Todays DBMS Like Oracle?
Oracle Recommends 2 Gigabytes of Primary Memory
That 2 Gigabytes Must be Shared by
Operating System
Other Applications Running on Same Server (Web
Server, etc.)
Database Management Software
Even if there was 1.5 Gigabytes Available, Modern
DBs can Exceed that size Very Easily
Moreover,
Cartesian Product Could Exceed Available Mem.
Join Could Require External Approach Since All
Tables Involved in Join Cant fit in 1.5
Gigabytes
External Sorting/Block Oriented Processing is Norm

31
Algorithms for DB Query Operators

Relational Algebra Operators can be Classified
into Three Groups
Tuple-at-a-time Unary Operators
Selection and Projection
No Need to Bring Entire Relation into Memory at
One Time
Full-Relation Unary Operators
Duplicate Elimination and Grouping
Requires Seeing All or Most of the Tuples in
Memory at Once
Full-Relation Binary Operators
Set and Bag Versions of Union, Intersection, and
Difference, Joins, and Cartesian Products
Requires Seeing the Tuples of Both Relations in
Memory

32
Query Access
SELECT EMP.ENAME FROM EMP, WORKS, PROJ WHERE
(EMP.ENO WORKS.ENO) AND (WORKS.PNO PROJ.PNO)
AND (PROJ.PNAME CAD/CAM)
33
Database Access
SELECT EMP.ENAME FROM EMP, WORKS, PROJ WHERE
(EMP.ENO WORKS.ENO) AND (WORKS.PNO PROJ.PNO)
AND (PROJ.PNAME CAD/CAM)
9
10
2
7
1
8
Schema
3
6
4
5
Physical/Internal Data Schema
Operating System
34
Database Access

User program A sends to DBMS an invoke command to
retrieve a (set of) record
DBMS analyzes the external schema of the user
program A and finds the database description of
the record.
DBMS checks with the schema to get the data types
and location information of record
DBMS checks with the physical schema to find out
which device the record is in and what access
methods can be used.
According to 4, DBMS sends OS a read command to
execute the search.
OS issues the page invoke command to the
correspond device, and then puts the page
fetched into the system buffer.
DBMS uses the schema and the external schema to
infer the logical structure of the retrieving
record.
DBMS places the relevant data to the UWA, and
provides the status information at the program
invocation exit

35
The System Catalog

Store the Meta Information that Describes Each
Database, Including a Description of
Conceptual Database Schema (Logical Data Model)
Relations, Attributes, Keys, Indexes, Views
Internal Schema
External Schema
Store Information Needed by Specific DBMS Modules
Query Optimization Module
Security and Authorization

36
Metadata - What is it?

System metadata
Where data came from
How data were changed
How data are stored
How data are mapped
Who owns data
Who can access data
Data usage history
Data usage statistics
System metadata are critical in a DBMS

Application metadata
What data are available
Where data are located
What the data mean
How to access the data
Predefined reports
Predefined queries
How current the data are
Application metadata are critical in a database
system

37
Metadata v.s. Data
relations

Meta schema
describes all schemata that can be defined in the
data model
Data Dictionary Schema
contains copy of metaschema schema for format
definitions schema for data about application
data
Data Dictionary Data
schema for application data metadata about
application data
Data
raw formatted application data

rel-name
att-name
dom-name
access-rights
user
relation
operation
relations
rel-name
att-name
dom-name
(u1, supplier, insert)
(u2, supplier, delete)
supplier
s
sname
location
(s1, smith, london)
(s2, jones, boston)
38
Example of Catalog Information
39
Relational DBMS Catalog

All Metadata Stored as Relations
Example of Metadata Tables are

40
EER Diagram for Relational Catalog
41
Metadata in Oracle

Complex Data Dictionary
All Schema Objects (Tables,Views, Indices, )
User, All, and DBA Views

SELECT FROM ALL_CATALOG WHERE
OWNERSMITH
SELECT COLUMN_NAME, DATA_TYPE, DATA_LENGTH,
NUM_DISTINCT, LOW_VALUE, HIGH_VALUE FROM
USER_TAB_COLUMS WHERE TABLE_NAMEORDERS
42
Metadata in Oracle
SELECT PCT_FREE, INITIAL_EXTENT, NUM_ROWS,
BLOCKS, EMPTY_BLOCKS, AVG_ROW_LENGTH FROM
USER_TABLES WHERE TABLE_NAME ORDERS
SELECT INDEX_NAME, UNIQUENESS, BLEVEL,
LEAF_BLOCKS, DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_K
EY, AVG_DATA_BLOCKS_PER_KEY FROM
USER_INDEXES WHERE TABLE_NAME ORDERS
43
Uses of System Catalog

DDL Compilers
Correct Definition ofRelations and Attributes
DML (Query) Compiler
DML Parser
Guided by the Description of DML Syntax and the
Schema Information in the Catalog, Generates a
Query Tree after Parser
Optimizer
Generates Access Paths that is Relatively Optimal
for Executing a Query/ DML Command, by Accessing
the Database Structure Information (Schemas), and
Mapping High-level SQL Queries Into Low-level
File Access Commands

SELECT EMP.ENAME FROM EMP, WORKS, PROJ WHERE
(EMP.ENO WORKS.ENO) AND (WORKS.PNO PROJ.PNO)
AND (PROJ.PNAME CAD/CAM)
44
Revisit Typical Database Processing
Parsed and Optimized User Trans.
Concurrency Control
Pre-Processing - Parser/Lexical - Optimizer/Views
Lock Request Response
Lock Request
User Transaction
Low-Level Processing - Enqueue Trans. - Request
Locks - Issue I/Os - Process Returned Data -
Integrity Checks - Security Checks - Logging for
Recovery - Release Locks - Dequeue Trans.
High-Level Processing - Enqueue Trans. - Request
Locks - Release Locks -Dequeue Trans.
Errors
Response to User
I/O Request
Post-Processing - Collection of Results -
Aggregation Operations - Security Checks
Errors Results
Results
Disk I/O
Recovery
45
Typical Database Processing

Pre-Processing
Actions Taken Upon Receipt of a Query from User
SQL Query via Query Tool or JDBC Call
Compilation of DB Query
Check Syntax, Optimize, Develop Run-Time Strategy
(Similar to PL Compilation)
Query is Translated to DB Transaction
A Transaction Contains Multiple DB Operations
Transaction has Explicit Order of Operations
Database Transaction Must Succeed or Fail
There is no Intermediate State
Completely Executed and Committed or Aborts at
any Point and Undone
New State or Previous State of DB

46
Typical Database Processing

High-Level Processing
Enqueue Transaction from Pre-Processing
Transaction Must Wait for Earlier Transactions
Remember - Shared DB State!
Request Locks from Concurrency Control
All Locks Before Proceeding vs. Locks as Needed
Avoid Deadlock and Livelock
Release Locks
As Use of Data Completes to Increase Availability
What Happens if Failure of Later Step in
Transaction
Dequeue Transaction
Completes Transaction Processing
Return Result to Post-Processing

47
Typical Database Processing

Low-Level Processing
Enqueue Transaction - Do Actual DB Operations
Request Locks - Lower Granularity Level
Issue I/Os - Based on Operations to Access
Correct and Relevant DB Records
Process Returned Data - Aggregation, Sorting
Integrity Checks Do I/D/U Satisfy Constraints?
Security Checks Is DB R/I/D/U Allowed?
Logging for Recovery - Commit the Transaction
Release Locks - Available to Others
Dequeue Transaction - Return Results to
High-Level Processing
Note The Multiple Operations of Each DB
Transaction All Must be Successful

48
Typical Database Processing

Post Processing
Collection of Results
May be Passed Portions of Results as they
Complete
For Example, Sorted Blocks of Data that are then
Merged in a Final Step
Aggregation Operations
May be Passed Aggregate Intermediate Results
Sum for Different Departments to be Totaled
Security Checks
Last Step Filtering to Insure Only Allowed Data
is Returned
May Execute Query but Only see Aggregate Result
Send Results to User

49
Typical Database Processing

Concurrency Control
Control Access to Information
Data and Metadata
Prevent Simultaneous Updates
Ensure Database Always Correct and Consistent
Serial Schedule vs. Serializable Transaction
Two Types
Pessimistic - Locking-Based - Assume Collisions
Will Occur - e.g., Peoplesoft Course Registration
Optimistic - Time-Based - Fix Problems After the
Fact - e.g., ATM Machines Example
CC Manages Locks at Different Granularity Levels
(Table, Attribute, View, Tuple, Metadata, etc.)

50
Typical Database Processing

Disk I/O
Performs the Actual Disk I/O for Read/Writes
Block Oriented Activity
Maintain Queue of All I/O Requests
Ordering is Critical
Related to Concurrency Control and Consistency
Single DB Transactions can have Multiple DB
Operations
Disk I/Os for Different Operations at Different
Times
High and Low Level Processing will Determine What
Operations Needed When
Disk I/O - Relatively Dumb

51
Typical Database Processing

Recovery
Tightly Tied to DB Transaction Concept
Transactions Must be
Atomic - Happens or Doesnt
Durable - Once Committed, Results Survive Failure
Consistent - Follows Protocol/Correct DB State
When Failure Occurs, Can we
Recover to a Correct Earlier State
Reconcile all Active Transactions that were
Executing at Failure Time
Involves Logging of Database Actions
Objective High Availability and Reliability

52
Query Optimization

Not Really Optimizing, but Planning to Avoid Bad
Execution Strategies
Models
Heuristics-Based
Apply Transformation Rules According to a General
Strategy
Focus on Relational Algebra that Underlies Each
Query
Improve the Order of Relational Operations
Cost-Based
Minimize a Cost FunctionI/O Cost CPU Cost
Subject to a Set of Constraints

53
Query Processing Methodology
High-level Calculus-based Query
EXTERNAL SCHEMA
Query Preprocessing
LOGICAL SCHEMA
Algebraic Query (a tree structure)
Query Optimization
INTERNAL SCHEMA
Execution Schedule (file access plan)
54
Query Preprocessing

Input Calculus Query on Base Relations
Normalization
Manipulate Query Quantifiers and Qualification
Analysis
Detect and Reject Incorrect Queries
Possible for Only a Subset of Relational Calculus
Simplification
Eliminate Redundant Predicates
Restructuring
Calculus Query implies Algebraic Query
More Than One Translation is Possible
Use Transformation Rules

55
Normalization

Lexical and Syntactic Analysis (Similar to
Compilers)
Check Validity
Check for Attributes and Relations
Type Checking on the Qualification
Put into Normal Form
Conjunctive Normal Form
(p11?p12??p1n) ??????pm1?pm2??pmn)
Disjunctive Normal Form
(p11??p12??p1n) ??????pm1?pm2??pmn)
OR's Mapped into Union
AND's Mapped into Join or Selection

56
Refute Incorrect Queries

Example E(ENAME, ENO), P(JNO,JNAME),
W(ENO,PNO,DUR) SELECT ENAME, JNAME
FROM E, P, W
WHERE DUR gt 27 AND DUR lt 25
Incorrect
Disjoint Components are Useless
Multiple Relations, Missing Joins, may not be
incorrect, but may indicate Cartesian product
Contradictory
Qualification can not be Satisfied by any Tuple
DUR gt 27 AND DUR lt 25

57
Simplification

Why Simplify?
The Simpler the Query, the Less Work there is and
the Better the Performance
How? Use transformation rules
Elimination of Redundancy
Idempotency Rules
p1 ? (p1) false
(p1 ??p2) (p1) ? ?(p2)
p1 ? false p1
Application of Transitivity
Use of Integrity Rules
Example
x gt a and x gt b
DUR gt 27 AND DUR gt 25

58
Restructuring

Convert Relational Calculus to Relational Algebra
Make use of Query Trees
Example
Find the names of employees other than J.
Doe who worked on the CAD/CAM project for either
1 or 2 years.
SELECT ENAME
FROM E, W, P
WHERE E.ENOW.ENO
AND W.JNOP.JNO
AND E.ENAMEltgt"J. Doe"
AND P.JNAME"CAD/CAM"
AND (W.DUR12 OR W.DUR24)

59
Query Optimization Objectives

Improving Performance
Arriving at a Query Plan of Execution
Analyzing the Relational Algebra Query
Replace Costly Operations
Do Selections and Projections Early
Optimization Heuristics for the Relational
Algebra
Performing Selection and Projection Before Join
Combining Several Selections Over a Single
Relation Into One Selection
Find Common Subexpressions
Algebraic Rewriting/transformation Rules
General Transformation Rules for Relational
Algebra (Equivalence-preserving Algebraic
Rewriting Rules)

60
Query Optimization An Example

Why is it important?
SELECT ENAME
FROM E,W
WHERE E.ENO W.ENO
AND W.RESP "Manager"
Strategy 1
?ENAME(?RESP"Manager"?E.ENOW.ENO(E ? W))
Strategy 2
?ENAME( E ENO(?RESP"Manager"(W)))

61
Cost of Alternatives

Assume
card(E) 4,000 card(W)10,000
10 of tuples in W satisfy RESP"Manager"
(selection generates 1,000 tuples)
Execution time Proportional to the Sum of the
Cardinalities of the Temporary Relations
Searching is Done by Sequential Scanning
Strategy 1 Strategy 2
Cartesian prod. 40,000,000 Selection over W
10,000
Search over all 40,000,000 Join(40001000)
4,000,000
80,000,000 4,010,000

62
General Query Optimization Strategy

Perform Selections Early
Yields Smaller Intermediate Results
Direct Impact on Subsequent Join/Cartesian Prod.
Combine Selections with a Prior Cartesian Product
into a Theta or Equi Join
Join is a Cheaper Operation
Combine (Cascade) Selections and Projections
?AB(?B (R)) ? ?AB(R) ?p1 (? p2 (R)) ? ?p1 p2
(R) This Results in One Pass Instead of Two
over Table

63
General Query Optimization Strategy

Identify Common Subexpressions
Compute Once and Store
use Stored Version for Subsequent Times
Often Useful When Views are Employed
Preprocess Data via Sorts and Indexes
Speeds up Searches and Joins by Limiting Scope
Evaluate and Assess Different Options
For Cartesian Product, Use Smaller Relation for
Comparison
Use System Catalog (Meta-data) to Effect Order in
Query Execution Plan

64
Relational Algebra Transformations

Cascade of Selection
?p1 p2 pn(R)????p1(?p2(...(?pn(R))...))
Commutativity of Selection
?p1(??p2(R))????p2(?p1(R))
?p1 or?p2(R )????p1(R??? ?p2(R)
Cascade of Projection
?A1,A2, An(R)????A1(?A2(...(?An(R))...))
???A1(R) if A1? A2? ... ? An
Commuting Selection with Projection
?A1,A2,...,An(?p(R))????p(?A1,A2,...,An(R)

65
Relational Algebra Transformations

Commutativity of Theta Join and Cartesian Product
R A S???S A R
R ? S???S ? R
Commuting Selection with Theta Join (Cartesian)
?p(A)(R ??S) ????p(A)(R)) ??S A defined on R
only
?p(A)p(B)(R ??S) ??? ?p(A)(R)) ?? (?p(B)(S)) (A
defined on R, B defined on S)
Also Holds for Theta Join as Well
Commuting Projection with Theta Join (Cartesian)
?C(R ??S) ???A(R) ???B(S) where A?BC
A are Attributes in C for R and B are Attributes
in C for S

66
Relational Algebra Transformations

Commutativity of Set Operations
R ??S ??S ??R
R ??S ??S ??R
Associativity of Set Operations
(R ??S) ??T ??R ???S ??T)
(R S) T ??R (S T)
(R ??S) ??S ??R ? (S ? T)
(R ??S) ??S ??R ??(S ??T)
Commuting Select with Set Operations
?p(Ai)(R ??T) ???p(Ai)(R) ???p(Ai)(T) where Ai
is defined on both R and T
?p(Ai)(R ??T) ???p(Ai)(R) ???p(Ai)(T) where Ai
is defined on both R and T

67
Relational Algebra Transformations

11. Commuting Projection with Union
?C(R q(Aj,Bk) S) ???A(R) q(Aj,Bk)
?B(S)
?C(R ??S) ???A (R) ???B (S)where RA and SB
C A' ??B' where A' ? A, B ? B
12. Converting Selection/Cartesian Into Theta
Join
?C (R ??S) ? R S

C
68
Heuristics Algebraic Optimization Concepts

Using Cascade of Selections Rule, Break up Any
Selections With Conjunctive Conditions Into a
Cascade of Selections
Allows More Freedom in Moving Selections Down
Different Branches of the Tree
Using Commutativity of Selections with Other
Operations Rules, Move Each Selection Down the
Query Tree as far as Possible
If Possible, Combine a Cartesian Product With a
Selection Into a Join

69
Heuristics Algebraic Optimization Concepts

Using Associativity of Binary Operations,
Rearrange the Leaf Nodes So That the Most
Restrictive Selections Are Executed First
The Fewer Tuples the Resulting Relation Contains,
the More Restrictive the Selection
Reducing the Size of Intermediate Results
Improves Performance
Using Cascade of Projections and Commutativity of
Projections with Other Operations, Move
Projections Down the Query Tree as Far as
Possible
Identify Subtrees that Represent Groups of
Operations that can be Executed by a Single
Algorithm

70
Heuristic Algebraic Optimization Algorithm

Use Rule 1 to Break up Selects with Conjunctions
into a Cascade to Move them Down the Query Tree
Use Rules 2, 4, 6, and 10 to Commute Select with
Project, Join, Cart. Prod., Union, and
Intersection
Use Rule 5 (Commute) and 9 (Associative) to
Rearrange the Leaf Nodes of Query Tree to
Most Restrictive Select Executed First
Avoid Cartesian Product in Leaf Nodes
Use Rule 12 to Convert a Select/Cart Prod to Join
Use Rules 3, 4, 7, and 11 to Cascade and Commute
Project - Pushing Down Tree as Far as Possible
Identify Subtrees that Can Execute as Independent
Algorithms (Set of Operations)

71
Heuristic Optimization Example
Canonical query tree at the end of query
preprocessing phase
?ENAME
?(DUR12 OR DUR24) AND JNAMECAD/CAM AND
ENAMEltgtJ. DOE
E(ENAME, ENO) P(JNO,JNAME) W(ENO,PNO,DUR)
JNO
ENO
P
W
E
72
Heuristic Optimization Example
?ENAME
?DUR12 OR DUR24
?JNAMECAD/CAM
Use cascading of selections rule to decompose
selections
?ENAME ltgt J. DOE
JNO
ENO
P
W
E
73
Heuristic Optimization Example
?ENAME
?DUR12 OR DUR24
?JNAMECAD/CAM
Push selection down using commutativity of
selection over join
JNO
ENO
?ENAME ltgt "J. Doe"
W
P
E
74
Heuristic OptimizationExample
?ENAME
?DUR12 OR DUR24
Push selection down using commutativity of
selection over join
JNO
ENO
?JNAME "CAD/CAM"
?ENAME ltgt "J. Doe"
P
W
E
75
Heuristic OptimizationExample
?ENAME
JNO
Push selection down
ENO
?JNAME "CAD/CAM"
?ENAME ltgt "J. Doe"
?DUR 12 ??DUR24
P
E
W
76
Heuristic OptimizationExample
?ENAME
JNO
Do early projection
?JNO,ENAME
ENO
?JNO
?JNO,ENAME
?JNO,ENO
?JNAME "CAD/CAM"
?ENAME ltgt "J. Doe"
?DUR 12 ??DUR24
P
E
W
77
Heuristic OptimizationExample
?ENAME
Identify subtrees that can be implemented in one
algorithm
JNO
?JNO,ENAME
ENO
?JNO
?JNO,ENAME
?JNO,ENO
?JNAME "CAD/CAM"
?ENAME ltgt "J. Doe"
?DUR 12 ??DUR24
P
E
W
78
Heuristic Optimization A Second Example
BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pnam
e, Paddr, Pcity) BORROWERS(Name, Addr, City,
Card_No) LOANS(Card_No, LC_No, Date)
Let XLOANS ?S(?F(Loans x Borrowers x
Books)) where S Title, Author, Pname, LC_No,
Name,
Addr, City, Card_No, Date and F
Borrower.Card_No Loans.Card_No
Books.LC_No Loans.LC_No
79
Heuristic Optimization A Second Example
XLOANS
80
Heuristic Optimization A Second Example
?
Title
?
Date ? 1/1/88
?
Title, Author, Pname, LC_No, Name, Addr, City,
Card_No, Date
?
Borrower.Card_No Loans.Card_No Books.LC_No
Loans.LC_No
X
Query ?TITLE(?Date ? 1/1/88 (XLOANS))
Books
X
Loans
Borrower
81
Heuristic Optimization A Second Example
?
?
Title
Date ? 1/1/88
Try to Cascade
X
Books
X
Loans
Borrower
82
Heuristic Optimization A Second Example
?
Title
Commute Selectand Project
X
Books
X
Loans
Borrower
83
Heuristic Optimization A Second Example
?
Title
Commute Selectand Select
X
Books
X
Loans
Borrower
84
Heuristic Optimization A Second Example
?
Title
X
Commute Select andCartesian Product Two Levels
Down
Books
X
Borrower
Loans
85
Heuristic Optimization A Second Example
?
Title
Try to Cascade
Borrower.Card_No Loans.Card_No
X
Books
X
Borrower
Loans
86
Heuristic Optimization A Second Example
?
Title
X
Commute Select andCartesian Product One Level
Down
Books
?
Borrower.Card_No Loans.Card_No
X
Whats Next?
Borrower
Loans
87
Heuristic Optimization A Second Example
?
Title
X
CombineProjections
Books
?
Borrower.Card_No Loans.Card_No
X
Borrower
What is Still a Problem? We are Not Projecting so
All Attributes are Still Collected Until the
Final Project!
Loans
88
Heuristic Optimization A Second Example
X
Books
?
Borrower.Card_No Loans.Card_No
X
Borrower
Add Strategic Projections to Send Only the
Minimum Up the Tree as Needed for Join/Result Set
Loans
89
Heuristic Optimization A Second Example
What is the Final Step? Combine Select and
Cartesian Product Result Equijoins!
X
Books
?
Borrower.Card_No Loans.Card_No
X
Borrower
Loans
90
Heuristics Query Optimization Summary

First Apply Operations that Reduce the Size of
Intermediate Results
Move Selections and Projections Down the Tree as
far as Possible
Early Selections Reduce the Number of Tuples
Early Projections Reduce the Number of Attributes
Selection and Join Should be Executed Before
Other Similar Operations.
This is Accomplished by Reordering the Leaf Nodes
of the Tree Among Themselves and Adjusting the
Rest of the Tree Appropriately

91
Cost-Based Optimization

Reduce Defined Cost of Executing Queries
What is Involved in the Cost of Executing a
Query?
Access Cost to Secondary Storage
Search for Data Block (Index)
Read/Write Index and Data Blocks
Storage Cost
Index and Data Blocks
Intermediate Files
Computation Cost
Query Planning - Optimization Effort
Record Search, Sort, Merge
Actual Transaction/Query Operations
Communications Cost
Transfer of Results to the User

92
Complexity of Relational Operations

Assuming
Relations of Cardinality n
Sequential Scan of Data in each Relation
Complexity of Each Operation is Indicated
Avoid Cartesian Product at All Costs!

Operation
Complexity
Select
Project
O(n)
(w/o duplicate elimination)
Project
O(nlog n)
(with duplicate elimination)
Group
Join
O(nlog n)
Division
Set Operators
Cartesian Product
O(n2)
93
Cost-Based Optimization

To Understand Cost-Based Operations, we Must
Focus on Implementation Strategy of
Select
Project
Join
For Select and Project - There is a Fixed Cost
that we Must Live With
For Join
Implementation Strategy
Different Join Strategies
Objective
Minimize the Number of Blocks Involved
Note that Cost-Based and Relational Algebra
Heuristic Optimization Can Complement One Another

94
Implementation of SELECT

Principles
Equality Eliminates Many Tuples
Index Focuses and Limits Search Scope
Sequential Scan
Brute Force
Search All Records to Find Matching Ones
Binary Search
Equality Comparison on a Key Attribute
Primary Index or Hash Key for Single Record
Equality Comparison on a Key Attribute With
Primary Index or Hash Key
Go Directly to Record No Need to Scan Entire
Table
Cost to Maintain Index/Hash

95
Implementation of SELECT

Primary Index for Multiple Records
Use Primary Key to Find the Equality Attribute
Go Forward (gt or ?) or Backward (lt or ?)
According to the Comparison Operator
Clustering Index for Multiple Records
Equality Comparison on a Non-key Attribute With a
Clustering Index (e.g., Sort-Merge Algorithm)
Secondary Index
Equality or Range Queries
Primary Indexes Play a Role Similar to Searching
Sorted Array
Well Discuss Indexing Techniques at a Later Time

96
Implementation of SELECT

Conjunctive Selection (C1? C2 ? ? CN)
If One of the Conjuncts has a Good Access Path,
Use it and Check the Other Conjuncts for Each of
These Records
Composite Index
If an Index has Been Established Jointly for a
Number of Attributes in the Conjunct
Equality Condition
Intersection of Pointers
If Secondary Indexes Exist on All or Most of the
Attributes in the Conjunct and the Indexes
Include Record Pointers
Retrieve Each Attribute Using These Indexes and
Then Take Their Intersection

97
Implementing PROJECT

If ltAttribute Listgt Includes Key
Simple Since the Cardinality of the Result is the
Same as the Cardinality of the Original Relation
No Need to Remove Duplicates - Key Attribute
If ltAttribute Listgt Does Not Include Key
Duplicates Allowed
Duplicate Elimination
Sort After Projection and then Eliminate
Consecutively Appearing Duplicates
See Textbook for Algorithms
Use Hashing Hash Each Record Into a Bucket and
Check Against Records Already in That Bucket
Size Estimation card(?A(R))card(R)

98
Implementing JOIN

Nested Loop
Simple Iteration and Block-Oriented Iteration
For Each Block in R do
Retrieve Every Record from S and Test Join
Condition
An Index for S may Speed up the Inner Loop
Smaller Relation should be Outer Loop
Calculation of I/O
Let bo (bi) be the Number of Blocks taken up by
Outer (Inner) Relation
Let nB (gt1) the Buffer Size (in blocks) Devoted
to Arguments
Let bR be the size of the Resulting Relation (in
blocks)
Total no. of Block Access bo ?bo/(nB-1)??bi
bR

99
Implementing JOIN

Sort-Merge Join
Physically Sort Relations R and S
Scan R and S in the Sorted Order and Merge
See Algorithm in Textbook
If Files are Not Physically Sorted, but Sorted on
the Join Attributes, a Variation May be used
Quite Inefficient Since Records are Scattered
Over the Disk
Total number of block access b bi
bolog2bo bilog2bi bR

100
Implementing JOIN

Hash Join
Hash R and S Using the Same Hash Function
If Hash File Can Be Memory-Resident, it is
Efficient and Easy to Implement
If Buffer Space is Insufficient, then Part of the
Hash File has to be on Disk
Various Optimizations for this Case
Hybrid Hash Join is Described in the Book
Again - Biggest Problem is Overhead Associated
with Maintaining Hash Index Over Time

101
Access Using Indices Estimation of Costs
Example Given a bank database consisting of
the following three relation schemas Branch(bank
-name, assets, bank-city) Deposit(bank-name,
account-number, customer-name, balance) Customer(c
ustomer-name, street, zipcode, customer-city) Co
nsider the SQL query for the bank
database Select account-number From Deposit
Where bank-name BofA and customer-name
Bill and balance gt 1000
102
Heuristic Optimization

Use Cascading of Selections Rule to Decompose,
Three Logical Query Plan Alternatives Are
Obtained
Objective - Choose the Best Alternative in
Terms of Execution Time (Block Reads)
What should be the Focus in Select Order?

?Account-Number
?Account-Number
?Account-Number
? bank-name BofA
?balance gt 1000
? bank-name BofA
?customer-name Bill
?balance gt 1000
?customer-name Bill
?balance gt 1000
?customer-name Bill
? bank-name BofA
Deposit
Deposit
Deposit
103
Access Using Indices Estimation of Costs
Assumptions 100 Different Banks
(bank-name) 1000 Customers (on average) per
bank Balance could range from 0 to 10,000
dollars Branch(bank-name, assets,
bank-city) Deposit(bank-name, account-number,
customer-name, balance) Customer(customer-name,
street, zipcode, customer-city) Select
account-number From Deposit Where bank-name
BofA and customer-name Bill and
balance gt 1000
104
Estimation of Cost of Access - Version 1
Branch(bank-name, assets, bank-city) Deposit(bank-
name, account-number, customer-name,
balance) Customer(customer-name, street, zipcode,
customer-city)

Recall Assumptions
100 Banks
1000 Customers/Bank
0 to 10,000 dollars/account that are Distributed
Evenly Across Accts.
Tuples in Deposit? 100,000
What Does balance gt 1000 do?
Retrieve 90 of Accounts
All Banks, All Customers
What Does customer-name bill do?
All Customers Named Bill Regardless of the Bank
Is this a Good Strategy?

105
Estimation of Cost of Access - Version 2
Branch(bank-name, assets, bank-city) Deposit(bank-
name, account-number, customer-name,
balance) Customer(customer-name, street, zipcode,
customer-city)

Recall Assumptions
100 Banks
1000 Customers/Bank
0 to 10,000 dollars/account that are Distributed
Evenly Across Accts.
Tuples in Deposit 100,000
What Does bank-name BofA do?
Retrieves 1000 Tuples for BofA on Average
What Does customer-name bill do?
The Customer Bill
What Does balance gt 1000 do?
Is this a Good Strategy?

106
Estimation of Cost of Access - Version 3
Branch(bank-name, assets, bank-city) Deposit(bank-
name, account-number, customer-name,
balance) Customer(customer-name, street, zipcode,
customer-city)

Recall Assumptions
100 Banks
1000 Customers/Bank
0 to 10,000 dollars/account that are Distributed
Evenly Across Accts.
Tuples in Deposit 100,000
What Does customer-name bill do?
Retrieves 100 Tuples
One per Bank
What Does balance gt 1000 do?
Do they Have Enough Money?
What Does bank-name BofA do?
Is this a Good Strategy?

107
Join Strategies

Several Factors Influence the Selection of an
Optimal
The Physical Order of Tuples in a Relation
The Presence of Indices and the Type of Index
(Clustering or Nonclustering)
The Cost of Computing a Temporary Index for the
Sole Purpose of Processing One Query

Example Consider the natural join
Deposit ? Customer
nDeposit 10,000.
nCustomer 200.
20 tuples fit in one block for both relations
buffersize 2 blocks

108
Join Strategies Block-Oriented Iteration

Block-oriented Iteration
Process the relations on a per-block basis
rather on a per-tuple basis
Using this approach, a major saving in block
accesses results
Example Consider the natural join Deposit ?
Customer
nDeposit 10,000.
nCustomer 200.
20 tuples fit in one block for both relations
Case 1 outerloop Deposit , inner loop Customer
reading Customer once for every block of Deposit
tuples requires (200/20) (10,000/20) 10
500
reading Deposit relation requires 10000/20 500
block reads
the total cost in terms of block accesses is
5500
-gt 5000 blocks accesses to Customer and
-gt 500 blocks accesses to Deposit

109
Join Strategies Block-oriented Iteration

Case 2 outerloop Customer, inner loop Deposit
Reading Deposit once for every block of Customer
tuples requires (10,000/20) (200/20) 5000
Reading Customer relation requires 200/20
10 block reads
The total cost in terms of block accesses is
5010
gt5000 accesses to Deposit blocks and
gt10 accesses to Customer blocks
Case 3 If Customer relation is smaller enough
to fit in main memory, our strategy requires only
500 blocks to read Deposit relation and 10 blocks
to read Customer relation.
The total comes to 510 blocks

110
Query Execution Cost Summary

Access Cost to Secondary Storage
Search for Data Block (Index)
Read/write Index and Data Blocks
Storage Cost
Index and Data Blocks
Intermediate Files
Computation Cost
Query Planning
Record Search, Sort, Merge
Actual Transaction/query Operations
Communications Cost
Data Transfer Across a Network

111
Access Plan

Access Plan is a Concrete Query Processing Plan
which Presents a Detailed Strategy for Processing
a Query
The Main Cost Factors to Be Considered Include
The Relational Operations to be Performed
Indices to be Used
The Order in Which Tuples are to be Accessed
The Order in Which Operations are to be Performed
Typical Focus is on Join and Optimizing its
Execution, Particularly when Multiple Tables are
Involved

112
Statistics

The Following are Kept in the System Catalog for
Optimization Purposes
File Parameters Block Size
Number of Tuples in Each Relation
Size of Tuples
Key Fields, Indices
Number of Levels in an Index
Highest Key, Lowest Key
Number of Distinct Values (Maybe)
Others Frequency of Operations, Join Keys, Etc.
All DBMSs Keep the First Four, Many Keep All

113
Join Ordering

Given R S T W Determine the Best
Ordering Alternative
((R S) T) W
(R (S T)) W
R (S (T W))
((R T) S) W
((R W) S) T
(R S) (T W)
Ordering is Critical to Arrive at Best Strategy
for Execution, Particularly as
Number of Relations Increase
Size of Relation (Tuples/Blocks) Increase

114
Query Optimization Search Strategies

Exhaustive Search
Optimal
Combinatorial Complexity in the Number of
Relations
Heuristics
Not Optimal
Group Common Sub-expressions
Perform Selection, Projection First
Replace a Join by a Series of Semi-joins
Reorder Operations to Reduce Intermediate
Relation Size
Optimize Individual Operations

115
Query Optimization Timing Issues

Static
Compilation gt Optimize Prior to the Execution
Difficult to Estimate the Size of the
Intermediate Results gt Error Propagation
Can Amortize Over Many Executions
Dynamic
Run Time Optimization
Exact Information on the Intermediate Relation
Sizes
Have to Reoptimize for Multiple Executions
Hybrid
Compile Using a Static Algorithm
If the Error in Estimate Sizes gt Threshold,
Reoptimize at Run Time

116
Concluding Remarks

Most Systems Implement Only a Few Strategies
The Number of Strategies that are Considered by
Any Query Optimizer is Limited
Some Systems Reduce the Number of Strategies by
Making a Heuristic Guess of Strategy for Each
Query
The Optimizer Considers Every Possible Strategy,
but Terminates as Soon as it Determines the Cost
is Greater than the Pre-chosen Strategy
Thus Only a Few Competing Strategies Require Full
Analysis of the Cost
The Overhead of Query Optimization is Reduced
Remember - Trade off in Optimization Time
For PL - Optimization is Pre-Execution (Compile)
For DB - Optimization is Part of Execution (Run)