Gitte Christensen

About This Presentation

Title:

Gitte Christensen

Description:

Title: SQL Query Optimisation Subject: Ian Smith's Master Author: Dept of Computing Sciences Last modified by: Gitte Christensen Created Date: 9/15/1997 9:06:54 AM – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 65

Provided by: Deptof66

Category:

more less

Transcript and Presenter's Notes

Title: Gitte Christensen

1
Managing External Data 3

Gitte Christensen
Dyalog Ltd

2
Relational Algebra

Relational Algebra is
the formal description of how a relational
database operates
the mathematics which underpin SQL
operations.
Operators in relational algebra are not
necessarily the same as SQL operators, even if
they have the same name.

3
Terminology

Relation - a set of tuples.
Tuple - a collection of attributes which describe
some real world entity.
Attribute - a real world role played by a named
domain.
Domain - a set of atomic values.
Set - a mathematical definition for a collection
of objects which contains no duplicates.

4
Operators - Write

INSERT - provides a list of attribute values for
a new tuple in a relation. This operator is the
same as SQL.
DELETE - provides a condition on the attributes
of a relation to determine which tuple(s) to
remove from the relation. This operator is the
same as SQL.
MODIFY - changes the values of one or more
attributes in one or more tuples of a relation,
as identified by a condition operating on the
attributes of the relation. This is equivalent to
SQL UPDATE.

5
Operators - Retrieval

There are two groups of operations
Mathematical set theory based relations UNION,
INTERSECTION, DIFFERENCE, and CARTESIAN PRODUCT.
Special database operations SELECT (not the
same as SQL SELECT), PROJECT, and JOIN.

6
Relational SELECT

SELECT is used to obtain a subset of the tuples
of a relation that satisfy a select condition.
For example, find all employees born after 1st
Jan 1950
SELECT dob gt 01/JAN/1950 (employee)

7
Relational PROJECT

The PROJECT operation is used to select a subset
of the attributes of a relation by specifying the
names of the required attributes.
For example, to get a list of all employees
surnames and employee numbers
PROJECT surname,empno (employee)

8
SELECT and PROJECT
SELECT and PROJECT can be combined together. For
example, to get a list of employee numbers for
employees in department number 1
9
Set Operations - semantics

Consider two relations R and S.
UNION of R and Sthe union of two relations is a
relation that includes all the tuples that are
either in R or in S or in both R and S. Duplicate
tuples are eliminated.
INTERSECTION of R and Sthe intersection of R and
S is a relation that includes all tuples that are
both in R and S.
DIFFERENCE of R and Sthe difference of R and S
is the relation that contains all the tuples that
are in R but that are not in S.

10
SET Operations - requirements

For set operations to function correctly the
relations R and S must be union compatible. Two
relations are union compatible if
they have the same number of attributes
the domain of each attribute in column order is
the same in both R and S.

11
UNION Example
12
INTERSECTION Example
13
DIFFERENCE Example
14
CARTESIAN PRODUCT

The Cartesian Product is also an operator which
works on two sets. It is sometimes called the
CROSS PRODUCT or CROSS JOIN.
It combines the tuples of one relation with all
the tuples of the other relation.

15
CARTESIAN PRODUCT Example
16
JOIN Operator

JOIN is used to combine related tuples from two
relations
In its simplest form the JOIN operator is just
the cross product of the two relations.
As the join becomes more complex, tuples are
removed within the cross product to make the
result of the join more meaningful.
JOIN allows you to evaluate a join condition
between the attributes of the relations on which
the join is undertaken.
The notation used is
R JOIN join condition S

17
JOIN Example
18
Natural Join

Invariably the JOIN involves an equality test,
and thus is often described as an equi-join. Such
joins result in two attributes in the resulting
relation having exactly the same value. A
natural join will remove the duplicate
attribute(s).
In most systems a natural join will require that
the attributes have the same name to identify the
attribute(s) to be used in the join. This may
require a renaming mechanism.
If you do use natural joins make sure that the
relations do not have two attributes with the
same name by accident.

19
OUTER JOINs

Notice that much of the data is lost when
applying a join to two relations. In some cases
this lost data might hold useful information. An
outer join retains the information that would
have been lost from the tables, replacing missing
data with nulls.
There are three forms of the outer join,
depending on which data is to be kept.
LEFT OUTER JOIN - keep data from the left-hand
table
RIGHT OUTER JOIN - keep data from the right-hand
table
FULL OUTER JOIN - keep data from both tables

20
OUTER JOIN Example 1
21
OUTER JOIN Example 2
22
SQL query optimisation

Optimisation Concept
Implementation of Rel. Algebra Operations
Oracle Query Execution Plans.
Btree Indexing
Rtree Indexing

23
SQL query optimisation
Parse and Translate
Relational algebra expression
SQL query
Optimisation using data statistics
Query result
Execution plan
Evaluate against Database
24
Optimisation steps

Parse
check SQL syntax and columns tables valid
Translate
SQL into relational algebra expression
Select most efficient query execution plan
minimisation of the input/output and cpu
requirements
Evaluate expression
Call required code modules

25
Example SQL query
S(sno,sname,status,scity) SP(sno,pno,qty)
P(pno,pname,colour,weight,pcity) Get supplier
name for suppliers who supply red
parts. SQL select sname from S,P,SP where S.sno
SP.sno and P.pno Sp.pno and colour red
26
Example SQL query

How would you do this?

27
Query as relational algebra expression and graph
(((((( P restrict colour red ) project pno
) join SP pno pno) project sno ) join S
sno sno ) project sname )
project join
project join
project restrict
P SP S
28
Relational algebra transformations

The following transformations can be applied
without regard to the actual data values stored
in the tables referenced in the SQL query.
They are being stated to justify some of the
common manipulations of the relational algebra
during optimisation.

29
Distributive transformation
Distributive Law f(A o B) f(A) o
f(B) e.g. restrict(A join B) restrict (A)
join restrict (B) Restrict distributes over
union, intersection, difference and join. For
join the restriction must, at its most complex,
consist of the AND of two separate conditions one
for each operand. Project distributes over union,
intersection, join. For join the join attributes
must be included in the projection. Thus
restricts / projects can often be done before
joins.
30
Commutative transformation
Commutative Law The operator o is commutative if
- A o B B o A for all values of A and
B Union, intersection and join are
commutative. E.g S join SP SP join S Thus
optimiser can choose best order typically the
smallest table is used to drive the join
processing.
31
Associative transformation
Associative Law Operator o is associative if - A
o ( B o C ) ( A o B ) o C for all A,B,C
values e.g. A join ( B join C ) ( A join B )
join C Union, intersection and join are
associative. Thus for a 3 table join the
optimiser can choose any sequence of pairs of
joins.
32
Semantic transformations
Referential Integrity Based. select pno from SP,S
where SP.sno S.sno If declared S.sno as PK
and SP.sno as FK then optimise to- select pno
from SP Application of Negator Functions where
not (status gt 20) this cannot use an index scan
so must use table scan but if transform to
status lt 20 - now can use index scan Assumes
that an index on the status column already exists
33
Will the results be the same?
select sname from s,sp where s,sno sp.sno and
sp.pno p2 select sname from sp,s where
sp.pno p2 and sp.sno s.sno select sname
from s where s.sno in ( select sp.sno from
sp where sp.pno p2) Explain. What is
the implication for performance of the three?
34
Will the results be the same?
35
Yes - all produce the same result rows

join s to sp can be transformed into join sp to s
as join operation is commutative
restrict can be distributed past the join thus
first two order of operations can be ignored
set membership can be converted to a join
optimisation will convert all 3 queries the same
query so all must have the same performance -
user can chose which to use

36
Break
37
SQL query optimisation
38
Algorithm selection

The relational algebra operations typically have
multiple code implementations. It is necessary
for the optimiser to select the most appropriate
for the current circumstances.
For example the presence of a direct access path
to the data required will be preferred to a
complete table scan if only 10 -20 of the table
is to be accessed.

39
Restriction implementations
Full table scan or table scan with early
termination All blocks must be searched unless
an early termination condition is found e.g row
found and no duplicates or data is ordered and
already passed the required data location. Btree
Non Clustered Index Search Tree search to
starting key then linear scan of sorted
keys. Assume to access each record needs a disk
block read.
40
Multiple restrictions
Conjunction Andd together If only some
columns are indexed then use most selective index
to locate the data record and check out the other
restriction conditions in the buffer. If all
columns are indexed then can get rowids of all
records meeting the restriction conditions and
perform an intersection. Use the result set of
rowids to locate the data records. Disjunction
Ord together If all columns indexed get all
rowids and perform union. Use the result set of
rowids to locate the data records.Else must scan
table.
41
Join implementations

All give the same final result set of rows as
defined by the join operation.
Algorithm and cost differ.
Nested loop algorithm will function with any join
condition I.e lt, lt, , gt, gt
Hash algorithm only possible with
Relation S 500 rows 50 rows per disk block
Relation R 100000 rows 50 rows per disk block

42
Join - Block Nested Loop Inner file pass 1
S Outer File Pass.Block Buffers Pass.Block
R Inner File
1.1 2.1 1.2 2.2 1.3 2.3
1.1 1.2
Join col values equal
43
Join - Block Nested LoopInner file pass 2
S Outer File Pass.Block Buffers Pass.Block
R Inner File
1.1 2.1 1.2 2.2 1.3 2.3
1.1 1.2
Join col values equal
44
Join - Block Nested Loop - Cost

Cost
No. Outer Loop Blocks No. Inner Loop blocks
No.
Outer Loop Blocks
Use smallest table as outer loop
i.e. (500/50 100000/50 ) 500/50 20010
Reduce cost of inner loop by having -
index built on inner loop join column which is
also a Primary Key - lookup avoiding full file
scan assume needs 1 disk read per row in outer
file
(500 / 50 500 ) 510

45
Equi Join - Hash example
Relation R 91 1 40 900
Disk Hash Buckets R 0 1

Disk Hash Buckets S 0 1
Relation S 40 101 90 1
1000 91
Value allocated to 0 if even else allocated to
1 if odd
46
Equi Join - Hash example
Relation R 91 1 40 900
Disk Hash Buckets R 0 40 900
1 1 91
Disk Hash Buckets S 0 40 90 1000 1
101 1 91
Relation S 40 101 90 1
1000 91
Value allocated to 0 if even else allocated to
1 if odd
47
Equi Join - Hash cost

Only need to compare record values from matching
hash buckets to find which records need joining.
Assumes mechanism to retrieve required data rows.
Cost
Read files in for hashing Write out to allocated
hash bucket Read in each pair of buckets for the
search for matching join values.
3 ( No Blocks of R No of blocks of S)
3 ( 2000 10 ) 6030

48
Join before restrict Example
S(sno,sname,status,scity) 100 rows 20 rows/disk
block SP(sno,pno,qty) 10000 rows 100/disk
block 50 with pno p2 select from S,SP where
pno p2 and S.sno SP.sno Approach 1 Cost
in Block I/O restrict write out after
restrict 1 join nested loop
5 (1005) 505 S SP Total 506
49
Restrict before join Example
Approach 2 Cost in Block I/O write out
after join 1 join nested
loop in RAM full scan of S
5 restrict full scan of
SP 100 S SP Total 106
50
Three table join query
branch(b_name, b_city, manager, b_address) 100
rows account(b_name, a_number, balance)
10 000 rows account_holder(h_name,a_number,h_addre
ss) 50 000 rows select h_name, balance from
account a, account_holder h, branch b where
a.a_number h.a_number and a.b_name
b.b_name and b.city Leicester and balance gt
10000
51
Optimised query plan
project(h_name,balance)
join (a_number) project(a_number,balance)
join (b_name) project (b_name)
project(h_name,a_number) restrict (b_city
Leicester) restrict(balancegt10000) branc
h account account_holder
52
Three table join query plan justification
53
Three table join query plan justification

Projects and restricts distributed past joins
join branch and account as pair of smallest
tables and therefore assume that this will
provide the smallest result set.

54
Oracle query demo tables
Tables arel ( col1 number(3) constraint pk_arel
primary key, col2 number(2), col3
varchar(400)) Btree index col1 col2 900
rows brel (col1 number(3) constraint pk_brel
primary key, col2 number(2)) Btree index col1
col2 900 rows
55
Data distribution arel (1)
Check data distribution of arel by - select
colX , count() from arel group by colX for X
1 i.e col1 get 900 groups with count of 1
thus 900 unique values no duplicates for X 3
i.e. col3 get 1 group with count of 900
thus 1 value duplicated 900 times Both above
have a uniform distribution.
56
Data distribution arel (2)
for X 2 i.e col2 get - COL2 COUNT() COL2
COUNT() 1 120 8 60 2
120 9 30 3 120 10
30 4 120 15 30 5 60 20
30 6 60 29 30 7
60 30 30 A non uniform distribution of
values in col2
57
Analysis of arel table to provide Oracle with
statistics
analyze table arel compute statistics
COLUMN_NAME NUM_DISTINCT NUM_NULLS
NUM_BUCKETS COL1 900 0
1 COL2 14 0 1 COL3
1 0 1 As col2is known
to have a non uniform distribution best to
generate a histogram i.e need more than 1 bucket
58
Analysis of arel col2
analyze table arel compute statistics for columns
col2 Now show endpoint data generated using a
query against a system table NUMBER
VALUE NUMBER VALUE 120 1 720
8 240 2 750 9 360 3 780
10 480 4 810 15 540 5 840
20 600 6 870 29 660
7 900 30
59
Query execution plan

Oracle uses the explain plan for select
statement to generate a query plan
Results stored in a table generated in your
schema by running a script utlxplan
Results are displayed from the table by running
the script utlxpls

60
Oracle query Restrict Project via indexed PK
col1
Restrict Project explain plan for select
col1,col2 from arel where col1 56 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 39 2 TABLE
ACCESS BY INDEX ROWID AREL 1 39 2
INDEX UNIQUE SCAN PK_AREL 1 1
Query buffer reads 3
61
Oracle query Restrict Project via non indexed
col3
Restrict Project explain plan for select
col1,col2 from arel where col3 x start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 900 200K 35 TABLE
ACCESS FULL AREL 900 200K 35 Query
buffer reads 229
62
Oracle query Restrict Project via indexed col2
(1)
Restrict Project explain plan for select
col1,col2 from arel where col2 30 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 30 1K 31 TABLE
ACCESS BY INDEX ROWID AREL 30 1K 31
INDEX RANGE SCAN AREL_COL2 30
1 Query buffer reads 34
63
Oracle query Restrict Project via indexed col2
(2)
Restrict Project explain plan for select
col1,col2 from arel where col2 1 start utlxpls
/ Operation Name Rows Bytes Cost SELECT
STATEMENT 120 4K 35 TABLE ACCESS
FULL AREL 120 4K 35 Query buffer
reads 236 Why has it changed just because
using a different row value?
64
Oracle query Project Join all rows arel brel
Project Join explain plan for select count()
from arel a, brel b where a.col1 b.col1 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 23 2SORT
AGGREGATE 1 23 NESTED LOOPS
900 20K 2
INDEX FAST FULL SCAN PK_AREL 900 11K
INDEX UNIQUE SCAN PK_BREL 900 8K
Query buffer reads 912
65
Oracle query Restrict Join via indexed cols (1)
Restrict Join Query buffer reads 4 explain
plan for select count() from arel a, brel b
where a.col1 b.col1 and a.col1 500 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 36 1 SORT
AGGREGATE 1 36 NESTED LOOPS 1 36 1
INDEX UNIQUE SCAN PK_BREL 1 10 1 INDEX
UNIQUE SCAN PK_AREL 1 26 1
66
Oracle query Restrict Join via indexed cols (2)
explain plan for Query buffer reads 64 select
count() from arel a, brel b where a.col1
b.col1 and a.col2 10 Operation Name
Rows Bytes Cost SELECT STATEMENT 1 36
31 SORT AGGREGATE 1 36 NESTED LOOPS 30
1K 31 TABLE ACCESS BY INDEX
ROWID AREL INDEX RANGE SCAN AREL_COL2
30 1 INDEX UNIQUE SCAN PK_BREL 900
8K
67
Oracle query Restrict Join via indexed cols (3)
Restrict Join Query buffer reads
351 explain plan for select count() from arel
a, brel b where a.col1 b.col1 and a.col2
1 Operation Name Rows Bytes Cost SELECT
STATEMENT 1 36 35 SORT AGGREGATE
1 36 NESTED LOOPS 120 4K 35
TABLE ACCESS FULL AREL 120 3K 35 INDEX
UNIQUE SCAN PK_BREL 900 8K
68
Oracle query Restrict Project via indexed col2
(3)
Restrict Project explain plan for select
col1,col2 from brel where col2 30 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 30 300 2 TABLE
ACCESS FULL BREL 30 300 2 Query buffer
reads 18 Previous query against
arel used indexed access. Why not for brel?
69
Oracle query Restrict Project via indexed col2
(4)
analyze table brel delete statistics explain
plan for select col1,col2 from brel where col2
1 start utlxpls / Operation Name Rows
Bytes Cost SELECT STATEMENT TABLE ACCESS BY
INDEX ROWID BREL INDEX RANGE
SCAN BREL_COL2 Now no statistics
so chooses index access
70
Optimiser estimation methods

Hard coded values in the optimiser algorithm
e.g restriction 1 rows on , 10 on gt
No consideration of actual data values
Statistics based on number of different distinct
values in column
assumes uniform distribution
Statistics based on histogram of count of unique
values.

71
Importance of estimates

The use of an index is only efficient if less
than 20 of the data is to be returned.
Each row access may require a separate disk read.
Thus same disk block may be retrieved more than
once if buffer space limited. More efficient to
execute a full table scan.
Cost of statistics collection needs to be traded
off against improved performance of queries based
on more accurate estimates.

72
Database Implementations
73
Relational Databases

The Conceptual model has become the Physical
model
Row based storage
No ordering of data
Row based searches
Very many comparisons is required to find and
manipulate data

74
Relational Databases/APL

Current APL/K/J databases
all relational
all Column stores
kdb
Vstar
flipdb
TakeCare

75
Relational Databases/APL

The relational model holds
Row stores are implemented with great engineering
skills
Use them if you can

76
Relational Databases/APL

but
The physical laws still apply
Some problems might need different solutions

77
Semistructured data
78
title year length filmType studioName starName
Star Wars 1977 124 color Fox Carrie Fisher
Star Wars 1977 124 color Fox Mark Hamill
Star Wars 1977 124 color Fox Harrison Ford
Mighty Ducks 1991 104 color Disney Emilio Estevez
Wayne's World 1992 95 color Paramount Dana Carvey
Wayne's World 1992 95 color Paramount Mike Meyers
79
Root
star
movie
star
starIn
mh
sw
cf
starOf
year
name
title
city
address
address
name
street
Star Wars
1977
Mark Hamill
Oak
street
Brentwood
Carrie Fisher
street
city
city
starOf
Hollywood
Maple
Locust
Malibu
starIn
80
Multidimensional data
81
(No Transcript)

Write a Comment

User Comments (0)