Optimizing Query Execution presentation

About This Presentation

Transcript and Presenter's Notes

Title: Optimizing Query Execution

1
Optimizing Query Execution

Zachary G. Ives
University of Pennsylvania
CIS 650 Implementing Data Management Systems
January 26, 2005

Content on hashing and sorting courtesy
Ramakrishnan Gehrke
2
Administrivia

My office hour moved up ½ hour to 200-300 on
Tuesdays
Next week
Some initial suggestions for the project proposal
Scheduling of the deadline for your midterm
report

3
Todays Trivia Question
4
Query Execution

What are the goals?
Logical vs. physical plans what are the
differences?
Some considerations in building execution
engines
Efficiency minimize copying, comparisons
Scheduling make standard code-paths fast
Data layout how to optimize cache behavior,
buffer management, distributed execution, etc.

5
Speeding Operations over Data

Three general data organization techniques
Indexing
Associative lookup synopses
Both for selection and projection
Inner loop of nested loops join
And anywhere sorted data is useful
Sorting
Hashing

6
Speeding Operations over Data

Three general data organization techniques
Indexing
Sorting
Hashing

7
General External Merge Sort

To sort a file with N pages using B buffer pages
Pass 0 use B buffer pages. Produce dN / Be
sorted runs of B pages each
Pass 2, , etc. merge B-1 runs
Number of passes 1dlogB-1 dN / Bee
Cost 2N ( of passes)

INPUT 1
. . .
. . .
INPUT 2
. . .
OUTPUT
INPUT B-1
Disk
Disk
B Main memory buffers
8
Applicability of Sort Techniques

Aggregation
Duplicate removal as an instance of aggregation
XML nesting as an instance of aggregation
Join, semi-join, and intersection

9
Merge Join

Requires data sorted by join attributes
Merge and join sorted files, reading sequentially
a block at a time
Maintain two file pointers
While tuple at R lt tuple at S, advance R (and
vice versa)
While tuples match, output all possible pairings
Maintain a last in sequence pointer
Preserves sorted order of outer relation
Cost b(R) b(S) plus sort costs, if
necessaryIn practice, approximately linear, 3
(b(R) b(S))

10
Hashing

Several types of hashing
Static hashing
Extendible hashing
Consistent hashing (used in P2P well see later)

11
Static Hashing

Fixed number of buckets (and pages) overflow
when necessary
h(k) mod N bucket to which data entry with key
k belongs
Downside long overflow chains

0
h(key) mod N
2
key
h
N-1
Primary bucket pages
Overflow pages
12
Extendible Hashing

If a bucket becomes full split in half
Use directory of pointers to buckets, double the
directory, splitting just the bucket that
overflowed
Directory much smaller than file, so doubling it
is much cheaper
Only one page of data entries is split
Trick lies in how hash function is adjusted!

13
Insert h(r)20 (Causes Doubling)
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
14
Relevance of Hashing Techniques

Hash indices use extensible hashing
Uses of static hashing
Aggregation
Intersection
Joins
Why isnt extendible hashing used in hash joins
only as a disk indexing technique?

15
Hash Join

Read entire inner relation into hash table (join
attributes as key)
For each tuple from outer, look up in hash table
join
Not fully pipelined

16
Running out of Memory

Prevention First partition the data by value
into memory-sized groups
Partition both relations in the same way, write
to files
Recursively join the partitions
Resolution Similar, but do when hash tables
full
Split hash table into files along bucket
boundaries
Partition remaining data in same way
Recursively join partitions with diff. hash fn!
Hybrid hash join flush lazily a few buckets at
a time
Cost lt 3 (b(R) b(S))

17
The Duality of Hash and Sort

Different means of partitioning and merging data
when comparisons are necessary
Break on physical rule (mem size) in sorting
Merge on logical step, the merge
Break on logical rule (hash val) in hashing
Combine using physical step (concat)
When larger-than-memory sorting is necessary,
multiple operators use the same key, we can make
all operators work on the same in-memory portion
of data at the same time
Can we do this with hashing? Hash teams (Graefe)

18
What If I Want to Distribute Query Processing?

Where do I put the data in the first place (or do
I have a choice)?
How do we get data from point A -gt point B?
What about delays?
What about binding patterns?
Looks kind of like an index join with a sargable
predicate

19
Pipelined Hash Join Useful for Joining Web Sources

Two hash tables
As a tuple comes in, add to the appropriate side
join with opposite table
Fully pipelined, adaptive to source data rates
Can handle overflow as with hash join
Needs more memory

20
The Dependent Join

Take attributes from left and feed to the right
source as input/filter
Important in data integration
Simple method
for each tuple from left send to right
source get data back, join
More complex
Hash cache of attributes mappings
Dont send attribute already seen
Bloom joins (use bit-vectors to reduce traffic)

JoinA.x B.y
A
B
x
21
Wrap-Up of Execution

Query execution is all about engineering for
efficiency
O(1) and O(lg n) algorithms wherever possible
Avoid looking at or copying data wherever
possible
Note that larger-than-memory is of paramount
importance
Should that be so in todays world?
As weve seen it so far, its all about
pipelining things through as fast as possible
But may also need to consider other axes
Adaptivity/flexibility may sometimes need this
Information flow to the optimizer, the runtime
system

22
Query Optimization

Challenge pick the query execution plan that
has minimum cost
Sources of cost
Interactions with other work
Size of intermediate results
Choices of algorithms, access methods
Mismatch between I/O, CPU rates
Data properties skew, order, placement
Strategy Estimate the cost of every query plan,
find cheapest
Given
Some notion of CPU, disk speeds
Cost model for every operator
Some information about tables and data

23
The General Model of Optimization

Given an AST of a query
Build a logical query plan
(Tree of query algebraic operations)
Transform into better logical plan
Convert into a physical query plan
(Includes strategies for executing operations)

24
Which Operators Need Significant Optimization
Decisions?

We typically make the following assumptions
All predicates are evaluated as early as possible
All data is projected away as early as possible
As a general rule, those that produce
intermediate state or are blocking
Joins
Aggregation
Sorting
By choosing a join ordering, were automatically
choosing where selections and projections are
pushed why is this so?

25
The Basic Model System-R

Breaks a query into its blocks, separately
optimizes them
Focuses strictly on joins (and only a few kinds)
in dynamic programming enumeration
Principle of optimality best k-way join includes
best (k-1)-way join
Use simple table statistics when available, based
on indices magic numbers where unavailable
Heuristics
Push sargable selects, projects as low as
possible
Cartesian products after joins
Left-linear trees only n2n-1 cost-est.
operations
Grouping last
Extra interesting orders dimension
Grouping, ordering, join attributes

26
Next Time Beyond System-R

Cross-query-block optimizations
e.g., push a selection predicate from one block
to another
Better statistics
More general kinds of optimizations
Optimization of aggregation operations
Different cost and data models, e.g., OO, XML
Additional joins, e.g., containment joins
Can we build an extensible architecture for this?
Logical, physical, and logical-to-physical
transformations
Enforcers
Alternative search strategies
Left-deep plans arent always optimal
Perhaps we can prune more efficiently

Optimizing Query Execution PowerPoint PPT Presentation