Optimizing Query Execution - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Optimizing Query Execution

Description:

CIS 650 Implementing Data Management Systems. January 26, 2005 ... Some initial suggestions for the project proposal ... The Duality of Hash and Sort ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 28
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Query Execution


1
Optimizing Query Execution
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Implementing Data Management Systems
  • January 26, 2005

Content on hashing and sorting courtesy
Ramakrishnan Gehrke
2
Administrivia
  • My office hour moved up ½ hour to 200-300 on
    Tuesdays
  • Next week
  • Some initial suggestions for the project proposal
  • Scheduling of the deadline for your midterm
    report

3
Todays Trivia Question
4
Query Execution
  • What are the goals?
  • Logical vs. physical plans what are the
    differences?
  • Some considerations in building execution
    engines
  • Efficiency minimize copying, comparisons
  • Scheduling make standard code-paths fast
  • Data layout how to optimize cache behavior,
    buffer management, distributed execution, etc.

5
Speeding Operations over Data
  • Three general data organization techniques
  • Indexing
  • Associative lookup synopses
  • Both for selection and projection
  • Inner loop of nested loops join
  • And anywhere sorted data is useful
  • Sorting
  • Hashing

6
Speeding Operations over Data
  • Three general data organization techniques
  • Indexing
  • Sorting
  • Hashing

7
General External Merge Sort
  • To sort a file with N pages using B buffer pages
  • Pass 0 use B buffer pages. Produce dN / Be
    sorted runs of B pages each
  • Pass 2, , etc. merge B-1 runs
  • Number of passes 1dlogB-1 dN / Bee
  • Cost 2N ( of passes)

INPUT 1
. . .
. . .
INPUT 2
. . .
OUTPUT
INPUT B-1
Disk
Disk
B Main memory buffers
8
Applicability of Sort Techniques
  • Aggregation
  • Duplicate removal as an instance of aggregation
  • XML nesting as an instance of aggregation
  • Join, semi-join, and intersection

9
Merge Join
  • Requires data sorted by join attributes
  • Merge and join sorted files, reading sequentially
    a block at a time
  • Maintain two file pointers
  • While tuple at R lt tuple at S, advance R (and
    vice versa)
  • While tuples match, output all possible pairings
  • Maintain a last in sequence pointer
  • Preserves sorted order of outer relation
  • Cost b(R) b(S) plus sort costs, if
    necessaryIn practice, approximately linear, 3
    (b(R) b(S))

10
Hashing
  • Several types of hashing
  • Static hashing
  • Extendible hashing
  • Consistent hashing (used in P2P well see later)

11
Static Hashing
  • Fixed number of buckets (and pages) overflow
    when necessary
  • h(k) mod N bucket to which data entry with key
    k belongs
  • Downside long overflow chains

0
h(key) mod N
2
key
h
N-1
Primary bucket pages
Overflow pages
12
Extendible Hashing
  • If a bucket becomes full split in half
  • Use directory of pointers to buckets, double the
    directory, splitting just the bucket that
    overflowed
  • Directory much smaller than file, so doubling it
    is much cheaper
  • Only one page of data entries is split
  • Trick lies in how hash function is adjusted!

13
Insert h(r)20 (Causes Doubling)
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
14
Relevance of Hashing Techniques
  • Hash indices use extensible hashing
  • Uses of static hashing
  • Aggregation
  • Intersection
  • Joins
  • Why isnt extendible hashing used in hash joins
    only as a disk indexing technique?

15
Hash Join
  • Read entire inner relation into hash table (join
    attributes as key)
  • For each tuple from outer, look up in hash table
    join
  • Not fully pipelined

16
Running out of Memory
  • Prevention First partition the data by value
    into memory-sized groups
  • Partition both relations in the same way, write
    to files
  • Recursively join the partitions
  • Resolution Similar, but do when hash tables
    full
  • Split hash table into files along bucket
    boundaries
  • Partition remaining data in same way
  • Recursively join partitions with diff. hash fn!
  • Hybrid hash join flush lazily a few buckets at
    a time
  • Cost lt 3 (b(R) b(S))

17
The Duality of Hash and Sort
  • Different means of partitioning and merging data
    when comparisons are necessary
  • Break on physical rule (mem size) in sorting
  • Merge on logical step, the merge
  • Break on logical rule (hash val) in hashing
  • Combine using physical step (concat)
  • When larger-than-memory sorting is necessary,
    multiple operators use the same key, we can make
    all operators work on the same in-memory portion
    of data at the same time
  • Can we do this with hashing? Hash teams (Graefe)

18
What If I Want to Distribute Query Processing?
  • Where do I put the data in the first place (or do
    I have a choice)?
  • How do we get data from point A -gt point B?
  • What about delays?
  • What about binding patterns?
  • Looks kind of like an index join with a sargable
    predicate

19
Pipelined Hash Join Useful for Joining Web Sources
  • Two hash tables
  • As a tuple comes in, add to the appropriate side
    join with opposite table
  • Fully pipelined, adaptive to source data rates
  • Can handle overflow as with hash join
  • Needs more memory

20
The Dependent Join
  • Take attributes from left and feed to the right
    source as input/filter
  • Important in data integration
  • Simple method
  • for each tuple from left send to right
    source get data back, join
  • More complex
  • Hash cache of attributes mappings
  • Dont send attribute already seen
  • Bloom joins (use bit-vectors to reduce traffic)

JoinA.x B.y
A
B
x
21
Wrap-Up of Execution
  • Query execution is all about engineering for
    efficiency
  • O(1) and O(lg n) algorithms wherever possible
  • Avoid looking at or copying data wherever
    possible
  • Note that larger-than-memory is of paramount
    importance
  • Should that be so in todays world?
  • As weve seen it so far, its all about
    pipelining things through as fast as possible
  • But may also need to consider other axes
  • Adaptivity/flexibility may sometimes need this
  • Information flow to the optimizer, the runtime
    system

22
Query Optimization
  • Challenge pick the query execution plan that
    has minimum cost
  • Sources of cost
  • Interactions with other work
  • Size of intermediate results
  • Choices of algorithms, access methods
  • Mismatch between I/O, CPU rates
  • Data properties skew, order, placement
  • Strategy Estimate the cost of every query plan,
    find cheapest
  • Given
  • Some notion of CPU, disk speeds
  • Cost model for every operator
  • Some information about tables and data

23
The General Model of Optimization
  • Given an AST of a query
  • Build a logical query plan
  • (Tree of query algebraic operations)
  • Transform into better logical plan
  • Convert into a physical query plan
  • (Includes strategies for executing operations)

24
Which Operators Need Significant Optimization
Decisions?
  • We typically make the following assumptions
  • All predicates are evaluated as early as possible
  • All data is projected away as early as possible
  • As a general rule, those that produce
    intermediate state or are blocking
  • Joins
  • Aggregation
  • Sorting
  • By choosing a join ordering, were automatically
    choosing where selections and projections are
    pushed why is this so?

25
The Basic Model System-R
  • Breaks a query into its blocks, separately
    optimizes them
  • Focuses strictly on joins (and only a few kinds)
    in dynamic programming enumeration
  • Principle of optimality best k-way join includes
    best (k-1)-way join
  • Use simple table statistics when available, based
    on indices magic numbers where unavailable
  • Heuristics
  • Push sargable selects, projects as low as
    possible
  • Cartesian products after joins
  • Left-linear trees only n2n-1 cost-est.
    operations
  • Grouping last
  • Extra interesting orders dimension
  • Grouping, ordering, join attributes

26
Next Time Beyond System-R
  • Cross-query-block optimizations
  • e.g., push a selection predicate from one block
    to another
  • Better statistics
  • More general kinds of optimizations
  • Optimization of aggregation operations
  • Different cost and data models, e.g., OO, XML
  • Additional joins, e.g., containment joins
  • Can we build an extensible architecture for this?
  • Logical, physical, and logical-to-physical
    transformations
  • Enforcers
  • Alternative search strategies
  • Left-deep plans arent always optimal
  • Perhaps we can prune more efficiently

27
Upcoming Readings
  • For Monday
  • Read EXODUS and Starburst papers
  • Write one review contrasting the two on the major
    issues
Write a Comment
User Comments (0)
About PowerShow.com