15. Query Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

15. Query Optimization

Description:

EXPLAIN the output of an optimizer. Overview of internals. Dynamic programming ... The output is displayed by the EXPLAIN statement ... – PowerPoint PPT presentation

Number of Views:341
Avg rating:3.0/5.0
Slides: 31
Provided by: RaghuRamak
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: 15. Query Optimization


1
15. Query Optimization
  • System Design
  • Parsing
  • Examples
  • Modern Optimizers
  • EXPLAIN the output of an optimizer
  • Overview of internals
  • Dynamic programming
  • VP view the internals
  • Examples
  • Variants
  • Top Down Optimization
  • Optimizer Hints
  • Unnesting Queries

2
Learning Objectives
  • Explain EXPLAIN
  • Explain VPs output
  • Explain each variant
  • Flatten a nested query.

3
Overview of Query Processing
SQL
Parser
Security
Catalog
Relational Algebra(RA)
Optimizer
Executable Plan (RAAlgorithms)
Plan Executor
Concurrency
Files, Indexes Access Methods
Crash Recovery
Database, Indexes
4
Now we focus on the top of this diagram
Relation Algebra Query
SQL Query
Parser
Query Optimizer
Relational Operator Algs.
Files and Access Methods
Buffer Management
Disk Space Management
DB
5
Detail of the top
Query Parser
SQL Query(SELECT )
Relational Algebra Expression (Query Tree)
Query Optimizer
Plan Generator
Plan Cost Estimator
Catalog Manager
Query Tree Algorithms (Plan)
Plan Evaluator
6
Parsing and Optimization
  • The Parser
  • Verifies that the SQL query is syntactically
    correct, that the tables and attributes exist,
    and that the user has the appropriate
    permissions.
  • Translates the SQL query into a simple query tree
    (operators relational algebra plus a few other
    ones)
  • The Optimizer
  • Generates other, equivalent query trees
    (Actually builds these trees bottom up)
  • For each query tree generated
  • Selects algorithms for each operator (producing
    a query plan)
  • estimates the cost of the plan
  • Chooses the plan with lowest cost (of the plans
    considered, which is not necessarily all possible
    plans)

7
Schema for Examples
  • Download Postgres source from
  • postgresql.org/download
  • Logical and physical schema is at
  • src/test/bench/create.source
  • Simplified version is at
  • www.cs.pdx.edu/587/create.bench
  • Log into the database at
  • https//www.cat.pdx.edu/pgCS587/
  • Click on PostgreSQL
  • User CS587, Password 587dbms, see notes view
  • Browse the tables, especially tenk1 and its
    attributes unique1, unique2, stringu1

8
Heres what the parser does
Relational Algebra Tree
SQL Query
SELECT tenk1.unique1 FROM tenk1 JOIN
tenk2 USING unique2 WHERE
tenk1.stringu1'xxx'
?tenk1.unique1
? tenk1.stringu1'xxx'
?
Unique2unique2
tenk2
tenk1
9
Practice Parse a Query
  • Describe the parser's output when the input is
  • SELECT stringu2
  • FROM tenk1 JOIN tenk2
  • USING unique1
  • WHERE tenk1.stringu1'abc'

10
How Can We View the Optimizer?
  • Postgres calls its optimizer the Planner
  • Postgres' planner algorithm 668 is the same as
    all modern DBMSs' optimizer algorithms
  • Except SQL Server
  • We have good news and good news.
  • We can see both the planner's output AND its
    internal processing.
  • Its output is available to anyone its internal
    processing is avaliable only to us and a few
    others.
  • The output is displayed by the EXPLAIN statement
  • Every DBMS has a version of EXPLAIN (e.g., SHOW
    PLAN)

11
Postgres EXPLAIN
  • Output for
  • EXPLAIN SELECT FROM tenk1
  • Seq Scan on tenk1 (cost0.00.. 445.00 rows10000
    width244)
  • These values are estimates from sampling.
  • Very useful when a query runs longer than
    expected.
  • All our examples are from
  • www.postgresql.org/docs/8.3/interactive/using-exp
    lain.html
  • Actually this includes CPU costs but we will
    call it I/O costs to simplify

Sequential Scan
I/Os to get first row
I/Os to get last row
Rows retrieved
Average Row Width
12
More EXPLAIN examples
  • EXPLAIN SELECT FROM tenk1 WHERE unique1 lt 7000
  • Seq Scan on tenk1 (cost0.00..470.00 rows7124
    width244)
  • Filter (unique1 lt 7000)
  • Cost is higher because of CPU cost for filtering
  • rows is off because of estimation using
    histogram
  • EXPLAIN SELECT FROM tenk1 WHERE unique1 lt 1
  • Index Scan using tenk1_unique1 on tenk1
    (cost0.00..8.27 rows1 width244)
  • Index Cond (unique1 lt 1)
  • Why is the cost so much less?
  • EXPLAIN SELECT FROM tenk1 WHERE unique1 lt 10
  • Bitmap Heap Scan on tenk1 (cost4.34..42.58
    rows11 width244)
  • Recheck Cond (unique1 lt 10)
  • -gt Bitmap Index Scan on tenk1_unique1
    (cost0.00..4.33 rows11 width0)
  • Index Cond (unique1 lt 10)
  • Shopping list optimization!

13
The planner's internal processing
  • So far we've seen the planner's output its
    opinion as to what is the fastest plan. How does
    it reach that conclusion?
  • Fortunately, our TA, Tom Raney, has just added a
    patch to PostgreSQL (PG) that allows anyone to
    look inside the planner.
  • This was part of a Google Summer of Code project
  • One of the lead PG developers says its like
    finding Sasquatch.
  • No other DBMS has this capability.
  • Well use Toms patch to view the planner's
    internals.

14
Overview of DBMS Optimizers
  • Recall that "optimizing a query" consists of
    these 4 tasks
  • Generate all trees equivalent to the
    parser-generated tree
  • Assign algorithms to each node of each tree
  • A tree with algorithms is called a plan.
  • Calculate the cost of each generated plan
  • Using the join cost formulas we learned in
    previous slides
  • Choose the cheapest plan
  • A nice independent study project would be to
    write a visualizer for the parser
  • Use Raney's Visual Planner here to look at a
    plan
  • Statistics for calculating these costs are
    kept in the system catalog.

15
Dynamic Programming
  • A no-brainer approach to these 4 tasks could take
    forever. For medium-large queries there are
    millions of plans and it can take a millisecond
    to compute each plan cost, resulting in hours to
    optimize a query.
  • This problem was solved in 1979 668 by Patsy
    Selinger's IBM team using Dynamic Programming.
  • The trick is to solve the problem bottom-up
  • First optimize all one-table subqueries
  • Then use those optimal plans to optimize all
    two-table subqueries
  • Use those results to optimize all three-table
    subqueries, etc.

16
Consider A Query and its Parsed Form
  • SELECT tenk1.unique1
  • FROM tenk1 JOIN tenk2 USING (unique2)
  • WHERE tenk1.unique1lt 100

?tenk1.unique1
? tenk1.unique1lt100
?
unique2unique2
tenk2
tenk1
17
What Will a Selinger-type Optimizer Do?
  • Optimize one table subqueries
  • tenk1 WHERE unique1lt100
  • This is called "pushing selects"
  • Then optimize tenk2
  • Use the results of the previous steps to Optimize
    two-table queries
  • The entire query
  • Let's use Raney's patch, the Visual Planner, to
    see what PG's Planner does.

18
How to Set Up Your Visual Planner
  • Download, then unzip, in Windows or NIX
  • cs.pdx.edu/len/587/VP1.7.zip
  • Read README.TXT, don't worry about details
  • Be sure your machine has a Java VM
  • http//www.java.com/en/download/index.jsp
  • Click on Visual_Planner.jar
  • If that does not work, use this at the command
    line
  • java -jar Visual_Planner.jar
  • In the resulting window
  • File/Open
  • Navigate to the directory where you put VP1.7
  • Navigating to C may take a while
  • Choose plan1.pln

19
Windows in the Visual Planner
  • The SQL window holds the (canned) query
  • The Plan Tree window holds the optimal plan for
    the query (in this VP view).
  • The Statistics window holds statistics about the
    highlighted node of the Plan Tree's plan
  • Click a Plan Tree node to see its statistics
  • Why is a nested loop the optimal plan?

20
Why Not?
  • To see other plans, click on tenk1/tenk2 in the
    ROI window.
  • Then shift-click the plan you want to see
  • Plans are in alphabetical order, then by total
    cost
  • Why isn't a merge join cheaper?
  • Why isn't a hash join cheaper?

21
Visualize Dynamic Programming
  • Recall the first steps of Dynamic Programming
  • Optimize tenk1 where unique1lt100
  • Optimize tenk2.
  • VP calls these the ROI steps and they are
    displayed in the ROI window of VP.
  • In the ROI window, click on the symbol next to
    tenk2 to see how the PG Planner optimized tenk2.
  • Note that blue plans are saved for later steps,
    red plans are discarded.
  • Postgres uses an internal data structure called
    RelOptInfo to hold the relations currently being
    optimized

22
Optimizing tenk2 Interesting Orders
  • The cheapest access path(plan) is a sequential
    scan.
  • However, an index scan is also saved. Why?
    Because the index scan has an order associated
    with it, and the order is an interesting order.
  • The order is unique2, and unique2 is the joining
    attribute for the later join.
  • It may be worth sacrificing some cost here to
    save the cost of a sort later!
  • plan access path since tenk2 is a single table

23
Optimizing tenk1
  • Explain each of the planner's decisions in its
    optimization of tenk1.

24
Variants on what we've discussed
  • SQLServer Top down
  • Graefe, McKenna, The Cascades Framework for
    query optimization, DEBulletin, 1995.
  • Hints
  • Rewrite optimization rules unnesting

25
Top Down Optimization
  • Begin with original query
  • Consider subqueries, optimize them.
  • Depth first search
  • Example A ? B ? C
  • First optimize, say, (A ? B) ?C.
  • If its cost is less than B ? C, need not
    calculate the cost of A ? (B ?C).
  • Memo structure used to keep track of optimized
    subqueries.

26
Optimizer Hints
  • A hint tells the optimizer to ignore its
    algorithm in part, for example
  • Order the joins in a certain way
  • Use a particular index
  • Use a type of join for a pair of tables.
  • Oracle has over 120 possible hints
  • www.dba-oracle.com/art_otn_cbo_p7.htm
  • SQL Server
  • www.sql-server-performance.com/tips/hints_general_
    p1.aspx

27
15.5 Nested Queries
  • No-brainer method for executing nested queries
  • Tuple iteration semantics
  • For each outer tuple, evaluate inner block
  • Equivalent to simple nested loop join
  • Optimizer optimizes inner block, then outer block
  • Is there a better way?

SELECT S.sid FROM Sailors S WHERE EXISTS
(SELECT FROM Reserves R WHERE
R.bid103 AND R.sidS.sid)
Nested block to optimize SELECT FROM
Reserves R WHERE R.bid103 AND R.sid
outer value
28
Unnesting queries
Equivalent non-nested query SELECT DISTINCT
S.sid FROM Sailors S, Reserves R WHERE
S.sidR.sid AND R.bid103
  • Optimizer (proprietary systems mostly) or you
    (open source systems mostly) unnest nested
    queries.
  • If the query is unnested, the optimizer can use
    bulk join algorithms (merge, hash join) and
    performance can be much better.

29
Unnesting with COUNT
SELECT S.sname FROM Sailors S WHERE S.rank gt
(SELECT COUNT(R.) FROM Reserves R
WHERE R.sidS.sid)
  • Beware if there is a COUNT in the subquery
  • The query may appear to unnest into a join with a
    GROUP BY.
  • But consider a sailor with a high rank and no
    reservations

SELECT S.sname FROM Sailors S, Reserves R WHERE
S.sidR.sid GROUP BY S.sid HAVING S.rank gt
COUNT(R.)
30
Unnesting The Count Bug 298
  • The query may not unnest into a join, but rather
    an outer join.
  • Many queries are much harder or impossible to
    unnest!

SELECT S.sname FROM Sailors S WHERE S.rank gt
(SELECT FROM Reserves R WHERE
R.sidS.sid)
SELECT S.sname FROM Sailors S NATURAL RIGHT
OUTER JOIN Reserves R GROUP BY S.sid HAVING
S.rank gt COUNT(R.)
Write a Comment
User Comments (0)
About PowerShow.com