Cost-based Query Scrambling for Initial Delays

About This Presentation

Title:

Cost-based Query Scrambling for Initial Delays

Description:

Traditional query scrambling based on simple heuristics is susceptible to ... unable to pick slightly sub-optimal plan to generate interesting orders ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 29

Provided by: zmao

Learn more at: http://db.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cost-based Query Scrambling for Initial Delays

1
Cost-based Query Scrambling for Initial Delays

Tolga Urhan
Michael J. Franklin
Laurent Amsaleg

2
Introduction

Problem Remote data access (e.g. from the Web)
in query processing introduce unpredictable
delays.
Traditional query scrambling based on simple
heuristics is susceptible to problems from bad
scrambling decisions.
Three different approaches to using query
optimization for scrambling are proposed.
More intelligent scrambling decisions are made.

3
Query Scrambling Overview

Goal To hide the delays encountered when
obtaining data from remote sources by performing
other useful work.
Consists of two phases
Rescheduling phase
Operator Synthesis phase
Objective function of optimization based either
on
total work
or response time

4
Goals of this paper

Focus on the problem of initial delay
delay in receiving the first tuple from a
particular remote source.
Due to
difficulty in establishing a connection to a
remote source,
heavy load of the remote source
large amount of work remote source needs to
perform (lack of global optimization)
Investigate both the use of response time-based
and total work-based optimization for query
scrambling

5
Cost-based Query Scrambling

Assumption
Query execution environment consists of a query
site and remote data sources.
Processing work occurs in both query site and
remote sites
example of a complex query tree

Communication Link
Query Result
Select
Join
C
A
D
E
B
6
Iterative Process of Query Scrambling

(1) Rescheduling execution plan of a query is
dynamically rescheduled when delay is detected.
(2) Operator Synthesis new operators can be
created when there are no other operators that
can execute.
E.g. Stalls in getting A tuples
(1) retrieve B tuples, execute D join E
(2) execute a new join between
B and (D join E) join C

7
Cost-based Rescheduling

Identify runnable subtrees subtrees made up
entirely of nonbocked operators.
Runnable subtrees can be scheduled out of order
by inserting materialization operators at its
root.
Materialization operators they issue Open, Next,
close calls to the root of the subtree and save
results in a temporary relation.

8
Cost-based Rescheduling

Selection of runnable subtrees to execute
Traditional way maximal one.
Maximal efficiency (P - MR)/(P MW)
MW cost of writing materialized temporary result
to disk
MR cost of reading temporary results form disk
P cost of executing the subtree
efficiency improvement in response time per unit
of scrambling execution
Another iteration of rescheduling is started
until the delayed data has arrived.

9
Cost-based Operator Synthesis

Second phase starts when no more progress can be
made in phase 1.
Three approaches of optimization strategies
Pair
(IN) Include Delayed
(ED) Estimated Delay

10
Pair total work-based optimizer

Construct a query plan containing only a single
join using two unblocked relations.
Analyzes each pair of unblocked relations sharing
a join predicate.
Chooses the join with the least total cost to
execute.
Materialize the results of the join to disk.
Avoids Cartesian products, joins whose produced
results take longer to read from disk than to
compute from scratch.

11
Pair continued

At the end of each join, checks for the arrival
of delayed data. If not arrived, do another
iteration
If no qualified joins exist, wait for delayed
data to arrive
Reconstruction phase
when all blocked relations become available, need
to construct a single query tree
necessary, since Pair policy works only on pairs
of relations and does not maintain a complete
query plan

12
IN (Include Delayed) response time-based
optimizer

Each iteration generates a complete alternative
plan
Chooses a very long delay duration (relative to
response time) to postpone any access to the
delayed data.
Chooses a plan with the greatest benefit
(potential improvement in response time) whose
risk (duration of the optimization step) can be
overlapped with the expected delay duration.

13
IN Contined

Use risk/benefit knob (Rbknob) to prevent
optimizer from choosing high-risk plans for
relatively small potential gains over low risk
plans.
Rbknob ratio of the amount of benefit the
optimizer is willing to give up for a given
savings in risk.
Increasing Rbknob ----gt more conservative plans.

14
ED (Estimated Delay) response time-based
optimizer

Similar to IN except that it uses relatively
short delay estimates (relative to the response
time of the non-delayed plan)
Delay estimates successively increase when
necessary to make more progress
Motivation Use low risk plans when delays are
short, use high risk/high pay off plans for
larger delays.

15
ED Continued

Execution steps
Starts by picking an estimated delay value equal
to 25 of the original query response time
Repeat iterations until progress is too small
Increase delay value to 50 of response time
Increase to 100 of response time if progress is
still too small.
For short delays scrambling more useful
For long delays more aggressive.

16
Experimental Setup

Two-phase randomized query optimizer
Workload based on queries from TPC-D benchmarks
Single query site, six remote data source sites.
Experimental methodology plots the duration of
initial delay of a remote source vs. the response
time achieved using scrambling

17
One case study
18
Lessons learned

1. With sufficient memory, all cost-based
approaches (Pair, IN, ED) can effectively hide
initial delays.
When delayed relations are encountered early in
the query execution, delay as long as normal
response time can be absorbed.
Heuristic-based algorithm performs worse than
original query w/o scrambling, unless there is
substantial amount of extra memory for scrambling

19
Lessons learned

2. Cost-based scrambling tradeoff between
conservative approaches and aggressive ones.
conservative safer for short delays
aggressive bigger savings for long delays
Amount of delay to be hidden is limited by the
normal response time of the query
As delay increases beyond normal response time,
benefits of scrambling as a percentage of total
execution time decreases.
So, maybe more conservative plans?

20
Lessons Learned

3. As memory available for scrambling is reduced,
scrambling plans are more expensive.
Longer delay duration is needed for scrambling to
pay off
Good predictions of delay duration needed to make
good scrambling decisions

21
Lessons Learned

4. Aggressiveness of IN and ED policies can be
adjusted using Rbknob.
Give up potential gains for long delays in order
to reduce risks for short delays
This tradeoff is useful in the absence of
accurate predictions of delay duration.

22
Lessons Learned

5. Pair (total work-based optimizer) may perform
unnecessary work
lack of a global view of the scrambled plan
unable to pick slightly sub-optimal plan to
generate interesting orders
Therefore, response time should be used as the
objective function to generated a complete and
reasonable scrambled plan.

23
Discussion

How to tune Rbknob?
Query scrambling can be very dangerous, when
delay duration is short. How to reduce the risks?
Cross products might be Okay sometimes?
Reality of scenarios studied?
QS treats remote sites as black boxes. Additional
processing at data source sites?
Nonblocking join algorithm instead of hash join?

24
Discussion continued

Better delay duration estimates? (using probes)
Quality of decisions limited by that of
optimizer.
Replicas complicate problems?
Query scrambling decision should take
selectivity, size of intermediate results,
importance of operators into consideration.
Only addressed the problem of initial delay,
ignores bandwidth problem.

25
Discussion continued

Checking for arrival of delayed data during a
scrambling step?

26
Related Work

Do not adapt to changes in the system parameters
during query execution
Volcano optimizer
introduces choose-plan operators into a query
plan to compensate for the lack of information
about system parameters at compile time.
Y. Ioannidis et.al
generates multiple alternative plans, chooses
among them when the query is initialized.

27
Related Work

Rdb/VMS
Start multiple different executions of the same
logical operators, choose the best operator,
terminate the others.
MIND heterogeneous database project
divide query into subqueries and send to each
subquery to a site for execution, compose results
incrementally
resembles Pair.
Reorder left-deep join trees into busy join trees
Mermaid, InterViso

28
Conclusion

Three different approaches (Pair, IN, ED) are
investigated to make intelligent query scrambling
decisions
In general, use of a response time optimizer has
the advantage of being able to construct complete
query execution plans that include access to
delayed data
Fundamental trade-offs between risk aversion and
ability to hide large delays in ED and IN.

Write a Comment

User Comments (0)

About PowerShow.com

Cost-based Query Scrambling for Initial Delays - PowerPoint PPT Presentation

Cost-based Query Scrambling for Initial Delays

Traditional query scrambling based on simple heuristics is susceptible to ... unable to pick slightly sub-optimal plan to generate interesting orders ... – PowerPoint PPT presentation