Cost-based Query Scrambling for Initial Delays - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Cost-based Query Scrambling for Initial Delays

Description:

Traditional query scrambling based on simple heuristics is susceptible to ... unable to pick slightly sub-optimal plan to generate interesting orders ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 29
Provided by: zmao
Category:

less

Transcript and Presenter's Notes

Title: Cost-based Query Scrambling for Initial Delays


1
Cost-based Query Scrambling for Initial Delays
  • Tolga Urhan
  • Michael J. Franklin
  • Laurent Amsaleg

2
Introduction
  • Problem Remote data access (e.g. from the Web)
    in query processing introduce unpredictable
    delays.
  • Traditional query scrambling based on simple
    heuristics is susceptible to problems from bad
    scrambling decisions.
  • Three different approaches to using query
    optimization for scrambling are proposed.
  • More intelligent scrambling decisions are made.

3
Query Scrambling Overview
  • Goal To hide the delays encountered when
    obtaining data from remote sources by performing
    other useful work.
  • Consists of two phases
  • Rescheduling phase
  • Operator Synthesis phase
  • Objective function of optimization based either
    on
  • total work
  • or response time

4
Goals of this paper
  • Focus on the problem of initial delay
  • delay in receiving the first tuple from a
    particular remote source.
  • Due to
  • difficulty in establishing a connection to a
    remote source,
  • heavy load of the remote source
  • large amount of work remote source needs to
    perform (lack of global optimization)
  • Investigate both the use of response time-based
    and total work-based optimization for query
    scrambling

5
Cost-based Query Scrambling
  • Assumption
  • Query execution environment consists of a query
    site and remote data sources.
  • Processing work occurs in both query site and
    remote sites
  • example of a complex query tree

Communication Link
Query Result
Select
Join
C
A
D
E
B
6
Iterative Process of Query Scrambling
  • (1) Rescheduling execution plan of a query is
    dynamically rescheduled when delay is detected.
  • (2) Operator Synthesis new operators can be
    created when there are no other operators that
    can execute.
  • E.g. Stalls in getting A tuples
  • (1) retrieve B tuples, execute D join E
  • (2) execute a new join between
  • B and (D join E) join C

7
Cost-based Rescheduling
  • Identify runnable subtrees subtrees made up
    entirely of nonbocked operators.
  • Runnable subtrees can be scheduled out of order
    by inserting materialization operators at its
    root.
  • Materialization operators they issue Open, Next,
    close calls to the root of the subtree and save
    results in a temporary relation.

8
Cost-based Rescheduling
  • Selection of runnable subtrees to execute
  • Traditional way maximal one.
  • Maximal efficiency (P - MR)/(P MW)
  • MW cost of writing materialized temporary result
    to disk
  • MR cost of reading temporary results form disk
  • P cost of executing the subtree
  • efficiency improvement in response time per unit
    of scrambling execution
  • Another iteration of rescheduling is started
    until the delayed data has arrived.

9
Cost-based Operator Synthesis
  • Second phase starts when no more progress can be
    made in phase 1.
  • Three approaches of optimization strategies
  • Pair
  • (IN) Include Delayed
  • (ED) Estimated Delay

10
Pair total work-based optimizer
  • Construct a query plan containing only a single
    join using two unblocked relations.
  • Analyzes each pair of unblocked relations sharing
    a join predicate.
  • Chooses the join with the least total cost to
    execute.
  • Materialize the results of the join to disk.
  • Avoids Cartesian products, joins whose produced
    results take longer to read from disk than to
    compute from scratch.

11
Pair continued
  • At the end of each join, checks for the arrival
    of delayed data. If not arrived, do another
    iteration
  • If no qualified joins exist, wait for delayed
    data to arrive
  • Reconstruction phase
  • when all blocked relations become available, need
    to construct a single query tree
  • necessary, since Pair policy works only on pairs
    of relations and does not maintain a complete
    query plan

12
IN (Include Delayed) response time-based
optimizer
  • Each iteration generates a complete alternative
    plan
  • Chooses a very long delay duration (relative to
    response time) to postpone any access to the
    delayed data.
  • Chooses a plan with the greatest benefit
    (potential improvement in response time) whose
    risk (duration of the optimization step) can be
    overlapped with the expected delay duration.

13
IN Contined
  • Use risk/benefit knob (Rbknob) to prevent
    optimizer from choosing high-risk plans for
    relatively small potential gains over low risk
    plans.
  • Rbknob ratio of the amount of benefit the
    optimizer is willing to give up for a given
    savings in risk.
  • Increasing Rbknob ----gt more conservative plans.

14
ED (Estimated Delay) response time-based
optimizer
  • Similar to IN except that it uses relatively
    short delay estimates (relative to the response
    time of the non-delayed plan)
  • Delay estimates successively increase when
    necessary to make more progress
  • Motivation Use low risk plans when delays are
    short, use high risk/high pay off plans for
    larger delays.

15
ED Continued
  • Execution steps
  • Starts by picking an estimated delay value equal
    to 25 of the original query response time
  • Repeat iterations until progress is too small
  • Increase delay value to 50 of response time
  • Increase to 100 of response time if progress is
    still too small.
  • For short delays scrambling more useful
  • For long delays more aggressive.

16
Experimental Setup
  • Two-phase randomized query optimizer
  • Workload based on queries from TPC-D benchmarks
  • Single query site, six remote data source sites.
  • Experimental methodology plots the duration of
    initial delay of a remote source vs. the response
    time achieved using scrambling

17
One case study
18
Lessons learned
  • 1. With sufficient memory, all cost-based
    approaches (Pair, IN, ED) can effectively hide
    initial delays.
  • When delayed relations are encountered early in
    the query execution, delay as long as normal
    response time can be absorbed.
  • Heuristic-based algorithm performs worse than
    original query w/o scrambling, unless there is
    substantial amount of extra memory for scrambling

19
Lessons learned
  • 2. Cost-based scrambling tradeoff between
    conservative approaches and aggressive ones.
  • conservative safer for short delays
  • aggressive bigger savings for long delays
  • Amount of delay to be hidden is limited by the
    normal response time of the query
  • As delay increases beyond normal response time,
    benefits of scrambling as a percentage of total
    execution time decreases.
  • So, maybe more conservative plans?

20
Lessons Learned
  • 3. As memory available for scrambling is reduced,
    scrambling plans are more expensive.
  • Longer delay duration is needed for scrambling to
    pay off
  • Good predictions of delay duration needed to make
    good scrambling decisions

21
Lessons Learned
  • 4. Aggressiveness of IN and ED policies can be
    adjusted using Rbknob.
  • Give up potential gains for long delays in order
    to reduce risks for short delays
  • This tradeoff is useful in the absence of
    accurate predictions of delay duration.

22
Lessons Learned
  • 5. Pair (total work-based optimizer) may perform
    unnecessary work
  • lack of a global view of the scrambled plan
  • unable to pick slightly sub-optimal plan to
    generate interesting orders
  • Therefore, response time should be used as the
    objective function to generated a complete and
    reasonable scrambled plan.

23
Discussion
  • How to tune Rbknob?
  • Query scrambling can be very dangerous, when
    delay duration is short. How to reduce the risks?
  • Cross products might be Okay sometimes?
  • Reality of scenarios studied?
  • QS treats remote sites as black boxes. Additional
    processing at data source sites?
  • Nonblocking join algorithm instead of hash join?

24
Discussion continued
  • Better delay duration estimates? (using probes)
  • Quality of decisions limited by that of
    optimizer.
  • Replicas complicate problems?
  • Query scrambling decision should take
    selectivity, size of intermediate results,
    importance of operators into consideration.
  • Only addressed the problem of initial delay,
    ignores bandwidth problem.

25
Discussion continued
  • Checking for arrival of delayed data during a
    scrambling step?

26
Related Work
  • Do not adapt to changes in the system parameters
    during query execution
  • Volcano optimizer
  • introduces choose-plan operators into a query
    plan to compensate for the lack of information
    about system parameters at compile time.
  • Y. Ioannidis et.al
  • generates multiple alternative plans, chooses
    among them when the query is initialized.

27
Related Work
  • Rdb/VMS
  • Start multiple different executions of the same
    logical operators, choose the best operator,
    terminate the others.
  • MIND heterogeneous database project
  • divide query into subqueries and send to each
    subquery to a site for execution, compose results
    incrementally
  • resembles Pair.
  • Reorder left-deep join trees into busy join trees
  • Mermaid, InterViso

28
Conclusion
  • Three different approaches (Pair, IN, ED) are
    investigated to make intelligent query scrambling
    decisions
  • In general, use of a response time optimizer has
    the advantage of being able to construct complete
    query execution plans that include access to
    delayed data
  • Fundamental trade-offs between risk aversion and
    ability to hide large delays in ED and IN.
Write a Comment
User Comments (0)
About PowerShow.com