Interactive Query Processing presentation

About This Presentation

Transcript and Presenter's Notes

Title: Interactive Query Processing

1
Interactive Query Processing

Vijayshankar Raman
Computer Science Division
University of California at Berkeley

2
Motivation Nature of querying

querying extracting information from data sets
different techniques in different settings
intrinsically slow
significant user-system interaction
info. seekers work iteratively, gradually
refining requests based on feedback Oday and
Jeffries 93, Koenemann and Belkin 96

3
Problems with traditional solutions

mismatch between system functionality and
iterative mode of HCI
black box functionality
batch processing
frustrating delays in iterative process

Query find average grade of students in each
college
4
Interactive processing
process
query
exact answer

HCI requirements
users must get continual feedback on results of
processing
allow users to control processing based on prior
feedback

5
Interactive processing
result estimates

HCI requirements
users must get continual feedback on results of
processing
allow users to control processing based on prior
feedback
performance goals
not to minimize time to give complete results
give continually improving partial results
adapt to dynamically specified performance goals

6
My Research

interactive query processing
interactive data cleaning (Potters
Wheel)

assume dataflow thru pipelined operators
support for dynamic user control in
traditional query proc. architectures
adaptively generating partial results in
response to user control
still more aggressive adaptation -- state
modules

T
S
R
S
R
T
S
R
T
T
7
Talk Outline

motivation and context
support for dynamic user control in traditional
query proc. architectures
architecture for adaptively generating partial
results
policy for generating partial results
user interface for displaying partial results
impact on routing
wrapup

8
Design goals in supporting user control

make minimal change to system architecture
must be independent of particular query
processing algorithms
no delay in processing

9
Online Reordering (Raman et al. 99,00)

users perceive data being processed over time
prioritize processing for interesting tuples
interest based on user-specified preferences
reorder dataflow so that interesting tuples go
first
encapsulate reordering as pipelined dataflow
operator

T
T
S
S
R
R
10
Context an application of reordering

online aggregation Hellerstein/Haas/Wang 97,
Haas/Hellerstein 99
for SQL aggregate queries, give gradually
improving estimates
with confidence intervals
allow users to speed up estimate refinement for
groups of interest
prioritize for processing at a per-group
granularity

SELECT AVG(gpa) FROM students GROUP BY college
11
Online Aggregation Screenshot
SELECT AVG(gpa) FROM students GROUP BY college
12
Framework for Online Reordering
network xfer.
acddbadb...
f(t)
abcdabc..
process
reorder
user interest

want no delay in processing
in general, reordering can only be best-effort
typically process/consume slower than produce
exploit throughput difference to reorder
two aspects
mechanism for best-effort reordering
reordering policy

13
Juggle mechanism for reordering
process/consume
getNext
buffer
spool
prefetch
enrich
produce
side disk

two threads -- prefetch from input
-- spool/enrich from auxiliary side disk
juggle data between buffer and side disk
keep buffer full of interesting items
getNext chooses best item currently on buffer
getNext, enrich/spool decisions -- based on
reordering policy
side disk management
hash index, populated in a way that postpones
random I/O

14
Reordering policies
good permutation of items t1tn to t?1t?n
GOAL
QOF
time

quality of feedback for a prefix t?1t?2t?k
QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
preference
determined by application
goodness of reordering dQOF/dt
implication for juggle mechanism
process gets item from buffer that increases QOF
the most
juggle tries to maintain buffer with such items

15
QOF in Online Aggregation

avg weighted confidence interval
preference acts as weight on confidence interval
QOF ?? UPi / ?ni , ni number of tuples
processed from group i
process pulls items from group with max UPi /
ni?ni
desired ratio of group i tuples on buffer
UPi2/3/? UPj2/3
juggle tries to maintain this by enrich/spool

16
Other QOF functions

rate of processing (for a group) ? preference
QOF ?? (ni - nUPi)2 (variance from ideal
proportions)
process pulls items from group with max (nUPi -
ni )
desired ratio of group i tuples in buffer UPi

17
Results Reordering in Online Aggregation

implemented in Informix UDO server
experiments with modified TPC-D queries
questions
how much throughput difference is needed for
reordering
can we reorder handle skewed data
one stress test skew, very small proc. cost
index-only join
5 orderpriorities, zipf distribution

consume
SELECT AVG(o_totalprice), o_orderpriority FROM
order WHERE exists ( SELECT FROM lineitem
WHERE l_orderkey o_orderkey) GROUP BY
o_orderpriority
juggle
index
scan
process
18
Performance results
confidence interval
tuples processed
E
C
A
time
time

3 times faster for interesting groups
2 completion time overhead

19
Overall Findings

higher processing costs
index/hash join, subquery,
reordering easy
very low processing costs
juggle constrained by density of interesting
tuples
outlier groups hard to speed up
better to use index stride Hellerstein/Haas/Wang
97
needs pre-computed index
reordering becomes easier over time
question to answer
where to place juggle?

20
Outline

motivation and context
support for dynamic user control in traditional
query proc. architectures
architecture for adaptively generating partial
results
policy for generating partial results
user interface for displaying partial results
impact on routing
wrapup

21
Incremental results

traditional arch. also generate continual result
tuples
arises from continual dataflow thru pipelining
operators
much work on pipelining joins Wilschut/Apers91,
Haas/Hellerstein99, Ives et al.99,
Urhan/Franklin00
this is too rigid
especially in distributed envirorments

22
Context Query processing in Telegraph

Telegraph adaptive dataflow system to query
diverse, distributed sources
much data available as services over Internet
currently only accessible by browse/search/forms
semi-automatically combine this data
examples
campaign finance information (Election
2000)Federal Election Commision Yahoo
Home PricesCensus APBnews Crime Ratings
Maps IMDB
restaurant information
SwitchBoard Fodors S.F.Chronicle
MapQuest Health inspection reports

23
Partial results

complete result tuples too rigid
source latencies high, diverse
dynamic source variations, delays
query does not capture user desires
non-expert and even expert users query too broad

24
Partial results

complete tuples too rigid
source latencies high, diverse
dynamic source variations, delays
query does not capture user desires
non-expert and even expert users query too broad
want to process queries flexibly
give partial result tuples asap
adapt dynamically to user preferences and source
variations

25
Correctness of partial results

some columns essential for UI
e.g. group-by columns / sort-by columns
for maximum flexibility -- outer-join semantics
good idea for Web sources
or -- strict join semantics
no partial results without ensuring match exists
key constraints helpful
aggregates
update early, and compensate later
statistical guarantees for aggregates difficult
fanouts unknown (can be 0!)
key constraints helpful

26
Dynamic query plans
T
R
S
T
S
R

Eddy Avnur and Hellerstein 2000
router for directing data thru modules
minimize completion time by adaptively choosing
join order
all partial tuples generable
limitations
prioritized on modules, not data
small memory buffer

27
Partial result generation
. . . modules
. . . inputs
R
P
S

our focus continual partial results
Eddy must decide
what tuple to route next
where to route it
based on user preferences and module properties
need routing policy and a reordering mechanism
eddy memory buffer and module queues bounded

28
Prioritizing tuples
. . . modules
R
P
S
. . . inputs

enhance Eddy with Juggle
reorder exploiting slowness of modules
synergy -- juggle location problem solved!
juggle everywhere, to the extent of throughput
difference

29
Routing and reordering policy
GOAL
QOF
time

GOAL at any time, route to max. dQOF/dt
benefit of sending tuple to module / cost
cost estimate data rates to/from module
benefit dependent on application and UI
how partial results impact the UI
user preferences

30
Outline

motivation and context
online reordering for user prioritization of
partial results
architecture for adaptively generating partial
results
policy for generating partial results
Telegraph UI displaying results and inferring
preferences
experimental results
wrapup

31
Telegraph UI

screenshot

32
Telegraph UI

partial results displayed on screen as they
arrive
multi-resolution spreadsheet
values clustered into groups
can roll-up or drill-down to see different
granularities
different columns visible at different drilldown
levels
client has hash table mapping groups to values
navigation
vertical/horizontal scrolling
rollup/drilldown

33
Getting user preferences

infer from navigation
row/column scrolling, group drill down and rollup
prioritize visible rows/columns
query evolution
subset one-size-fits-all queries
future work query expansion
explicit
up/down buttons on columns
at the cell level -- need for some expensive
sources
map to QOF metric on partial results

34
QOF Benefit of a partial result

depends on user preferences
benefit of updating a cell in output
row priority x column weight x ? cell resolution
incremental cell resolution how much does one
extra update add to the cells value
scalars -- 1
aggregations -- change in confidence interval
informing user about execution progress
convey cell resolution on UI
future work

35
QOF Benefit of Routing a Tuple

route tuple according to expected benefit and
cost
benefit of sending a tuple t to a module M and
forming set T ?cells c ?T benefit of updating c

36
Throughput difference

Bush Contributors Income (index) AFB
Crime Ratings (index)

40000
Bush Contributors
30000
20000
Number of tuples read
10000
Income
Crime Ratings
0
200
400
600
time (s)
37
Benefit of giving partial results

Bush Contributors Income (index) AFB
Crime Ratings (index)

40000
Bush Contributors
30000
partial results
number of partial results
complete results
20000
10000
Income
Crime Ratings
0
200
400
600
time (s)
38
Effect of Delays
Bush Contributors Income (index) AFB
Crime Ratings (index) with delay in AFB Crime
Ratings
number of partial results
10000
delay
1000
number of complete results
100
10
200
400
600
0
time(s)
39
Prioritizing particular values

SELECT AVG(Income) FROM Bush Donors, Census
GROUP BY BushDonors.State
scrolling AZ, AR, CA, CO, CT,
KY, LA, MA, MD, MI, TX, UT,
VA, VI, VT

600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
40
Prioritizing particular values

SELECT AVG(Income) FROM Bush Donors, Census
GROUP BY BushDonors.State
scrolling AZ, AR, CA, CO, CT,
KY, LA, MA, MD, MI, TX, UT,
VA, VI, VT

600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
41
Prioritizing particular values

SELECT AVG(Income) FROM Bush Donors, Census
GROUP BY BushDonors.State
scrolling AZ, AR, CA, CO, CT,
KY, LA, MA, MD, MI, TX, UT,
VA, VI, VT

600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
42
Distribution of contributions
5000
4000
3000
Bush Contributors
2000
1000
State
43
Prioritizing particular columns (1)

how do we prioritize external sources?
increase number of threads used to probe it
resource scheduling rather than data scheduling
if eddy sends na tuples to A, nb to B, per second
na lt threadsa/latency(A)
nb lt threadsb/latency(B)
threadsa threadsb lt Number of Threads Per
Query
GOAL maximize na ?QOF(A) nb ?QOF(B)

44
Prioritizing particular columns (2)

with 3 threads for each source

Income
3000
2000
Number of tuples read
1000
Crime Ratings
300
100
200
0
time (s)
45
Prioritizing particular columns (2)

with 5 threads for Crime Ratings, 1 for Income

Income
3000
2000
Number of tuples read
1000
Crime Ratings
300
100
200
0
time (s)
46
Prioritizing particular columns (3)

more generally, optimize with multiple resource
constraints
netizenship
s
future work

47
Outline

motivation and context
online reordering for user prioritization of
partial results
architecture for adaptively generating partial
results
policy for generating partial results
more aggressive adaptation state modules
wrapup

48
Granularity of Query Operators

relational operators logical abstractions
encapsulate multiple physical effects
inflexible in handling unexpected changes
cannot gracefully adapt
access method/data source selection
join algorithm selection
resource allocation e.g. memory
delays Query Scrambling, XJoin
want to encapsulate at level of physical operators

49
State Modules
hash jn
ind.jn
Eddy
R
R
P

isolate (and share) state in State Modules
(SteMs)
work sharing
routing flexibility
can devise constraints so that query execn.
routing
can adapt access methods, join algorithms
gracefully
directly measure adapt resource consumption ??

50
Outline

motivation and context
online reordering for user prioritization of
partial results
architecture for generating more aggressive
partial results
policy for generating partial results
more aggressive adaptation state modules
wrapup

51
Related Work

information retrieval
ranked retrieval, relevance feedback
search strategies, Berry Picking
incremental query processing
pipelining hash joins (Haas/Hellerstein99,
Ives
et al.99, Urhan/Franklin00)
top N/fast first queries (Carey/Kossman97,

Antoshenkov/Ziauddin96)
adaptivity
parametric query plans (Graefe/Cole94)
mid-query reoptimization (Urhan/Franklin/Amsaleg9
8,
Kabra/DeWitt98, Ives et al.99)
competition
(Antoshenkov/Ziauddin96)
miscellaneous
precomputed summaries (OLAP, materialized views,
AQUA)
parachute queries (Bonnet/Tomasic98)
APPROXIMATE (Vrbsky/Liu93)

52
Summary

applications, query processors need tighter
coupling
online reordering
effective way of supporting dynamic user control
partial results as desired
embed reordering within dynamically controlled
dataflow
hard to map user-interaction needs intoconcrete
algorithm performance goals
wanted benchmarks based on user/application
traces

53
Future Work

session-granularity query processing
query evolution
lazy evaluation user/client navigates through
partial results
helps thin clients
closer dbms-application interaction
evaluation tools
benchmarks driven by traces/user studies
for more information
http//telegraph.cs.berkeley.edu/
http//control.cs.berkeley.edu/

54
HCI Motivation Why Interactivity?

Berry picking (Bates 90, 93)
user studies
decision support (Oday and Jeffries 93)
relevance feedback (Koenemann and Belkin 96)

55
Disk data layout during juggling

performance goal
favor early results optimize Phase 1, at expense
of Phase 2
spool sequential I/O, enrich random I/O
reordering in Phase 2 much easier than in Phase 1
enrich needs approx. index on side-disk
have hash-index on tuples, according to user
interest
done at group granularity

56
Other applications of reordering

can add reorder to any dataflow
will later discuss application for query systems
that give aggressive partial results
also useful in batch query processing
sorting often used in query plans for performance
gains
can be replaced by best-effort reordering
little performance hit, but plan is now
pipelining
will not discuss further in this talk
spreadsheets

57
Telegraph-Client API
sql query
client
Telegraph
partial results

preferences
col priorities
ltpredicate, row priority, row-col prioritiesgt
e.g. lt1, 1.0, nonegt, lt2, 1.7, nonegt,
lt3, 1.3, maphighgt, lt8/33014, 1.0,
nonegt, lt8/60610, 1.0, nonegt

preferences
58
Granularity of Reordering/Routing

reorder and route tuples at granularity of a
group
group ltbase relations, predicategt
groups created and deleted dynamically, as user
navigates in UI
group predicate may be a subset of
application-specified row predicate
final row depends on values this tuple joins with

59
Interactive query processing with non-pipelining
operators

basic techniques independent of pipelined
operators.
reordering --- can be used in general query
plans, although effectiveness may be hindered by
blocking operators
State Modules applicable in general -- helps with
adaptivity
much work on making plan operators pipelining --
ripple join, xjoin, tuqwila
reordering itself can be used to avoid blocking
sorts in some cases

60
Query processing using SteMs (1)

SteM operations -- works like index on a table
build(tuple) add tuple to SteM
probe(tuple) find all matches among build values
bounce back tuple if all matches not found and
probe tuple not cached elsewhere
performing joins by routing through SteMs
index joins
regular synchronous index joins easy -- have no
state
over distributed sources, lookups cached by
building into SteM
separating cache allows Eddy to distinguish
access cost
asynchronous index joins GW00
helps for high-throughput Internet sources
SteM acts as rendezvous buffer for pending probe
results

61
Query processing using SteMs (2)

performing joins by routing through SteMs (contd)
doubly pipelined hash join
use two SteMs, one on each source
can also do extensions (e.g. Tuqwila IFF99),
and even non-pipelined joins, using appropriate
SteMs
thus can simulate any static query plan by
appropriate routing

SteMS
SteMR
R probe
S probe
R bld.
S bld.
Eddy
R
S
62
Query processing using SteMs (3)

how to do this in general?
dont want Eddy to be a super-join
want it to route as per user desires and source
properties, independent of any join algorithm
routing constraint
each tuple must have a non-zero probability of
being sent to every appropriate module
can avoid repeated looping by marking tuples
Theorem any routing that satisfies above
constraint will produce all query results and
terminate
could get duplicates
prune at application, or
enforce atomicity in tuple routing

63
Benefits of SteMs

adaptivity
join algorithm and access path selection are
determined by eddy routing
hence Eddy can dynamically choose these based on
source properties
in fact, Eddy can do hybrid join algorithms
adapt to dynamic delays, etc.
work sharing -- for query evolution, competition
better information to the Eddy
speeds, memory consumption

64
Lazy Evaluation

User navigates thru partial results
can see only a few at a time
exploit to delay processing
do on demand on the region user currently
navigating over
things to delay
expensive source lookups (e.g. map)
joins (e.g. )
any other formatting etc.

65
Giving probabilistic results

view partial result as somethng that throws
light on outer-join space
positive result probabilistic
negative result deterministic
benefit of positive result
weighted sum of cells it is likely to show
penalty for false result
repudiation of earlier false results
benefit of negative result
direct benefit unlikely
repudiation of earlier false results
complications
many tuples may contribute to single output tuple
-- aggregation

66
Probabilistic partial results

partial result gt may not apply all filters
still want to show result probabilistically
filters ill-specified
data on Internet often inconsistent
show all partial results full disjunction
(Galindo-Legaria 94)
positive and negative results

Output Space
Outer Join Space
67
What does user see?

user navigates thru partial results
with complete row partial results
can either explore these results in detail
(sort/scroll along columns as in spreadsheet)
compute approx. aggregates on them in online
fashion
errors shrinking as more results come in
with probabilistic, incomplete results?
easy if primary key in partial result
what to show otherwise?

68
Quantifying benefit of partial results

partial result sheds light on outer-join space
positive result probabilistic
negative result deterministic
benefit of positive result
weighted sum of cells it is likely to show
penalty for false result ?
repudiation of earlier false results ?
benefit of negative result
direct benefit unlikely
repudiation of earlier false results ?

69
Why probabilistic results are good

data on Internet is often inconsistent
hence even exact processing cannot give perfect
answers
people are used to sloppy answers
can allow negations in expert interfaces only

70
Data growth vs. Computer Speedup

Moores Law -- of transistors/chip doubles
every 18 months (1965)
data growth

Source J. Porter, Disk/Trend, Inc.
(http//www.disktrend.com/pdf/portrpkg.pdf)
71
Disk Appetite, contd.

Greg Papadopoulos, CTO Sun
Disk sales doubling every 9 months
similar results from Winter VLDB survey
time to process all your data doubles every 18
months!

Write a Comment

User Comments (0)

About PowerShow.com

Interactive Query Processing PowerPoint PPT Presentation