Content-Based Routing: Different Plans for Different Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Content-Based Routing: Different Plans for Different Data

1
Content-Based RoutingDifferent Plans for
Different Data

Pedro Bizarro
Joint work withShivnath Babu, David DeWitt,
Jennifer Widom
September 1, 2005
VLDB 2005

2
Introduction
Introduction

The opportunity to impove
Optimizers pick a single plan for a query
However, different subsets of data may have very
different statistical properties
May be more efficient to use different plans for
different subsets of data

3
Overview of CBR

Eliminates single plan assumption
Identifies tuple classes
Uses multiple plans, each customized for a
different tuple class
CBR applies to any streaming data
E.gs. stream systems, regular DBMS operators
using iterators, and acquisitional systems.
Adaptive and low overhead algorithm
Implemented in TelegraphCQ as an extension to
Eddies

4
Overview of Eddies

Eddy routes tuples through a pool of operators
Routing decisions based on operator
characteristics selectivity, cost, queue size,
etc.

Tuples not differentiated based on content
We call it SBR Source-Based Routing

5
Content-Based Routing Example

Consider stream S processed by O1, O2, O3

O1 O2 O3
Selectivities 30 40 60
Overall Operator Selectivities

Best routing order is O1, then O2, then O3

6
Content-Based Routing Example

Let A be an attribute with domain a,b,c

Value of A O1 O2 O3
Aa 32 10 55
Ab 31 20 65
Ac 27 90 60
Overall 30 40 60
Content-Specific Selectivities

Best routing order for Aa O2, O1, O3
Best routing order for Ab O2, O1, O3
Best routing order for Ac O1, O3, O2

7
Classifier Attributes

Goal identify tuple classes
Each with a different optimal operator ordering
CBR considers
Tuple classes distinguished by content, i.e.,
attribute values
Classifier attribute (informal definition)
Attribute A is classifier attribute for operator
O if the value of A is correlated with
selectivity of O.

8
Best Classifier Attribute Example

Attribute A with domain a, b, c
Attribute B with domain x, y, z
Which is the best to use for routing decisions?
Similar to AI problem classifier attributes for
decision trees
Two labels pass operator, dropped by operator
AI solution Use GainRatio to pick best
classifier attribute

s 1-s
Bx 43 57
By 38 62
Bz 39 61
Overall 40 60
s 1-s
Aa 10 90
Ab 20 80
Ac 90 10
Overall 40 60
9
GainRatio to Measure Correlation
s 1-s
Aa 10 90
Ab 20 80
Ac 90 10
Overall 40 60
s 1-s
Bx 43 57
By 38 62
Bz 39 61
Overall 40 60
GainRatio(R, A) 0.87 GainRatio(R, B)
0.002

R random sample of tuples processed by operator O

Formulas from T. Mitchell, Machine Learning.
McGraw-Hill, '97.
10
Classifier AttributesDefinition

An attribute A is a classifier attribute for
operator O, if for any large random sample R of
tuples processed by O, GainRatio(R,A)gt??, for
some threshold ?

11
Content-Learns AlgorithmLearning Routes
Automatically

Content-Learns consists of two continuous,
concurrent steps
Optimization For each Ol ? O1, ,On find
that Ol does not have a classifier attribute or
find the best classifier attribute, Cl, of Ol.
Routing Route tuples according to the
selectivities of Ol if Ol does not have a
classifier attribute or
according to the content-specific selectivities
of the pair ltOl, Clgt if Cl is the best classifier
attribute of Ol

12
Content-Learns Optimization Step

Find Cl by profiling Ol
Route a fraction of input tuples to Ol
For each sampled tuple, map attribute values to d
partitions
For each attribute, partition update pass/fail
counters
When all sample tuples seen, compute Cl

2
sampled tuple
3
1
2
4
4
7
1
1
1
2
1
1
1
2
1
1
2 partitions
1
1
1
2
1
2
1
1
2
1
1
2
1
2
corresponding partitions
3 attributes
13
Content-Learns Routing Step

SBR routes to Ol with probability inversely
proportional to Ols selectivity, Wl
CBR routes to operator with minimum??
If Ol does not have a classifier attribute, its
?Wl
If Ol has a classifier attribute, its ?Sl,i,
jCAl, ifj(t.Cj)

40
1
2
-1
50
20
2 partitions
1
55
14
Adaptivity and Overhead

CBR introduces new routing and learning overheads
Overheads at odds with adaptivity
Adaptivity ability to find efficient plan
quickly when data or system characteristics change

15
CBR Update Overheads

Once per tuple
selectivities as fresh as possible
Once per sampled tuple
correlations between operators and content
Once per sample (2500 tuples)
Computing GainRatio and updating one entry in
array CA

W
operator selectivities
25
40
50
60
CA
classifier attributes
2
-1
2
1
In
Out
partitions 1,,d
0
1
2
0
1
1
50
20
75
-
tuples in, tuples out
detailed selectivities
S
2
1
0
1
0
0
0
80
55
-
operators 1,...,n
attributes 1,,k
16
Experimental ResultsDatasets

Stream-star dataset
Synthetic dataset based on a star-schema
SELECT FROM stream S, d1, d2, , dN
WHERE s.fkd1 d1.pk // Operator O1
AND s.fkdN dN.pk // Operator ON
Attribute S.attrC best classifier attribute for
all operators
8 other attributes in S not correlated with
operator selectivities
100K records in stream, 10K records in dimension
tables
Lab dataset
Explained later

17
Experimental ResultsSystem, Metrics, Defaults

TelegraphCQ version 0.2
Tao Linux release 1, 512 MB RAM, Pentium 4 2.4
GHz
Metrics improvement running time and routing
calls
Default values

Parameter Defaults Comment
P 6 Tuple sampling probability
R 150 tuples Sample size
d 24 Number of partitions
Confidence 95 Conf. interval in graphs
18
Experimental ResultsRun-time Overheads

Routing overhead
time to perform routing decisions (SBR, CBR)
Learning overhead
Time to update data structures (SBR, CBR) plus
Time to compute gain ratio (CBR only).

Overhead increase 30-45
19
Experimental ResultsVarying Skew

One operator with selectivity A, all others with
selectivity B
Skew is A-B. A varied from 5 to 95
Overall selectivity 5

6 joins
20
Experimental ResultsRandom Selectivities

Attribute attrC correlated with the selectivities
of the operators
Other attributes in stream tuples not correlated
with selectivities
Random selectivities in each operator

21
Experimental ResultsVarying Aggregate
Selectivity

Aggregate selectivity in previous experiments was
5 or 8
Here we vary aggregate selectivity between 5 to
35
Random selectivities within these bounds

6 joins
22
Experimental ResultsDatasets

Lab dataset
Real data
2.2 M readings from 54 sensors (Intel Research
Berkeley Lab)
Single stream with attributes
Light
Humidity
Temperature
Voltage
sensorID
Year
Month
Day
Hours
Minutes
Seconds

23
Experimental ResultsChallenging Adaptivity
Experiment (1)

Using Lab dataset
Example querySELECT FROM sensors WHERE
lightgt500
Observations
Very high variation in selectivity
Best classifier attributes change with time
No classifier attribute found for over half the
time

24
Experimental ResultsChallenging Adaptivity
Experiment (2)

Query SELECT FROM sensors
WHERE light BETWEEN lowL AND highL
AND temperature BETWEEN lowT AND highT
AND humidity BETWEEN lowH AND highH
AND voltage BETWEEN lowV AND highV
lowX random number from lower 25 of domain
highX random number from upper 25
Results for 50 different queries.
Average improvement of
8 in routing calls
5 in execution time
7 in time spent evaluating operators
18 in routing calls until a tuple is dropped

25
Experimental ResultsVarying Operator Cost

Run random query from previous slide
Run query for periods with correlations
Varied operator cost by running CPU intensive
computations

4 operators
26
Conclusions

CBR eliminates single plan assumption
Explores correlation between tuple content and
operator selectivities
Adaptive learner of correlations with negligible
overhead
Performance improvements over non-CBR routing
Selectivity changes much more than correlations

27
Acknowledgements

Sam Madden and Amol Deshpande for providing the
Lab dataset.
Sailesh Krishnamurthy, Amol Deshpande, Joe
Hellerstein, and the rest of the TelegraphCQ team
for providing TelegraphCQ and answering all my
questions.

28
Q A?
29
Extra slides
30
Motivational Example (1)Intrusion Detection
Query

Track packets with destination address matching
a prefix in table T, and containing the 100-byte
and 256-byte sequences 0xa...8 and 0x7...b
respectively as subsequence
SELECT FROM packetsWHERE matches(destination,
T)AND contains(data, 0xa...8)AND
contains(data, 0x7...b)

O1
O2
O3
31
Motivational Example (2)Intrusion Detection
Query

Assume
costs are c3gtc1gtc2
selectivities are ??3gt?1gt?2
SBR routing converges to O2, O1, O3

O3
SBR
O2
O1
almost all tuples follow this route
Stream of tuples
32
Motivational Example (3)Intrusion Detection
Query

Suppose an attack (O2 and O3) on a network whose
prefix is not in T (O1) is underway
s2 and s3 will be very high, s1 will be very low
O1, O2, O3 will be the most efficient ordering
for attack tuples

O3
O3
SBR
CBR
O2
O1
O2
O1
addr
attack tuples follow this route
almost all tuples follow this route
non-attack tuples follow this route
Stream of tuples
Stream of tuples
33
Experimental ResultsVarying Skew

One operator with selectivity A, all others with
selectivity B
Skew is A-B. A varied from 5 to 95
Overall selectivity 5

2 joins
6 joins
34
Related Work

Adaptivity in Stream Systems
Avnur Eddies Continuously Adaptive Query
Processing. SIGMOD'00.
Arasu STREAM The Stanford Stream Data
Manager. DEBul 26(1).
Babu Adaptive ordering of pipelined stream
filters. SIGMOD'04.
Babu StreaMon An Adaptive Engine for Stream
Query Processing. SIGMOD'04.
Carney Monitoring streams a new class of
data management applications. VLDB'02.
Chandrasekaran TelegraphCQ Continuous
dataflow processing for an uncertain world.
CIDR'03.
Chandrasekaran Psoup a system for streaming
queries over streaming data. VLDBj 12(2).
Deshpande An initial study of overheads of
eddies. SIGMODrec 33(1).
Deshpande Lifting the Burden of History from
Adaptive Query Processing. VLDB'04.
Madden Continuously adaptive continuous
queries over streams. SIGMOD'02.
Raman Using state modules for adaptive query
processing. ICDE'03.
Tian Tuple Routing Strategies for Distributed
Eddies. VLDB'03.
Viglas Maximizing the Output Rate of Multi-Way
Join Queries over Streaming Information Sources.
VLDB'03.
AQP Surveys
Babu Adaptive Query Processing in the Looking
Glass. CIDR'05.
Hellerstein Adaptive query processing
Technology in evolution. DEBul 23(2).

Adaptivity in DBMS
Babu Proactive Re-optimization. SIGMOD'05.
Graefe Dynamic query evaluation plans.
SIGMOD'89.
Kabra Efficient Mid-Query Re-Optimization of
Sub-Optimal Query Execution Plans. SIGMOD'98.
Markl Robust Query Processing through
Progressive Optimization. SIGMOD'04.
Wong Decomposition - a strategy for query
processing. TODS 1(3).
Adaptivity in Distributed Systems
Bouganim Dynamic query scheduling in data
integration systems. ICDE'00.
Ives An adaptive query execution system for
data integration. SIGMOD'99.
Ives Adaptive query processing for internet
applications. DEBul 23(2).
Ives Adapting to Source Properties in
Processing Data Integration Queries. SIGMOD'04.
Ives Efficient Query Processing for Data
Integration. PhD thesis.
Ng Dynamic query re-optimization. ICSSDM'99.
Urhan Dynamic pipeline scheduling for
improving interactive performance of online
queries. VLDB'01.
Urhan Cost based query scrambling for initial
delays. SIGMOD'98.
Zhu Dynamic plan migration for continuous
queries over data streams. SIGMOD'04.

35
Related Work (contd)

Acquisitional Systems
Madden The Design of an Acquisitional Query
Processor for Sensor Networks. SIGMOD'03.
Deshpande Model-Driven Data Acquisition in
Sensor Networks. VLDB'04
Deshpande Exploiting Correlated Attributes in
Acquisitional Query Processing. ICDE'05.
Multiple Plans in Same Query
Antoshenkov Query processing and optimization
in oracle rdb. VLDBj 5(4).
Bizarro Content-Based Routing Different Plans
for Different Data. VLDB'05.
Polyzotis Selectivity-Based Partitioning A
Divide-and-Union Paradigm For Effective Query
Optimization. Unpublished.
URDs and Modeling Uncertainty
Anderson Index key range estimator. U. S.
Patent 4,774,657.
Babcock Towards a Robust Query Optimizer A
Principled and Practical Approach. SIGMOD'05.
Chu Least expected cost query optimization An
exercise in utility. SoPoDS'99.
Ramamurthy Buffer-pool Aware Query
Optimization. CIDR'05.
Viglas Novel Query Optimization and Evaluation
Techniques, Ph.D. Thesis.

Optimization, Cardinality Estimation,
Correlations
Selinger Access Path Selection in a Relational
Database Management System. SIGMOD'79.
Acharya Join Synopses for Approximate Query
Answering. SIGMOD'99.
Christodoulakis Implications of certain
assumptions in database performance evaluation.
TODS 9(2).
Graefe Query Evaluation Techniques for Large
Databases. ACM Comput. Surv. 25(2).
Getoor Selectivity Estimation using
Probabilistic Models. SIGMOD'01.
Ioannidis On the Propagation of Errors in the
Size of Join Results. SIGMOD'91.
Ilyas CORDS Automatic discovery of
correlations and soft functional dependencies.
SIGMOD'04.
Stillger LEO - DB2s LEarning Optimizer.
VLDB'01.
AI and Learning from Streams
Mitchell Machine Learning. McGraw-Hill, '97.
Guha Data-streams and histograms. ACM Symp. on
Theory of Computing, '01.
Domingos Mining high-speed data streams.
SIGKDD'00.
Gehrke On computing correlated aggregates over
continual data streams. SIGMOD'01.

Write a Comment

User Comments (0)

About PowerShow.com

Content-Based Routing: Different Plans for Different Data PowerPoint PPT Presentation