Content-Based Routing: Different Plans for Different Data - PowerPoint PPT Presentation

About This Presentation
Title:

Content-Based Routing: Different Plans for Different Data

Description:

according to the content-specific selectivities of the pair Ol, Cl if Cl is ... Explores correlation between tuple content and operator selectivities ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 35
Provided by: pedrob4
Category:

less

Transcript and Presenter's Notes

Title: Content-Based Routing: Different Plans for Different Data


1
Content-Based RoutingDifferent Plans for
Different Data
  • Pedro Bizarro
  • Joint work withShivnath Babu, David DeWitt,
    Jennifer Widom
  • September 1, 2005
  • VLDB 2005

2
Introduction
Introduction
  • The opportunity to impove
  • Optimizers pick a single plan for a query
  • However, different subsets of data may have very
    different statistical properties
  • May be more efficient to use different plans for
    different subsets of data

3
Overview of CBR
  • Eliminates single plan assumption
  • Identifies tuple classes
  • Uses multiple plans, each customized for a
    different tuple class
  • CBR applies to any streaming data
  • E.gs. stream systems, regular DBMS operators
    using iterators, and acquisitional systems.
  • Adaptive and low overhead algorithm
  • Implemented in TelegraphCQ as an extension to
    Eddies

4
Overview of Eddies
  • Eddy routes tuples through a pool of operators
  • Routing decisions based on operator
    characteristics selectivity, cost, queue size,
    etc.
  • Tuples not differentiated based on content
  • We call it SBR Source-Based Routing

5
Content-Based Routing Example
  • Consider stream S processed by O1, O2, O3

O1 O2 O3
Selectivities 30 40 60
Overall Operator Selectivities
  • Best routing order is O1, then O2, then O3

6
Content-Based Routing Example
  • Let A be an attribute with domain a,b,c

Value of A O1 O2 O3
Aa 32 10 55
Ab 31 20 65
Ac 27 90 60
Overall 30 40 60
Content-Specific Selectivities
  • Best routing order for Aa O2, O1, O3
  • Best routing order for Ab O2, O1, O3
  • Best routing order for Ac O1, O3, O2

7
Classifier Attributes
  • Goal identify tuple classes
  • Each with a different optimal operator ordering
  • CBR considers
  • Tuple classes distinguished by content, i.e.,
    attribute values
  • Classifier attribute (informal definition)
  • Attribute A is classifier attribute for operator
    O if the value of A is correlated with
    selectivity of O.

8
Best Classifier Attribute Example
  • Attribute A with domain a, b, c
  • Attribute B with domain x, y, z
  • Which is the best to use for routing decisions?
  • Similar to AI problem classifier attributes for
    decision trees
  • Two labels pass operator, dropped by operator
  • AI solution Use GainRatio to pick best
    classifier attribute

s 1-s
Bx 43 57
By 38 62
Bz 39 61
Overall 40 60
s 1-s
Aa 10 90
Ab 20 80
Ac 90 10
Overall 40 60
9
GainRatio to Measure Correlation
s 1-s
Aa 10 90
Ab 20 80
Ac 90 10
Overall 40 60
s 1-s
Bx 43 57
By 38 62
Bz 39 61
Overall 40 60
GainRatio(R, A) 0.87 GainRatio(R, B)
0.002
  • R random sample of tuples processed by operator O

Formulas from T. Mitchell, Machine Learning.
McGraw-Hill, '97.
10
Classifier AttributesDefinition
  • An attribute A is a classifier attribute for
    operator O, if for any large random sample R of
    tuples processed by O, GainRatio(R,A)gt??, for
    some threshold ?

11
Content-Learns AlgorithmLearning Routes
Automatically
  • Content-Learns consists of two continuous,
    concurrent steps
  • Optimization For each Ol ? O1, ,On find
  • that Ol does not have a classifier attribute or
  • find the best classifier attribute, Cl, of Ol.
  • Routing Route tuples according to the
  • selectivities of Ol if Ol does not have a
    classifier attribute or
  • according to the content-specific selectivities
    of the pair ltOl, Clgt if Cl is the best classifier
    attribute of Ol

12
Content-Learns Optimization Step
  • Find Cl by profiling Ol
  • Route a fraction of input tuples to Ol
  • For each sampled tuple, map attribute values to d
    partitions
  • For each attribute, partition update pass/fail
    counters
  • When all sample tuples seen, compute Cl

2
sampled tuple
3
1
2
4
4
7
1
1
1
2
1
1
1
2
1
1
2 partitions
1
1
1
2
1
2
1
1
2
1
1
2
1
2
corresponding partitions
3 attributes
13
Content-Learns Routing Step
  • SBR routes to Ol with probability inversely
    proportional to Ols selectivity, Wl
  • CBR routes to operator with minimum??
  • If Ol does not have a classifier attribute, its
    ?Wl
  • If Ol has a classifier attribute, its ?Sl,i,
    jCAl, ifj(t.Cj)

40
1
2
-1
50
20
2 partitions
1
55
14
Adaptivity and Overhead
  • CBR introduces new routing and learning overheads
  • Overheads at odds with adaptivity
  • Adaptivity ability to find efficient plan
    quickly when data or system characteristics change

15
CBR Update Overheads
  • Once per tuple
  • selectivities as fresh as possible
  • Once per sampled tuple
  • correlations between operators and content
  • Once per sample (2500 tuples)
  • Computing GainRatio and updating one entry in
    array CA

W
operator selectivities
25
40
50
60
CA
classifier attributes
2
-1
2
1
In
Out
partitions 1,,d
0
1
2
0
1
1
50
20
75
-
tuples in, tuples out
detailed selectivities
S
2
1
0
1
0
0
0
80
55
-
operators 1,...,n
attributes 1,,k
16
Experimental ResultsDatasets
  • Stream-star dataset
  • Synthetic dataset based on a star-schema
  • SELECT FROM stream S, d1, d2, , dN
  • WHERE s.fkd1 d1.pk // Operator O1
  • AND s.fkdN dN.pk // Operator ON
  • Attribute S.attrC best classifier attribute for
    all operators
  • 8 other attributes in S not correlated with
    operator selectivities
  • 100K records in stream, 10K records in dimension
    tables
  • Lab dataset
  • Explained later

17
Experimental ResultsSystem, Metrics, Defaults
  • TelegraphCQ version 0.2
  • Tao Linux release 1, 512 MB RAM, Pentium 4 2.4
    GHz
  • Metrics improvement running time and routing
    calls
  • Default values

Parameter Defaults Comment
P 6 Tuple sampling probability
R 150 tuples Sample size
d 24 Number of partitions
Confidence 95 Conf. interval in graphs
18
Experimental ResultsRun-time Overheads
  • Routing overhead
  • time to perform routing decisions (SBR, CBR)
  • Learning overhead
  • Time to update data structures (SBR, CBR) plus
  • Time to compute gain ratio (CBR only).

Overhead increase 30-45
19
Experimental ResultsVarying Skew
  • One operator with selectivity A, all others with
    selectivity B
  • Skew is A-B. A varied from 5 to 95
  • Overall selectivity 5

6 joins
20
Experimental ResultsRandom Selectivities
  • Attribute attrC correlated with the selectivities
    of the operators
  • Other attributes in stream tuples not correlated
    with selectivities
  • Random selectivities in each operator

21
Experimental ResultsVarying Aggregate
Selectivity
  • Aggregate selectivity in previous experiments was
    5 or 8
  • Here we vary aggregate selectivity between 5 to
    35
  • Random selectivities within these bounds

6 joins
22
Experimental ResultsDatasets
  • Lab dataset
  • Real data
  • 2.2 M readings from 54 sensors (Intel Research
    Berkeley Lab)
  • Single stream with attributes
  • Light
  • Humidity
  • Temperature
  • Voltage
  • sensorID
  • Year
  • Month
  • Day
  • Hours
  • Minutes
  • Seconds

23
Experimental ResultsChallenging Adaptivity
Experiment (1)
  • Using Lab dataset
  • Example querySELECT FROM sensors WHERE
    lightgt500
  • Observations
  • Very high variation in selectivity
  • Best classifier attributes change with time
  • No classifier attribute found for over half the
    time

24
Experimental ResultsChallenging Adaptivity
Experiment (2)
  • Query SELECT FROM sensors
  • WHERE light BETWEEN lowL AND highL
  • AND temperature BETWEEN lowT AND highT
  • AND humidity BETWEEN lowH AND highH
  • AND voltage BETWEEN lowV AND highV
  • lowX random number from lower 25 of domain
  • highX random number from upper 25
  • Results for 50 different queries.
  • Average improvement of
  • 8 in routing calls
  • 5 in execution time
  • 7 in time spent evaluating operators
  • 18 in routing calls until a tuple is dropped

25
Experimental ResultsVarying Operator Cost
  • Run random query from previous slide
  • Run query for periods with correlations
  • Varied operator cost by running CPU intensive
    computations

4 operators
26
Conclusions
  • CBR eliminates single plan assumption
  • Explores correlation between tuple content and
    operator selectivities
  • Adaptive learner of correlations with negligible
    overhead
  • Performance improvements over non-CBR routing
  • Selectivity changes much more than correlations

27
Acknowledgements
  • Sam Madden and Amol Deshpande for providing the
    Lab dataset.
  • Sailesh Krishnamurthy, Amol Deshpande, Joe
    Hellerstein, and the rest of the TelegraphCQ team
    for providing TelegraphCQ and answering all my
    questions.

28
Q A?
29
Extra slides
30
Motivational Example (1)Intrusion Detection
Query
  • Track packets with destination address matching
    a prefix in table T, and containing the 100-byte
    and 256-byte sequences 0xa...8 and 0x7...b
    respectively as subsequence
  • SELECT FROM packetsWHERE matches(destination,
    T)AND contains(data, 0xa...8)AND
    contains(data, 0x7...b)

O1
O2
O3
31
Motivational Example (2)Intrusion Detection
Query
  • Assume
  • costs are c3gtc1gtc2
  • selectivities are ??3gt?1gt?2
  • SBR routing converges to O2, O1, O3

O3
SBR
O2
O1
almost all tuples follow this route
Stream of tuples
32
Motivational Example (3)Intrusion Detection
Query
  • Suppose an attack (O2 and O3) on a network whose
    prefix is not in T (O1) is underway
  • s2 and s3 will be very high, s1 will be very low
  • O1, O2, O3 will be the most efficient ordering
    for attack tuples

O3
O3
SBR
CBR
O2
O1
O2
O1
addr
attack tuples follow this route
almost all tuples follow this route
non-attack tuples follow this route
Stream of tuples
Stream of tuples
33
Experimental ResultsVarying Skew
  • One operator with selectivity A, all others with
    selectivity B
  • Skew is A-B. A varied from 5 to 95
  • Overall selectivity 5

2 joins
6 joins
34
Related Work
  • Adaptivity in Stream Systems
  • Avnur Eddies Continuously Adaptive Query
    Processing. SIGMOD'00.
  • Arasu STREAM The Stanford Stream Data
    Manager. DEBul 26(1).
  • Babu Adaptive ordering of pipelined stream
    filters. SIGMOD'04.
  • Babu StreaMon An Adaptive Engine for Stream
    Query Processing. SIGMOD'04.
  • Carney Monitoring streams a new class of
    data management applications. VLDB'02.
  • Chandrasekaran TelegraphCQ Continuous
    dataflow processing for an uncertain world.
    CIDR'03.
  • Chandrasekaran Psoup a system for streaming
    queries over streaming data. VLDBj 12(2).
  • Deshpande An initial study of overheads of
    eddies. SIGMODrec 33(1).
  • Deshpande Lifting the Burden of History from
    Adaptive Query Processing. VLDB'04.
  • Madden Continuously adaptive continuous
    queries over streams. SIGMOD'02.
  • Raman Using state modules for adaptive query
    processing. ICDE'03.
  • Tian Tuple Routing Strategies for Distributed
    Eddies. VLDB'03.
  • Viglas Maximizing the Output Rate of Multi-Way
    Join Queries over Streaming Information Sources.
    VLDB'03.
  • AQP Surveys
  • Babu Adaptive Query Processing in the Looking
    Glass. CIDR'05.
  • Hellerstein Adaptive query processing
    Technology in evolution. DEBul 23(2).
  • Adaptivity in DBMS
  • Babu Proactive Re-optimization. SIGMOD'05.
  • Graefe Dynamic query evaluation plans.
    SIGMOD'89.
  • Kabra Efficient Mid-Query Re-Optimization of
    Sub-Optimal Query Execution Plans. SIGMOD'98.
  • Markl Robust Query Processing through
    Progressive Optimization. SIGMOD'04.
  • Wong Decomposition - a strategy for query
    processing. TODS 1(3).
  • Adaptivity in Distributed Systems
  • Bouganim Dynamic query scheduling in data
    integration systems. ICDE'00.
  • Ives An adaptive query execution system for
    data integration. SIGMOD'99.
  • Ives Adaptive query processing for internet
    applications. DEBul 23(2).
  • Ives Adapting to Source Properties in
    Processing Data Integration Queries. SIGMOD'04.
  • Ives Efficient Query Processing for Data
    Integration. PhD thesis.
  • Ng Dynamic query re-optimization. ICSSDM'99.
  • Urhan Dynamic pipeline scheduling for
    improving interactive performance of online
    queries. VLDB'01.
  • Urhan Cost based query scrambling for initial
    delays. SIGMOD'98.
  • Zhu Dynamic plan migration for continuous
    queries over data streams. SIGMOD'04.

35
Related Work (contd)
  • Acquisitional Systems
  • Madden The Design of an Acquisitional Query
    Processor for Sensor Networks. SIGMOD'03.
  • Deshpande Model-Driven Data Acquisition in
    Sensor Networks. VLDB'04
  • Deshpande Exploiting Correlated Attributes in
    Acquisitional Query Processing. ICDE'05.
  • Multiple Plans in Same Query
  • Antoshenkov Query processing and optimization
    in oracle rdb. VLDBj 5(4).
  • Bizarro Content-Based Routing Different Plans
    for Different Data. VLDB'05.
  • Polyzotis Selectivity-Based Partitioning A
    Divide-and-Union Paradigm For Effective Query
    Optimization. Unpublished.
  • URDs and Modeling Uncertainty
  • Anderson Index key range estimator. U. S.
    Patent 4,774,657.
  • Babcock Towards a Robust Query Optimizer A
    Principled and Practical Approach. SIGMOD'05.
  • Chu Least expected cost query optimization An
    exercise in utility. SoPoDS'99.
  • Ramamurthy Buffer-pool Aware Query
    Optimization. CIDR'05.
  • Viglas Novel Query Optimization and Evaluation
    Techniques, Ph.D. Thesis.
  • Optimization, Cardinality Estimation,
    Correlations
  • Selinger Access Path Selection in a Relational
    Database Management System. SIGMOD'79.
  • Acharya Join Synopses for Approximate Query
    Answering. SIGMOD'99.
  • Christodoulakis Implications of certain
    assumptions in database performance evaluation.
    TODS 9(2).
  • Graefe Query Evaluation Techniques for Large
    Databases. ACM Comput. Surv. 25(2).
  • Getoor Selectivity Estimation using
    Probabilistic Models. SIGMOD'01.
  • Ioannidis On the Propagation of Errors in the
    Size of Join Results. SIGMOD'91.
  • Ilyas CORDS Automatic discovery of
    correlations and soft functional dependencies.
    SIGMOD'04.
  • Stillger LEO - DB2s LEarning Optimizer.
    VLDB'01.
  • AI and Learning from Streams
  • Mitchell Machine Learning. McGraw-Hill, '97.
  • Guha Data-streams and histograms. ACM Symp. on
    Theory of Computing, '01.
  • Domingos Mining high-speed data streams.
    SIGKDD'00.
  • Gehrke On computing correlated aggregates over
    continual data streams. SIGMOD'01.
Write a Comment
User Comments (0)
About PowerShow.com