Sangeetha Seshadri - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Sangeetha Seshadri

Description:

Continuous, unbounded, rapid, time-varying streams of data ... Sliding windows, synopsis, samples, load-shed. Major Issues? Metric for set-valued queries ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 69
Provided by: leno166
Category:

less

Transcript and Presenter's Notes

Title: Sangeetha Seshadri


1
Data Stream Processing An Overview
CS 4440 Lecture 6
  • Sangeetha Seshadri
  • sangeeta_at_cc.gatech.edu

2
Agenda
  • Data Streams
  • What are they?
  • Why now? Applications..
  • DSMS Architecture Issues
  • Query Processing

3
Data Streams What and Where?
  • Continuous, unbounded, rapid, time-varying
    streams of data elements (tuples).
  • Occur in a variety of modern applications
  • Network monitoring and traffic engineering
  • Sensor networks, RFID tags
  • Telecom call records
  • Financial applications
  • Web logs and click-streams
  • Manufacturing processes
  • DSMS Data Stream Management System

4
DBMS versus DSMS
  • Persistent relations
  • One-time queries
  • Random access
  • Access plan determined by query processor and
    physical DB design
  • Transient streams (and persistent relations)
  • Continuous queries
  • Sequential access
  • Unpredictable data characteristics and arrival
    patterns

5
Continuous Queries
  • One time queries Run once to completion over
    the current data set.
  • Continuous queries Issued once and then
    continuously evaluated over the data.
  • Example
  • Notify me when the temperature drops below X
  • Tell me when prices of stock Y gt 300

6
The (Simplified) Big Picture
DSMS
Scratch Store
Stored Relations
7
(Simplified) Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces
Scratch Store
Lookup Tables
8
Triggers?
  • Recall triggers in traditional DBMSs?
  • Why not use triggers to process continuous
    queries over data streams?

9
Making Things Concrete
BOB
ALICE
Outgoing (call_ID, caller, time, event)
Incoming (call_ID, callee, time, event)
DSMS
event start or end
10
Query 1 (self-join)
  • Find all outgoing calls longer than 2 minutes
  • SELECT O1.call_ID, O1.caller
  • FROM Outgoing O1, Outgoing O2
  • WHERE (O2.time O1.time gt 2
  • AND O1.call_ID O2.call_ID
  • AND O1.event start
  • AND O2.event end)
  • Result requires unbounded storage
  • Can provide result as data stream
  • Can output after 2 min, without seeing end

11
Query 2 (join)
  • Pair up callers and callees
  • SELECT O.caller, I.callee
  • FROM Outgoing O, Incoming I
  • WHERE O.call_ID I.call_ID
  • Can still provide result as data stream
  • Requires unbounded temporary storage
  • unless streams are near-synchronized

12
Query 3 (group-by aggregation)
  • Total connection time for each caller
  • SELECT O1.caller, sum(O2.time O1.time)
  • FROM Outgoing O1, Outgoing O2
  • WHERE (O1.call_ID O2.call_ID
  • AND O1.event start
  • AND O2.event end)
  • GROUP BY O1.caller
  • Cannot provide result in (append-only) stream
  • Output updates?
  • Provide current value on demand?
  • Memory?

13
DSMS Architecture Issues
  • Data streams and stored relations Architectural
    differences.
  • Declarative language for registering continuous
    queries
  • Flexible query plans and execution strategies
  • Centralized ? Distributed ?

14
Agenda
  • Data Streams
  • What are they?
  • Why now? Applications..
  • DSMS Architecture Issues
  • Query Processing

15
DSMS Issues
  • Relation Tuple Set or Sequence?
  • Updates Modifications or Appends?
  • Query Answer Exact or Approximate?
  • Query Evaluation One of multiple Pass?
  • Query Plan Fixed or Adaptive?

16
Architectural Issues
  • DBMS
  • DSMS
  • Resource (memory, disk, per-tuple computation)
    rich
  • Extremely sophisticated query processing,
    analysis
  • Useful to audit query results of data stream
    systems.
  • Query Evaluation Arbitrary
  • Query Plan Fixed.
  • Resource (memory, per-tuple computation) limited
  • Reasonably complex, near real time, query
    processing
  • Useful to identify what data to populate in
    database
  • Query Evaluation One pass
  • Query Plan Adaptive

N.Koudas, D. Srivastava (2003) ATT Labs-Research
17
STREAM System Challenges
  • Must cope with
  • Stream rates that may be high,variable, bursty
  • Stream data that may be unpredictable, variable
  • Continuous query loads that may be high, variable

18
STREAM System Challenges
  • Must cope with
  • Stream rates that may be high,variable, bursty
  • Stream data that may be unpredictable, variable
  • Continuous query loads that may be high, variable
  • Overload

19
STREAM System Challenges
  • Must cope with
  • Stream rates that may be high,variable, bursty
  • Stream data that may be unpredictable, variable
  • Continuous query loads that may be high, variable
  • Overload need to use resources very carefully.
  • Changing conditions adaptive strategy.

20
Query Model
User/Application
DSMS
21
Agenda
  • Data Streams
  • What are they?
  • Why now? Applications..
  • DSMS Architecture Issues
  • Query Processing
  • Language
  • Operators
  • Optimization
  • Multi-Query Optimization

22
Stream Query Language
  • SQL extension
  • Queries reference/produce relations or streams
  • Examples GSQL Gigascope, CQL STREAM

Stream or Finite Relation
Stream or Finite Relation
Stream Query Language
23
Example Continuous Query Language CQL
  • Start with SQL
  • Then add
  • Streams as new data type
  • Continuous instead of one-time semantics
  • Windows on streams (derived from SQL-99)
  • Sampling on streams (basic)

24
Impact of Limited Memory
  • Continuous streams grow unboundedly
  • Queries may require unbounded memory
  • One solution Approximate query evaluation

25
Approximate Query Evaluation
  • Why?
  • Handling load streams coming too fast
  • Avoid unbounded storage and computation
  • Ad hoc queries need approximate history
  • How? Sliding windows, synopsis, samples,
    load-shed
  • Major Issues?
  • Metric for set-valued queries
  • Composition of approximate operators
  • How is it understood/controlled by user?
  • Integrate into query language
  • Query planning and interaction with resource
    allocation
  • Accuracy-efficiency-storage tradeoff and global
    metric

26
Windows
  • Mechanism for extracting a finite relation from
    an infinite stream
  • Various window proposals for restricting operator
    scope.
  • Windows based on ordering attribute (e.g. time)
  • Windows based on tuple counts
  • Windows based on explicit markers (e.g.
    punctuations)
  • Variants (e.g., partitioning tuples in a window)

Window specifications
streamify
Stream
Stream
Finite relations manipulated using SQL
N.Koudas, D. Srivastava (2003) ATT Labs-Research
27
Windows
  • Terminology

Start time
Current time
t1
t2
t3
t4
t5
time
Sliding Window
time
Tumbling Window
N.Koudas, D. Srivastava (2003) ATT Labs-Research
28
Query Operators
  • Selections - Where clause
  • Projections - Select clause
  • Joins - From clause
  • Group-by (Aggregations) Group-by clause

29
Query Operators
  • Selections and projections on streams -
    straightforward
  • Local per-element operators
  • Projection may need to include ordering
    attribute.
  • Joins Problematic
  • May need to join tuples that are arbitrarily far
    apart.
  • Equijoin on stream ordering attributes may be
    tractable.
  • Majority of the work focuses on joins using
    windows.

30
Blocking Operators
  • Blocking
  • No output until entire input seen
  • Streams input never ends
  • Simple Aggregates output update stream
  • Set Output (sort, group-by)
  • Root could maintain output data structure
  • Intermediate nodes try non-blocking analogs
  • Join
  • Apply sliding-window restrictions

31
Optimization in DSMS
  • Traditionally table based cardinalities used in
    query optimizer.
  • Goal of query optimizer Minimize the size of
    intermediate results.
  • Problematic in a streaming environment All
    streams are unbounded infinite size!
  • Need novel optimization objectives that are
    relevant when the input sources are streams.

N.Koudas, D. Srivastava (2003) ATT Labs-Research
32
Query Optimization in DSMS
  • Novel notions of optimization
  • Stream rate based e.g. NiagaraCQ
  • Resource-based e.g. STREAM
  • QoS based e.g. Aurora
  • Continuous adaptive optimization
  • Possibilities that objectives cannot be met
  • Resource constraints
  • Bursty arrivals under limited processing
    capabilities.

N.Koudas, D. Srivastava (2003) ATT Labs-Research
33
Stream Projects
  • Amazon/Cougar (Cornell) sensors
  • Aurora (Brown/MIT) sensor monitoring, dataflow
  • Hancock (ATT) telecom streams
  • Niagara (OGI/Wisconsin) Internet XML databases
  • OpenCQ (Georgia) triggers, incr. view
    maintenance
  • Stream (Stanford) general-purpose DSMS
  • Tapestry (Xerox) pub/sub content-based
    filtering
  • Telegraph (Berkeley) adaptive engine for
    sensors
  • Tribeca (Bellcore) network monitoring

34
Optimizing Multiple Distributed Stream Queries
Using Hierarchical Network Partitions
  • Sangeetha Seshadri
  • Jointly with Vibhore Kumar, Brian F. Cooper,
    Ling Liu and Karsten Schwan
  • College of Computing
  • Georgia Tech
  • Yahoo! Research
  • IPDPS07
  • March 29th 2007

35
Talk Outline
  • Motivation
  • Challenges
  • Our Approach
  • Experimental Results
  • Future Work

36
Distributed Data Stream Systems
Can low-capacity flights be cancelled?
Flight information
What is the status of my flight?
Weather
Web sources
Centralized DB
Local Weather
Travel Agent
37
Motivation
  • Lots of data produced in lots of places
  • Examples operational information systems,
    scientific collaborations, web traffic data,
    financial applications
  • Centralized processing does not scale

38
Challenges
  • Choosing efficient deployments.
  • Fast and efficient initial deployments.
  • Utilize reuse opportunities.
  • Handling dynamic nature of system.
  • Queries arrive or leave.
  • Nodes join (recover) or leave (fail).
  • Network conditions change.
  • Data conditions (e.g. rate) changes.

39
Approach Outline
40
Query Planning
C
B ? C
B
(B ? C) ? A
Sink
(A ? B) ? C
A ? B
A
SELECT FROM A ? B ? C
41
Query Deployment
A ? B
(A ? B) ? C
Sink3
C
N4
Sink1
N3
N1
A
Sink4
N2
N5
Sink2
B
Sink5
42
An Illustrative Example..
SELECT FROM A ? C
SELECT FROM A ? B ? C
43
Why an integrated approach?
  • Integrated approach decreases cost by gt 50
  • Setup 64 node network, 100 queries over 5 stream
    sources each. Y-axis represents communication
    costs.

44
Problem
  • Massive Search Space.
  • Example 5 stream sources, 64 nodes
  • 2,880,000,000 (approx) plans considered.
  • Lemma 1
  • Our Solution
  • Trade some optimality for smaller search space

45
Solution
  • Organize the nodes into a virtual Network
    Hierarchy.
  • Operator reuse through Stream Advertisements
  • Two approximation based algorithms
  • Top-Down
  • Bottom-Up

46
Optimization Metric
  • Minimize network usage
  • Network usage total amount of data in transit
    at any point in time.
  • Encapsulates both bandwidth and latency of links.

47
Network Hierarchy
  • Cluster network nodes based on cost.
  • User defined parameter maxcs

Coordinator Nodes
48
Stream Advertisements for Reuse
A, C and A ? C
B
Coordinator Nodes
B
A
A ? C
C
49
Optimization Algorithms
Top-Down
Bottom-Up
50
Planning algorithms
  • Top down

A ? B ? C ? D
C ? D
A ? B ?
D
C
B
A
C ? D
A ? B
?
51
Top-Down Algorithm Features
  • Reduced search space
  • Search space reduced by a factor ß.
  • (h height of hierarchy, N network size, K
    number of sources).
  • User defined parameter maxcs allows to tune
    trade-off between search space and
    sub-optimality.
  • Operators re-used when beneficial through stream
    advertisements.

52
Planning algorithms
  • Bottom up

A ? B
A ? B
A ? B
? C ? D
D
C
B
A
A ? B
A ? B ? C ? D
53
Bottom-Up Algorithm Features
  • Reduced search space.
  • Deploys only sub-queries within current cluster.
  • Analytical bounds Search space reduced by factor
    ß.
  • Operators re-used when beneficial.
  • But, may choose sub-optimal join-orders.

54
Experiments
  • Simulation and prototype based experiments.
  • 128 node network Used GT-ITM internetwork
    topology generator.
  • Uniformly random workload generator 10 sources,
    100 queries, 2-5 join operators, random sink
    placements.

55
Cost with Bottom-Up Algorithm
56
Comparison with existing approaches
57
Comparison of Search Space
58
Future Work
  • We have built a prototype based on IFLOW a
    distributed data stream system built at Georgia
    Tech.
  • Aggregations
  • Modifying existing deployments at runtime
  • Relaxing filter conditions
  • Modifying join ordering at runtime.

59
Related Work
  • Distributed query optimization
  • Distributed INGRES, R, SDD-1
  • Stream data processing engines
  • Centralized - STREAM, Aurora, TelegraphCQ
  • Distributed - Borealis, Flux

60
Conclusion
  • Integrated approach to query optimization
  • Hierarchical clustering of network and stream
    advertisements.
  • Approximation based algorithms
  • Top-Down
  • Bottom-Up
  • Design Highlights
  • Trade some optimality for smaller search space.
  • Decrease search space while offering bounds on
    the sub-optimality.

61
For further information
  • http//www.cc.gatech.edu/sangeeta
  • Contact sangeeta_at_cc.gatech.edu

Thank You!
62
Deployment Times
63
Example
  • Simple use-case for pushing down selections
  • Query 1
  • SELECT FLIGHTS.Number, FLIGHTS.Status
    CARRIER_CODES.Name
  • FROM FLIGHTS, CARRIER_CODES
  • WHERE FLIGHTS.Departing ATLANTA
  • AND FLIGHTS.Carrier_Code CARRIER_CODES.Code
  • AND FLIGHTS.Departure_terminal TERMINAL
    SOUTH
  • Query 2
  • SELECT FLIGHTS.Number, FLIGHTS.Status,
    CARRIER_CODES.Name
  • FROM FLIGHTS, CARRIER_CODES
  • WHERE FLIGHTS.Departing ATLANTA
  • AND FLIGHTS.Carrier_Code CARRIER_CODES.Code
  • AND FLIGHTS.Departure_terminal TERMINAL
    NORTH'

64
The Big Picture
  • Large number of possibilities
  • System Model
  • Stream processing systems (SQL-style queries)
  • Pub-sub systems
  • Runtime annotators (keyword-based queries).
  • Trade-offs Cost with
  • Search space
  • Reliability
  • Availability.
  • Adaptivity
  • Admission Control
  • Moving operators
  • Dropping data
  • Migrating plans.

65
Real Enterprise Workload
  • Delta Airlines Operational information system
  • Q1 (15) Terminal Overhead Display (Lifetime
    12 hours)
  • Q2 (80) Gate Agent Query (Lifetime 2 hours)
  • Q3 (5) Ad-hoc flight status monitoring queries
    (Lifetime 6 hours)

66
Real Enterprise Workload
67
Backups
68
Data Model
  • Append-only
  • Call records
  • Updates
  • Stock tickers
  • Deletes
  • Transactional data
  • Meta-Data
  • Control signals, punctuations
  • System Internals probably need all above

69
Aurora/STREAM Overview
Output streams
Synopses
Query Plans
Running Op
Ready Op
Applications register continuous queries
p
x
Waiting Op
s
s
x
Users issue continuous and ad-hoc queries
Historical Storage
Administrator monitors query execution and
adjusts run-time parameters
Input streams
70
Sliding Window Approximation
0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1
0
  • Why?
  • Approximation technique for bounded memory
  • Natural in applications (emphasizes recent data)
  • Well-specified and deterministic semantics
  • Issues
  • Extend relational algebra, SQL, query
    optimization
  • Algorithmic work
  • Timestamps?

71
Adaptivity (Telegraph)
Output Queues
STeMs for join
R
grouped filter (R.A)
EDDY
S
grouped filter (S.B)
R x S x T
T
Input Streams
  • Runtime Adaptivity
  • Multi-query Optimization
  • Framework implements arbitrary schemes

72
Query-Split Scheme (Niagara)
trig.Act.i
trig.Act.j
scan
scan
file i
file j
split
Symbol Const.Value
join
Quotes.XML
constant table
scan
scan
  • Aggregate subscription for efficiency
  • Split evaluate trigger only when file updated
  • Triggers multi-query optimization

73
Shared Predicates Niagara, Telegraph
gt
7
Predicates for R.A
1
11
R.A gt 1 R.A gt 7 R.A gt 11 R.A lt 3 R.A lt 5 R.A
6 R.A 8 R.A ? 9
Agt7
Agt11
Agt1
Tuple A8
lt
3
Alt3
Alt5

6 8
?
9
Write a Comment
User Comments (0)
About PowerShow.com