Semantics and Evaluation Techniques for Window Aggregates in Data Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Semantics and Evaluation Techniques for Window Aggregates in Data Streams

Description:

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 25
Provided by: JinL8
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Semantics and Evaluation Techniques for Window Aggregates in Data Streams


1
Semantics and Evaluation Techniques for Window
Aggregates in Data Streams
  • Jin Li, David Maier, Kristin Tufte, Vassilis
    Papadimos, Peter Tucker

This work was supported by NSF grant IIS 0086002
2
Motivation
( sid, max)( s5, 47 )( s6, 48 )
  • Traffic ltsid, speed, tsgt

(sid, speed, ts (hhmmss))
Traffic sensor
Traffic sensor
t1 (s5, 40, 010630)
Traffic sensor
t2 (s6, 42, 010745)
Q1For every minute, find the max speed of the
past 5 minutes for each sensor.
t3 (s5, 45, 010815)
window 010600 011100 010700
011200 010800 011300
windows 0106xx 011100 010700
011200 010800 011300
windows 010600 011100 010700
011200 010800 011300
( sid, speed, ts )
t4 ( s5, 47, 011010)
t5 ( s6, 48, 011030)
t6 ( s6, 46, 011102)
3
Limitations
  • Window semantics definition and implementation
  • Assumptions on data arrival order
  • Data arrival affects query answer and result
    production
  • Query evaluation performance
  • Space Internal buffer space to hold a window
  • Time Tuple access each tuple is accessed
    multiple times
  • Latency Window aggregate computation is tied
    with window completion

4
Outline
  • WID overview
  • Window semantics definition and its
    implementation in WID
  • Disorder
  • Sharing panes an optimization technique using
    sub-windows (panes)
  • Conclusion

5
WID Overview
(sid, window-id, max) ( s6, 70, 48 )
  • Q1
  • SELECT sid, max(speed)
  • FROM Traffic
  • RANGE 5 minutes
  • SLIDE 1 minute
  • WATTR ts
  • GROUP-BY sid

(sid, speed, ts, window-id)
p1 ( s6, , , 70 )
t6 ( s6, 46, 011102, 71-75 )
t2 ( s6, 42, 010745, 70-74 )
t1 ( s5, 40, 010630, 70-74 )
t4 ( s5, 47, 011010, 70-74 )
t5 ( s6, 48, 011030, 70-74 )
t3 ( s5, 45, 010815, 70-74 )
tag window-id
  • A punctuation is a message embedded in the data
    to indicate the end of a sub-stream

(sid, speed, ts )
t5 ( s6, 48, 011030)
t4 ( s5, 47, 011010)
p1 ( s6, , 011100)
t6 ( s6, 46, 011102)
t1 ( s5, 40, 010630)
t2 ( s6, 42, 010745)
t3 ( s5, 45, 010815)
6
Window Semantics Framework
  • T the set of all tuples in the input stream
  • S a window specification
  • W a set of window-ids
  • windows (T, S) ? W
  • Defines the set of window ids to be used
  • extent (T, S, w) ? U ? T, where w ? W
  • Specifies which tuples belong to a given window
  • wids (T, S, t) ? V ? W, where t ? T
  • Determines the set of window-ids to which a tuple
    belongs
  • Is the dual of extent

7
Defining Window Semantics - sliding window
Q1SELECT sid, max(speed) FROM Traffic
RANGE 5 minutes SLIDE 1 minute
WATTR ts GROUP-BY sid
For t4 (s5, 47, 011010), wids (t4, T, S 5,
1, ts) w ? W t4.ts / 1 1 lt w
(t4.ts 5) / 1) 1 w ? W 69.17 lt
w 74.17 w ? W 70 w 74
where t4.ts is 011010
70.17 minute
T the set of all tuples in the input stream S a
window specification W a set of window-ids
8
Window Semantics Implementation in WID sliding
window
SELECT sid, max(speed)FROM Traffic
RANGE 5 minutes SLIDE 1 minute
WATTR ts GROUP BY sid
(sid, window-id, max) ( s5, 70, 40 )
(sid, speed, ts, window-id) ( s5, 40,
000630, 70-74 ) t1
( s5, , , 70 )
p1
  1. Bucket implements wids function
  2. Bucket for sliding windows is stateless

(sid, speed, ts ) ( s5, ,
011100 ) p1
(sid, speed, ts ) ( s5, 40,
010630 ) t1
9
Defining Window Semantics - partitioned window
Q2 SELECT sid, max(speed ) FROM
Traffic RANGE 1000 rows
SLIDE 100 rows
WATTR row-num
PATTR sid
T the set of all tuples in the input stream S a
window specification W a set of window-ids
windows (T, S RANGE, SLIDE, row-num, PATTR)
(i, p) i ? 0, 1, 2, , p ? T.PATTR
extent ((i, p), T, SRANGE, SLIDE, row-num,
PATTR) t ?T t.PATTR p, ((i1)
SLIDE)-RANGE )
rank(t.row-num, PATTR, T) lt (i1) SLIDE
10
Defining Window Semantics - partitioned window
(cont.)
wids (t, T, SRANGE, row-num, PATTR) (i,
p)?W t.PATTR p, r
/ SLIDE 1 ? i ? (r RANGE) / SLIDE 1
where r rank
(t, row-num, PATTR, T)
Q2 SELECT sid, max(speed ) FROM
Traffic RANGE 1000 rows
SLIDE 100 rows
WATTR row-num
PATTR sid
T the set of all tuples in the input stream S a
window specification W a set of window-ids
11
Window Semantics Implementation in WID
partitioned window
SELECT sid, max(speed)FROM Traffic
RANGE 1000 rows SLIDE 100
rows WATTR row-num
PATTR sid
(sid, window-id, max) ( s5, 3, 47
)
Max (speed) (group on window-id, sid)
(sid, window-id, speed, row-num) ( s5, 3-12,
47, 507 ) t1
( s5, 3, ,
) p1
bucket RANGE 1000 rows SLIDE 100
rows WATTR row-num PATTR sid
  1. Bucket generates punctuations
  2. Bucket for partitioned windows maintains states
    (count for each partition)

(sid, speed, row-num ) ( s5, 47, 507
) t1
streamscan
12
WID Advantages
  • Window semantics definition
  • Separated from physical implementation and data
    arrival order
  • Flexible covers varieties of windows, e.g.,
    sliding, tumbling, landmark, time-based,
    tuple-based allow user-specified windowing
    attribute
  • Implementation of query evaluation
  • Window semantics localized in Bucket
  • Insensitive to data arrival order
  • Punctuation can guarantee progress
  • Gaps in tuple arrival need not affect result
    production
  • Performance gains in space, execution time and
    latency

13
WID vs. Buffering execution time comparison
(overview)
14
WID vs. Buffering execution time comparison
(zoom-in)
15
Outline
  • WID overview
  • Window semantics definition and its
    implementation in WID
  • Disorder
  • Sharing panes an optimization technique using
    sub-windows
  • Conclusion

16
Sources of Disorder
  • Sources of disorder
  • Merging different data sources
  • Various network transmission delay
  • Data prioritization
  • Query processing algorithms, e.g., shared window
    joins Hammad, et al.
  • Multiple possible windowing attributes, e.g., two
    timestamps

17
Handling Disorder
  • Generally dealt with by buffering
  • Slack BSort in Aurora
  • Output buffering in a shared-window join
  • Punctuation Window-id
  • Heartbeat

18
Disorder Handling - WID
(sid, window-id, max) ( s6, 70, 48 )
  • Q1
  • SELECT sid, max(speed)
  • FROM Traffic
  • RANGE 5 minutes
  • SLIDE 1 minute
  • WATTR ts
  • GROUP-BY sid

(sid, speed, ts, window-id)
p1 ( s6, , , 70 )
t6 ( s6, 46, 011102, 71-75)
t7 ( s5, 52, 011015, 70-74)
bucket
(sid, speed, ts )
p1 ( s6, , 011100)
t7 ( s5, 52, 011015)
t3 ( s6, 46, 011102)
19
Outline
  • WID overview
  • Window semantics definition and its
    implementation in WID
  • Disorder
  • Sharing panes an optimization technique using
    sub-windows
  • Conclusion

20
Sharing Panes
Q3 SELECT sid, count() FROM Traffic
RANGE 4 minutes SLIDE 1 minute
WATTR tsGROUP BY sid
21
Pane Implementation
SELECT sid, count() FROM Traffic RANGE 4
minutes SLIDE 1 minute
WATTR ts GROUP BY sid
sum() (group on window-id, sid)
(sid, ts, pane-id, count, window-id) (
s5, 011010, 70, 8, 70-74
) m0
bucket B2 as window-id RANGE 4 SLIDE 1 WATTR
pane-id
(sid, ts, pane-id, count) ( s5,
011010, 70, 8 ) m0
count () (group on pane-id, sid)
(sid, speed, ts, pane-id ) ( s5,
47, 011010, 70-70 ) t1 ( s5,
, , 70 )
p1 ( s6, 48, 011030, 70-70 )
t2
bucket B1 as pane-id RANGE 1 min SLIDE 1
min WATTR ts
(sid, speed, ts ) ( s5, 47,
011010 ) t1 ( s5, , 011100 ) p1 (
s6, 48, 011030 ) t2
streamscan
22
When are panes better than windows?
SELECT sid, max() FROM Traffic RANGE X rows
SLIDE Y rows
WATTR row-num GROUP BY sid
  1. Panes are better when cost ratio is less than 1
  2. The number of tuples per pane affects whether
    using panes is better

23
Conclusion and Future Work
  • Conclusion
  • A framework for defining window semantics
  • A one pass, non-buffering, disorder-tolerant
    query evaluation technique
  • Initial investigation on disorder
  • Sharing panes
  • Future work
  • Disorder-tolerant window join
  • Sharing panes among multiple aggregate queries

24
Related Work
  • STREAM_at_Stanford Heartbeat,
    Sub-aggregation
  • TelegraphCQ_at_Berkeley Sliding window
    aggregates
  • AuroraBorealis_at_BrownMITBrandeis Slack
  • Gigascope_at_ATT Ordering Update Token,
    Sub-aggregation
Write a Comment
User Comments (0)
About PowerShow.com