Title: Semantics and Evaluation Techniques for Window Aggregates in Data Streams
1Semantics and Evaluation Techniques for Window
Aggregates in Data Streams
- Jin Li, David Maier, Kristin Tufte, Vassilis
Papadimos, Peter Tucker
This work was supported by NSF grant IIS 0086002
2Motivation
( sid, max)( s5, 47 )( s6, 48 )
- Traffic ltsid, speed, tsgt
(sid, speed, ts (hhmmss))
Traffic sensor
Traffic sensor
t1 (s5, 40, 010630)
Traffic sensor
t2 (s6, 42, 010745)
Q1For every minute, find the max speed of the
past 5 minutes for each sensor.
t3 (s5, 45, 010815)
window 010600 011100 010700
011200 010800 011300
windows 0106xx 011100 010700
011200 010800 011300
windows 010600 011100 010700
011200 010800 011300
( sid, speed, ts )
t4 ( s5, 47, 011010)
t5 ( s6, 48, 011030)
t6 ( s6, 46, 011102)
3Limitations
- Window semantics definition and implementation
- Assumptions on data arrival order
- Data arrival affects query answer and result
production - Query evaluation performance
- Space Internal buffer space to hold a window
- Time Tuple access each tuple is accessed
multiple times - Latency Window aggregate computation is tied
with window completion
4Outline
- WID overview
- Window semantics definition and its
implementation in WID - Disorder
- Sharing panes an optimization technique using
sub-windows (panes) - Conclusion
5WID Overview
(sid, window-id, max) ( s6, 70, 48 )
- Q1
- SELECT sid, max(speed)
- FROM Traffic
- RANGE 5 minutes
- SLIDE 1 minute
- WATTR ts
- GROUP-BY sid
(sid, speed, ts, window-id)
p1 ( s6, , , 70 )
t6 ( s6, 46, 011102, 71-75 )
t2 ( s6, 42, 010745, 70-74 )
t1 ( s5, 40, 010630, 70-74 )
t4 ( s5, 47, 011010, 70-74 )
t5 ( s6, 48, 011030, 70-74 )
t3 ( s5, 45, 010815, 70-74 )
tag window-id
- A punctuation is a message embedded in the data
to indicate the end of a sub-stream
(sid, speed, ts )
t5 ( s6, 48, 011030)
t4 ( s5, 47, 011010)
p1 ( s6, , 011100)
t6 ( s6, 46, 011102)
t1 ( s5, 40, 010630)
t2 ( s6, 42, 010745)
t3 ( s5, 45, 010815)
6Window Semantics Framework
- T the set of all tuples in the input stream
- S a window specification
- W a set of window-ids
- windows (T, S) ? W
- Defines the set of window ids to be used
- extent (T, S, w) ? U ? T, where w ? W
- Specifies which tuples belong to a given window
- wids (T, S, t) ? V ? W, where t ? T
- Determines the set of window-ids to which a tuple
belongs - Is the dual of extent
7Defining Window Semantics - sliding window
Q1SELECT sid, max(speed) FROM Traffic
RANGE 5 minutes SLIDE 1 minute
WATTR ts GROUP-BY sid
For t4 (s5, 47, 011010), wids (t4, T, S 5,
1, ts) w ? W t4.ts / 1 1 lt w
(t4.ts 5) / 1) 1 w ? W 69.17 lt
w 74.17 w ? W 70 w 74
where t4.ts is 011010
70.17 minute
T the set of all tuples in the input stream S a
window specification W a set of window-ids
8Window Semantics Implementation in WID sliding
window
SELECT sid, max(speed)FROM Traffic
RANGE 5 minutes SLIDE 1 minute
WATTR ts GROUP BY sid
(sid, window-id, max) ( s5, 70, 40 )
(sid, speed, ts, window-id) ( s5, 40,
000630, 70-74 ) t1
( s5, , , 70 )
p1
- Bucket implements wids function
- Bucket for sliding windows is stateless
(sid, speed, ts ) ( s5, ,
011100 ) p1
(sid, speed, ts ) ( s5, 40,
010630 ) t1
9Defining Window Semantics - partitioned window
Q2 SELECT sid, max(speed ) FROM
Traffic RANGE 1000 rows
SLIDE 100 rows
WATTR row-num
PATTR sid
T the set of all tuples in the input stream S a
window specification W a set of window-ids
windows (T, S RANGE, SLIDE, row-num, PATTR)
(i, p) i ? 0, 1, 2, , p ? T.PATTR
extent ((i, p), T, SRANGE, SLIDE, row-num,
PATTR) t ?T t.PATTR p, ((i1)
SLIDE)-RANGE )
rank(t.row-num, PATTR, T) lt (i1) SLIDE
10Defining Window Semantics - partitioned window
(cont.)
wids (t, T, SRANGE, row-num, PATTR) (i,
p)?W t.PATTR p, r
/ SLIDE 1 ? i ? (r RANGE) / SLIDE 1
where r rank
(t, row-num, PATTR, T)
Q2 SELECT sid, max(speed ) FROM
Traffic RANGE 1000 rows
SLIDE 100 rows
WATTR row-num
PATTR sid
T the set of all tuples in the input stream S a
window specification W a set of window-ids
11Window Semantics Implementation in WID
partitioned window
SELECT sid, max(speed)FROM Traffic
RANGE 1000 rows SLIDE 100
rows WATTR row-num
PATTR sid
(sid, window-id, max) ( s5, 3, 47
)
Max (speed) (group on window-id, sid)
(sid, window-id, speed, row-num) ( s5, 3-12,
47, 507 ) t1
( s5, 3, ,
) p1
bucket RANGE 1000 rows SLIDE 100
rows WATTR row-num PATTR sid
- Bucket generates punctuations
- Bucket for partitioned windows maintains states
(count for each partition)
(sid, speed, row-num ) ( s5, 47, 507
) t1
streamscan
12WID Advantages
- Window semantics definition
- Separated from physical implementation and data
arrival order - Flexible covers varieties of windows, e.g.,
sliding, tumbling, landmark, time-based,
tuple-based allow user-specified windowing
attribute - Implementation of query evaluation
- Window semantics localized in Bucket
- Insensitive to data arrival order
- Punctuation can guarantee progress
- Gaps in tuple arrival need not affect result
production - Performance gains in space, execution time and
latency
13WID vs. Buffering execution time comparison
(overview)
14WID vs. Buffering execution time comparison
(zoom-in)
15Outline
- WID overview
- Window semantics definition and its
implementation in WID - Disorder
- Sharing panes an optimization technique using
sub-windows - Conclusion
16Sources of Disorder
- Sources of disorder
- Merging different data sources
- Various network transmission delay
- Data prioritization
- Query processing algorithms, e.g., shared window
joins Hammad, et al. - Multiple possible windowing attributes, e.g., two
timestamps
17Handling Disorder
- Generally dealt with by buffering
- Slack BSort in Aurora
- Output buffering in a shared-window join
- Punctuation Window-id
- Heartbeat
18Disorder Handling - WID
(sid, window-id, max) ( s6, 70, 48 )
- Q1
- SELECT sid, max(speed)
- FROM Traffic
- RANGE 5 minutes
- SLIDE 1 minute
- WATTR ts
- GROUP-BY sid
(sid, speed, ts, window-id)
p1 ( s6, , , 70 )
t6 ( s6, 46, 011102, 71-75)
t7 ( s5, 52, 011015, 70-74)
bucket
(sid, speed, ts )
p1 ( s6, , 011100)
t7 ( s5, 52, 011015)
t3 ( s6, 46, 011102)
19Outline
- WID overview
- Window semantics definition and its
implementation in WID - Disorder
- Sharing panes an optimization technique using
sub-windows - Conclusion
20Sharing Panes
Q3 SELECT sid, count() FROM Traffic
RANGE 4 minutes SLIDE 1 minute
WATTR tsGROUP BY sid
21Pane Implementation
SELECT sid, count() FROM Traffic RANGE 4
minutes SLIDE 1 minute
WATTR ts GROUP BY sid
sum() (group on window-id, sid)
(sid, ts, pane-id, count, window-id) (
s5, 011010, 70, 8, 70-74
) m0
bucket B2 as window-id RANGE 4 SLIDE 1 WATTR
pane-id
(sid, ts, pane-id, count) ( s5,
011010, 70, 8 ) m0
count () (group on pane-id, sid)
(sid, speed, ts, pane-id ) ( s5,
47, 011010, 70-70 ) t1 ( s5,
, , 70 )
p1 ( s6, 48, 011030, 70-70 )
t2
bucket B1 as pane-id RANGE 1 min SLIDE 1
min WATTR ts
(sid, speed, ts ) ( s5, 47,
011010 ) t1 ( s5, , 011100 ) p1 (
s6, 48, 011030 ) t2
streamscan
22When are panes better than windows?
SELECT sid, max() FROM Traffic RANGE X rows
SLIDE Y rows
WATTR row-num GROUP BY sid
- Panes are better when cost ratio is less than 1
- The number of tuples per pane affects whether
using panes is better
23Conclusion and Future Work
- Conclusion
- A framework for defining window semantics
- A one pass, non-buffering, disorder-tolerant
query evaluation technique - Initial investigation on disorder
- Sharing panes
- Future work
- Disorder-tolerant window join
- Sharing panes among multiple aggregate queries
24Related Work
- STREAM_at_Stanford Heartbeat,
Sub-aggregation - TelegraphCQ_at_Berkeley Sliding window
aggregates - AuroraBorealis_at_BrownMITBrandeis Slack
- Gigascope_at_ATT Ordering Update Token,
Sub-aggregation