SWAT: Hierarchical Stream Summarization in Large Networks - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

SWAT: Hierarchical Stream Summarization in Large Networks

Description:

Avg(8-15) Avg(4,7) Avg(0,3) Avg(2,3) Avg(0,1) R0. R1. L0. R2. L1. S0. S1. 11. Bulut, Singh ... Avg(0,1) Avg(1,2) Avg(2,3) Avg(1,4) Avg(3,6) Avg(5,8) Avg(3,10) ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 31
Provided by: ahmet4
Category:

less

Transcript and Presenter's Notes

Title: SWAT: Hierarchical Stream Summarization in Large Networks


1
SWAT Hierarchical Stream Summarization in Large
Networks
  • Ahmet Bulut Ambuj K. Singh
  • Department of Computer Science
  • University of California, Santa Barbara
  • Santa Barbara, CA 93106
  • USA
  • Life is what happens to you while you are busy
    making other plans
  • --John Lennon

2
Motivations
  • Numerous real world applications generate streams
    of data
  • Call detail records in telecommunications
    networks
  • Transactions in retail chains, Sensor network
    data
  • Log records generated by Web Servers


NetFlow Data (Ciscos monitoring tool)
3
Network Data Processing
  • Traffic estimation
  • How many bytes were sent between a pair of IP
    addresses?
  • List the top 100 IP addresses in terms of
    traffic
  • Traffic analysis
  • What is the average duration of a TCP session?
  • Fraud detection
  • List all sessions that transmitted more than 1000
    bytes
  • Security/Denial of Service
  • Identify IP addresses involved in more than 1000
    sessions

4
Application characteristics
  • Data volume is massive (several terabytes)
  • ATT collects 100 GBs of NetFlow data each day!
  • New records arrive at a fast rate
  • Dealing with transactional data streams drinking
    from the proverbial fire hose.

Recent values are more important compared to old
values.

5
Data Stream Computation Model
  • A data stream is an infinite sequence of
    elements . . .,xi, . . .
  • Stream processing requirements
  • Bounded memory small space usage
  • Real-time low per record processing time (to
    maintain synopsis)
  • Efficiency fast response time and accurate
    results to user queries

Synopsis in Memory
Data Stream
Stream Processing Engine
(Approximate) Answer
6
Data Stream Query Model
  • Inner (dot) product
  • Network Data Processing What is the average
    duration of a TCP session?
  • Stock Data Processing What is the average
    closing price of INTEL for the last month?
  • Medical Sensor Data Processing Notify when the
    weighted average of last 20 body temperature
    measurements of a patient exceed a threshold
    value.
  • Query A triple (I,W,d)
  • I data items of interest
  • W individual weights corresponding to each data
    item
  • d precision quality

7
Outline
  • Introduction
  • Motivation Applications
  • Stream computation model
  • Stream query model
  • Centralized system design
  • Wavelet-based Approximation Tree
  • Handling user queries
  • De-centralized system design
  • Adaptive stream replication in a large network
  • Future directions

8
One-Dimensional Haar Wavelets
  • Wavelets Mathematical tool for hierarchical
    decomposition of functions/signals
  • Haar wavelets Simplest wavelet basis, easy to
    understand and implement
  • Recursive pairwise averaging and differencing at
    different resolutions

Resolution Averages Detail
Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
9
One-Dimensional Haar Wavelets
  • Recent biased wavelet decomposition
  • Keep a few of the coefficients at each
    resolution
  • Approximation of the original signal

Recent values are more important compared to old
values.

Resolution Averages Detail
Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
2.75, 1.5, 4, 4, 4 Approximation
10
Haar Wavelet Coefficients
  • Hierarchical decomposition structure
  • Keep a few coefficients at each resolution O
    (log N) coefficients for a window size of N

11
Execution Trace on SWAT
  • Sliding Window Size 16

98/16
46/8
  • Level 2

54/8
44/8
54/8
8/4
  • Level 1

22/4
32/4
36/4
12/4
14/2
  • Level 0

6/2
18/2
10/2
8/2
12/2
26/2
2 8 6 4 2 10 16 6 8 2 2 0 4 2 12 14
2
10
4
6
12
Query by Example
  • Q(0,3,8,13,10,8,4,1)
  • 10x0 8x3 4x8x13

Avg(3,18)
V
R2
Avg(7,14)
Avg(3,10)
R0
Avg(11,18)
L0
L1
Avg(3,6)
S2
Avg(1,4)
Avg(5,8)
Avg(1,2)
Avg(2,3)
Avg(0,1)
13
Experimental Settings
  • Real Dataset of size 3K daily maximum
    temperature for the city of Santa Barbara, CA
    from 1994 to 2001 http//www.ipm.ucdavis.edu
    /WEATHER
  • Synthetic Dataset of size 10M.
  • Data arrival period, Td and query arrival period,
    Tq 1 sec
  • Exponential inner product query Q(0, 1, 2, 3,
    8, 4, 2, 1, 20)
  • Linear inner product query Q(8, 9, 10, 11,
    4, 3, 2, 1, 40)
  • Execution at a query point
  • fixed query mode the same recent biased query
    repeatedly
  • random query mode a new randomly chosen query
  • Compare with the incremental Histogram Guha and
    Koudas 02

14
Performance measurements
  • Dataset Real data
  • Query mode fixed
  • Exponential queries are
  • more recent biased wrt
  • linear queries
  • Dataset Synthetic data
  • Query mode fixed

15
Performance measurements
  • Dataset Synthetic data
  • Query mode Random
  • Average Query Response Times
  • SWAT 2.80e-3 sec.
  • Histogram 25.433 sec.
  • Parameter tuning for Histogram
  • Refer to the paper or the accompanying technical
    report for more experimental results.

16
Complexity analysis of SWAT
  • Small space usage
  • Real-time Per record processing time (to
    maintain synopsis) must be low
  • Time needed to answer posed queries,
  • and the precision of answers
  • Space complexity is O (log N)
  • Amortized per item processing time O (1)

17
Outline
  • Introduction
  • Motivation Applications
  • Stream computation model
  • Stream query model
  • Centralized system design
  • Wavelet-based Approximation Tree
  • Handling user queries
  • De-centralized system design
  • Adaptive stream replication in a large network
  • Future directions

18
Stream Replication Motivations
  • Centralized model the synopsis at a single
    site.
  • () easy system design, no replica consistency
  • (-) the central cite becomes a bottleneck in
    query intensive environments.
  • Decentralized model the synopsis at multiple
    sites.
  • () less message transmissions in query intensive
    environments
  • (-) replica consistency in data intensive
    environments
  • Solution An adaptive stream replication that
    minimizes the number of message transmissions

19
System topology
Data Stream
S
2 8 6 4 2 10 16 6 8 2 2 0 4 2 12 14
Q1
A
C1
C2
Q4
A3
Q3
C3
C4
20
Computation model
  • Nodes maintain range values rather than exact
    data values

dL, dH approximation for the data element d
Data window segment data
range subscription list (0,1) 25,45 C1,C
2 (2,3) 30,40 C2 (4,7) 2,7 C2 (8,15)
4,10 C2
21
Query model
  • Clients issue inner product queries Q(I,W,d) over
    the stream.
  • When a client receives a query Q(I,W,d) with d as
    precision requirement
  • Check the local cache to see if there are
    approximations for I
  • If d
  • else return answer A

AL
Define d AH-AL
A
ß
Compute an answer with precision quality
22
Adaptive Data Replication Wolfson, Jajodia, and
Huang 97
4
  • The replication scheme is a sub-tree of nodes
    that have the replica of the object
  • The replication scheme R consists of nodes 3, 7
    8
  • The replication scheme expands and/or contracts
    depending on read and write activities adaptively.

3
7
8
23
Adaptive Stream Replication (SWAT-ASR)
PHASE CHANGE
S
, C1
32,38
34,35
(2,3) 30,40
(2,3) 32,38
Q1(3,1,8) X 4
Q1(3,1,8) X 4
C3
C1
(2,3) 30,40
Q0(3,1,20) X 3
Q0(3,1,20)
Q0(3,1,20) X 3
C3
24
Experimental Settings
  • Execute at each query point a new randomly chosen
    inner product query (random query mode)
  • Divergence Caching Huang, Sloan, and Wolfson
    94 mechanism to reduce the number of object
    transmissions in Client-Server architectures.
  • Compute the optimal refresh rate (dH-dL for
    dL,dH) using a window of past reads and writes
    Poisson processes
  • Adaptive Precision Setting Olston, Widom, and
    Loo 01
  • For every data value d, keep an approximation as
    dL,dH, W dH-dL
  • In case of an update, the width stays the same/is
    enlarged
  • In case of an unsatisfied query, the width stays
    the same/is shrunk

25
Performance measurements for adaptive stream
replication
26
Performance measurements for adaptive stream
replication (cont.)
27
Performance measurements for adaptive stream
replication (cont.)
Largest weight coefficient in weight vector
28
Future directions
  • Resource utilization in a multiple-streams
    environment
  • Multiple streams competing for limited resources
  • limited memory
  • limited bandwidth
  • limited cpu
  • How to schedule resources to optimize throughput

  • Fraud and security monitoring
  • Application of stream mining techniques developed
    on logs for pattern analysis

29
References
  • GGR02, M. N. Garofalakis, J. Gehrke and R.
    Rastogi. Querying and Mining Data Streams You
    Only Get One Look. VLDB 2002.
  • GKMS01, A. C. Gilbert and Y. Kotidis and S.
    Muthukrishnan and M. Strauss, Surfing Wavelets on
    Streams One-Pass Summaries for Approximate
    Aggregate Queries, VLDB 2001.
  • GK02,S. Guha and N. Koudas, Approximating a
    Data Stream for Querying and Estimation
    Algorithms and Performance Evaluation, ICDE
    2002.
  • HSW94, Y. Huang and R. H. Sloan and O. Wolfson,
    Divergence Caching in Client Server
    Architectures, PDIS 1994.
  • OWL01, C. Olston and J. Widom and B. T. Loo,
    Adaptive Precision Setting for Cached Approximate
    Values, SIGMOD 2001.
  • TR-BS02, A. Bulut and A. K. Singh, SWAT
    Hierarchical Stream Summarization in Large
    Networks, University of California Santa Barbara,
    2002.
  • WJH97, O. Wolfson and S. Jajodia and Y. Huang,
    An adaptive data replication algorithm. ACM TODS
    1997.
  • ZS02, Y. Zhu and D. Shasha. Statstream
    Statistical monitoring of thousands of data
    streams in real time. VLDB 2002.

30
Thanks
you got to make some plans just in case
Write a Comment
User Comments (0)
About PowerShow.com