Title: Single-Pass Algorithms for Querying and Mining Data Streams
1Single-Pass Algorithms for Querying and Mining
Data Streams
- Rajeev Rastogi
- Internet Management Research
- Bell Labs
2Processing Data Streams Motivation
- A growing number of applications generate streams
of data - Performance measurements in network monitoring
and traffic management - Call detail records in telecommunications
- Transactions in retail chains
- ATM and credit card operations in banks
- Financial tickers
- Log records generated by Web Servers
- Sensor network data
- Application characteristics
- Massive volumes of data (several terabytes)
- Records arrive at a rapid rate
- Goal Mine patterns, process queries and compute
statistics on data streams in real-time
3Data Streams Computation Model
- A data stream is a (massive) sequence of records
- Stream processing requirements
- Single pass Each record is examined at most once
- Bounded storage Limited Memory (M) for storing
synopsis - Real-time Per record processing time must be low
(to maintain synopsis)
Synopsis in Memory
Data Streams
Stream Processing Engine
(Approximate) Answer
4Network Management Application
- Network Management involves monitoring and
configuring network hardware and software to
ensure smooth operation - Monitor link bandwidth usage, estimate traffic
demands - Quickly detect faults, congestion and isolate
root cause - Load balancing, improve utilization of network
resources
Network Operations Center
Measurements Alarms
Network
5IP Network Measurement Data
- IP session data (collected using
NetFlow) - ATT collects 100 GB of NetFlow data per day!
Source Destination Duration
Bytes Protocol 10.1.0.2
16.2.3.7 12 20K
http 18.6.7.1 12.4.0.3
16 24K http
13.9.4.3 11.6.8.2 15
20K http 15.2.2.9
17.1.2.1 19 40K
http 12.4.3.8 14.8.7.4
26 58K http
10.5.1.3 13.0.0.1 27
100K ftp 11.1.0.6
10.3.4.5 32 300K
ftp 19.7.1.2 16.5.5.8
18 80K ftp
6Network Data Processing
- Traffic estimation
- How many bytes were sent between a pair of IP
addresses? - What fraction network IP addresses are active?
- List the top 100 IP addresses in terms of
traffic? - Traffic analysis
- What is the average duration of an IP session?
- What is the median of the number of bytes in each
IP session? - Fraud detection
- List all sessions that transmitted more than 1000
bytes - Identify all sessions whose duration was more
than twice the normal - Security/Denial of Service
- List all IP addresses that have witnessed a
sudden spike in traffic - Identify IP addresses involved in more than 1000
sessions
7Data Stream Processing Algorithms
- Accurate answers
- Algorithms compute answer accurately (without
errors) - MAX, MIN, COUNT, SUM, AVG
- Approximate answers - Deterministic bounds
- Algorithms only compute an approximate answer,
but bounds on error - Clustering, equi-depth histograms
- Approximate answers - Probabilistic bounds
- Algorithms compute an approximate answer with
high probability - With probability , the computed answer is
within a factor of the actual answer - Decision trees, distinct value estimation, join
queries - Single-pass algorithms for processing streams can
also be applied to (massive) terabyte databases
8Deterministic bounds
9Clustering
- Problem (k-median) Find k centers in a stream S
so as to minimize the sum of distances from data
points in S to their closest cluster centers. - Example (k1)
1
2
4
5
Median
3
cost is minimum
10Simple One-Pass Algorithm Guha et. al.
- For each successive set of M records, , find
O(k) centers in - Assign each point in to its closest center
- Let S be centers for with each center
weighted by number of points assigned to it - Cluster S to find k centers
11One-Pass Algorithm - First Phase (Example)
1
2
4
5
3
12One-Pass Algorithm - Second Phase (Example)
13Analysis of Algorithm - First Phase
- Observation 1 The sum of the optimal solution
values for the k-median problem for is
at most twice the cost of the optimal solution
for S
1
1
cost S
2
2
4
4
5
cost S
3
3
Data Stream
cost
Let 1 be the point closest to optimal median 4
14Analysis of Algorithm - Second Phase
- Observation 2 The cost of optimal solution for
S is at most a constant times the sum of the
optimal costs for S and
w3
1
cost
1
2
2
w2
4
4
cost S
5
5
3
3
Data Stream
S
cost S
15Overall Analysis of Algorithm
- Final Result Cost of final solution is at most
sum of costs of S and which is at most
a constant times cost of S - If constant factor approximation algorithm used
to cluster then simple algorithm
yields constant factor approximation - Algorithm can be extended to cluster in more than
2 phases
w3
1
1
cost S
cost
2
2
w2
4
4
5
5
cost
3
3
Data Stream
S
16Quantile Computation
- Problem (q-quantile) Find element in position qn
when n stream elements are sorted - Median is 0.5-quantile
- Can be used to compute equi-depth histograms
- Example
Data stream 9 3 5 2 7 1 6 5 8
4 9 1
After sort 1 1 2 3 4 5 5 6 7
8 9 9
Median
0.8-quantile
17Single-Pass Algorithm Munro Paterson
- Split memory M into b buffers of size k (M bk)
- For each successive set of k elements in stream
- If free buffer B exists
- insert k elements into B, set level of B to 0
- Else
- merge two buffers B and B at same level l
- output result of merge into B, set level of B
to l1 - insert k elements into B, set level of B to 0
- Output element in position qn after making
copies of each element in final buffer and
sorting them - Merge operation (input buffers B and B at level
l) - Make copies of each element in B and B
- Sort copies
- Output elements in positions
in sorted sequence, j0, ..., k-1
18Single-Pass Algorithm (Example)
- M9, b3, k3, q 0.8
- Computed 0.8-quantile 7 (0.7-quantile)
level 2
1 3 7
1 1 1 1 3 3 5 5 7 7 8 8
1 3 7
1 2 3 5 7 9
level 1
1 5 8
level 0
4 9 1
6 5 8
9 3 5
2 7 1
1 1 1 1 3 3 3 3 7 7 7 7
19Analysis of Algorithm
b
- Number of elements that are neither definitely
small, nor definately large - Algorithm returns q-quantile, where
- Choose smallest b such that and bk M
20Probabilistic bounds
21Decision Trees
- Problem Construct a decision tree for a stream
of records
Packets gt 10
yes
no
Bytes gt 60K
Protocol http
yes
no
Protocol http
Protocol ftp
- Recursively scan (sample of) data set to split
tree nodes - At each tree node, choose attribute with lowest
entropy or Gini index to split node
22Single-Pass Algorithm Domingos Hulten
- Initialize T to root node with counts 0
- For each record in stream
- Traverse T to determine appropriate leaf L for
record - Update (attribute, class) counts in L and compute
split function for each attribute - If for attribute
-- (1) - split L using attribute
- Compute value for using Hoeffding Bound
- Hoeffding Bound If SP() takes values in range R,
and L contains m records, then with probability
, the computed value of SP() (using m
records in L) differs from the true value by at
most - Hoeffding Bound guarantees that if (1) holds,
then is correct choice for split with
probability
23Single-Pass Algorithm (Example)
Packets gt 10
Data Stream
yes
no
Protocol http
SP(Bytes) - SP(Packets) gt
Packets gt 10
Data Stream
yes
no
Bytes gt 60K
Protocol http
yes
Protocol ftp
24Analysis of Algorithm
- Result Expected probability that constructed
decision tree classifies a record differently
from conventional tree is less than /p - Here p is probability that a record is assigned
to a leaf at each level - Even with very small p (that is, large trees),
small disagreements result with few records per
node - With only 725 records per node
- For p0.01, expected disagreement is at most 1
- For p1, expected disagreement is at most .01
25Distinct Value Estimation
- Problem Find the number of distinct values in a
stream of values with domain 0,...,D-1 - Example (D8)
Data stream 3 0 5 3 0 1 7 5 1
0 3 7
Number of distinct values 5
26Single-Pass Algorithm Gibbons
- Initialize cur_level to 0, V to empty
- For each value v in stream
- Let l hash(v) / Pr(hash(v)
l) / - If l gt cur_level
- V V U v
- If V gt M
- delete all values in V at level cur_level
- cur_level cur_level 1
- Output
- Computing hash function
- hash(v) Number of leading zeros in binary
representation of AvB mod D - A/ B chosen randomly from 1/0, ...., D-1
- 0 lt hash(v) lt log D
27Single-Pass Algorithm (Example)
Data stream 3 0 5 3 0 1 7 5 1
0 3 7
0 1 3 5 7 0 1 0 1 0
Hash
Data stream 1 7 5 1 0 3 7
V3,0,5, cur_level 0
V1,5, cur_level 1
28Analysis of Algorithm
- Set V contains all values v such that hash(v)
cur_level - Expected value for V num_distinct_values/
- Pr(hash(v) cur_level)
- Expected value for
num_distinct_values
29Join Queries
- Problem Compute size of join for two (or more)
streams R and S - Example
Data stream R 3 0 1 3 0 3
Data stream S 2 0 1 3 1 2
0 1 2 3 2 1 0 3
Frequency
Frequency
30Single-Pass Algorithm Alon et. al.
- Initialize
- For each value v in Stream R/S
- Set
- Output
- Properties of random variables
- Each -1, 1
- Pr( 1) Pr( -1) 1/2
- expected value of each 0
- Variables are independent
- expected value of product of distinct 0
- Each variable can be efficiently generated
using a pseudo-random generator
31Single-Pass Algorithm (Example)
Data stream R 3 0 1 3 0 3
Data stream S 2 0 1 3 1 2
0 1 2 3 2 1 0 3
Frequency
Frequency
32Analysis of Algorithm
-
- Expected value of
- Expected value of product terms
- Expected value of product terms
- Variance of
- Self-join size of R/S
- Result (ChebyShev) Averaging over
instantiations of estimates join
size to within a factor of with high
probability - L lower bound on join size
33Data Streaming - Summary
- Growing number of applications generate data
streams - Performance measurements in network monitoring
and traffic management - Call detail records in telecommunications
- Transactions in retail chains
- ATM and credit card operations in banks
- Financial tickers
- Need for single-pass algorithms for querying and
mining streams - Approximate answers with deterministic bounds on
error - Clustering, quantiles
- Approximate answers with probabilistic bounds on
error - Decision trees, distinct value estimation, join
queries - New and exciting research area with technically
challenging problems!
34Data Streaming - Future Research Directions
- Stream processing system architectures
- Models, algebras and languages for stream
processing - Algorithms for mining high-speed data streams
- Processing general database queries on streams
- Stream selectivity estimation methods
- Compression and approximation techniques for
streams - Stream indexing, searching and similarity
matching - Exploiting prior knowledge for stream computation
- Memory management for stream processing
- Content-based routing and filtering of XML
streams - Integration of stream processing and databases
- Novel stream processing applications
35References
- Guha et. al.
- S. Guha, N, Mishra, R. Motwain and L.
OCallaghan. Clustering data streams, FOCS, 2000. - Munro Paterson
- J. Munro and M. Paterson. Selection and sorting
with limited storage, Theoretical Computer
Science, vol 12, 1980. - Domingos Hulten
- P. Domingos and G. Hulten. Mining high-speed data
streams, SIGKDD, 2000. - Gibbons
- P. Gibbons. Distinct sampling for highly-accurate
answers to distinct values queries and even
reports, VLDB 2001. - Alon et. al.
- N. Alon, P. Gibbons, Y. Matias and M. Szegedy.
Tracking join and self-join sizes in limited
storage, PODS, 1999.