Single-Pass Algorithms for Querying and Mining Data Streams - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Single-Pass Algorithms for Querying and Mining Data Streams

Description:

Single-Pass Algorithms for Querying and Mining Data Streams Rajeev Rastogi Internet Management Research Bell Labs Processing Data Streams: Motivation A growing number ... – PowerPoint PPT presentation

Number of Views:412
Avg rating:3.0/5.0
Slides: 36
Provided by: MarkusH96
Category:

less

Transcript and Presenter's Notes

Title: Single-Pass Algorithms for Querying and Mining Data Streams


1
Single-Pass Algorithms for Querying and Mining
Data Streams
  • Rajeev Rastogi
  • Internet Management Research
  • Bell Labs

2
Processing Data Streams Motivation
  • A growing number of applications generate streams
    of data
  • Performance measurements in network monitoring
    and traffic management
  • Call detail records in telecommunications
  • Transactions in retail chains
  • ATM and credit card operations in banks
  • Financial tickers
  • Log records generated by Web Servers
  • Sensor network data
  • Application characteristics
  • Massive volumes of data (several terabytes)
  • Records arrive at a rapid rate
  • Goal Mine patterns, process queries and compute
    statistics on data streams in real-time

3
Data Streams Computation Model
  • A data stream is a (massive) sequence of records
  • Stream processing requirements
  • Single pass Each record is examined at most once
  • Bounded storage Limited Memory (M) for storing
    synopsis
  • Real-time Per record processing time must be low
    (to maintain synopsis)

Synopsis in Memory
Data Streams
Stream Processing Engine
(Approximate) Answer
4
Network Management Application
  • Network Management involves monitoring and
    configuring network hardware and software to
    ensure smooth operation
  • Monitor link bandwidth usage, estimate traffic
    demands
  • Quickly detect faults, congestion and isolate
    root cause
  • Load balancing, improve utilization of network
    resources

Network Operations Center
Measurements Alarms
Network
5
IP Network Measurement Data
  • IP session data (collected using
    NetFlow)
  • ATT collects 100 GB of NetFlow data per day!

Source Destination Duration
Bytes Protocol 10.1.0.2
16.2.3.7 12 20K
http 18.6.7.1 12.4.0.3
16 24K http
13.9.4.3 11.6.8.2 15
20K http 15.2.2.9
17.1.2.1 19 40K
http 12.4.3.8 14.8.7.4
26 58K http
10.5.1.3 13.0.0.1 27
100K ftp 11.1.0.6
10.3.4.5 32 300K
ftp 19.7.1.2 16.5.5.8
18 80K ftp
6
Network Data Processing
  • Traffic estimation
  • How many bytes were sent between a pair of IP
    addresses?
  • What fraction network IP addresses are active?
  • List the top 100 IP addresses in terms of
    traffic?
  • Traffic analysis
  • What is the average duration of an IP session?
  • What is the median of the number of bytes in each
    IP session?
  • Fraud detection
  • List all sessions that transmitted more than 1000
    bytes
  • Identify all sessions whose duration was more
    than twice the normal
  • Security/Denial of Service
  • List all IP addresses that have witnessed a
    sudden spike in traffic
  • Identify IP addresses involved in more than 1000
    sessions

7
Data Stream Processing Algorithms
  • Accurate answers
  • Algorithms compute answer accurately (without
    errors)
  • MAX, MIN, COUNT, SUM, AVG
  • Approximate answers - Deterministic bounds
  • Algorithms only compute an approximate answer,
    but bounds on error
  • Clustering, equi-depth histograms
  • Approximate answers - Probabilistic bounds
  • Algorithms compute an approximate answer with
    high probability
  • With probability , the computed answer is
    within a factor of the actual answer
  • Decision trees, distinct value estimation, join
    queries
  • Single-pass algorithms for processing streams can
    also be applied to (massive) terabyte databases

8
Deterministic bounds
9
Clustering
  • Problem (k-median) Find k centers in a stream S
    so as to minimize the sum of distances from data
    points in S to their closest cluster centers.
  • Example (k1)

1
2
4
5
Median
3
cost is minimum
10
Simple One-Pass Algorithm Guha et. al.
  • For each successive set of M records, , find
    O(k) centers in
  • Assign each point in to its closest center
  • Let S be centers for with each center
    weighted by number of points assigned to it
  • Cluster S to find k centers

11
One-Pass Algorithm - First Phase (Example)
  • M 3, k1, Data Stream

1
2
4
5
3
12
One-Pass Algorithm - Second Phase (Example)
  • M 3, k1, Data Stream

13
Analysis of Algorithm - First Phase
  • Observation 1 The sum of the optimal solution
    values for the k-median problem for is
    at most twice the cost of the optimal solution
    for S

1
1
cost S
2
2
4
4
5
cost S
3
3
Data Stream
cost
Let 1 be the point closest to optimal median 4
14
Analysis of Algorithm - Second Phase
  • Observation 2 The cost of optimal solution for
    S is at most a constant times the sum of the
    optimal costs for S and

w3
1
cost
1
2
2
w2
4
4
cost S
5
5
3
3
Data Stream
S
cost S
15
Overall Analysis of Algorithm
  • Final Result Cost of final solution is at most
    sum of costs of S and which is at most
    a constant times cost of S
  • If constant factor approximation algorithm used
    to cluster then simple algorithm
    yields constant factor approximation
  • Algorithm can be extended to cluster in more than
    2 phases

w3
1
1
cost S
cost
2
2
w2
4
4
5
5
cost
3
3
Data Stream
S
16
Quantile Computation
  • Problem (q-quantile) Find element in position qn
    when n stream elements are sorted
  • Median is 0.5-quantile
  • Can be used to compute equi-depth histograms
  • Example

Data stream 9 3 5 2 7 1 6 5 8
4 9 1
After sort 1 1 2 3 4 5 5 6 7
8 9 9
Median
0.8-quantile
17
Single-Pass Algorithm Munro Paterson
  • Split memory M into b buffers of size k (M bk)
  • For each successive set of k elements in stream
  • If free buffer B exists
  • insert k elements into B, set level of B to 0
  • Else
  • merge two buffers B and B at same level l
  • output result of merge into B, set level of B
    to l1
  • insert k elements into B, set level of B to 0
  • Output element in position qn after making
    copies of each element in final buffer and
    sorting them
  • Merge operation (input buffers B and B at level
    l)
  • Make copies of each element in B and B
  • Sort copies
  • Output elements in positions
    in sorted sequence, j0, ..., k-1

18
Single-Pass Algorithm (Example)
  • M9, b3, k3, q 0.8
  • Computed 0.8-quantile 7 (0.7-quantile)

level 2
1 3 7
1 1 1 1 3 3 5 5 7 7 8 8
1 3 7
1 2 3 5 7 9
level 1
1 5 8
level 0
4 9 1
6 5 8
9 3 5
2 7 1
1 1 1 1 3 3 3 3 7 7 7 7
19
Analysis of Algorithm
b
  • Number of elements that are neither definitely
    small, nor definately large
  • Algorithm returns q-quantile, where
  • Choose smallest b such that and bk M

20
Probabilistic bounds
21
Decision Trees
  • Problem Construct a decision tree for a stream
    of records

Packets gt 10
yes
no
Bytes gt 60K
Protocol http
yes
no
Protocol http
Protocol ftp
  • Recursively scan (sample of) data set to split
    tree nodes
  • At each tree node, choose attribute with lowest
    entropy or Gini index to split node

22
Single-Pass Algorithm Domingos Hulten
  • Initialize T to root node with counts 0
  • For each record in stream
  • Traverse T to determine appropriate leaf L for
    record
  • Update (attribute, class) counts in L and compute
    split function for each attribute
  • If for attribute
    -- (1)
  • split L using attribute
  • Compute value for using Hoeffding Bound
  • Hoeffding Bound If SP() takes values in range R,
    and L contains m records, then with probability
    , the computed value of SP() (using m
    records in L) differs from the true value by at
    most
  • Hoeffding Bound guarantees that if (1) holds,
    then is correct choice for split with
    probability

23
Single-Pass Algorithm (Example)
Packets gt 10
Data Stream
yes
no
Protocol http
SP(Bytes) - SP(Packets) gt
Packets gt 10
Data Stream
yes
no
Bytes gt 60K
Protocol http
yes
Protocol ftp
24
Analysis of Algorithm
  • Result Expected probability that constructed
    decision tree classifies a record differently
    from conventional tree is less than /p
  • Here p is probability that a record is assigned
    to a leaf at each level
  • Even with very small p (that is, large trees),
    small disagreements result with few records per
    node
  • With only 725 records per node
  • For p0.01, expected disagreement is at most 1
  • For p1, expected disagreement is at most .01

25
Distinct Value Estimation
  • Problem Find the number of distinct values in a
    stream of values with domain 0,...,D-1
  • Example (D8)

Data stream 3 0 5 3 0 1 7 5 1
0 3 7
Number of distinct values 5
26
Single-Pass Algorithm Gibbons
  • Initialize cur_level to 0, V to empty
  • For each value v in stream
  • Let l hash(v) / Pr(hash(v)
    l) /
  • If l gt cur_level
  • V V U v
  • If V gt M
  • delete all values in V at level cur_level
  • cur_level cur_level 1
  • Output
  • Computing hash function
  • hash(v) Number of leading zeros in binary
    representation of AvB mod D
  • A/ B chosen randomly from 1/0, ...., D-1
  • 0 lt hash(v) lt log D

27
Single-Pass Algorithm (Example)
  • M3, D8

Data stream 3 0 5 3 0 1 7 5 1
0 3 7
0 1 3 5 7 0 1 0 1 0
Hash
Data stream 1 7 5 1 0 3 7
V3,0,5, cur_level 0
V1,5, cur_level 1
  • Computed value 4 ( )

28
Analysis of Algorithm
  • Set V contains all values v such that hash(v)
    cur_level
  • Expected value for V num_distinct_values/
  • Pr(hash(v) cur_level)
  • Expected value for
    num_distinct_values

29
Join Queries
  • Problem Compute size of join for two (or more)
    streams R and S
  • Example

Data stream R 3 0 1 3 0 3
Data stream S 2 0 1 3 1 2
0 1 2 3 2 1 0 3
Frequency
Frequency
  • Join size 7 (2 2 0 3)

30
Single-Pass Algorithm Alon et. al.
  • Initialize
  • For each value v in Stream R/S
  • Set
  • Output
  • Properties of random variables
  • Each -1, 1
  • Pr( 1) Pr( -1) 1/2
  • expected value of each 0
  • Variables are independent
  • expected value of product of distinct 0
  • Each variable can be efficiently generated
    using a pseudo-random generator

31
Single-Pass Algorithm (Example)
Data stream R 3 0 1 3 0 3
Data stream S 2 0 1 3 1 2
0 1 2 3 2 1 0 3
Frequency
Frequency
32
Analysis of Algorithm
  • Expected value of
  • Expected value of product terms
  • Expected value of product terms
  • Variance of
  • Self-join size of R/S
  • Result (ChebyShev) Averaging over
    instantiations of estimates join
    size to within a factor of with high
    probability
  • L lower bound on join size

33
Data Streaming - Summary
  • Growing number of applications generate data
    streams
  • Performance measurements in network monitoring
    and traffic management
  • Call detail records in telecommunications
  • Transactions in retail chains
  • ATM and credit card operations in banks
  • Financial tickers
  • Need for single-pass algorithms for querying and
    mining streams
  • Approximate answers with deterministic bounds on
    error
  • Clustering, quantiles
  • Approximate answers with probabilistic bounds on
    error
  • Decision trees, distinct value estimation, join
    queries
  • New and exciting research area with technically
    challenging problems!

34
Data Streaming - Future Research Directions
  • Stream processing system architectures
  • Models, algebras and languages for stream
    processing
  • Algorithms for mining high-speed data streams
  • Processing general database queries on streams
  • Stream selectivity estimation methods
  • Compression and approximation techniques for
    streams
  • Stream indexing, searching and similarity
    matching
  • Exploiting prior knowledge for stream computation
  • Memory management for stream processing
  • Content-based routing and filtering of XML
    streams
  • Integration of stream processing and databases
  • Novel stream processing applications

35
References
  • Guha et. al.
  • S. Guha, N, Mishra, R. Motwain and L.
    OCallaghan. Clustering data streams, FOCS, 2000.
  • Munro Paterson
  • J. Munro and M. Paterson. Selection and sorting
    with limited storage, Theoretical Computer
    Science, vol 12, 1980.
  • Domingos Hulten
  • P. Domingos and G. Hulten. Mining high-speed data
    streams, SIGKDD, 2000.
  • Gibbons
  • P. Gibbons. Distinct sampling for highly-accurate
    answers to distinct values queries and even
    reports, VLDB 2001.
  • Alon et. al.
  • N. Alon, P. Gibbons, Y. Matias and M. Szegedy.
    Tracking join and self-join sizes in limited
    storage, PODS, 1999.
Write a Comment
User Comments (0)
About PowerShow.com