PPT – Mining from Data Streams - Competition between Quality and Speed PowerPoint presentation

About This Presentation

Title:

Mining from Data Streams - Competition between Quality and Speed

Description:

New Applications data input as continuous, ordered data streams ... Mine patterns, process queries and compute statistics on data streams in real-time ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 59

Provided by: surajL

Category:

more less

Transcript and Presenter's Notes

Title: Mining from Data Streams - Competition between Quality and Speed

1
Mining from Data Streams - Competition between
Quality and Speed

Adapted from
Wei-Guang Teng (???) and
S. Muthukrishnans presentations

2
Streaming Finding Missing Numbers

Paul permutes numbers 1n, and shows all but one
to Carole, in the permuted order, one after the
other.
Carole must find the missing number.
Carole can not remember all the numbers she has
been shown.

3
Streaming Finding Missing Numbers

Carole cumulates the sum of all the numbers that
she has been shown. At the end she can subtract
this sum from
n(n1)/2
Analysis
Takes O(log n) bits to store the partial sum
Performs one addition each time a new number is
shown (takes O(log n) time per number)
Performs one subtraction at the end (takes O(log
n time)

4
Data Streams (1)

Traditional DBMS data stored in finite,
persistent data sets
New Applications data input as continuous,
ordered data streams
Network monitoring and traffic engineering
Telecom call detail records (CDR)
ATM operations in banks
Sensor networks
Web logs and click-streams
Transactions in retail chains
Manufacturing processes

5
Data Streams (2)

Definition
Continuous, unbounded, rapid, time-varying
streams of data elements
Application Characteristics
Massive volumes of data (can be several
terabytes)
Records arrive at a rapid rate
Goal
Mine patterns, process queries and compute
statistics on data streams in real-time

6
Data Stream Algorithms

Streaming involves
Small number of passes over data. (Typically 1?)
Sublinear space (sublinear in the universe or
number of stream items?)
Sublinear time for computing (?)
Similar to dynamic, online, approximation or
randomized algorithms, but with more constraints.

7
Data Streams Analysis Model
User/Application
Query/Mining Target
Results
Stream Processing Engine
Scratch Space (Memory and/or Disk)
8
Motivation

3 Billion Telephone Calls in US each day
30 Billion emails daily, 1 Billion SMS, IMs
Scientific data NASA's observation satellites
generate billions of readings each day.
IP Network Traffic up to 1 Billion packets per
hour per router. Each ISP has many hundreds) of
routers!
Compare to human scale data "only" 1 billion
worldwide credit card transactions per month.

9
Network Management Application

Monitoring and configuring network hardware and
software to ensure smooth operation

Network Operations Center
Measurements Alarms
Network
10
IP Network Measurement Data

IP session data
ATT collects 100 GBs of NetFlow data each day!

11
Network Data Processing

Traffic estimation/analysis
List the top 100 IP addresses in terms of traffic
What is the average duration of an IP session?
Fraud detection
Identify all sessions whose duration was more
than twice the normal
Security/Denial of Service
List all IP addresses that have witnessed a
sudden spike in traffic
Identify IP addresses involved in more than 1000
sessions

12
Challenges in Network Apps.

1 link with 2 Gb/s. Say avg packet size is 50
bytes.
Number of pkts/sec 5 Million.
Time per pkt 0.2 µsec.
If we capture pkt headers per packet src/dest
IP, time, no of bytes, etc. at least 10 bytes.
Space per second is 50 Mb. Space per day is 4.5
Tb per link. ISPs have hundreds of links.

13
Data Streaming Models

Input data a1, a2, a3,
Input stream describes a signal Ai, a
one-dimensional function (value vs. index)
There is mapping from the input stream to the
signal
This is the data stream model

14
Time-Series Model

ais are form Ais.

15
Cash-Register Model

ais are increments to Aj
ai (j, Ii) Ii gt 0
Aij Ai-1j Ii

16
Turnstile Model

ais are updates to Aj
ai (j, Ui)
Aij Ai-1j Ui
Strict turnstile model
Aij gt at all i

17
Data Stream Algorithms

Compute various functions on the signal A at
various times
Performance measures
Processing time per item ai in the stream
Space used to store the data structure on At at
time t
Time needed to compute the functions on A

18
Outline

Introduction Motivation
Issues Techs. of Processing Data Streams
Sampling
Histogram
Wavelet
Data Streaming Systems System
Example Algorithms for Frequency Counting
Lossy Counting
Sticky Sampling

19
Data Stream Algorithms

Stream Processing Requirements
Single pass each record is examined at most once
Bounded storage limited memory for storing
synopsis
Real-time per record processing time (to
maintain synopsis) must be low
Generally, algorithms compute approximate answers
Difficult to compute answers accurately with
limited memory

20
Approximation in Data Streams

Approximate Answers - Deterministic Bounds
Algorithms only compute an approximate answer,
but bounds on error
Approximate Answers - Probabilistic Bounds
Algorithms compute an approximate answer with
high probability
With probability at least , the computed
answer is within a factor of the actual answer

Data Streaming Systems System

21
Sliding Window Approximation
0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1
0

Why?
Approximation technique for bounded memory
Natural in applications (emphasizes recent data)
Well-specified and deterministic semantics
Issues
Extend relational algebra, SQL, query
optimization
Algorithmic work
Timestamps?

22
Timestamps

Explicit
Injected by data source
Models real-world event represented by tuple
Tuples may be out-of-order, but if near-ordered
can reorder with small buffers
Implicit
Introduced as special field by DSMS
Arrival time in system
Enables order-based querying and sliding windows
Issues
Distributed streams?
Composite tuples created by DSMS?

23
Time

Easiest global system clock
Stream elements and relation updates timestamped
on entry to system
Application-defined time
Streams and relation updates contain application
timestamps, may be out of order
Application generates heartbeat
Or deduce heartbeat from parameters stream skew,
scrambling, latency, and clock progress
Query results in application time

24
Sampling Basics

A small random sample S of the data often
well-represents all the data
Example select agg from R where R.e is odd
(n12)
If agg is avg, return average of odd elements in
S
If agg is count, return average over all elements
e in S of
n if e is odd
0 if e is even

Data stream 9 3 5 2 7 1 6 5 8
4 9 1
Sample S 9 5 1 8
answer 5
answer 123/4 9
Unbiased!
25
Histograms

Histograms approximate the frequency distribution
of element values in a stream
A histogram (typically) consists of
A partitioning of element domain values into
buckets
A count per bucket B (of the number of
elements in B)
Long history of use for selectivity estimation
within a query optimizer (Koo80, PSC84, etc)

26
Types of Histograms

Equi-Depth Histograms
Select buckets such that counts per bucket are
equal
V-Optimal Histograms IP95 JKM98
Select buckets to minimize frequency variance
within buckets

Count for bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
Count for bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
27
Answering Queries using Histograms
IP99

(Implicitly) map the histogram back to an
approximate relation, apply the query to the
approximate relation
Example select count() from R where
4ltR.elt15
For equi-depth histograms, maximum error

Count spread evenly among bucket values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
4 ? R.e ? 15
28
Wavelet Basics

For hierarchical decomposition of
functions/signals
Haar wavelets
Simplest wavelet basis gt Recursive pairwise
averaging and differencing at different
resolutions

Resolution Averages
Detail Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
Haar wavelet decomposition
2.75, -1.25, 0.5, 0, 0, -1, -1, 0
29
Haar Wavelet Coefficients

Hierarchical decomposition structure (error
tree)

Coefficient Supports

-

-

-

-
-
-
2 2 0 2 3
5 4 4
-
Original frequency distribution
30
Wavelet-based Histograms MVW98

Problem range-query selectivity estimation
Key idea use a compact subset of Haar wavelet
coefficients for approximating frequency
distribution
Steps
Compute cumulative frequency distribution C
Compute Haar wavelet transform of C
Coefficient thresholding only mltltn coefficients
can be kept

31
Using Wavelet-based Histograms

Selectivity estimation count(alt R.elt b)
Cb - Ca-1
C is the (approximate) reconstructed
cumulative distribution
Time O(minm, logN), where m size of wavelet
synopsis (number of coefficients), N size of
domain
Empirical results over synthetic data shows
improvements over random sampling and histograms

At most logN1 coefficients are needed to
reconstruct any C value

Ca
32
Data Streaming Systems

Low-level application specific approach
DBMS approach
Generic data stream management systems

33
DBMS Vs. DSMS Meta-Questions

Killer-apps
Application stream rates exceed DBMS capacity?
Can DSMS handle high rates anyway?
Motivation
Need for general-purpose DSMS?
Not ad-hoc, application-specific systems?
Non-Trivial
DSMS merely DBMS with enhanced support for
triggers, temporal constructs, data rate mgmt?

34
DBMS versus DSMS

Persistent relations
One-time queries
Random access
Access plan determined by query processor and
physical DB design

Transient streams (and persistent relations)
Continuous queries
Sequential access
Unpredictable data characteristics and arrival
patterns

35
(Simplified) Big Picture of DSMS
Stored Result
Streamed Result
DSMS
Scratch Store
Stored Relations
36
(Simplified) Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces
Scratch Store
Lookup Tables
37
Using Conventional DBMS

Data streams as relation inserts, continuous
queries as triggers or materialized views
Problems with this approach
Inserts are typically batched, high overhead
Expressiveness simple conditions (triggers), no
built-in notion of sequence (views)
No notion of approximation, resource allocation
Current systems dont scale to large of
triggers
Views dont provide streamed results

38
Query 1 (self-join)

Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM Outgoing O1, Outgoing O2
WHERE (O2.time O1.time gt 2
AND O1.call_ID O2.call_ID
AND O1.event start
AND O2.event end)
Result requires unbounded storage
Can provide result as data stream
Can output after 2 min, without seeing end

39
Query 2 (join)

Pair up callers and callees
SELECT O.caller, I.callee
FROM Outgoing O, Incoming I
WHERE O.call_ID I.call_ID
Can still provide result as data stream
Requires unbounded temporary storage
unless streams are near-synchronized

40
Query 3 (group-by aggregation)

Total connection time for each caller
SELECT O1.caller, sum(O2.time O1.time)
FROM Outgoing O1, Outgoing O2
WHERE (O1.call_ID O2.call_ID
AND O1.event start
AND O2.event end)
GROUP BY O1.caller
Cannot provide result in (append-only) stream
Output updates?
Provide current value on demand?

41
Data Model

Append-only
Call records
Updates
Stock tickers
Deletes
Transactional data
Meta-Data
Control signals, punctuations
System Internals probably need all above

42
Related Database Technology

DSMS must use ideas, but none is substitute
Triggers, Materialized Views in Conventional DBMS
Main-Memory Databases
Sequence/Temporal/Timeseries Databases
Realtime Databases
Adaptive, Online, Partial Results
Novelty in DSMS
Semantics input ordering, streaming output,
State cannot store unending streams, yet need
history
Performance rate, variability, imprecision,

43
Outline

Introduction Motivation
Data Stream Management System
Issues Techs. of Processing Data Streams
Sampling
Histogram
Wavelet
Example Algorithms for Frequency Counting
Lossy Counting
Sticky Sampling

44
Problem of Frequency Counts

Identify all elements whose current frequency
exceeds support threshold s 0.1

Stream
45
Algorithm 1 Lossy Counting

Step 1 Divide the stream into windows
Is window size a function of support s? Will fix
later

46
Lossy Counting in Action ...
Empty
At window boundary, decrement all counters by 1
47
Lossy Counting (contd)
Frequency Counts

Next Window
At window boundary, decrement all counters by 1
48
Error Analysis

How much do we undercount?
If current size of stream
N
and window-size
1/e
then
windows eN

frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
49
Analysis of Lossy Counting

Output
Elements with counter values exceeding sN eN
How many counters do we need?
Worst case 1/e log(eN) counters

Approximation guarantees Frequencies
underestimated by at most eN No false negatives
False positives have true frequency at least sN
eN
50
Algorithm 2 Sticky Sampling

Create counters by sampling
Maintain exact counts thereafter

What rate should we sample?
51
Sticky Sampling (contd)

For finite stream of length N
Sampling rate 2/Ne log 1/(s ?)
(? probability of failure)
Output
Elements with counter values exceeding sN eN

Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
Same error guarantees as Lossy Counting
but probabilistic!
52
Sampling rate?

Finite stream of length N
Sampling rate 2/Ne log 1/(s?)
Infinite stream with unknown N
Gradually adjust sampling rate
In either case,
Expected number of counters 2/elog 1/s?

Independent of N!
53
New Directions

Functional approximation theory
Data structures
Computational geometry
Graph theory
Databases
Hardware
Streaming models
Data stream quality monitoring

54
References (1)

AGM99 N. Alon, P.B. Gibbons, Y. Matias, M.
Szegedy. Tracking Join and Self-Join Sizes in
Limited Storage. ACM PODS, 1999.
AMS96 N. Alon, Y. Matias, M. Szegedy. The space
complexity of approximating the frequency
moments. ACM STOC, 1996.
CIK02 G. Cormode, P. Indyk, N. Koudas, S.
Muthukrishnan. Fast mining of tabular data via
approximate distance computations. IEEE ICDE,
2002.
CMN98 S. Chaudhuri, R. Motwani, and V.
Narasayya. Random Sampling for Histogram
Construction How much is enough?. ACM SIGMOD
1998.
CDI02 G. Cormode, M. Datar, P. Indyk, S.
Muthukrishnan. Comparing Data Streams Using
Hamming Norms. VLDB, 2002.
DGG02 A. Dobra, M. Garofalakis, J. Gehrke, R.
Rastogi. Processing Complex Aggregate Queries
over Data Streams. ACM SIGMOD, 2002.
DJM02 T. Dasu, T. Johnson, S. Muthukrishnan, V.
Shkapenyuk. Mining database structure or how to
build a data quality browser. ACM SIGMOD, 2002.
DH00 P. Domingos and G. Hulten. Mining
high-speed data streams. ACM SIGKDD, 2000.
EKSWX98 M. Ester, H.-P. Kriegel, J. Sander, M.
Wimmer, and X. Xu. Incremental Clustering for
Mining in a Data Warehousing Environment. VLDB
1998.
FKS99 J. Feigenbaum, S. Kannan, M. Strauss, M.
Viswanathan. An approximate L1-difference
algorithm for massive data streams. IEEE FOCS,
1999.
FM85 P. Flajolet, G.N. Martin. Probabilistic
Counting Algorithms for Data Base Applications.
JCSS 31(2), 1985

55
References (2)

Gib01 P. Gibbons. Distinct sampling for
highly-accurate answers to distinct values
queries and event reports, VLDB 2001.
GGI02 A.C. Gilbert, S. Guha, P. Indyk, Y.
Kotidis, S. Muthukrishnan, M. Strauss. Fast,
small-space algorithms for approximate histogram
maintenance. ACM STOC, 2002.
GGRL99 J. Gehrke, V. Ganti, R. Ramakrishnan,
and W.-Y. Loh BOAT-Optimistic Decision Tree
Construction. SIGMOD 1999.
GK01 M. Greenwald and S. Khanna.
Space-Efficient Online Computation of Quantile
Summaries. ACM SIGMOD 2001.
GKM01 A.C. Gilbert, Y. Kotidis, S.
Muthukrishnan, M. Strauss. Surfing Wavelets on
Streams One Pass Summaries for Approximate
Aggregate Queries. VLDB 2001.
GKM02 A.C. Gilbert, Y. Kotidis, S.
Muthukrishnan, M. Strauss. How to Summarize the
Universe Dynamic Maintenance of Quantiles. VLDB
2002.
GKS01b S. Guha, N. Koudas, and K. Shim. Data
Streams and Histograms. ACM STOC 2001.
GM98 P. B. Gibbons and Y. Matias. New
Sampling-Based Summary Statistics for Improving
Approximate Query Answers. ACM SIGMOD 1998.
GMP97 P. B. Gibbons, Y. Matias, and V. Poosala.
Fast Incremental Maintenance of Approximate
Histograms. VLDB 1997.
GT01 P.B. Gibbons, S. Tirthapura. Estimating
Simple Functions on the Union of Data Streams.
ACM SPAA, 2001.

56
References (3)

HHW97 J. M. Hellerstein, P. J. Haas, and H. J.
Wang. Online Aggregation. ACM SIGMOD 1997.
HSD01 Mining Time-Changing Data Streams. G.
Hulten, L. Spencer, and P. Domingos. ACM SIGKD
2001.
IKM00 P. Indyk, N. Koudas, S. Muthukrishnan.
Identifying representative trends in massive time
series data sets using sketches. VLDB, 2000.
Ind00 P. Indyk. Stable Distributions,
Pseudorandom Generators, Embeddings, and Data
Stream Computation. IEEE FOCS, 2000.
IP95 Y. Ioannidis and V. Poosala. Balancing
Histogram Optimality and Practicality for Query
Result Size Estimation. ACM SIGMOD 1995.
IP99 Y.E. Ioannidis and V. Poosala.
Histogram-Based Approximation of Set-Valued
Query Answers. VLDB 1999.
JKM98 H. V. Jagadish, N. Koudas, S.
Muthukrishnan, V. Poosala, K. Sevcik, and T.
Suel. Optimal Histograms with Quality
Guarantees. VLDB 1998.
JL84 W.B. Johnson, J. Lindenstrauss. Extensions
of Lipshitz Mapping into Hilbert space.
Contemporary Mathematics, 26, 1984.
Koo80 R. P. Kooi. The Optimization of Queries
in Relational Databases. PhD thesis, Case
Western Reserve University, 1980.

57
References (4)

MRL98 G.S. Manku, S. Rajagopalan, and B. G.
Lindsay. Approximate Medians and other Quantiles
in One Pass and with Limited Memory. ACM SIGMOD
1998.
MRL99 G.S. Manku, S. Rajagopalan, B.G. Lindsay.
Random Sampling Techniques for Space Efficient
Online Computation of Order Statistics of Large
Datasets. ACM SIGMOD, 1999.
MVW98 Y. Matias, J.S. Vitter, and M. Wang.
Wavelet-based Histograms for Selectivity
Estimation. ACM SIGMOD 1998.
MVW00 Y. Matias, J.S. Vitter, and M. Wang.
Dynamic Maintenance of Wavelet-based
Histograms. VLDB 2000.
PIH96 V. Poosala, Y. Ioannidis, P. Haas, and E.
Shekita. Improved Histograms for Selectivity
Estimation of Range Predicates. ACM SIGMOD
1996.
PJO99 F. Provost, D. Jenson, and T. Oates.
Efficient Progressive Sampling. KDD 1999.
Poo97 V. Poosala. Histogram-Based Estimation
Techniques in Database Systems. PhD Thesis,
Univ. of Wisconsin, 1997.
PSC84 G. Piatetsky-Shapiro and C. Connell.
Accurate Estimation of the Number of Tuples
Satisfying a Condition. ACM SIGMOD 1984.
SDS96 E.J. Stollnitz, T.D. DeRose, and D.H.
Salesin. Wavelets for Computer Graphics.
Morgan-Kauffman Publishers Inc., 1996.

58
References (5)

T96 H. Toivonen. Sampling Large Databases for
Association Rules. VLDB 1996.
TGI02 N. Thaper, S. Guha, P. Indyk, N. Koudas.
Dynamic Multidimensional Histograms. ACM SIGMOD,
2002.
U89 P. E. Utgoff. Incremental Induction of
Decision Trees. Machine Learning, 4, 1989.
U94 P. E. Utgoff An Improved Algorithm for
Incremental Induction of Decision Trees. ICML
1994.
Vit85 J. S. Vitter. Random Sampling with a
Reservoir. ACM TOMS, 1985.