Title: Mining from Data Streams - Competition between Quality and Speed
1Mining from Data Streams - Competition between
Quality and Speed
- Adapted from
- Wei-Guang Teng (???) and
- S. Muthukrishnans presentations
2Streaming Finding Missing Numbers
- Paul permutes numbers 1n, and shows all but one
to Carole, in the permuted order, one after the
other. - Carole must find the missing number.
- Carole can not remember all the numbers she has
been shown.
3Streaming Finding Missing Numbers
- Carole cumulates the sum of all the numbers that
she has been shown. At the end she can subtract
this sum from - n(n1)/2
- Analysis
- Takes O(log n) bits to store the partial sum
- Performs one addition each time a new number is
shown (takes O(log n) time per number) - Performs one subtraction at the end (takes O(log
n time)
4Data Streams (1)
- Traditional DBMS data stored in finite,
persistent data sets - New Applications data input as continuous,
ordered data streams - Network monitoring and traffic engineering
- Telecom call detail records (CDR)
- ATM operations in banks
- Sensor networks
- Web logs and click-streams
- Transactions in retail chains
- Manufacturing processes
5Data Streams (2)
- Definition
- Continuous, unbounded, rapid, time-varying
streams of data elements - Application Characteristics
- Massive volumes of data (can be several
terabytes) - Records arrive at a rapid rate
- Goal
- Mine patterns, process queries and compute
statistics on data streams in real-time
6Data Stream Algorithms
- Streaming involves
- Small number of passes over data. (Typically 1?)
- Sublinear space (sublinear in the universe or
number of stream items?) - Sublinear time for computing (?)
- Similar to dynamic, online, approximation or
randomized algorithms, but with more constraints.
7Data Streams Analysis Model
User/Application
Query/Mining Target
Results
Stream Processing Engine
Scratch Space (Memory and/or Disk)
8Motivation
- 3 Billion Telephone Calls in US each day
- 30 Billion emails daily, 1 Billion SMS, IMs
- Scientific data NASA's observation satellites
generate billions of readings each day. - IP Network Traffic up to 1 Billion packets per
hour per router. Each ISP has many hundreds) of
routers! - Compare to human scale data "only" 1 billion
worldwide credit card transactions per month.
9Network Management Application
- Monitoring and configuring network hardware and
software to ensure smooth operation
Network Operations Center
Measurements Alarms
Network
10IP Network Measurement Data
- IP session data
- ATT collects 100 GBs of NetFlow data each day!
11Network Data Processing
- Traffic estimation/analysis
- List the top 100 IP addresses in terms of traffic
- What is the average duration of an IP session?
- Fraud detection
- Identify all sessions whose duration was more
than twice the normal - Security/Denial of Service
- List all IP addresses that have witnessed a
sudden spike in traffic - Identify IP addresses involved in more than 1000
sessions
12Challenges in Network Apps.
- 1 link with 2 Gb/s. Say avg packet size is 50
bytes. - Number of pkts/sec 5 Million.
- Time per pkt 0.2 µsec.
- If we capture pkt headers per packet src/dest
IP, time, no of bytes, etc. at least 10 bytes.
Space per second is 50 Mb. Space per day is 4.5
Tb per link. ISPs have hundreds of links.
13Data Streaming Models
- Input data a1, a2, a3,
- Input stream describes a signal Ai, a
one-dimensional function (value vs. index) - There is mapping from the input stream to the
signal - This is the data stream model
14Time-Series Model
15Cash-Register Model
- ais are increments to Aj
- ai (j, Ii) Ii gt 0
- Aij Ai-1j Ii
16Turnstile Model
- ais are updates to Aj
- ai (j, Ui)
- Aij Ai-1j Ui
- Strict turnstile model
- Aij gt at all i
17Data Stream Algorithms
- Compute various functions on the signal A at
various times - Performance measures
- Processing time per item ai in the stream
- Space used to store the data structure on At at
time t - Time needed to compute the functions on A
18Outline
- Introduction Motivation
- Issues Techs. of Processing Data Streams
- Sampling
- Histogram
- Wavelet
- Data Streaming Systems System
- Example Algorithms for Frequency Counting
- Lossy Counting
- Sticky Sampling
19Data Stream Algorithms
- Stream Processing Requirements
- Single pass each record is examined at most once
- Bounded storage limited memory for storing
synopsis - Real-time per record processing time (to
maintain synopsis) must be low - Generally, algorithms compute approximate answers
- Difficult to compute answers accurately with
limited memory
20Approximation in Data Streams
- Approximate Answers - Deterministic Bounds
- Algorithms only compute an approximate answer,
but bounds on error - Approximate Answers - Probabilistic Bounds
- Algorithms compute an approximate answer with
high probability - With probability at least , the computed
answer is within a factor of the actual answer
- Data Streaming Systems System
21Sliding Window Approximation
0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1
0
- Why?
- Approximation technique for bounded memory
- Natural in applications (emphasizes recent data)
- Well-specified and deterministic semantics
- Issues
- Extend relational algebra, SQL, query
optimization - Algorithmic work
- Timestamps?
22Timestamps
- Explicit
- Injected by data source
- Models real-world event represented by tuple
- Tuples may be out-of-order, but if near-ordered
can reorder with small buffers - Implicit
- Introduced as special field by DSMS
- Arrival time in system
- Enables order-based querying and sliding windows
- Issues
- Distributed streams?
- Composite tuples created by DSMS?
23Time
- Easiest global system clock
- Stream elements and relation updates timestamped
on entry to system - Application-defined time
- Streams and relation updates contain application
timestamps, may be out of order - Application generates heartbeat
- Or deduce heartbeat from parameters stream skew,
scrambling, latency, and clock progress - Query results in application time
24Sampling Basics
- A small random sample S of the data often
well-represents all the data - Example select agg from R where R.e is odd
(n12) -
- If agg is avg, return average of odd elements in
S -
- If agg is count, return average over all elements
e in S of - n if e is odd
- 0 if e is even
Data stream 9 3 5 2 7 1 6 5 8
4 9 1
Sample S 9 5 1 8
answer 5
answer 123/4 9
Unbiased!
25Histograms
- Histograms approximate the frequency distribution
of element values in a stream - A histogram (typically) consists of
- A partitioning of element domain values into
buckets - A count per bucket B (of the number of
elements in B) - Long history of use for selectivity estimation
within a query optimizer (Koo80, PSC84, etc)
26Types of Histograms
- Equi-Depth Histograms
- Select buckets such that counts per bucket are
equal - V-Optimal Histograms IP95 JKM98
- Select buckets to minimize frequency variance
within buckets
Count for bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
Count for bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
27Answering Queries using Histograms
IP99
- (Implicitly) map the histogram back to an
approximate relation, apply the query to the
approximate relation - Example select count() from R where
4ltR.elt15 - For equi-depth histograms, maximum error
Count spread evenly among bucket values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
4 ? R.e ? 15
28Wavelet Basics
- For hierarchical decomposition of
functions/signals - Haar wavelets
- Simplest wavelet basis gt Recursive pairwise
averaging and differencing at different
resolutions
Resolution Averages
Detail Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
Haar wavelet decomposition
2.75, -1.25, 0.5, 0, 0, -1, -1, 0
29Haar Wavelet Coefficients
- Hierarchical decomposition structure (error
tree)
Coefficient Supports
-
-
-
-
-
-
2 2 0 2 3
5 4 4
-
Original frequency distribution
30Wavelet-based Histograms MVW98
- Problem range-query selectivity estimation
- Key idea use a compact subset of Haar wavelet
coefficients for approximating frequency
distribution - Steps
- Compute cumulative frequency distribution C
- Compute Haar wavelet transform of C
- Coefficient thresholding only mltltn coefficients
can be kept
31Using Wavelet-based Histograms
- Selectivity estimation count(alt R.elt b)
Cb - Ca-1 - C is the (approximate) reconstructed
cumulative distribution - Time O(minm, logN), where m size of wavelet
synopsis (number of coefficients), N size of
domain - Empirical results over synthetic data shows
improvements over random sampling and histograms
- At most logN1 coefficients are needed to
reconstruct any C value
Ca
32Data Streaming Systems
- Low-level application specific approach
- DBMS approach
- Generic data stream management systems
33DBMS Vs. DSMS Meta-Questions
- Killer-apps
- Application stream rates exceed DBMS capacity?
- Can DSMS handle high rates anyway?
- Motivation
- Need for general-purpose DSMS?
- Not ad-hoc, application-specific systems?
- Non-Trivial
- DSMS merely DBMS with enhanced support for
triggers, temporal constructs, data rate mgmt?
34DBMS versus DSMS
- Persistent relations
- One-time queries
- Random access
- Access plan determined by query processor and
physical DB design
- Transient streams (and persistent relations)
- Continuous queries
- Sequential access
- Unpredictable data characteristics and arrival
patterns
35(Simplified) Big Picture of DSMS
Stored Result
Streamed Result
DSMS
Scratch Store
Stored Relations
36(Simplified) Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces
Scratch Store
Lookup Tables
37Using Conventional DBMS
- Data streams as relation inserts, continuous
queries as triggers or materialized views - Problems with this approach
- Inserts are typically batched, high overhead
- Expressiveness simple conditions (triggers), no
built-in notion of sequence (views) - No notion of approximation, resource allocation
- Current systems dont scale to large of
triggers - Views dont provide streamed results
38Query 1 (self-join)
- Find all outgoing calls longer than 2 minutes
- SELECT O1.call_ID, O1.caller
- FROM Outgoing O1, Outgoing O2
- WHERE (O2.time O1.time gt 2
- AND O1.call_ID O2.call_ID
- AND O1.event start
- AND O2.event end)
- Result requires unbounded storage
- Can provide result as data stream
- Can output after 2 min, without seeing end
39Query 2 (join)
- Pair up callers and callees
- SELECT O.caller, I.callee
- FROM Outgoing O, Incoming I
- WHERE O.call_ID I.call_ID
- Can still provide result as data stream
- Requires unbounded temporary storage
- unless streams are near-synchronized
40Query 3 (group-by aggregation)
- Total connection time for each caller
- SELECT O1.caller, sum(O2.time O1.time)
- FROM Outgoing O1, Outgoing O2
- WHERE (O1.call_ID O2.call_ID
- AND O1.event start
- AND O2.event end)
- GROUP BY O1.caller
- Cannot provide result in (append-only) stream
- Output updates?
- Provide current value on demand?
41Data Model
- Append-only
- Call records
- Updates
- Stock tickers
- Deletes
- Transactional data
- Meta-Data
- Control signals, punctuations
- System Internals probably need all above
42Related Database Technology
- DSMS must use ideas, but none is substitute
- Triggers, Materialized Views in Conventional DBMS
- Main-Memory Databases
- Sequence/Temporal/Timeseries Databases
- Realtime Databases
- Adaptive, Online, Partial Results
- Novelty in DSMS
- Semantics input ordering, streaming output,
- State cannot store unending streams, yet need
history - Performance rate, variability, imprecision,
43Outline
- Introduction Motivation
- Data Stream Management System
- Issues Techs. of Processing Data Streams
- Sampling
- Histogram
- Wavelet
- Example Algorithms for Frequency Counting
- Lossy Counting
- Sticky Sampling
44Problem of Frequency Counts
- Identify all elements whose current frequency
exceeds support threshold s 0.1
Stream
45Algorithm 1 Lossy Counting
- Step 1 Divide the stream into windows
- Is window size a function of support s? Will fix
later
46Lossy Counting in Action ...
Empty
At window boundary, decrement all counters by 1
47Lossy Counting (contd)
Frequency Counts
Next Window
At window boundary, decrement all counters by 1
48Error Analysis
- How much do we undercount?
- If current size of stream
N - and window-size
1/e - then
windows eN
frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
49Analysis of Lossy Counting
- Output
- Elements with counter values exceeding sN eN
- How many counters do we need?
- Worst case 1/e log(eN) counters
Approximation guarantees Frequencies
underestimated by at most eN No false negatives
False positives have true frequency at least sN
eN
50Algorithm 2 Sticky Sampling
- Create counters by sampling
- Maintain exact counts thereafter
What rate should we sample?
51Sticky Sampling (contd)
- For finite stream of length N
- Sampling rate 2/Ne log 1/(s ?)
- (? probability of failure)
- Output
- Elements with counter values exceeding sN eN
Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
Same error guarantees as Lossy Counting
but probabilistic!
52Sampling rate?
- Finite stream of length N
- Sampling rate 2/Ne log 1/(s?)
- Infinite stream with unknown N
- Gradually adjust sampling rate
- In either case,
- Expected number of counters 2/elog 1/s?
Independent of N!
53New Directions
- Functional approximation theory
- Data structures
- Computational geometry
- Graph theory
- Databases
- Hardware
- Streaming models
- Data stream quality monitoring
54References (1)
- AGM99 N. Alon, P.B. Gibbons, Y. Matias, M.
Szegedy. Tracking Join and Self-Join Sizes in
Limited Storage. ACM PODS, 1999. - AMS96 N. Alon, Y. Matias, M. Szegedy. The space
complexity of approximating the frequency
moments. ACM STOC, 1996. - CIK02 G. Cormode, P. Indyk, N. Koudas, S.
Muthukrishnan. Fast mining of tabular data via
approximate distance computations. IEEE ICDE,
2002. - CMN98 S. Chaudhuri, R. Motwani, and V.
Narasayya. Random Sampling for Histogram
Construction How much is enough?. ACM SIGMOD
1998. - CDI02 G. Cormode, M. Datar, P. Indyk, S.
Muthukrishnan. Comparing Data Streams Using
Hamming Norms. VLDB, 2002. - DGG02 A. Dobra, M. Garofalakis, J. Gehrke, R.
Rastogi. Processing Complex Aggregate Queries
over Data Streams. ACM SIGMOD, 2002. - DJM02 T. Dasu, T. Johnson, S. Muthukrishnan, V.
Shkapenyuk. Mining database structure or how to
build a data quality browser. ACM SIGMOD, 2002. - DH00 P. Domingos and G. Hulten. Mining
high-speed data streams. ACM SIGKDD, 2000. - EKSWX98 M. Ester, H.-P. Kriegel, J. Sander, M.
Wimmer, and X. Xu. Incremental Clustering for
Mining in a Data Warehousing Environment. VLDB
1998. - FKS99 J. Feigenbaum, S. Kannan, M. Strauss, M.
Viswanathan. An approximate L1-difference
algorithm for massive data streams. IEEE FOCS,
1999. - FM85 P. Flajolet, G.N. Martin. Probabilistic
Counting Algorithms for Data Base Applications.
JCSS 31(2), 1985
55References (2)
- Gib01 P. Gibbons. Distinct sampling for
highly-accurate answers to distinct values
queries and event reports, VLDB 2001. - GGI02 A.C. Gilbert, S. Guha, P. Indyk, Y.
Kotidis, S. Muthukrishnan, M. Strauss. Fast,
small-space algorithms for approximate histogram
maintenance. ACM STOC, 2002. - GGRL99 J. Gehrke, V. Ganti, R. Ramakrishnan,
and W.-Y. Loh BOAT-Optimistic Decision Tree
Construction. SIGMOD 1999. - GK01 M. Greenwald and S. Khanna.
Space-Efficient Online Computation of Quantile
Summaries. ACM SIGMOD 2001. - GKM01 A.C. Gilbert, Y. Kotidis, S.
Muthukrishnan, M. Strauss. Surfing Wavelets on
Streams One Pass Summaries for Approximate
Aggregate Queries. VLDB 2001. - GKM02 A.C. Gilbert, Y. Kotidis, S.
Muthukrishnan, M. Strauss. How to Summarize the
Universe Dynamic Maintenance of Quantiles. VLDB
2002. - GKS01b S. Guha, N. Koudas, and K. Shim. Data
Streams and Histograms. ACM STOC 2001. - GM98 P. B. Gibbons and Y. Matias. New
Sampling-Based Summary Statistics for Improving
Approximate Query Answers. ACM SIGMOD 1998. - GMP97 P. B. Gibbons, Y. Matias, and V. Poosala.
Fast Incremental Maintenance of Approximate
Histograms. VLDB 1997. - GT01 P.B. Gibbons, S. Tirthapura. Estimating
Simple Functions on the Union of Data Streams.
ACM SPAA, 2001.
56References (3)
- HHW97 J. M. Hellerstein, P. J. Haas, and H. J.
Wang. Online Aggregation. ACM SIGMOD 1997. - HSD01 Mining Time-Changing Data Streams. G.
Hulten, L. Spencer, and P. Domingos. ACM SIGKD
2001. - IKM00 P. Indyk, N. Koudas, S. Muthukrishnan.
Identifying representative trends in massive time
series data sets using sketches. VLDB, 2000. - Ind00 P. Indyk. Stable Distributions,
Pseudorandom Generators, Embeddings, and Data
Stream Computation. IEEE FOCS, 2000. - IP95 Y. Ioannidis and V. Poosala. Balancing
Histogram Optimality and Practicality for Query
Result Size Estimation. ACM SIGMOD 1995. - IP99 Y.E. Ioannidis and V. Poosala.
Histogram-Based Approximation of Set-Valued
Query Answers. VLDB 1999. - JKM98 H. V. Jagadish, N. Koudas, S.
Muthukrishnan, V. Poosala, K. Sevcik, and T.
Suel. Optimal Histograms with Quality
Guarantees. VLDB 1998. - JL84 W.B. Johnson, J. Lindenstrauss. Extensions
of Lipshitz Mapping into Hilbert space.
Contemporary Mathematics, 26, 1984. - Koo80 R. P. Kooi. The Optimization of Queries
in Relational Databases. PhD thesis, Case
Western Reserve University, 1980.
57References (4)
- MRL98 G.S. Manku, S. Rajagopalan, and B. G.
Lindsay. Approximate Medians and other Quantiles
in One Pass and with Limited Memory. ACM SIGMOD
1998. - MRL99 G.S. Manku, S. Rajagopalan, B.G. Lindsay.
Random Sampling Techniques for Space Efficient
Online Computation of Order Statistics of Large
Datasets. ACM SIGMOD, 1999. - MVW98 Y. Matias, J.S. Vitter, and M. Wang.
Wavelet-based Histograms for Selectivity
Estimation. ACM SIGMOD 1998. - MVW00 Y. Matias, J.S. Vitter, and M. Wang.
Dynamic Maintenance of Wavelet-based
Histograms. VLDB 2000. - PIH96 V. Poosala, Y. Ioannidis, P. Haas, and E.
Shekita. Improved Histograms for Selectivity
Estimation of Range Predicates. ACM SIGMOD
1996. - PJO99 F. Provost, D. Jenson, and T. Oates.
Efficient Progressive Sampling. KDD 1999. - Poo97 V. Poosala. Histogram-Based Estimation
Techniques in Database Systems. PhD Thesis,
Univ. of Wisconsin, 1997. - PSC84 G. Piatetsky-Shapiro and C. Connell.
Accurate Estimation of the Number of Tuples
Satisfying a Condition. ACM SIGMOD 1984. - SDS96 E.J. Stollnitz, T.D. DeRose, and D.H.
Salesin. Wavelets for Computer Graphics.
Morgan-Kauffman Publishers Inc., 1996.
58References (5)
- T96 H. Toivonen. Sampling Large Databases for
Association Rules. VLDB 1996. - TGI02 N. Thaper, S. Guha, P. Indyk, N. Koudas.
Dynamic Multidimensional Histograms. ACM SIGMOD,
2002. - U89 P. E. Utgoff. Incremental Induction of
Decision Trees. Machine Learning, 4, 1989. - U94 P. E. Utgoff An Improved Algorithm for
Incremental Induction of Decision Trees. ICML
1994. - Vit85 J. S. Vitter. Random Sampling with a
Reservoir. ACM TOMS, 1985.