Mining from Data Streams - Competition between Quality and Speed - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Mining from Data Streams - Competition between Quality and Speed

Description:

New Applications data input as continuous, ordered data streams ... Mine patterns, process queries and compute statistics on data streams in real-time ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 59
Provided by: surajL
Category:

less

Transcript and Presenter's Notes

Title: Mining from Data Streams - Competition between Quality and Speed


1
Mining from Data Streams - Competition between
Quality and Speed
  • Adapted from
  • Wei-Guang Teng (???) and
  • S. Muthukrishnans presentations

2
Streaming Finding Missing Numbers
  • Paul permutes numbers 1n, and shows all but one
    to Carole, in the permuted order, one after the
    other.
  • Carole must find the missing number.
  • Carole can not remember all the numbers she has
    been shown.

3
Streaming Finding Missing Numbers
  • Carole cumulates the sum of all the numbers that
    she has been shown. At the end she can subtract
    this sum from
  • n(n1)/2
  • Analysis
  • Takes O(log n) bits to store the partial sum
  • Performs one addition each time a new number is
    shown (takes O(log n) time per number)
  • Performs one subtraction at the end (takes O(log
    n time)

4
Data Streams (1)
  • Traditional DBMS data stored in finite,
    persistent data sets
  • New Applications data input as continuous,
    ordered data streams
  • Network monitoring and traffic engineering
  • Telecom call detail records (CDR)
  • ATM operations in banks
  • Sensor networks
  • Web logs and click-streams
  • Transactions in retail chains
  • Manufacturing processes

5
Data Streams (2)
  • Definition
  • Continuous, unbounded, rapid, time-varying
    streams of data elements
  • Application Characteristics
  • Massive volumes of data (can be several
    terabytes)
  • Records arrive at a rapid rate
  • Goal
  • Mine patterns, process queries and compute
    statistics on data streams in real-time

6
Data Stream Algorithms
  • Streaming involves
  • Small number of passes over data. (Typically 1?)
  • Sublinear space (sublinear in the universe or
    number of stream items?)
  • Sublinear time for computing (?)
  • Similar to dynamic, online, approximation or
    randomized algorithms, but with more constraints.

7
Data Streams Analysis Model
User/Application
Query/Mining Target
Results
Stream Processing Engine
Scratch Space (Memory and/or Disk)
8
Motivation
  • 3 Billion Telephone Calls in US each day
  • 30 Billion emails daily, 1 Billion SMS, IMs
  • Scientific data NASA's observation satellites
    generate billions of readings each day.
  • IP Network Traffic up to 1 Billion packets per
    hour per router. Each ISP has many hundreds) of
    routers!
  • Compare to human scale data "only" 1 billion
    worldwide credit card transactions per month.

9
Network Management Application
  • Monitoring and configuring network hardware and
    software to ensure smooth operation

Network Operations Center
Measurements Alarms
Network
10
IP Network Measurement Data
  • IP session data
  • ATT collects 100 GBs of NetFlow data each day!

11
Network Data Processing
  • Traffic estimation/analysis
  • List the top 100 IP addresses in terms of traffic
  • What is the average duration of an IP session?
  • Fraud detection
  • Identify all sessions whose duration was more
    than twice the normal
  • Security/Denial of Service
  • List all IP addresses that have witnessed a
    sudden spike in traffic
  • Identify IP addresses involved in more than 1000
    sessions

12
Challenges in Network Apps.
  • 1 link with 2 Gb/s. Say avg packet size is 50
    bytes.
  • Number of pkts/sec 5 Million.
  • Time per pkt 0.2 µsec.
  • If we capture pkt headers per packet src/dest
    IP, time, no of bytes, etc. at least 10 bytes.
    Space per second is 50 Mb. Space per day is 4.5
    Tb per link. ISPs have hundreds of links.

13
Data Streaming Models
  • Input data a1, a2, a3,
  • Input stream describes a signal Ai, a
    one-dimensional function (value vs. index)
  • There is mapping from the input stream to the
    signal
  • This is the data stream model

14
Time-Series Model
  • ais are form Ais.

15
Cash-Register Model
  • ais are increments to Aj
  • ai (j, Ii) Ii gt 0
  • Aij Ai-1j Ii

16
Turnstile Model
  • ais are updates to Aj
  • ai (j, Ui)
  • Aij Ai-1j Ui
  • Strict turnstile model
  • Aij gt at all i

17
Data Stream Algorithms
  • Compute various functions on the signal A at
    various times
  • Performance measures
  • Processing time per item ai in the stream
  • Space used to store the data structure on At at
    time t
  • Time needed to compute the functions on A

18
Outline
  • Introduction Motivation
  • Issues Techs. of Processing Data Streams
  • Sampling
  • Histogram
  • Wavelet
  • Data Streaming Systems System
  • Example Algorithms for Frequency Counting
  • Lossy Counting
  • Sticky Sampling

19
Data Stream Algorithms
  • Stream Processing Requirements
  • Single pass each record is examined at most once
  • Bounded storage limited memory for storing
    synopsis
  • Real-time per record processing time (to
    maintain synopsis) must be low
  • Generally, algorithms compute approximate answers
  • Difficult to compute answers accurately with
    limited memory

20
Approximation in Data Streams
  • Approximate Answers - Deterministic Bounds
  • Algorithms only compute an approximate answer,
    but bounds on error
  • Approximate Answers - Probabilistic Bounds
  • Algorithms compute an approximate answer with
    high probability
  • With probability at least , the computed
    answer is within a factor of the actual answer
  • Data Streaming Systems System

21
Sliding Window Approximation
0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1
0
  • Why?
  • Approximation technique for bounded memory
  • Natural in applications (emphasizes recent data)
  • Well-specified and deterministic semantics
  • Issues
  • Extend relational algebra, SQL, query
    optimization
  • Algorithmic work
  • Timestamps?

22
Timestamps
  • Explicit
  • Injected by data source
  • Models real-world event represented by tuple
  • Tuples may be out-of-order, but if near-ordered
    can reorder with small buffers
  • Implicit
  • Introduced as special field by DSMS
  • Arrival time in system
  • Enables order-based querying and sliding windows
  • Issues
  • Distributed streams?
  • Composite tuples created by DSMS?

23
Time
  • Easiest global system clock
  • Stream elements and relation updates timestamped
    on entry to system
  • Application-defined time
  • Streams and relation updates contain application
    timestamps, may be out of order
  • Application generates heartbeat
  • Or deduce heartbeat from parameters stream skew,
    scrambling, latency, and clock progress
  • Query results in application time

24
Sampling Basics
  • A small random sample S of the data often
    well-represents all the data
  • Example select agg from R where R.e is odd
    (n12)
  • If agg is avg, return average of odd elements in
    S
  • If agg is count, return average over all elements
    e in S of
  • n if e is odd
  • 0 if e is even

Data stream 9 3 5 2 7 1 6 5 8
4 9 1
Sample S 9 5 1 8
answer 5
answer 123/4 9
Unbiased!
25
Histograms
  • Histograms approximate the frequency distribution
    of element values in a stream
  • A histogram (typically) consists of
  • A partitioning of element domain values into
    buckets
  • A count per bucket B (of the number of
    elements in B)
  • Long history of use for selectivity estimation
    within a query optimizer (Koo80, PSC84, etc)

26
Types of Histograms
  • Equi-Depth Histograms
  • Select buckets such that counts per bucket are
    equal
  • V-Optimal Histograms IP95 JKM98
  • Select buckets to minimize frequency variance
    within buckets

Count for bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
Count for bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
27
Answering Queries using Histograms
IP99
  • (Implicitly) map the histogram back to an
    approximate relation, apply the query to the
    approximate relation
  • Example select count() from R where
    4ltR.elt15
  • For equi-depth histograms, maximum error

Count spread evenly among bucket values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
4 ? R.e ? 15
28
Wavelet Basics
  • For hierarchical decomposition of
    functions/signals
  • Haar wavelets
  • Simplest wavelet basis gt Recursive pairwise
    averaging and differencing at different
    resolutions

Resolution Averages
Detail Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
Haar wavelet decomposition
2.75, -1.25, 0.5, 0, 0, -1, -1, 0
29
Haar Wavelet Coefficients
  • Hierarchical decomposition structure (error
    tree)

Coefficient Supports



-

-

-




-
-
-
2 2 0 2 3
5 4 4
-
Original frequency distribution
30
Wavelet-based Histograms MVW98
  • Problem range-query selectivity estimation
  • Key idea use a compact subset of Haar wavelet
    coefficients for approximating frequency
    distribution
  • Steps
  • Compute cumulative frequency distribution C
  • Compute Haar wavelet transform of C
  • Coefficient thresholding only mltltn coefficients
    can be kept

31
Using Wavelet-based Histograms
  • Selectivity estimation count(alt R.elt b)
    Cb - Ca-1
  • C is the (approximate) reconstructed
    cumulative distribution
  • Time O(minm, logN), where m size of wavelet
    synopsis (number of coefficients), N size of
    domain
  • Empirical results over synthetic data shows
    improvements over random sampling and histograms
  • At most logN1 coefficients are needed to
    reconstruct any C value

Ca
32
Data Streaming Systems
  • Low-level application specific approach
  • DBMS approach
  • Generic data stream management systems

33
DBMS Vs. DSMS Meta-Questions
  • Killer-apps
  • Application stream rates exceed DBMS capacity?
  • Can DSMS handle high rates anyway?
  • Motivation
  • Need for general-purpose DSMS?
  • Not ad-hoc, application-specific systems?
  • Non-Trivial
  • DSMS merely DBMS with enhanced support for
    triggers, temporal constructs, data rate mgmt?

34
DBMS versus DSMS
  • Persistent relations
  • One-time queries
  • Random access
  • Access plan determined by query processor and
    physical DB design
  • Transient streams (and persistent relations)
  • Continuous queries
  • Sequential access
  • Unpredictable data characteristics and arrival
    patterns

35
(Simplified) Big Picture of DSMS
Stored Result
Streamed Result
DSMS
Scratch Store
Stored Relations
36
(Simplified) Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces
Scratch Store
Lookup Tables
37
Using Conventional DBMS
  • Data streams as relation inserts, continuous
    queries as triggers or materialized views
  • Problems with this approach
  • Inserts are typically batched, high overhead
  • Expressiveness simple conditions (triggers), no
    built-in notion of sequence (views)
  • No notion of approximation, resource allocation
  • Current systems dont scale to large of
    triggers
  • Views dont provide streamed results

38
Query 1 (self-join)
  • Find all outgoing calls longer than 2 minutes
  • SELECT O1.call_ID, O1.caller
  • FROM Outgoing O1, Outgoing O2
  • WHERE (O2.time O1.time gt 2
  • AND O1.call_ID O2.call_ID
  • AND O1.event start
  • AND O2.event end)
  • Result requires unbounded storage
  • Can provide result as data stream
  • Can output after 2 min, without seeing end

39
Query 2 (join)
  • Pair up callers and callees
  • SELECT O.caller, I.callee
  • FROM Outgoing O, Incoming I
  • WHERE O.call_ID I.call_ID
  • Can still provide result as data stream
  • Requires unbounded temporary storage
  • unless streams are near-synchronized

40
Query 3 (group-by aggregation)
  • Total connection time for each caller
  • SELECT O1.caller, sum(O2.time O1.time)
  • FROM Outgoing O1, Outgoing O2
  • WHERE (O1.call_ID O2.call_ID
  • AND O1.event start
  • AND O2.event end)
  • GROUP BY O1.caller
  • Cannot provide result in (append-only) stream
  • Output updates?
  • Provide current value on demand?

41
Data Model
  • Append-only
  • Call records
  • Updates
  • Stock tickers
  • Deletes
  • Transactional data
  • Meta-Data
  • Control signals, punctuations
  • System Internals probably need all above

42
Related Database Technology
  • DSMS must use ideas, but none is substitute
  • Triggers, Materialized Views in Conventional DBMS
  • Main-Memory Databases
  • Sequence/Temporal/Timeseries Databases
  • Realtime Databases
  • Adaptive, Online, Partial Results
  • Novelty in DSMS
  • Semantics input ordering, streaming output,
  • State cannot store unending streams, yet need
    history
  • Performance rate, variability, imprecision,

43
Outline
  • Introduction Motivation
  • Data Stream Management System
  • Issues Techs. of Processing Data Streams
  • Sampling
  • Histogram
  • Wavelet
  • Example Algorithms for Frequency Counting
  • Lossy Counting
  • Sticky Sampling

44
Problem of Frequency Counts
  • Identify all elements whose current frequency
    exceeds support threshold s 0.1

Stream
45
Algorithm 1 Lossy Counting
  • Step 1 Divide the stream into windows
  • Is window size a function of support s? Will fix
    later

46
Lossy Counting in Action ...
Empty
At window boundary, decrement all counters by 1
47
Lossy Counting (contd)
Frequency Counts

Next Window
At window boundary, decrement all counters by 1
48
Error Analysis
  • How much do we undercount?
  • If current size of stream
    N
  • and window-size
    1/e
  • then
    windows eN

frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
49
Analysis of Lossy Counting
  • Output
  • Elements with counter values exceeding sN eN
  • How many counters do we need?
  • Worst case 1/e log(eN) counters

Approximation guarantees Frequencies
underestimated by at most eN No false negatives
False positives have true frequency at least sN
eN
50
Algorithm 2 Sticky Sampling
  • Create counters by sampling
  • Maintain exact counts thereafter

What rate should we sample?
51
Sticky Sampling (contd)
  • For finite stream of length N
  • Sampling rate 2/Ne log 1/(s ?)
  • (? probability of failure)
  • Output
  • Elements with counter values exceeding sN eN

Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
Same error guarantees as Lossy Counting
but probabilistic!
52
Sampling rate?
  • Finite stream of length N
  • Sampling rate 2/Ne log 1/(s?)
  • Infinite stream with unknown N
  • Gradually adjust sampling rate
  • In either case,
  • Expected number of counters 2/elog 1/s?

Independent of N!
53
New Directions
  • Functional approximation theory
  • Data structures
  • Computational geometry
  • Graph theory
  • Databases
  • Hardware
  • Streaming models
  • Data stream quality monitoring

54
References (1)
  • AGM99 N. Alon, P.B. Gibbons, Y. Matias, M.
    Szegedy. Tracking Join and Self-Join Sizes in
    Limited Storage. ACM PODS, 1999.
  • AMS96 N. Alon, Y. Matias, M. Szegedy. The space
    complexity of approximating the frequency
    moments. ACM STOC, 1996.
  • CIK02 G. Cormode, P. Indyk, N. Koudas, S.
    Muthukrishnan. Fast mining of tabular data via
    approximate distance computations. IEEE ICDE,
    2002.
  • CMN98 S. Chaudhuri, R. Motwani, and V.
    Narasayya. Random Sampling for Histogram
    Construction How much is enough?. ACM SIGMOD
    1998.
  • CDI02 G. Cormode, M. Datar, P. Indyk, S.
    Muthukrishnan. Comparing Data Streams Using
    Hamming Norms. VLDB, 2002.
  • DGG02 A. Dobra, M. Garofalakis, J. Gehrke, R.
    Rastogi. Processing Complex Aggregate Queries
    over Data Streams. ACM SIGMOD, 2002.
  • DJM02 T. Dasu, T. Johnson, S. Muthukrishnan, V.
    Shkapenyuk. Mining database structure or how to
    build a data quality browser. ACM SIGMOD, 2002.
  • DH00 P. Domingos and G. Hulten. Mining
    high-speed data streams. ACM SIGKDD, 2000.
  • EKSWX98 M. Ester, H.-P. Kriegel, J. Sander, M.
    Wimmer, and X. Xu. Incremental Clustering for
    Mining in a Data Warehousing Environment. VLDB
    1998.
  • FKS99 J. Feigenbaum, S. Kannan, M. Strauss, M.
    Viswanathan. An approximate L1-difference
    algorithm for massive data streams. IEEE FOCS,
    1999.
  • FM85 P. Flajolet, G.N. Martin. Probabilistic
    Counting Algorithms for Data Base Applications.
    JCSS 31(2), 1985

55
References (2)
  • Gib01 P. Gibbons. Distinct sampling for
    highly-accurate answers to distinct values
    queries and event reports, VLDB 2001.
  • GGI02 A.C. Gilbert, S. Guha, P. Indyk, Y.
    Kotidis, S. Muthukrishnan, M. Strauss. Fast,
    small-space algorithms for approximate histogram
    maintenance. ACM STOC, 2002.
  • GGRL99 J. Gehrke, V. Ganti, R. Ramakrishnan,
    and W.-Y. Loh BOAT-Optimistic Decision Tree
    Construction. SIGMOD 1999.
  • GK01 M. Greenwald and S. Khanna.
    Space-Efficient Online Computation of Quantile
    Summaries. ACM SIGMOD 2001.
  • GKM01 A.C. Gilbert, Y. Kotidis, S.
    Muthukrishnan, M. Strauss. Surfing Wavelets on
    Streams One Pass Summaries for Approximate
    Aggregate Queries. VLDB 2001.
  • GKM02 A.C. Gilbert, Y. Kotidis, S.
    Muthukrishnan, M. Strauss. How to Summarize the
    Universe Dynamic Maintenance of Quantiles. VLDB
    2002.
  • GKS01b S. Guha, N. Koudas, and K. Shim. Data
    Streams and Histograms. ACM STOC 2001.
  • GM98 P. B. Gibbons and Y. Matias. New
    Sampling-Based Summary Statistics for Improving
    Approximate Query Answers. ACM SIGMOD 1998.
  • GMP97 P. B. Gibbons, Y. Matias, and V. Poosala.
    Fast Incremental Maintenance of Approximate
    Histograms. VLDB 1997.
  • GT01 P.B. Gibbons, S. Tirthapura. Estimating
    Simple Functions on the Union of Data Streams.
    ACM SPAA, 2001.

56
References (3)
  • HHW97 J. M. Hellerstein, P. J. Haas, and H. J.
    Wang. Online Aggregation. ACM SIGMOD 1997.
  • HSD01 Mining Time-Changing Data Streams. G.
    Hulten, L. Spencer, and P. Domingos. ACM SIGKD
    2001.
  • IKM00 P. Indyk, N. Koudas, S. Muthukrishnan.
    Identifying representative trends in massive time
    series data sets using sketches. VLDB, 2000.
  • Ind00 P. Indyk. Stable Distributions,
    Pseudorandom Generators, Embeddings, and Data
    Stream Computation. IEEE FOCS, 2000.
  • IP95 Y. Ioannidis and V. Poosala. Balancing
    Histogram Optimality and Practicality for Query
    Result Size Estimation. ACM SIGMOD 1995.
  • IP99 Y.E. Ioannidis and V. Poosala.
    Histogram-Based Approximation of Set-Valued
    Query Answers. VLDB 1999.
  • JKM98 H. V. Jagadish, N. Koudas, S.
    Muthukrishnan, V. Poosala, K. Sevcik, and T.
    Suel. Optimal Histograms with Quality
    Guarantees. VLDB 1998.
  • JL84 W.B. Johnson, J. Lindenstrauss. Extensions
    of Lipshitz Mapping into Hilbert space.
    Contemporary Mathematics, 26, 1984.
  • Koo80 R. P. Kooi. The Optimization of Queries
    in Relational Databases. PhD thesis, Case
    Western Reserve University, 1980.

57
References (4)
  • MRL98 G.S. Manku, S. Rajagopalan, and B. G.
    Lindsay. Approximate Medians and other Quantiles
    in One Pass and with Limited Memory. ACM SIGMOD
    1998.
  • MRL99 G.S. Manku, S. Rajagopalan, B.G. Lindsay.
    Random Sampling Techniques for Space Efficient
    Online Computation of Order Statistics of Large
    Datasets. ACM SIGMOD, 1999.
  • MVW98 Y. Matias, J.S. Vitter, and M. Wang.
    Wavelet-based Histograms for Selectivity
    Estimation. ACM SIGMOD 1998.
  • MVW00 Y. Matias, J.S. Vitter, and M. Wang.
    Dynamic Maintenance of Wavelet-based
    Histograms. VLDB 2000.
  • PIH96 V. Poosala, Y. Ioannidis, P. Haas, and E.
    Shekita. Improved Histograms for Selectivity
    Estimation of Range Predicates. ACM SIGMOD
    1996.
  • PJO99 F. Provost, D. Jenson, and T. Oates.
    Efficient Progressive Sampling. KDD 1999.
  • Poo97 V. Poosala. Histogram-Based Estimation
    Techniques in Database Systems. PhD Thesis,
    Univ. of Wisconsin, 1997.
  • PSC84 G. Piatetsky-Shapiro and C. Connell.
    Accurate Estimation of the Number of Tuples
    Satisfying a Condition. ACM SIGMOD 1984.
  • SDS96 E.J. Stollnitz, T.D. DeRose, and D.H.
    Salesin. Wavelets for Computer Graphics.
    Morgan-Kauffman Publishers Inc., 1996.

58
References (5)
  • T96 H. Toivonen. Sampling Large Databases for
    Association Rules. VLDB 1996.
  • TGI02 N. Thaper, S. Guha, P. Indyk, N. Koudas.
    Dynamic Multidimensional Histograms. ACM SIGMOD,
    2002.
  • U89 P. E. Utgoff. Incremental Induction of
    Decision Trees. Machine Learning, 4, 1989.
  • U94 P. E. Utgoff An Improved Algorithm for
    Incremental Induction of Decision Trees. ICML
    1994.
  • Vit85 J. S. Vitter. Random Sampling with a
    Reservoir. ACM TOMS, 1985.
Write a Comment
User Comments (0)
About PowerShow.com