Chap' 8 Mining Stream, TimeSeries, and Sequence Data - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Chap' 8 Mining Stream, TimeSeries, and Sequence Data

Description:

Fast changing and requires fast, real-time response ... Window stitching - Stitch similar windows to form pairs of large similar ... – PowerPoint PPT presentation

Number of Views:351
Avg rating:3.0/5.0
Slides: 52
Provided by: jiaw185
Category:

less

Transcript and Presenter's Notes

Title: Chap' 8 Mining Stream, TimeSeries, and Sequence Data


1
Chap. 8 Mining Stream, Time-Series, and Sequence
Data
  • Data Mining

2
Characteristics of Data Streams
  • Data Streams
  • Traditional DBMS - data stored in finite,
    persistent data sets
  • Data streams - continuous, ordered, changing,
    fast, huge amount
  • Characteristics
  • Fast changing and requires fast, real-time
    response
  • Random access is expensive - single scan
    algorithm (can only have one look)
  • Store only the summary of the data seen thus far
  • Most stream data are at pretty low-level or
    multi-dimensional in nature, needs multi-level
    and multi-dimensional processing

3
Stream Data Applications
  • Telecommunication calling records
  • Business credit card transaction flows
  • Financial market stock exchange
  • Computer network network traffic monitoring
  • Sensor network sensor data stream
  • Security monitoring video streams
  • Web click streams, Web log

4
DBMS versus DSMS
  • Persistent relations
  • One-time queries
  • Random access
  • Unbounded disk store
  • Only current state matters
  • No real-time services
  • Relatively low update rate
  • Data at any granularity
  • Assume precise data
  • Transient streams
  • Continuous queries
  • Sequential access
  • Bounded main memory
  • Historical data is important
  • Real-time requirements
  • Possibly multi-GB arrival rate
  • Data at fine granularity
  • Data stale/imprecise

5
Stream Query Processing
User/Application
Continuous Query
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6
Challenges of Stream Data Processing
  • Multiple, continuous, rapid, time-varying,
    ordered streams
  • Main memory computations
  • Queries are often continuous
  • Evaluated continuously as stream data arrives
  • Answer updated over time
  • Queries are often complex
  • Beyond element-at-a-time processing
  • Beyond stream-at-a-time processing
  • Beyond relational queries (scientific, data
    mining, OLAP)
  • Multi-level/multi-dimensional processing and data
    mining
  • Most stream data are at low-level or
    multi-dimensional in nature

7
Methodologies for Stream Data Processing
  • Major challenges
  • Keep track of a large universe, e.g., pairs of IP
    address
  • Methodology
  • Use synopsis data structure, much smaller (O(logk
    N) space) than their base data set (O(N) space)
  • Compute an approximate answer within a small
    error range (factor e of the actual answer)
  • Major methods
  • Random sampling
  • Histograms
  • Sliding windows
  • Multi-resolution model
  • Sketches
  • Radomized algorithms

8
Stream Data Mining
  • Stream miningA more challenging task
  • It shares most of the difficulties with stream
    querying
  • But often requires less precision
  • Patterns are hidden and more general than
    querying
  • It may require exploratory analysis
  • Not necessarily continuous queries
  • Stream data mining tasks
  • Multi-dimensional on-line analysis of streams
  • Mining outliers and unusual patterns in stream
    data
  • Clustering data streams
  • Classification of stream data

9
Multi-Dimensional Stream Analysis Examples
  • Analysis of Web click streams
  • Raw data at low levels seconds, web page
    addresses, user IP addresses,
  • Analysts want changes, trends, unusual patterns,
    at reasonable levels of details
  • E.g., Average clicking traffic in North America
    on sports in the last 15 minutes is 40 higher
    than that in the last 24 hours.
  • Analysis of power consumption streams
  • Raw data power consumption flow for every
    household, every minute
  • Patterns one may find average hourly power
    consumption surges up 30 for manufacturing
    companies in Chicago in the last 2 hours today
    than that of the same day a week ago

10
A Stream Cube Architecture
  • A tilted time frame
  • Different time granularities
  • second, minute, quarter, hour, day, week,
  • Critical layers
  • Minimum interest layer (m-layer)
  • Observation layer (o-layer)
  • User watches at o-layer and occasionally needs
    to drill-down down to m-layer
  • Partial materialization of stream cubes
  • Full materialization too space and time
    consuming
  • No materialization slow response at query time
  • Partial materialization what do we mean
    partial?

11
A Titled Time Model
  • Natural tilted time frame
  • Example Minimal quarter, 4 quarters ? 1 hour,
    24 hours ? day,
  • Logarithmic tilted time frame
  • Example Minimal 1 minute, then 1, 2, 4, 8, 16,
    32,

12
Two Critical Layers
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
13
Partial Materialization
  • On-line materialization
  • Materialization takes precious space and time
  • Only incremental materialization (with tilted
    time frame)
  • Only materialize cuboids of the critical
    layers?
  • Online computation may take too much time
  • Preferred solution
  • popular-path approach Materializing those along
    the popular drilling paths
  • H-tree structure Such cuboids can be computed
    and stored efficiently using the H-tree structure

14
Stream Cube Structure From m-layer to o-layer
15
Frequent Patterns for Stream Data
  • Frequent pattern mining is valuable in stream
    applications
  • e.g., network intrusion mining
  • Mining precise freq. patterns in stream data
  • Unrealistic even we store them in a compressed
    form, such as FPtree
  • ? Approximate frequent patterns
  • Mining evolution freq. patterns
  • Use tilted time window frame
  • Mining evolution and dramatic changes of frequent
    patterns

16
Mining Approximate Frequent Patterns
  • Approximate answers are often sufficient
  • Example a router is interested in all flows
  • whose frequency is at least 1 (s) of the entire
    traffic stream seen so far, and feels that 1/10
    of s (e 0.1) error is comfortable
  • Lossy Counting Algorithm
  • Major ideas not tracing items until it becomes
    frequent
  • Adv guaranteed error bound
  • Disadv keep a large set of traces

17
Lossy Counting
Divide Stream into Buckets (bucket size is 1/ e
, exgt1000)
18
Lossy Counting
  • First bucket

19
Lossy Counting
  • Next bucket

20
Approximation Guarantee
  • Given (1) support threshold s, (2) error
    threshold e, and (3) stream length N
  • Output items with frequency counts exceeding (s
    e) N
  • How much do we undercount?
  • If stream length seen so far
    N
  • and bucket-size
    1/e
  • then frequency count error ? buckets
    eN
  • Approximation guarantee
  • No false negatives
  • False positives have true frequency count at
    least (se)N
  • Frequency count underestimated by at most eN

21
Lossy Counting Example
N1000, s0.1, e0.01, b size100 ? count error ?
10 (if count50, actual count is 4050)
22
Lossy Counting For Frequent Itemsets
Divide Stream into Buckets as for frequent
items But fill as many buckets as possible in
main memory one time
If we put 3 buckets of data into main memory one
time, Then decrease each frequency count by 3
23

Lossy Counting For Frequent Itemsets
Itemset ( ) is deleted.choose a large
number of buckets delete more
24
Lossy Counting For Frequent Itemsets
Pruning Itemsets Apriori Rule If we find
itemset ( ) is not frequent itemset, then
we neednt consider its superset
25
Classification for Data Streams
  • Decision tree induction for stream data
    classification
  • VFDT(Very Fast Decision Tree) / CVFDT
  • Other stream classification methods
  • Instead of decision-trees, consider other models
  • NaĂŻve Bayesian
  • Ensemble
  • Tilted time framework, incremental updating,
    dynamic maintenance, and model construction
  • Comparing of models to find changes

26
Hoeffding Tree
  • Only uses small sample
  • Based on Hoeffding Bound principle
  • Hoeffding Bound (Additive Chernoff Bound)
  • r random variable
  • R range of r
  • n independent observations
  • Mean of r is at least ravg e, with probability
    1 ?

27
Hoeffding Tree
  • Input
  • S sequence of examples
  • X attributes
  • G( ) evaluation function, e.g. Information Gain
  • d desired accuracy
  • Hoeffding Tree Algorithm
  • for each example in S
  • retrieve G(Xa) and G(Xb) // two highest G(Xi)
  • if ( G(Xa) G(Xb) gt e )
  • split on Xa
  • recurse to next node
  • break

28
Hoeffding Tree
29
Hoeffding Tree
  • Strengths
  • Scales better than traditional methods
  • Sublinear with sampling
  • Very small memory utilization
  • Incremental
  • Make class predictions in parallel
  • New examples are added as they come
  • Weakness
  • Could spend a lot of time with ties
  • Memory used with tree expansion
  • Number of candidate attributes

30
VFDT
  • VFDT (Very Fast DT) - Modifications to Hoeffding
    Tree
  • Near-ties broken more aggressively
  • G computed every nmin
  • Deactivates certain leaves to save memory
  • Poor attributes dropped
  • Initialize with traditional learner (helps
    learning curve)
  • Compare to Hoeffding Tree
  • Better time and memory
  • Compare to traditional decision tree
  • Similar accuracy
  • Better runtime with 1.61 million examples
  • 21 minutes for VFDT
  • 24 hours for C4.5
  • Still does not handle concept drift

31
CVFDT
  • Concept Drift
  • Time-changing data streams
  • Incorporate new and eliminate old
  • CVFDT (Concept-adapting VFDT)
  • Increments count with new example
  • Decrement old example
  • Sliding window
  • Nodes assigned monotonically increasing IDs
  • Grows alternate subtrees
  • When alternate more accurate ? replace old
  • O(w) better runtime than VFDT-window

32
Ensemble of Classifiers Algorithm
  • Method
  • Train K classifiers from K chunks
  • For each subsequent chunk
  • Train a new classifier
  • Test other classifiers against the chunk
  • Assign weight to each classifier
  • Select top K classifiers

33

Clustering Data Streams
  • Base on the k-median method
  • Data stream points from metric space
  • Find k clusters in the stream s.t. the sum of
    distances from data points to their closest
    center is minimized
  • Constant factor approximation algorithm
  • For each set of M records, Si, find O(k) centers
    in S1, , Sl
  • Local clustering Assign each point in Si to its
    closest center
  • Let S be centers for S1, , Sl with each center
    weighted by number of points assigned to it
  • Cluster S to find k centers

34
Mining Time-Series and Sequence Data
  • Time-series database
  • Consists of sequences of values/events changing
    with time
  • Data is recorded at regular intervals
  • Exgt Stock prices, power consumption,
    precipitation
  • Sequence database
  • Database of ordered items
  • Exgt Web log data - page traverse sequence
  • Mining time-series and sequence data
  • Trend analysis
  • Similarity search
  • Mining of sequential/periodic patterns

35
(No Transcript)
36
Trend analysis
  • A time series
  • Illustrated as a time-series graph f(t)
  • Major components
  • Long-term(trend) movements
  • General direction of moving over a long interval
  • Cyclic movements
  • Long-term oscillation of trend curve
  • Seasonal movements
  • Identical patterns that appears annually during
    specific period
  • Irregular movements

37
Estimation of Trend Curve
  • The least-square method
  • Find the best fitting curve that minimizes the
    sum of the square error
  • The moving-average method
  • Average of n data values
  • Smoothing of time series
  • Exgt Stock price graph 5 day, 20 day, 60 day
    average

38
Similarity Search
  • Similarity search
  • Finds data sequences that differ slightly from
    the given query sequence
  • Two categories of similarity queries
  • Whole matching
  • Find a sequence that is similar to the query
    sequence
  • Subsequence matching
  • Find all pairs of similar sequences
  • Typical Applications
  • Financial market
  • Scientific databases
  • Medical diagnosis

39
Data Transformation
  • Time domain ? frequency domain
  • Many techniques for signal analysis require the
    data to be in the frequency domain
  • Transformations
  • Discrete fourier transform (DFT), wavelet
    transform (DWT)
  • The distance in the time domain Euclidean
    distance in the frequency domain
  • Matching
  • Construct multidimensional index using the first
    few Fourier coefficients
  • Subsequence matching
  • Break each sequence into a set of pieces of
    window with length w, and extract the features of
    the subsequence inside the window

40
Enhanced Similarity Search
  • Allow gaps, differences in offsets or amplitudes
  • Normalize sequences with amplitude scaling and
    offset translation
  • Two sequences are said to be similar if they have
    enough non-overlapping time-ordered pairs of
    similar subsequences
  • Steps for a similarity search
  • Atomic matching - Find all pairs of gap-free
    windows of a small length that are similar
  • Window stitching - Stitch similar windows to form
    pairs of large similar subsequences allowing gaps
  • Subsequence ordering - Linearly order the
    subsequence matches to determine whether enough
    similar pieces exist

41
Sequential Pattern Mining
  • Mining of frequently occurring patterns
  • Concentrate on symbolic patterns
  • Examples
  • (Buy PC, buy memory)
  • Applications
  • Customer retention
  • Medical treatment
  • Disaster (e.g. earthquakes), market prediction
  • Weblog click stream analysis
  • Methods for sequential pattern mining
  • Variations of Apriori-like algorithms

42
Sequential Pattern Mining
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is
one of sequential patterns
43
GSPGeneralized Sequential Pattern Mining
  • Outline of the method
  • Initially, every item in DB is a candidate of
    length-1
  • for each level (i.e., sequences of length-k) do
  • scan database to collect support count for each
    candidate sequence
  • generate candidate length-(k1) sequences from
    length-k frequent sequences using Apriori
  • repeat until no frequent sequence or no candidate
    can be found
  • Major strength
  • Candidate pruning by Apriori

44
GSPGeneralized Sequential Pattern Mining
min_sup 2
45
Biology Fundamentals DNA
  • DNA helix-shaped molecule whose constituents are
    two parallel strands of nucleotides
  • DNA is usually represented by sequences of four
    (A, C, G, T) nucleotides
  • This assumes only one strand is considered the
    second strand is always derivable from the first
    by pairing As with Ts and Cs with Gs and
    vice-versa
  • Nucleotides (bases)
  • Adenine (A)
  • Cytosine (C)
  • Guanine (G)
  • Thymine (T)

46
Biology Fundamentals Genes
  • Gene Contiguous subparts of single strand DNA
    that are templates for producing proteins. Genes
    can appear in either of the DNA strand.
  • Chromosomes compact chains of coiled DNA
  • Genome The set of all genes in a given organism.

Source www.mtsinai.on.ca/pdmg/Genetics/basic.htm
47
Biology Fundamentals Protein
  • Genes are transcribed into RNA by a complex
    ensemble of molecules. During transcription T is
    substituted by the letter U (for uracil).
  • Triplets of consecutive nucleotides (called
    codon) are repeatedly translated and produces one
    corresponding amino acid
  • Protein Built by amino acid. It participates
    with other proteins and molecules in keeping the
    cell alive and interacting with its environment

Source fajerpc.magnet.fsu.edu/Education/2010/Lect
ures/26_DNA_Transcription.htm
48
Data Mining Bioinformatics
  • Many biological processes are not well-understood
  • Biological data is abundant and information-rich
  • Genomics proteomics data (sequences),
    microarray and protein-arrays, protein database
    (PDB), bio-testing data
  • Huge data banks, rich literature, openly
    accessible
  • Largest and richest scientific data sets in the
    world
  • Mining gain biological insight (data ?
    knowledge)
  • Mining for correlations, linkages between disease
    and gene sequences, protein networks,
    classification, clustering, outliers, ...
  • Find correlations among linkages in literature
    and heterogeneous databases

49
Data Mining Bioinformatics
  • Research and development of new tools for
    bioinformatics
  • Similarity search and comparison between classes
    of genes by finding and comparing frequent
    patterns
  • Identify sequential patterns that play roles in
    various diseases
  • New clustering and classification methods for
    micro-array data and protein-array data analysis
  • Mining, indexing and similarity search in
    sequential and structured (e.g., graph and
    network) data sets
  • Path analysis linking genes/proteins to
    different disease development stages
  • High-dimensional analysis and OLAP mining
  • Visualization tools and genetic/proteomic data
    analysis

50
References
  • C. Aggarwal, J. Han, J. Wang, P. S. Yu. A
    Framework for Clustering Data Streams,  VLDB'03
  • C. C. Aggarwal, J. Han, J. Wang and P. S. Yu.
    On-Demand Classification of Evolving Data
    Streams, KDD'04
  • C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A
    Framework for Projected Clustering of High
    Dimensional Data Streams, VLDB'04
  • S. Babu and J. Widom. Continuous Queries over
    Data Streams. SIGMOD Record, Sept. 2001
  • B. Babcock, S. Babu, M. Datar, R. Motwani and J.
    Widom. Models and Issues in Data Stream Systems,
    PODS'02.  (Conference tutorial)
  • Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang.
    "Multi-Dimensional Regression Analysis of
    Time-Series Data Streams, VLDB'02
  • P. Domingos and G. Hulten, Mining high-speed
    data streams, KDD'00
  • A. Dobra, M. N. Garofalakis, J. Gehrke, R.
    Rastogi. Processing Complex Aggregate Queries
    over Data Streams, SIGMOD02
  • J. Gehrke, F. Korn, D. Srivastava. On computing
    correlated aggregates over continuous data
    streams.  SIGMOD'01
  • C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu.
    Mining frequent patterns in data streams at
    multiple time granularities, Kargupta, et al.
    (eds.), Next Generation Data Mining04
  • S. Guha, N. Mishra, R. Motwani, and L.
    O'Callaghan. Clustering Data Streams, FOCS'00
  • G. Hulten, L. Spencer and P. Domingos Mining
    time-changing data streams. KDD 2001
  • S. Madden, M. Shah, J. Hellerstein, V. Raman,
    Continuously Adaptive Continuous Queries over
    Streams, SIGMOD02
  • G. Manku, R. Motwani.  Approximate Frequency
    Counts over Data Streams, VLDB02
  • A. Metwally, D. Agrawal, and A. El Abbadi.
    Efficient Computation of Frequent and Top-k
    Elements in Data Streams. ICDT'05

51
References
  • S. Muthukrishnan, Data streams algorithms and
    applications, Proceedings of the fourteenth
    annual ACM-SIAM symposium on Discrete algorithms,
    2003
  • R. Motwani and P. Raghavan, Randomized
    Algorithms, Cambridge Univ. Press, 1995
  • S. Viglas and J. Naughton, Rate-Based Query
    Optimization for Streaming Information Sources,
    SIGMOD02
  • Y. Zhu and D. Shasha.  StatStream Statistical
    Monitoring of Thousands of Data Streams in Real
    Time, VLDB02
  • H. Wang, W. Fan, P. S. Yu, and J. Han, Mining
    Concept-Drifting Data Streams using Ensemble
    Classifiers, KDD'03
  • R. Agrawal, C. Faloutsos, and A. Swami. Efficient
    similarity search in sequence databases. FODO93
    (Foundations of Data Organization and
    Algorithms).
  • R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim.
    Fast similarity search in the presence of noise,
    scaling, and translation in time-series
    databases. VLDB'95.
  • R. Agrawal, G. Psaila, E. L. Wimmers, and M.
    Zait. Querying shapes of histories. VLDB'95.
  • C. Chatfield. The Analysis of Time Series An
    Introduction, 3rd ed. Chapman Hall, 1984.
  • C. Faloutsos, M. Ranganathan, and Y.
    Manolopoulos. Fast subsequence matching in
    time-series databases. SIGMOD'94.
  • D. Rafiei and A. Mendelzon. Similarity-based
    queries for time series data. SIGMOD'97.
  • Y. Moon, K. Whang, W. Loh. Duality Based
    Subsequence Matching in Time-Series Databases,
    ICDE02
  • B.-K. Yi, H. V. Jagadish, and C. Faloutsos.
    Efficient retrieval of similar time sequences
    under time warping. ICDE'98.
  • B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V.
    Jagadish, C. Faloutsos, and A. Biliris. Online
    data mining for co-evolving time sequences.
    ICDE'00.
  • Dennis Shasha and Yunyue Zhu. High Performance
    Discovery in Time Series Techniques and Case
    Studies, SPRINGER, 2004

52
References
  • R. Srikant and R. Agrawal. Mining sequential
    patterns Generalizations and performance
    improvements. EDBT96.
  • H. Mannila, H Toivonen, and A. I. Verkamo.
    Discovery of frequent episodes in event
    sequences. DAMI97.
  • M. Zaki. SPADE An Efficient Algorithm for Mining
    Frequent Sequences. Machine Learning, 2001.
  • J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
    M.-C. Hsu. PrefixSpan Mining Sequential Patterns
    Efficiently by Prefix-Projected Pattern Growth.
    ICDE'01 (TKDE04).
  • J. Pei, J. Han and W. Wang, Constraint-Based
    Sequential Pattern Mining in Large Databases,
    CIKM'02.
  • X. Yan, J. Han, and R. Afshar. CloSpan Mining
    Closed Sequential Patterns in Large Datasets.
    SDM'03.
  • J. Wang and J. Han, BIDE Efficient Mining of
    Frequent Closed Sequences, ICDE'04.
  • H. Cheng, X. Yan, and J. Han, IncSpan
    Incremental Mining of Sequential Patterns in
    Large Database, KDD'04.
  • J. Han, G. Dong and Y. Yin, Efficient Mining of
    Partial Periodic Patterns in Time Series
    Database, ICDE'99.
  • J. Yang, W. Wang, and P. S. Yu, Mining
    asynchronous periodic patterns in time series
    data, KDD'00.
  • A. Baxevanis and B. F. F. Ouellette.
    Bioinformatics A Practical Guide to the Analysis
    of Genes and Proteins (3rd ed.). John Wiley
    Sons, 2004
  • R.Durbin, S.Eddy, A.Krogh and G.Mitchison.
    Biological Sequence Analysis Probability Models
    of Proteins and Nucleic Acids. Cambridge
    University Press, 1998
  • N. C. Jones and P. A. Pevzner. An Introduction to
    Bioinformatics Algorithms. MIT Press, 2004
  • I. Korf, M. Yandell, and J. Bedell. BLAST.
    O'Reilly, 2003
  • L. R. Rabiner. A tutorial on hidden markov models
    and selected applications in speech recognition.
    Proc. IEEE, 77257--286, 1989
  • J. C. Setubal and J. Meidanis. Introduction to
    Computational Molecular Biology. PWS Pub Co.,
    1997.
  • M. S. Waterman. Introduction to Computational
    Biology Maps, Sequences, and Genomes. CRC Press,
    1995
Write a Comment
User Comments (0)
About PowerShow.com