Clustering%20of%20Streaming%20Time%20Series%20is%20Meaningless - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering%20of%20Streaming%20Time%20Series%20is%20Meaningless

Description:

'Clustering is perhaps the most frequently used data mining ... EEG, EKG, patient's temperature (medical) laser light intensity measured. stock market indices ... – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 45
Provided by: mis69
Learn more at: http://mason.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Clustering%20of%20Streaming%20Time%20Series%20is%20Meaningless


1
Clustering of Streaming Time Series is
Meaningless
  • presentation by Rafal Ladysz
  • after the original paper by
  • Eamonn Keogh
  • Jessica Lin
  • Wagner Truppel
  • Computer Science Engineering,
  • University of California-Riverside

2
interesting and important topic
  • foreward of the original paper reads
  • Clustering is perhaps the most frequently used
    data mining algorithm, being useful in it's own
    right as an exploratory technique, and also as a
    subroutine in more complex data mining algorithms
    such as rule discovery, indexing, summarization,
    anomaly detection, and classification
  • Time series data is perhaps the most frequently
    encountered type of data examined by the data
    mining community
  • thus, a lot of interest, works, papers,
    conferences on these two, nevertheless
  • it has never appeared in the literature what
    the title claims

3
QUIZ questions (asked upfront)
  • what are two main ways of clustering time series
    data? (name and describe each in one sentence)
  • one can convert hierarchical clustering into
    k-means clustering which of these two is
    deterministic (if any)?
  • what method can help subclustering time series
    work?

4
time series (TS) mini-primer
  • intuitive definition sequence of real numbers
    (usually acquired in equal time intervals)
  • examples of experimental time series
  • meteorological observations
  • EEG, EKG, patients temperature (medical)
  • laser light intensity measured
  • stock market indices
  • predator-prey population recorded
  • possible division
  • periodic/non-periodic
  • stochastic (random)/chaotic (deterministic)

5
possible TS hierarchy tree
the leaf nodes refer to the actual
representation, and the internal nodes refer to
the classification of the approach credit
Keogh et al.
6
TS illustration
SP laser Lorenz earthquake chaotic
7
mining TS
  • general examples
  • anomaly detection (deviation from some mean
    value, e.g. monitoring functioning of space
    shuttle)
  • classification/ forecasting
  • rule discovery (surprising/interesting patterns)
  • particular example (of my current interest)
  • detecting chaos in dynamic TS data streams
  • getting insight of the underlying systems
    dynamics
  • computing some crucial parameter(s)
  • possible applications of the above
  • EEG
  • stock market
  • weather-related catastrophes (extremally complex)

8
TS similarity issue
  • in many (though not all) cases similarity is
    necessary to investigate TS data
  • we need some measure of similarity to mine TS
  • classification, e.g. ECG patterns of new patients
    as indicator of heart deseases with known ECG
    pattern
  • clustering, e.g. groupping websites with similar
    traffic patterns
  • association, e.g. a plateau followed by a sudden
    decrease in EEG an epileptic seizure can happen
  • we need it for searching particular pattern
  • (once we can use techniques/tools to mine TS)

9
TS similarity possible measures
  • in general there are many and what to use
    depends on the application
  • an obvious similarity measure is one based on
    Euclidean distance (with its pros and cons)
  • each sequence as a point in n-dimensional
    Euclidean space, where n length of TS points
  • then similarity Lp between TS sequences X, Y
    is
  • Lp (?i1n xi yip)1/p
  • old problem of dimensionality curse exists
  • thus scalability is desired and enforces
  • trade off between accuracy and efficiency

10
Euclidean distance for TS in action
  • credit A. K. Singh

11
similarity of TS when we use it
  • Indexing problem
  • find all lakes whose water level fluctuations are
    similar to X
  • Subsequence Similarity problem
  • find out other days in which stock X had similar
    movements as today
  • Clustering problem
  • group regions that have similar sales patterns
  • Rule Discovery problem
  • find rules such as if stock X goes up and Y
    remains the same, then Z will go down soon

12
clustering algorithms quick look at three of
them
  • well known k-means
  • choosing k the number of clusters to generate
  • initializing k centers of clusters to be
    generated
  • keep re-estimating k clusters centers
  • ... greedy
  • ... converges but not (necessarily) to global
    minimum
  • ... depends on initialization is step 2
  • stops when no changes (in cluster membership)
  • hierarchical clustering
  • density-based clustering

13
hierarchical clustering step by step
  • 1. distances between objects compute and put
  • into distance matrix
  • 2. search through distance matrix to find two
    closest (i.e. most similar) objects (clusters in
    next iterations)
  • 3. join the two to get cluster of at least two
    objects
  • 4. update distance matrix (new clusters
    generated)
  • 5. repeat step 2 until there is one cluster of
    all objects (from step 1)
  • Q is it bottom up (aglomerative) or top down?

14
hierarchical clustering illustration
averages
  • TS being clustered hierarchically - starting
    with 10 sequences
  • sliding either way along green line the cut
    off line determines
  • k (clusters) - thus determines bottom-up or
    top-down way
  • so we can convert hierarchical clustering to
    k-means cluster.

15
hierarchical clustering summary
  • it produces the same results every time with a
    given set of data (unlike k-means clustering)
  • cons
  • splitting or merging irreversible in next
    iterations (i.e. no element redistribution among
    clusters)
  • poor scaling (quadratic in input size)
  • pros
  • no input parameters (like number of clasters k)
  • simplicity
  • can be integrated with other clustering methods

16
density-based clustering (DBC)
  • based on density - local cluster criterion
  • recognizes clusters as dense regions
  • major features
  • discover clusters of arbitrary shape
  • handle noise
  • one scan
  • need density parameters as termination condition
  • sources and algorithms
  • DBSCAN Ester, et al. (KDD96)
  • OPTICS Ankerst, et al (SIGMOD99).
  • DENCLUE Hinneburg D. Keim (KDD98)
  • CLIQUE Agrawal, et al. (SIGMOD98)

17
TS and its subsequences
  • formally, TS can be expressed as an ordered set
    of m variables or a point in m-dim space
  • TS t1, t2, ..., tm
  • this formality enables applying clustering to a
    set of TS sequences as if they were such points
  • Cp denotes a subsequence of length w of a TS,
    where w lt m
  • Cp tp, tp1, ..., tpw-1, 1 ? p ? m-w1
  • a technique of sliding window (of size w) is a
    useful concept here

18
subsequences via sliding window
  • sliding window extracts all subsequences Cp
    described earlier from a given TS
  • a matrix S of all such subsequences can be built
    by moving the sliding window across a given TS
  • and placing subsequence Cp in the pth row of S
    whose size is (m-w1) times w

far left first eight subsequences Cp, each of
length 16 middle C67 of the same length
19
sliding window and its matrix
  • denoting all possible subclusters Cp
  • C1 t1, t2, , t10
  • C2 t2, t3, , t11
  • Cm-w tm-9, tm-8, tm
  • and their corresponding matrix S

20
meaninglessness of STS clustering
  • to demonstrate meaninglessness of STS clustering
    two algorithms have been used
  • k-means
  • hierarchical clustering
  • important remark
  • to minimize any methodological bias, the whole
  • clustering (besides STS sliding window
    clustering)
  • has been performed to provide control results for
  • comparison

21
variability of k-means one data set
  • let A, B denote cluster centers derived from two
    different runs of k-means algorithm over the same
    data set (expect different results)
  • the cluster_distance(A, B) defines the distance
    between two sets of clusters A and B
  • remark the above definition enforces closest
    pairs from A, B

22
variability of k-means two data sets
  • applying this approach for different data sets
  • experiment performing 3 random restarts of
    k-means (applying sliding window) on a stock
    market dataset
  • set X the 3 resulting sets of cluster centers
  • similarly with 3 random runs of k-means on a
    random walk dataset
  • set Y the resulting cluster centers

23
more definitions
  • denote the avarage cluster_distance between each
    set of cluster centers in X to each other set of
    cluster centers in X (as it was for one data set)
    by
  • within_set_X_distance
  • denote the average cluster distance between each
    set of cluster centers in X to cluster centers in
    Y by
  • between_set_X_and_Y_distance

24
a brief analysis of the claster
meaningfullness(X, Y)
  • numerator (within set distance X) measuring
    clustering algorithms sensitivity to initial
    conditions (seeds)
  • briefly it asumes zero for same results
  • on the other hand there is no reason for
    similarity of clustering results for two
    different (and unrelated) data sets
  • briefly denominator (between set X and Y
    distance) should be (relatively) large
  • overall tendency
  • claster meaningfullness(X, Y) ? 0 if X, Y
    differ

25
experiment STS vs whole clustering
  • to obtain control set of results (for
    comparison)
  • the same experiment has been repeated by k-means
  • for the same data
  • using whole clustering method (i.e. randomly
    extracted subsequences)
  • entire process has ben repeated 100 times for
    every combination of parameters k and w
  • k3, 5, 7, 11
  • w8, 16, 32
  • results first surprise!!!

comparison whole (yellow) vs. STS Z-axis
meaningfulness value
26
same experiment hierarchical clust.
  • having proven meaningless of k-means clustering
    of STS,
  • the experiment has been performed using
    hierarchical clustering
  • new challenge defining distance between two
    clusters
  • linkage method - applicable for bottom-up
    clustering

cluster objects can be based on different
methods Single Linkage the minimum distance
between them (nearest neighbour rule) Complete
Linkage the maximum distance between them
(furthest neighbour rule) Average Linkage the
average distance between all pairs of objects
(one member of the pair must be from a
different cluster)
cluster meaningfullness comparison whole
clustering vs. STS clustering using hierarchical
approach data used SP 500 again, no
significant difference!
27
why it is really surprising dissimilarity of
data sets
  • the below two TS are very dissimilar
  • neverteless, the experimental results obtained
    for buoy sensor and ocean TS (using k-means)
    continue showing meaningless of STS clust.

28
preliminary conclusions
  • the authors reported similar results
  • using other clustering algorithms, e.g. EM, SOMs
    (self-organizing featire maps)
  • applied to more than 40 data sets
  • using Euclidean, L, Mahalanobis and time warping
    distances
  • and normalization techniques
  • and for all of those combinations observed
  • whole clustering of TS usually is to be
    meaningful
  • sliding window clustering of STS never is
    meaningful

29
looking for explanation
  • another comparison of both methods
  • using cylinder, bell and funnel data sets
  • 30 instances generated for each pattern (90
    total)
  • k-means applied (k 3)
  • all (three) clusters have been recognized
  • close resemblance found

30
more results, more surprises
  • the 90 TS data sets (generated) have been
    concatenated to one long TS
  • sliding window w 128, k-mean with k 3 (as
    expected!)
  • the above graph illustrates obtained result, i.e.
    cluster centers found by subsequence clustering
    (using sliding the window described above)
  • a big surprise the lines are sinusoids no
    resemblence to any patterns in data sets used as
    it was for whole clustering
  • summarizing regardless clustering algorithm,
    number of clusters, datasets used if w ltlt m and
    STS then sinusoid

31
summarizing once again
  • the authors conclude
  • obtained approximate sinusoids with STS
    clustering regardless of the clustering
    algorithm, the number of clusters, or the dataset
    used
  • if sinusoids appear as cluster centers for every
    dataset, then clearly it will be impossible to
    distinguish one datasets clusters from another
  • this is all the more true as the joint phase of
    the sinusoids is arbitrary does not depend on
    any input-related parameters
  • recall that independence on such parameters was
    defined as mininglessness

32
another concept Hidden Constraint
  • lets agree with the following theorem
  • for any TS dataset,
  • if TS is clustered using sliding windows with
    wltltm,
  • then the mean of all the data (i.e. case for
    k1)
  • will be approximately constant
  • (Im not sure why they use the tem vector
    here)

space shuttle flutter speech power data koski
ecg rarthquake chaotic cylinder random
walk balloon
visual proof of the theorem w 32, k 1, 10
dissimilar datasets right resulting cluster
centers (no rescaling has been done)
33
(more) intuitive proof of the theorem
  • consider a time series TS and a single datapoint
    ti, where w ? i ? m-w1
  • as the sliding window pass by, ti goes on to
    appear exactly once in every possible location
    within it
  • ti contribution to the overall shape is the same
    every where and must be a horizontal line
  • the average of many horizontal lines is just
    another horizontal line

ti
34
trivial match the main idea
  • consider TS subsequence Cp being a member of a
    cluster
  • searchng for similar subsequences, where one can
    expect them to be?
  • in closest proximity! thus
  • ..., Cp-2, Cp-1, Cp1, Cp2, ...

35
trivial match definition
  • trivial match C and M are subsequences beginning
    at p and q, respectively, while R is a distance
  • M is a trivial match to C of order R
  • if either p q
  • or there does not exist a subsequence M
  • beginning at q
  • and such that D(C, M) gt R,
  • and
  • either q lt q lt p
  • or p lt q lt q

C M M
pq
pltqltq
36
trivial match observation
  • smooth, slowly changing subsequences tend to have
    many trivial matches
  • rapidly changing subsequences (i.e. their
    features) tend to have very few trivial matches
  • the smooth pattern is surrounded by many trivial
    matches sort of compelling as a cluster
    center
  • highly featured, noisy pattern has few trivial
    matches, often ignored as a cluster center
    candidate

illustration of the observation A TS sequence
with a cluster of 3 square waves w 64 B
number of trivial matches
37
tentative conclusions
  • smooth patterns are surrounded by many trivial
    matches
  • extremely promising cluster center in clustering
    algorithms
  • D(C,M) lt R
  • in 1920s, Evgeny Slutsky demonstrated that any
    noisy time series will converge to a sine wave
    after repeated applications of moving window
    smoothing
  • STS, though not exactly such, is closely related

38
sine qua non for STS cluster
  • the weighted mean of the k patterns must sum to a
    horizontal line (constant line)
  • rach of the k patterns must have approximately
    equal number of trivial matches
  • the chances of both conditions being met is
    essentially zero

39
a tentative solution
  • proposed a method as an existence proof only that
    such an algorithm exists at all (conceptually)
  • presented below is a motif-based clustering
  • definition of K-motifs
  • given TS, a subsequence of length n, distance
    range R
  • the most significant motif in TS called 1-Motif
    is the subsequence C1 with the highest count of
    non-trivial matches
  • each subsequent K-motif in TS is the CK which
    differs from C1 in that additionaly D(CK, Ci) gt
    2R for all 1 ? i lt K

the motif (red) occurs 4 times winding(4)
dataset used
40
motif vs. cluster
  • when mining motifs, we must specify an additional
    parameter R
  • assuming the distance R is defined as Euclidean,
    motifs always define circular regions in space,
    whereas clusters may have arbitrary shapes
  • motifs generally define a small subset of the
    data, and not the entire dataset
  • the definition of motifs explicitly eliminates
    trivial matches

41
algorithm for motif-based clustering
  1. decide on a value for k
  2. discover the K-motifs in the data, for Kk?c (c
    is some constant about 2 to 30)
  3. run k-means, or k partitional hierarchical
    clustering, or any other clustering algorithm on
    the subsequences covered by K-motifs

42
experimental results
  • repeated experiment for searching cluster centers
    for cyllinder-bell-funnel trio
  • the results obtained are okey, i.e. they
    resemble the original patterns (see right) of the
    three TS data sets (as well as results obtained
    using whole clustering approach)

43
side remark another point of view
  • by Anne Denton
  • needlessly to say her Ph.D. thesis was entitled
    Fast kernel-density-based classification and
    clustering using P-trees, a good motivation to
    defend meaningfullness of STS
  • experimental setup
  • data sets halved before clustering
  • comparing derived cluster centers from both
    halves using meaningfullness measure
    (within/between) and similar cluster distance
    measure
  • claim such a test is stricter than that
    reported so far (based on separate runs of
    k-means on same data)
  • conclusion kernel-based clustering shows
    meaningful results for subsequence clustering

44
references
  • Keogh, Lin, Truppel Clustering of Time Series
    Subsequences is Meaningless Implications for
    Previous and Future Research
  • Han, Kamber Data mining. Concepts and
    Techniques
  • Lin, Keogh, Lonardim Chiu A symbolic
    representation of Time Series...
  • Denton Density-based Clustering of Time Series
    Subsequences
  • Sprott Chaos and Time-Series Analysis
  • references of the above ones and many
    pertaining web pages
  • THANK YOU!
Write a Comment
User Comments (0)
About PowerShow.com