ON STATISTICAL MODEL OF CLUSTER STABILITY - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

ON STATISTICAL MODEL OF CLUSTER STABILITY

Description:

Chi-2 distance. And so on.... 14. Concentration measure index ... CLUSTERING MACHINE M- CLM. In contrast with the geometrical algorithm, we use here a family of ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 56
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: ON STATISTICAL MODEL OF CLUSTER STABILITY


1
ON STATISTICAL MODEL OF CLUSTER STABILITY
  • Z. Volkovich
  • Software Engineering Department, ORT Braude
    College of Engineering, Karmiel 21982, Israel
    Department of Mathematics and Statistics, the
    University of Maryland, Baltimore County, USA
  • Z. Barzily
  • Software Engineering Department, ORT Braude
    College of Engineering, Karmiel 21982, Israel

2
Concept
  • We propose a general statistical approach for
    the study of the cluster validation problem.
  • Our concept suggests that partitions obtained
    by a cluster algorithm rerunning can be
    considered as realizations of a random variable
    such that the most stable random variable infers
    the true number of clusters.
  • We offer to measure stability by means of
    probability distances within the observed
    realization.
  • Our method suggests the application of
    probability metrics between random variables.

3
Motivating works
  • T. Lange, V. Roth, L. M. Braun, and J. M.
    Buhmann. Stability-based validation of clustering
    solutions. Neural Computation, 16(6)1299 1323,
    2004.
  • V. Roth, V. Lange, M. Braun, and Buhmann J. A
    resampling approach to cluster validation. In
    COMPSTAT, 2002.

4
Clustering
Goal partition a set S ? Rd by means of a
clustering algorithm CL such that CL(x) CL(y)
if x and y are similar CL(x) ? CL(y) if x
and y are dissimilar
5
Clustering (cont)
CL(x) CL(y)
CL(x) ? CL(y)
An important question how many clusters are
there?
6
Example a three-cluster set partitioned into 2
and 4 clusters
7
Implication
  • It is observed that in the case when the
  • number of clusters is not correctly chosen non
  • consistent clusters can be formed. Such a
  • cluster is a union of several heterogeneous
  • parts having different distributions.

8
Concept
DATA
SAMPLE - S
CLUSTERING MACHINE- CL
INITIAL PARTITION
CLUSTERED SAMPLE S CL(S)
9
Concept (cont. 1)
S1 CL(S1)

These clustered samples can be considered as
estimates of the looked-for true partition, and
appropriately, our concept views these partitions
as instances of a random variable such that the
steadiest random variable is associated with the
true number of clusters.
DATA
10
Concept (cont. 2)
  • Outliers in the samples and limitations of
    clustering algorithms significantly add to the
    noise level and increase the models volatility.
    Thus, the clustering procedure has to be iterated
    many times to obtain meaningful results.
  • The stability of random variables is measured
    via probability distances. According to our
    principle, it is natural to assume that the true
    number of clusters corresponds to the distance
    distribution having the minimal inner variation.

11
Some probability metrics
  • Probability metrics are introduced on a space ?
    of real-valued random variables defined on the
    same probability space.
  • A functional dis ??R is called a probability
    metric if it satisfies the following additional
    properties
  • Identity
  • Symmetry
  • Triangular inequality
  • The last two properties are not always required.
    If a probability metric identifies the
    distribution
  • i.e.
  • then the metric is called simple, otherwise the
    metric is called
  • compound.

12
Ky Fan metrics
  • K1( X,Y) infe gt P(X-Y gte)lte

Birnbaum-Orlicz metrics
13
Examples
  • LP-metric
  • Ky Fan metrics m
  • Hellinger distance
  • Relative entropy (or Kullback-Leibler divergence)
  • Kolmogorov (or Uniform) metric
  • Levy metric
  • Prokhorov metric
  • Total variation distance
  • Wasserstein (or Kantorovich) metric
  • Chi-2 distance
  • And so on.

14
Concentration measure index
  • The LP and the Ky Fan metrics are compound and
    the other
  • metrics shown above are simple. A difference
    between the
  • simple and compound metrics is that, unlike the
    compound
  • ones, simple metrics equal zero for two
    independent
  • identically distributed random variables.
    Moreover, if dis is a
  • compound metric, dis(X , X ) 0 and X , X are
    independent
  • realizations of the same random variable X, then
    X is a
  • constant almost surely.
  • A compound distance can be used as a measure of
  • uncertainty. Particularly, dis(X , X ) d(X )
    is called a
  • concentration measure index derived by the
    compound
  • distance dis. The stability of a random variable
    can be
  • estimated by the average value of this index.

15
Simple and Compound Metrics
Simple metrics
Compound metrics
Geometrical Algorithms
Membership Algorithms
16
Geometrical Algorithm
  • Several algorithms can be offered here. The
    first
  • type is to consider the Geometrical Stability.
    The
  • basic idea is that if one properly clusters,
    two
  • independent samples then, under the assumption
  • of a consistent clustering algorithm, the
    clustered
  • samples can be classified as samples drawn from
  • the same population.

17
Example Cluster stability via samples
Cluster 2
Cluster 1
The samples found the same clusters. A partition
is stable
18
General algorithm. Given a probability metric
dis(, )
  • For each tested number of clusters k, execute
    steps 2 to 6 as follows
  • Draw pairs of disjoint samples (St, St1),
    t1,2,
  • Cluster the united sample (S CL(St?St1)) and
    separate S to the two clustered samples (St,St1)
  • Calculate the distance values dtdis(St,St1)
  • Average each consecutive T distances
  • Normalize the set of averages and obtain the set
    DISk
  • The number of clusters k, which yields the most
    concentrated normalized distribution DISk, is
    chosen as the estimate of the true number of
    clusters.

19
How is the concentration measured?
  • The distances are inherently positive, and we
    are interested in distributions concentrated near
    zero. Thus, we can use as concentration measures
    the sample mean, the sample standard deviation
    and the lower quartile.

20
Simple distances
  • Many simple distances, applicable in two sample
    tests, are simple metrics. For instance, the well
    known Kolmogorov-Smirnov test is based on the
    max-distance between the probability functions in
    the one dimensional case. In the multivariate
    case, the following tests can be mentioned in
    this context
  • the Friedman-Rafsky test 1979. (Minimal spanning
    tree based)
  • the Nearest Neighbors test of Henze 1988.
  • the Energy test, Zech and Aslan 2005.
  • the N-distances test of Klebanov 2005.
  • Such metrics can be used in the Geometrical
    Algorithm.

21
Simple distances (cont)
  • Recall, that the classical two-sample problem
  • is intended for testing the hypothesis
  • H0 F(x) G(x)
  • against the general alternative
  • H1 F(x) ? G(x),
  • when the distributions F and G are unknown.
  • Here we consider applications of the N-
  • distances test of Klebanov and the
    Friedman-Rafsky
  • test.

22
Klebanovs N-distances
  • N-distances are defined as follows
  • where X1, X1 and Y1, Y1 are independent
  • realizations of X and Y respectively. N is a
  • negative definite kernel. We use
  • N(x,y)x-ya , 0lta?2.

23
Graphical illustration
  • Let us consider two samples S1 and S2 partitioned
    into clusters

S1
S2
24
Graphical illustration (cont. 1)Distances
between points belonging to different samples
Cluster C1
Cluster C2
25
Graphical illustration (cont. 2)
Cluster C2
Cluster C1
26
Remark
  • Note, the metric is actually the average
    distance
  • between the samples in the clusters. If this
  • value is close to zero then it can be concluded
  • that the partition is stable. Namely, the
    elements
  • of the samples are closely located inside the
  • clusters and can be deemed as a consistent set
  • within each of the clusters.

27
Distances calculations
28
Histograms of the distances values
Two cases are presented. The case when the
quantity T of the averaged distance values is big
and the case when T is small correspondently. In
the second case, measuring the concentration via
the lower quartile appears to be more
appropriated.
29
ExampleThe Iris Flower Dataset
  • The kernel N(x,y)x-y
  • Sample size 70
  • Number of samples 2080
  • Number of averaged samples 80

30
Graph of the normalized mean value
31
Graph of the normalized quartile value
32
Euclidean Minimal Spanning Tree
  • The Euclidean minimum spanning tree or EMST is a
  • minimum spanning tree of a set S of points in an
  • Euclidean space Rd, where the weight of the edge
  • between each pair of points is the distance
    between
  • those two points. In simpler terms, an EMST
  • connects a set of dots using lines such that the
    total
  • length of all the lines is minimized and any dot
    can
  • be reached from any other by following the lines
  • ( see, Wikipedia).

33
Euclidean minimal spanning tree (cont.)
  • A tree on S is a graph which has no loops and
    whose vertices are elements of S.
  • If all distances between the items of S are
    distinct then the set S is called nice and it has
    a unique EMST.
  • An EMST can be built in O(n2) time (including
    distance calculations) using the Prims,
    Kruskals, Boruvkas or the Dijkstras
    algorithms.

34
An EMST of 60 random points
35
How can an EMST be used in the cluster validation
problem?
We draw two samples S1 and S2 and determine
S S1?S2- pooled sample S CL(S)-clustered
pooled sample We expect that, in the case of a
stable clustering, the two samples items are
closely located inside the clusters. This fact
is characterized by the number of edges
connecting points from different samples. These
edges are marked by red in the diagrams in the
examples.
36
Graphical illustration. Stable clustering
37
Graphical illustration. Non-stable clustering
38
The two-sample MST-test
  • The two- sample test the hypothesis
  • H0 F(x) G(x)
  • against the general alternative
  • H1 F(x) ? G(x),
  • when the distributions F and G are unknown.
  • The number of the edges connecting points from
    different samples has been considered in the
    Friedman-Rafskys MST test.
  • Particularly, let X xi, i1,n, Y yj,
    j1,m be two samples of independent random
    elements, distributed according to F and G,
    respectively.

39
The two-sample MST-test (cont. 1)
  • Suppose that the set Z X ? Y is nice.
  • Friedman and Rafskys test statistic Rmn can be
    defined as the number of edges of Z which connect
    a point of X to a point of Y .
  • Friedman and Rafsky actually introduced the
    statistic 1Rmn, which expresses the number of
    disjoint sub-trees resulting from removing all
    edges uniting vertices of different samples.

40
The two-sample MST-test (cont. 2)
  • Henze and Penrose (1979) considered the
    asymptotic behavior of Rmn. Suppose that m ?? and
    n ?? such that m/(mn) ?p? (0, 1). Introducing
    q 1 - p and r 2pq, they obtained
  • where the convergence is in distribution and
    N(0,?2) denotes
  • the normal distribution with a zero expectation
    and a variance
  • ?2 r (r Cd(1 - 2r))
  • for some constant Cd depending only on the
    spaces
  • dimension d.

41
The two-sample MST-test (cont. 3)
  • This results coincides with our intuition since
    if two sets are closed then there are many edges
    which unit points from different samples. In the
    spirit of The Central Limit Theorem, this
    quantity is expected to be asymptotically
    normally distributed.

42
Theorems application
  • To use this theorem, for each possible number
    of clusters k2,,k , we draw many pairs of
    disjoint samples S1 and S2 having the same
    sufficiently big size n and calculate S S1? S2,
    St CL(S).
  • We consider sub- samples
  • S1,j S1??j(St),
  • S2,j S2??j(St),
  • where ?j(St), j1,,k is the jth cluster in the
    partition of St obtained by means of CL.

43
Theorems application (cont. 1)
  • For each j1,,k we compute a value of the two-
    sample MST-test statistics Rnn(S1,j, S2,j)
  • In the case where all clusters are homogenous
    sets this statistic is normally distributed. We
    construct masses of such values and look at the
    distance between their standardized distribution
    and the standard normal distribution.
  • The number of clusters k, which yields the
    minimal distance, is chosen as the estimate of
    the true number of clusters.

44
Example Calculation of Rn(S1,S2)
45
Distances from normality
  • We can consider the following distances from
    normality
  • The Kolmogorov-Smirnov test statistics
  • The Friedmans Pursuit Index
  • Entropy Based Pursuit Indexes
  • BCM function
  • Generally, any simple metric can be used here.

46
The Kolmogorov-Smirnov Distance
  • The Kolmogorov-Smirnov distance is based on the
    empirical distribution function.
  • Given N ordered data items R1 R2 R3 ...
    RN, a function FN(i) is defined as the fraction
    of items less or equal to Ri.
  • The distance to the standard normal
    distribution is given by
  • Dmax(FN(i) G(i), G(i)- FN(i)),
  • where G(i) is the value of standard normal
  • cumulative distribution function evaluated at
    the
  • point i.

47
The Kolmogorov-Smirnov Distance
48
Example synthetic data
Mixture of three Gaussian clusters
49
Example synthetic data (cont. 1)
50
Another application
  • Another approach by Jain, Xu, Ho and Xiao and
    by Smith and Jain proposes to carry out a
    uniformity testing for cluster detection.
  • The main idea here is to locate an
    inconsistent edge whose length is significantly
    larger than the average length of nearby edges.
    The distribution of these lengths must be normal
    under the null hypothesis assumption.
  • This method is unable to asses the true number
    of clusters in the data.

51
(No Transcript)
52
Membership Stability Algorithm
  • Such algorithm uses a simulation of Random
    Variables, obtained by repeated clusterings of
    the same sample.
  • Obtained results are compared by means of a
    compound (non-simple) metric, which is a function
    of the variable defined on the set Ck1,,k,
    where k is the number of clusters considered.

53
Lp-metric
  • The Lp metrics is the most known example of a
    compound metric. For every p the Lp -metrics is
    defined by
  • In particular
  • The last metric has been, de facto, applied in
  • T. Lange, V. Roth, L. M. Braun, and J. M.
    Buhmann. Stability-based validation of clustering
    solutions. Neural Computation, 16(6)1299 1323,
    2004.
  • V. Roth, V. Lange, M. Braun, and Buhmann J. A
    resampling approach to cluster validation. In
    COMPSTAT, 2002.

54
A family of clustering algorithms
In contrast with the geometrical algorithm, we
use here a family of clustering algorithms. It
can be the same algorithm starting from
different initial points.
In contrast with the geometrical algorithm, we
use here a family of clustering algorithms. It
can be the same algorithm starting from different
initial points.
55
Clusters correspondence problem
Let us consider we twice cluster the same set.
Cluster 1
S
Cluster 2
The same cluster can be labeled differently
Cluster 2
Cluster 1
56
Correspondence between labels a and ßobtained
for a sample S.
  • This task is solved by finding of a permutation ?
    of the set Ck which
  • achieves the smallest misclassification between a
    and the permuted ß.
  • I.e.
  • .
  • Computational complexity for solving this problem
    by the well known
  • Hungarian method is O(k3). This technique has
    been used in Lange T.
  • et al, 2004 and is also known in the clusters
    combining area (see, for
  • example Topchy A. et al, 2003).
Write a Comment
User Comments (0)
About PowerShow.com