Cluster Validation PowerPoint PPT Presentation

presentation player overlay

About This Presentation

Transcript and Presenter's Notes

Title: Cluster Validation

1
Cluster Validation

Cluster validation
Assess the quality and reliability of clustering
results.
Why validation?
To avoid finding clusters formed by chance
To compare clustering algorithms
To choose clustering parameters
e.g., the number of clusters in the K-means
algorithm

2
Clusters found in Random Data
Random Points
3
Aspects of Cluster Validation

Comparing the clustering results to ground truth
(externally known results).
External Index
Evaluating the quality of clusters without
reference to external information.
Use only the data
Internal Index
Determining the reliability of clusters.
To what confidence level, the clusters are not
formed by chance
Statistical framework

4
Comparing to Ground Truth

Notation
N number of objects in the data set
PP1,,Pm the set of ground truth clusters
CC1,,Cn the set of clusters reported by a
clustering algorithm.
The incidence matrix
N ? N (both rows and columns correspond to
objects).
Pij 1 if Oi and Oj belong to the same ground
truth cluster in P Pij0 otherwise.
Cij 1 if Oi and Oj belong to the same cluster
in C Cij0 otherwise.

5
External Index

A pair of data object (Oi,Oj) falls into one of
the following categories
SS Cij1 and Pij1 (agree)
DD Cij0 and Pij0 (agree)
SD Cij1 and Pij0 (disagree)
DS Cij0 and Pij1 (disagree)
Rand index
may be dominated by DD
Jaccard Coefficient

6
Internal Index

Ground truth may be unavailable
Use only the data to measure cluster quality
Measure the homogeneity and separation of
clusters.
SSE Sum of squared errors.
Calculate the correlation between clustering
results and distance matrix.

7
Sum of Squared Error

Homogeneity is measured by the within cluster sum
of squares
Exactly the objective function of K-means.
Separation is measured by the between cluster sum
of squares
Where Ci is the size of cluster i,
m is the centroid of the whole data set.
BSS WSS constant
A larger number of clusters tend to result in
smaller WSS.

8
Sum of Squared Error
K1
K2
K4
9
Sum of Squared Error

Can also be used to estimate the number of
clusters.

10
Internal Measures SSE

SSE curve for a more complicated data set

SSE of clusters found using K-means
11
Correlation with Distance Matrix

Distance Matrix
Dij is the similarity between object Oi and Oj.
Incidence Matrix
Cij1 if Oi and Oj belong to the same cluster,
Cij0 otherwise
Compute the correlation between the two matrices
Only n(n-1)/2 entries needs to be calculated.
High correlation indicates good clustering.

12
Correlation with Distance Matrix

Given Distance Matrix D d11,d12, , dnn and
Incidence Matrix C c11, c12,, cnn .
Correlation r between D and C is given by

13
Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices
for the K-means clusterings of the following two
data sets.

Corr -0.9235
Corr -0.5810
14
Clusters found in Random Data
Random Points
15
Using Similarity Matrix for Cluster Validation

Order the similarity matrix with respect to
cluster labels and inspect visually.

16
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means
17
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Complete Link
18
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

DBSCAN
19
Reliability of Clusters

Need a framework to interpret any measure.
For example, if our measure of evaluation has
the value, 10, is that good, fair, or poor?
Statistics provide a framework for cluster
validity
The more atypical a clustering result is, the
more likely it represents valid structure in the
data.

20
Statistical Framework for SSE

Example
Compare SSE of 0.005 against three clusters in
random data
SSE Histogram of 500 sets of random data points
of size 100 distributed over the range 0.2 0.8
for x and y values

SSE 0.005
21
Statistical Framework for Correlation

Correlation of incidence and distance matrices
for the K-means of the following two data sets.

Correlation histogram of random data
Corr -0.5810
Corr -0.9235
22
Hyper-geometric Distribution

Given the total number of genes in the data set
associated with term T is M, if randomly draw n
genes from the data set N, what is the
probability that m of the selected n genes will
be associated with T?

23
P-Value

Based on Hyper-geometric distribution, the
probability of having m genes or fewer associated
to T in N can be calculated by summing the
probabilities of a random list of N genes having
1, 2, , m genes associated to T. So the p-value
of over-representation is as follows

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user