Cluster Validation - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster Validation

Description:

Cluster Validation Cluster validation Assess the quality and reliability of clustering results. Why validation? To avoid finding clusters formed by chance – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 24
Provided by: djiang3
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Cluster Validation


1
Cluster Validation
  • Cluster validation
  • Assess the quality and reliability of clustering
    results.
  • Why validation?
  • To avoid finding clusters formed by chance
  • To compare clustering algorithms
  • To choose clustering parameters
  • e.g., the number of clusters in the K-means
    algorithm

2
Clusters found in Random Data
Random Points
3
Aspects of Cluster Validation
  • Comparing the clustering results to ground truth
    (externally known results).
  • External Index
  • Evaluating the quality of clusters without
    reference to external information.
  • Use only the data
  • Internal Index
  • Determining the reliability of clusters.
  • To what confidence level, the clusters are not
    formed by chance
  • Statistical framework

4
Comparing to Ground Truth
  • Notation
  • N number of objects in the data set
  • PP1,,Pm the set of ground truth clusters
  • CC1,,Cn the set of clusters reported by a
    clustering algorithm.
  • The incidence matrix
  • N ? N (both rows and columns correspond to
    objects).
  • Pij 1 if Oi and Oj belong to the same ground
    truth cluster in P Pij0 otherwise.
  • Cij 1 if Oi and Oj belong to the same cluster
    in C Cij0 otherwise.

5
External Index
  • A pair of data object (Oi,Oj) falls into one of
    the following categories
  • SS Cij1 and Pij1 (agree)
  • DD Cij0 and Pij0 (agree)
  • SD Cij1 and Pij0 (disagree)
  • DS Cij0 and Pij1 (disagree)
  • Rand index
  • may be dominated by DD
  • Jaccard Coefficient

6
Internal Index
  • Ground truth may be unavailable
  • Use only the data to measure cluster quality
  • Measure the homogeneity and separation of
    clusters.
  • SSE Sum of squared errors.
  • Calculate the correlation between clustering
    results and distance matrix.

7
Sum of Squared Error
  • Homogeneity is measured by the within cluster sum
    of squares
  • Exactly the objective function of K-means.
  • Separation is measured by the between cluster sum
    of squares
  • Where Ci is the size of cluster i,
    m is the centroid of the whole data set.
  • BSS WSS constant
  • A larger number of clusters tend to result in
    smaller WSS.

8
Sum of Squared Error
K1
K2
K4
9
Sum of Squared Error
  • Can also be used to estimate the number of
    clusters.

10
Internal Measures SSE
  • SSE curve for a more complicated data set

SSE of clusters found using K-means
11
Correlation with Distance Matrix
  • Distance Matrix
  • Dij is the similarity between object Oi and Oj.
  • Incidence Matrix
  • Cij1 if Oi and Oj belong to the same cluster,
    Cij0 otherwise
  • Compute the correlation between the two matrices
  • Only n(n-1)/2 entries needs to be calculated.
  • High correlation indicates good clustering.

12
Correlation with Distance Matrix
  • Given Distance Matrix D d11,d12, , dnn and
    Incidence Matrix C c11, c12,, cnn .

  • Correlation r between D and C is given by

13
Measuring Cluster Validity Via Correlation
  • Correlation of incidence and proximity matrices
    for the K-means clusterings of the following two
    data sets.

Corr -0.9235
Corr -0.5810
14
Clusters found in Random Data
Random Points
15
Using Similarity Matrix for Cluster Validation
  • Order the similarity matrix with respect to
    cluster labels and inspect visually.

16
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

K-means
17
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

Complete Link
18
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

DBSCAN
19
Reliability of Clusters
  • Need a framework to interpret any measure.
  • For example, if our measure of evaluation has
    the value, 10, is that good, fair, or poor?
  • Statistics provide a framework for cluster
    validity
  • The more atypical a clustering result is, the
    more likely it represents valid structure in the
    data.

20
Statistical Framework for SSE
  • Example
  • Compare SSE of 0.005 against three clusters in
    random data
  • SSE Histogram of 500 sets of random data points
    of size 100 distributed over the range 0.2 0.8
    for x and y values

SSE 0.005
21
Statistical Framework for Correlation
  • Correlation of incidence and distance matrices
    for the K-means of the following two data sets.

Correlation histogram of random data
Corr -0.5810
Corr -0.9235
22
Hyper-geometric Distribution
  • Given the total number of genes in the data set
    associated with term T is M, if randomly draw n
    genes from the data set N, what is the
    probability that m of the selected n genes will
    be associated with T?

23
P-Value
  • Based on Hyper-geometric distribution, the
    probability of having m genes or fewer associated
    to T in N can be calculated by summing the
    probabilities of a random list of N genes having
    1, 2, , m genes associated to T. So the p-value
    of over-representation is as follows
Write a Comment
User Comments (0)
About PowerShow.com