Cluster Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster Analysis

Description:

... analysis to externally known results, e.g., to externally given class labels. ... the extent to which cluster labels match externally supplied class labels. ... – PowerPoint PPT presentation

Number of Views:498
Avg rating:3.0/5.0
Slides: 37
Provided by: HKUC4
Learn more at: https://www.cs.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
2
Midterm Monday Oct 29, 4PM
  • Lecture Notes from Sept 5, 2007 until Oct 15,
    2007. Chapters from Textbook and papers
    discussed in class (see below detailed list)
  • Specific Readings  
  • Textbook
  • Chapter 1
  • Chapter 2 2.1- 2.4
  • Chapter 3 3.1-3.4
  • Chapter 4 4.1.1-4.1.2, 4.2.1
  • Chapter 5
  • Chapter 6 6.1-6.5, 6.9.1, 6.12, 6.13, 6.14
  • Chapter 7 7.1-7.4
  • Papers
  • Apriori Paper R. Agrawal, R. Srikant Fast
    Algorithms for Mining Association Rules. VLDB
    1994
  • MaxMiner Paper R. J. Bayardo Jr Efficiently
    Mining Long Patterns from Databases. SIGMOD 1998
  • SLIQ paperM. Mehta, R. Agrawal, J. Rissanen
    SLIQ A Fast Scalable Classifier for Data Mining.
    EDBT 1996

3
Cluster Analysis
  • What is Cluster Analysis?
  • Types of Data in Cluster Analysis
  • A Categorization of Major Clustering Methods
  • Partitioning Methods
  • Hierarchical Methods
  • Density-Based Methods
  • Grid-Based Methods
  • Model-Based Clustering Methods

4
Clustering High-Dimensional Data
  • Clustering high-dimensional data
  • Many applications text documents, DNA
    micro-array data
  • Major challenges
  • Many irrelevant dimensions may mask clusters
  • Distance measure becomes meaninglessdue to
    equi-distance
  • Clusters may exist only in some subspaces
  • Methods
  • Feature transformation only effective if most
    dimensions are relevant
  • PCA SVD useful only when features are highly
    correlated/redundant
  • Feature selection wrapper or filter approaches
  • useful to find a subspace where the data have
    nice clusters
  • Subspace-clustering find clusters in all the
    possible subspaces
  • CLIQUE, ProClus, and frequent pattern-based
    clustering

5
The Curse of Dimensionality (graphs adapted from
Parsons et al. KDD Explorations 2004)
  • Data in only one dimension is relatively packed
  • Adding a dimension stretch the points across
    that dimension, making them further apart
  • Adding more dimensions will make the points
    further aparthigh dimensional data is extremely
    sparse
  • Distance measure becomes meaninglessdue to
    equi-distance

6
Why Subspace Clustering?(adapted from Parsons et
al. SIGKDD Explorations 2004)
  • Clusters may exist only in some subspaces
  • Subspace-clustering find clusters in some of the
    subspaces

7
CLIQUE (Clustering In QUEst)
  • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD98)
  • Automatically identifying subspaces of a high
    dimensional data space that allow better
    clustering than original space
  • CLIQUE can be considered as both density-based
    and grid-based
  • It partitions each dimension into the same number
    of equal length interval
  • It partitions an m-dimensional data space into
    non-overlapping rectangular units
  • A unit is dense if the fraction of total data
    points contained in the unit exceeds the input
    model parameter
  • A cluster is a maximal set of connected dense
    units within a subspace

8
CLIQUE The Major Steps
  • Partition the data space and find the number of
    points that lie inside each cell of the
    partition.
  • Identify the subspaces that contain clusters
    using the Apriori principle
  • Identify clusters
  • Determine dense units in all subspaces of
    interests
  • Determine connected dense units in all subspaces
    of interests.
  • Generate minimal description for the clusters
  • Determine maximal regions that cover a cluster of
    connected dense units for each cluster
  • Determination of minimal cover for each cluster

9
Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
10
Strength and Weakness of CLIQUE
  • Strength
  • automatically finds subspaces of the highest
    dimensionality such that high density clusters
    exist in those subspaces
  • insensitive to the order of records in input and
    does not presume some canonical data distribution
  • scales linearly with the size of input and has
    good scalability as the number of dimensions in
    the data increases
  • Weakness
  • The accuracy of the clustering result may be
    degraded at the expense of simplicity of the
    method

11
Frequent Pattern-Based Approach
  • Clustering high-dimensional space (e.g.,
    clustering text documents, microarray data)
  • Projected subspace-clustering which dimensions
    to be projected on?
  • CLIQUE, ProClus
  • Feature extraction costly and may not be
    effective?
  • Using frequent patterns as features
  • Frequent are inherent features
  • Mining freq. patterns may not be so expensive
  • Typical methods
  • Frequent-term-based document clustering
  • Clustering by pattern similarity in micro-array
    data (pClustering)

12
Clustering by Pattern Similarity (p-Clustering)
  • Right The micro-array raw data shows 3 genes
    and their values in a multi-dimensional space
  • Difficult to find their patterns
  • Bottom Some subsets of dimensions form nice
    shift and scaling patterns

13
Why p-Clustering?
  • Microarray data analysis may need to
  • Clustering on thousands of dimensions
    (attributes)
  • Discovery of both shift and scaling patterns
  • Clustering with Euclidean distance measure?
    cannot find shift patterns
  • Clustering on derived attribute Aij ai aj?
    introduces N(N-1) dimensions
  • Bi-cluster using transformed mean-squared residue
    score matrix (I, J)
  • Where
  • A submatrix is a d-cluster if H(I, J) d for
    some d gt 0
  • Problems with bi-cluster
  • No downward closure property,
  • Due to averaging, it may contain outliers but
    still within d-threshold

14
p-Clustering
  • Given objects x, y in O and features a, b in T,
    pCluster is a 2 by 2 matrix
  • A pair (O, T) is in d-pCluster if for any 2 by 2
    matrix X in (O, T), pScore(X) d for some d gt 0
  • Properties of d-pCluster
  • Downward closure
  • Clusters are more homogeneous than bi-cluster
    (thus the name pair-wise Cluster)
  • Pattern-growth algorithm has been developed for
    efficient mining
  • For scaling patterns, one can observe, taking
    logarithmic on will lead to the pScore
    form

15
Cluster Analysis
  • What is Cluster Analysis?
  • Types of Data in Cluster Analysis
  • A Categorization of Major Clustering Methods
  • Partitioning Methods
  • Hierarchical Methods
  • Density-Based Methods
  • Grid-Based Methods
  • Model-Based Clustering Methods

16
Model based clustering
  • Assume data generated from K probability
    distributions
  • Typically Gaussian distribution Soft or
    probabilistic version of K-means clustering
  • Need to find distribution parameters.
  • EM Algorithm

17
EM Algorithm
  • Initialize K cluster centers
  • Iterate between two steps
  • Expectation step assign points to clusters
  • Maximation step estimate model parameters

18
Cluster Analysis
  • What is Cluster Analysis?
  • Types of Data in Cluster Analysis
  • A Categorization of Major Clustering Methods
  • Partitioning Methods
  • Hierarchical Methods
  • Density-Based Methods
  • Grid-Based Methods
  • Model-Based Clustering Methods
  • Cluster Validity

19
Cluster Validity
  • For supervised classification we have a variety
    of measures to evaluate how good our model is
  • Accuracy, precision, recall
  • For cluster analysis, the analogous question is
    how to evaluate the goodness of the resulting
    clusters?
  • But clusters are in the eye of the beholder!
  • Then why do we want to evaluate them?
  • To avoid finding patterns in noise
  • To compare clustering algorithms
  • To compare two sets of clusters
  • To compare two clusters

20
Clusters found in Random Data
Random Points
21
Different Aspects of Cluster Validation
  • Determining the clustering tendency of a set of
    data, i.e., distinguishing whether non-random
    structure actually exists in the data.
  • Comparing the results of a cluster analysis to
    externally known results, e.g., to externally
    given class labels.
  • Evaluating how well the results of a cluster
    analysis fit the data without reference to
    external information.
  • - Use only the data
  • Comparing the results of two different sets of
    cluster analyses to determine which is better.
  • Determining the correct number of clusters.
  • For 2, 3, and 4, we can further distinguish
    whether we want to evaluate the entire clustering
    or just individual clusters.

22
Measures of Cluster Validity
  • Numerical measures that are applied to judge
    various aspects of cluster validity, are
    classified into the following three types.
  • External Index Used to measure the extent to
    which cluster labels match externally supplied
    class labels.
  • Entropy
  • Internal Index Used to measure the goodness of
    a clustering structure without respect to
    external information.
  • Sum of Squared Error (SSE)
  • Relative Index Used to compare two different
    clusterings or clusters.
  • Often an external or internal index is used for
    this function, e.g., SSE or entropy
  • Sometimes these are referred to as criteria
    instead of indices
  • However, sometimes criterion is the general
    strategy and index is the numerical measure that
    implements the criterion.

23
Measuring Cluster Validity Via Correlation
  • Two matrices
  • Proximity Matrix
  • Incidence Matrix
  • One row and one column for each data point
  • An entry is 1 if the associated pair of points
    belong to the same cluster
  • An entry is 0 if the associated pair of points
    belongs to different clusters
  • Compute the correlation between the two matrices
  • Since the matrices are symmetric, only the
    correlation between n(n-1) / 2 entries needs to
    be calculated.
  • High correlation indicates that points that
    belong to the same cluster are close to each
    other.
  • Not a good measure for some density or contiguity
    based clusters.

24
Measuring Cluster Validity Via Correlation
  • Correlation of incidence and proximity matrices
    for the K-means clusterings of the following two
    data sets.

Corr -0.9235
Corr -0.5810
25
Using Similarity Matrix for Cluster Validation
  • Order the similarity matrix with respect to
    cluster labels and inspect visually.

26
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

DBSCAN
27
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

K-means
28
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

Complete Link
29
Using Similarity Matrix for Cluster Validation
DBSCAN
30
Internal Measures SSE
  • Clusters in more complicated figures arent well
    separated
  • Internal Index Used to measure the goodness of
    a clustering structure without respect to
    external information
  • SSE
  • SSE is good for comparing two clusterings or two
    clusters (average SSE).
  • Can also be used to estimate the number of
    clusters

31
Internal Measures SSE
  • SSE curve for a more complicated data set

SSE of clusters found using K-means
32
Framework for Cluster Validity
  • Need a framework to interpret any measure.
  • For example, if our measure of evaluation has the
    value, 10, is that good, fair, or poor?
  • Statistics provide a framework for cluster
    validity
  • The more atypical a clustering result is, the
    more likely it represents valid structure in the
    data
  • Can compare the values of an index that result
    from random data or clusterings to those of a
    clustering result.
  • If the value of the index is unlikely, then the
    cluster results are valid
  • These approaches are more complicated and harder
    to understand.
  • For comparing the results of two different sets
    of cluster analyses, a framework is less
    necessary.
  • However, there is the question of whether the
    difference between two index values is
    significant

33
Internal Measures Cohesion and Separation
  • Cluster Cohesion Measures how closely related
    are objects in a cluster
  • Example SSE
  • Cluster Separation Measure how distinct or
    well-separated a cluster is from other clusters
  • Example Squared Error
  • Cohesion is measured by the within cluster sum of
    squares (SSE)
  • Separation is measured by the between cluster sum
    of squares
  • Where Ci is the size of cluster i

34
Internal Measures Cohesion and Separation
  • A proximity graph based approach can also be used
    for cohesion and separation.
  • Cluster cohesion is the sum of the weight of all
    links within a cluster.
  • Cluster separation is the sum of the weights
    between nodes in the cluster and nodes outside
    the cluster.

cohesion
separation
35
Internal Measures Silhouette Coefficient
  • Silhouette Coefficient combine ideas of both
    cohesion and separation, but for individual
    points, as well as clusters and clusterings
  • For an individual point, i
  • Calculate a average distance of i to the points
    in its cluster
  • Calculate b min (average distance of i to
    points in another cluster)
  • The silhouette coefficient for a point is then
    given by s 1 a/b if a lt b, (or s b/a
    - 1 if a ? b, not the usual case)
  • Typically between 0 and 1.
  • The closer to 1 the better.
  • Can calculate the Average Silhouette width for a
    cluster or a clustering

36
Final Comment on Cluster Validity
  • The validation of clustering structures is
    the most difficult and frustrating part of
    cluster analysis.
  • Without a strong effort in this direction,
    cluster analysis will remain a black art
    accessible only to those true believers who have
    experience and great courage.
  • Algorithms for Clustering Data, Jain and Dubes
Write a Comment
User Comments (0)
About PowerShow.com