Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach - PowerPoint PPT Presentation

About This Presentation
Title:

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

Description:

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML 2003 Presented by Dehong Liu – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0
Slides: 18
Provided by: Dehon7
Category:

less

Transcript and Presenter's Notes

Title: Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach


1
Random Projection for High Dimensional Data
Clustering A Cluster Ensemble Approach
  • Xiaoli Zhang Fern, Carla E. Brodley
  • ICML2003
  • Presented by Dehong Liu

2
Contents
  • Motivation
  • Random projection and the cluster ensemble
    approach
  • Experimental results
  • Conclusion

3
Motivation
  • High dimensionality poses two challenges for
    unsupervised learning
  • The presence of irrelevant and noisy features can
    mislead the clustering algorithm.
  • In high dimensions, data may be sparse, making it
    difficult to find any structure in the data.
  • Two basic approaches to reduce the dimensionality
  • Feature subset selection
  • Feature transformation-PCA, random projection.

4
Motivation
  • Random projection
  • Advantage
  • A general data reduction technique
  • Has been shown to have special promise for high
    dimensional data clustering.
  • Disadvantage
  • Highly unstable. Different random projections may
    lead to radically different clustering results.

5
Idea
  • Aggregate multiple runs of clusterings to achieve
    better clustering performance.
  • A single run of clustering consists of applying
    random projection to the high dimensional data
    and clustering the reduced data using EM.
  • Multiple runs of clustering are performed and the
    results are aggregated to form an n?n similarity
    matrix.
  • An agglomerative clustering algorithm is then
    applied to the matrix to produce the final
    clusters.

6
A single run
  • Random projection XX ? R
  • X n ? d, reduced-dimension data set
  • X n ? d , high-dimensional data set
  • R d ? d, which is generated by first setting
    each entry of the matrix to a value drawn from an
    i.i.d N(0,1) distribution and then normalizing
    the columns to unit length.
  • EM clustering

7
Aggregating multiple clustering results
  • The probability that data point i belongs to each
    cluster under the model ?
  • The probability that data point i and j belongs
    to the same cluster under the model ?

8
Pij forms a similarity matrix.
9
Producing final clusters
10
How to decide k?
We can use the occurrence of a sudden similarity
drop as a heuristic to determine k.
11
Experimental results
  • Evaluation Criteria
  • Conditional Entropy (CE) measures the
    uncertainty of the class labels given a
    clustering solution.
  • Normalized Mutual Information (NMI) between the
    distribution of class labels and the distribution
    of cluster labels.
  • CE the smaller the better. NMI the larger the
    better.

12
Experimental results
  • Cluster ensemble versus single RPEM

13
Experimental results
  • Cluster ensemble versus PCAEM

14
Experimental results
  • Cluster ensemble versus PCAEM

15
Analysis of Diversity for Cluster Ensembles
  • Diversity the NMI between each pair of
    clustering solutions.
  • Quality average the NMI values between each of
    the solutions and the class labels

16
(No Transcript)
17
Conclusion
  • Techniques have been investigated to produce and
    combine multiple clusterings in order to achieve
    an improved final clustering.
  • The major contribution of this paper1)Examined
    random projection for high dimensional data
    clustering and identified its instability
    problem 2)formed a novel cluster ensemble
    framework based on random projection and
    demonstrated its effectiveness for high
    dimensional data clustering and 3) identified
    the importance of the quality and diversity of
    individual clustering solutions and illustrated
    their influence on the ensemble performance with
    empirical results.
Write a Comment
User Comments (0)
About PowerShow.com