Clustering - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Clustering

Description:

Clearly, subjective measure or problem-dependent. How Similar Clusters are? ... length & width of sepal and petal. Setosa. Versicolor. Virginica. Data 2: Iris Data Set ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 21
Provided by: giladl7
Category:
Tags: clustering | sepal

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
  • Gilad Lerman
  • Math Department, UMN

Slides/figures stolen from M.-A. Dillies, E.
Keogh, A. Moore
2
What is Clustering?
  • Partitioning data into classes with
  • high intra-class similarity
  • low inter-class similarity
  • Is it well-defined?

3
What is Similarity?
  • Clearly, subjective measure or problem-dependent

4
How Similar Clusters are?
  • Ex1 Two clusters or one clusters?

5
How Similar Clusters are?
  • Ex2 Cluster or outliers

6
Sum-Squares Intra-class Similarity
  • Given Cluster
  • Mean
  • Within Cluster Sum of Squares
  • Note that

7
Within Cluster Sum of Squares
  • For Set of Clusters SS1,,SK
  • Can use
  • So get Within Clusters Manhattan Distance
  • Question how to compute/estimate c?

8
Minimizing WCSS
  • Precise minimization is NP-hard
  • Approximate minimization for WCSS by K-means
  • Approximate minimization for WCMD by K-medians

9
The K-means Algorithm
  • Input Data number of clusters (K)
  • Randomly guess locations of K cluster centers
  • For each center assign nearest cluster
  • Repeat till convergence .

10
Demonstration K-means/medians
  • Applet

11
K-means Pros and Cons
  • Pros
  • Often fast
  • Often terminates at a local minimum
  • Cons
  • May not obtain the global minimum
  • Depends on initialization
  • Need to specify K
  • Sensitive to outliers
  • Sensitive to variations in sizes and densities of
    clusters
  • Not suitable for non-convex shapes
  • Does not apply directly to categorical data

12
Spectral Clustering
  • Idea embed data for easy clustering
  • Construct weights based on proximity
  • (Normalize W )
  • Embed using eigenvectors of W

13
Clustering vs. Classification
  • Clustering find classes in an unsupervised way
    (often K is given though)
  • Classification labels of clusters are given for
    some data points (supervised learning)

14
Data 1 Face images
  • Facial images (e.g., of persons 5,8,10) live on
    different planes in the image space
  • They are often well-separated so that simple
    clustering can apply to them (but not always)
  • Question What is the high-dimensional image
    space?
  • Question How can we present high-dim. data in 3D?


15
Data 2 Iris Data Set
Setosa
Versicolor
Virginica
  • 50 samples from each of 3 species
  • 4 features per sample
  • length width of sepal and petal

16
Data 2 Iris Data Set
17
Data 2 Iris Data Set
  • Setosa is clearly separated from 2 others
  • Cant separate Virginica and Versicolor
  • (need training set as done by Fischer in 1936)
  • Question What are other ways to visualize?

18
Data 3 Color-based Compression of Images
  • Applet
  • Question What are the actual data points?
  • Question What does the error mean?

19
Some methods for of Clusters(with online codes)
  • Gap statistics
  • Model-based clustering
  • G-means
  • X-means
  • Data-spectroscopic clustering
  • Self-tuning clustering

20
Your mission
  • Learn about clustering (theoretical results,
    algorithms, codes)
  • Focus methods for determining of clusters
  • Understand details
  • Compare using artificial and real data
  • Conclude good/bad scenarios for each (prove?)
  • Come up with new/improved methods
  • Summarize info literature survey and possibly
    new/improved demos/applets
  • We can suggest additional questions tailored to
    your interest
Write a Comment
User Comments (0)
About PowerShow.com