Overview of Clustering - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Overview of Clustering

Description:

– PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 67
Provided by: cse6
Learn more at: https://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Overview of Clustering


1
Overview of Clustering
  • Rong Jin

2
Outline
  • K means for clustering
  • Expectation Maximization algorithm for clustering
  • Spectrum clustering (if time is permitted)

3
Clustering
  • Find out the underlying structure for given data
    points


age
4
Application (I) Search Result Clustering
5
Application (II) Navigation
6
Application (III) Google News
7
Application (III) Visualization
Islands of music (Pampalk et al., KDD 03)
8
Application (IV) Image Compression
http//www.ece.neu.edu/groups/rpl/kmeans/
9
How to Find good Clustering?
  • Minimize the sum of distance within clusters

10
How to Efficiently Clustering Data?
11
K-means for Clustering
  • K-means
  • Start with a random guess of cluster centers
  • Determine the membership of each data points
  • Adjust the cluster centers

12
K-means for Clustering
  • K-means
  • Start with a random guess of cluster centers
  • Determine the membership of each data points
  • Adjust the cluster centers

13
K-means for Clustering
  • K-means
  • Start with a random guess of cluster centers
  • Determine the membership of each data points
  • Adjust the cluster centers

14
K-means
  • Ask user how many clusters theyd like. (e.g.
    k5)

15
K-means
  • Ask user how many clusters theyd like. (e.g.
    k5)
  • Randomly guess k cluster Center locations

16
K-means
  • Ask user how many clusters theyd like. (e.g.
    k5)
  • Randomly guess k cluster Center locations
  • Each datapoint finds out which Center its
    closest to. (Thus each Center owns a set of
    datapoints)

17
K-means
  • Ask user how many clusters theyd like. (e.g.
    k5)
  • Randomly guess k cluster Center locations
  • Each datapoint finds out which Center its
    closest to.
  • Each Center finds the centroid of the points it
    owns

18
K-means
  • Ask user how many clusters theyd like. (e.g.
    k5)
  • Randomly guess k cluster Center locations
  • Each datapoint finds out which Center its
    closest to.
  • Each Center finds the centroid of the points it
    owns

Any Computational Problem?
19
Improve K-means
  • Group points by region
  • KD tree
  • SR tree
  • Key difference
  • Find the closest center for each rectangle
  • Assign all the points within a rectangle to one
    cluster

20
Improved K-means
  • Find the closest center for each rectangle
  • Assign all the points within a rectangle to one
    cluster

21
Improved K-means
22
Improved K-means
23
Improved K-means
24
Improved K-means
25
Improved K-means
26
Improved K-means
27
Improved K-means
28
Improved K-means
29
Improved K-means
30
A Gaussian Mixture Model for Clustering
  • Assume that data are generated from a mixture of
    Gaussian distributions
  • For each Gaussian distribution
  • Center ?i
  • Variance ?i (ignore)
  • For each data point
  • Determine membership

31
Learning a Gaussian Mixture(with known
covariance)
  • Probability

32
Learning a Gaussian Mixture(with known
covariance)
  • Probability

33
Learning a Gaussian Mixture(with known
covariance)
E-Step
34
Learning a Gaussian Mixture(with known
covariance)
M-Step
35
Gaussian Mixture Example Start
36
After First Iteration
37
After 2nd Iteration
38
After 3rd Iteration
39
After 4th Iteration
40
After 5th Iteration
41
After 6th Iteration
42
After 20th Iteration
43
Mixture Model for Doc Clustering
  • A set of language models

44
Mixture Model for Doc Clustering
  • A set of language models
  • Probability

45
Mixture Model for Doc Clustering
  • A set of language models
  • Probability

46
Mixture Model for Doc Clustering
  • A set of language models

Introduce hidden variable zij zij document di is
generated by the j-th language model ?j.
  • Probability

47
Learning a Mixture Model
K number of language models
48
Learning a Mixture Model
M-Step
49
Examples of Mixture Models
50
Other Mixture Models
  • Probabilistic latent semantic index (PLSI)
  • Latent Dirichlet Allocation (LDA)

51
Problems (I)
  • Both k-means and mixture models need to compute
    centers of clusters and explicit distance
    measurement
  • Given strange distance measurement, the center of
    clusters can be hard to compute
  • E.g.,

52
Problems (II)
  • Both k-means and mixture models look for compact
    clustering structures
  • In some cases, connected clustering structures
    are more desirable

53
Graph Partition
  • MinCut bipartite graphs with minimal number of
    cut edges

CutSize 2
54
2-way Spectral Graph Partitioning
  • Weight matrix W
  • wi,j the weight between two vertices i and j
  • Membership vector q

55
Solving the Optimization Problem
  • Directly solving the above problem requires
    combinatorial search ? exponential complexity
  • How to reduce the computation complexity?

56
Relaxation Approach
  • Key difficulty qi has to be either 1, 1
  • Relax qi to be any real number
  • Impose constraint

57
Relaxation Approach
58
Relaxation Approach
  • Solution the second minimum eigenvector for D-W

59
Graph Laplacian
  • L is semi-positive definitive matrix
  • For Any x, we have xTLx ? 0, why?
  • Minimum eigenvalue ?1 0 (what is the
    eigenvector?)
  • The second minimum eigenvalue ?2 gives the best
    bipartite graph

60
Recovering Partitions
  • Due to the relaxation, q can be any number (not
    just 1 and 1)
  • How to construct partition based on the
    eigenvector?

61
Spectral Clustering
  • Minimum cut does not balance the size of
    bipartite graphs

62
Normalized Cut (Shi Malik, 1997)
  • Minimize the similarity between clusters and
    meanwhile maximize the similarity within clusters

63
Normalized Cut
64
Normalized Cut
  • Relax q to real value under the constraint

65
Image Segmentation
66
Non-negative Matrix Factorization
Write a Comment
User Comments (0)
About PowerShow.com