Overview of Clustering - PowerPoint PPT Presentation

1 / 66

About This Presentation

Title:

Overview of Clustering

Description:

– PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 67

Provided by: cse6

Learn more at: https://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Overview of Clustering

1
Overview of Clustering

Rong Jin

2
Outline

K means for clustering
Expectation Maximization algorithm for clustering
Spectrum clustering (if time is permitted)

3
Clustering

Find out the underlying structure for given data
points

age
4
Application (I) Search Result Clustering
5
Application (II) Navigation
6
Application (III) Google News
7
Application (III) Visualization
Islands of music (Pampalk et al., KDD 03)
8
Application (IV) Image Compression
http//www.ece.neu.edu/groups/rpl/kmeans/
9
How to Find good Clustering?

Minimize the sum of distance within clusters

10
How to Efficiently Clustering Data?
11
K-means for Clustering

K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers

12
K-means for Clustering

K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers

13
K-means for Clustering

K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers

14
K-means

Ask user how many clusters theyd like. (e.g.
k5)

15
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations

16
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to. (Thus each Center owns a set of
datapoints)

17
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to.
Each Center finds the centroid of the points it
owns

18
K-means

Ask user how many clusters theyd like. (e.g.
k5)
Randomly guess k cluster Center locations
Each datapoint finds out which Center its
closest to.
Each Center finds the centroid of the points it
owns

Any Computational Problem?
19
Improve K-means

Group points by region
KD tree
SR tree
Key difference
Find the closest center for each rectangle
Assign all the points within a rectangle to one
cluster

20
Improved K-means

Find the closest center for each rectangle
Assign all the points within a rectangle to one
cluster

21
Improved K-means
22
Improved K-means
23
Improved K-means
24
Improved K-means
25
Improved K-means
26
Improved K-means
27
Improved K-means
28
Improved K-means
29
Improved K-means
30
A Gaussian Mixture Model for Clustering

Assume that data are generated from a mixture of
Gaussian distributions
For each Gaussian distribution
Center ?i
Variance ?i (ignore)
For each data point
Determine membership

31
Learning a Gaussian Mixture(with known
covariance)

Probability

32
Learning a Gaussian Mixture(with known
covariance)

Probability

33
Learning a Gaussian Mixture(with known
covariance)
E-Step
34
Learning a Gaussian Mixture(with known
covariance)
M-Step
35
Gaussian Mixture Example Start
36
After First Iteration
37
After 2nd Iteration
38
After 3rd Iteration
39
After 4th Iteration
40
After 5th Iteration
41
After 6th Iteration
42
After 20th Iteration
43
Mixture Model for Doc Clustering

A set of language models

44
Mixture Model for Doc Clustering

A set of language models

Probability

45
Mixture Model for Doc Clustering

A set of language models

Probability

46
Mixture Model for Doc Clustering

A set of language models

Introduce hidden variable zij zij document di is
generated by the j-th language model ?j.

Probability

47
Learning a Mixture Model
K number of language models
48
Learning a Mixture Model
M-Step
49
Examples of Mixture Models
50
Other Mixture Models

Probabilistic latent semantic index (PLSI)
Latent Dirichlet Allocation (LDA)

51
Problems (I)

Both k-means and mixture models need to compute
centers of clusters and explicit distance
measurement
Given strange distance measurement, the center of
clusters can be hard to compute
E.g.,

52
Problems (II)