Clustering

About This Presentation

Title:

Clustering

Description:

k-Means, hierarchical clustering, Self-Organizing Maps Self Organizing Map Neighborhood function to preserve topological properties of the input space Neighbors share ... – PowerPoint PPT presentation

Number of Views:465

Avg rating:3.0/5.0

Slides: 39

Provided by: Piete6

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering

k-Means,
hierarchical clustering,
Self-Organizing Maps

2
Outline

k-means clustering
Hierarchical clustering
Self-Organizing Maps

3
Classification vs. Clustering
Classification Supervised learning
4
Classification vs. Clustering
labels unknown
Clustering Unsupervised learning No labels, find
natural grouping of instances
5
Many Clustering Applications

Basically, everywhere where labels are
unknown/uncertain/too expensive
Marketing find groups of similar customers
Astronomy find groups of similar stars, galaxies
Earthquake studies cluster earth quake
epicenters along continent faults
Genomics find groups of genes with similar
expressions

6
Clustering Methods Terminology
Non-overlapping
Overlapping
7
Clustering Methods Terminology
Bottom-up (agglomerative)
Top-down
8
Clustering Methods Terminology
Hierarchical
(vs flat)
9
Clustering Methods Terminology
Deterministic
Probabilistic
(vs flat)
10
k-Means Clustering
11
K-means clustering (k3)
Pick k random points initial cluster centers
12
K-means clustering (k3)
Assign each point to nearest cluster center
13
K-means clustering (k3)
Move cluster centers to mean of each cluster
14
K-means clustering (k3)
Reassign points to nearest cluster center
15
K-means clustering (k3)
Repeat step 3-4 until cluster centers converge
(dont/hardly move)
16
K-means

Works with numeric data only
Pick k random points initial cluster centers
Assign every item to its nearest cluster center
(e.g. using Euclidean distance)
Move each cluster center to the mean of its
assigned items
Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)

17
K-means clustering another example
http//www.youtube.com/watch?featureplayer_embedd
edvBVFG7fd1H30
18
Discussion

Result can vary significantly depending on
initial choice of centers
Can get trapped in local minimum
Example
To increase chance of finding global optimum
restart with different random seeds

19
Discussion, circular data

Arbitrary results
Prototypes not on data

20
K-means clustering summary

Advantages
Simple, understandable
Instances automatically assigned to clusters
Fast

Disadvantages
Must pick number of clusters beforehand
All instances forced into a single cluster
Sensitive to outliers
Random algorithm
Random results
Not always intuitive
Higher dimensions

21
K-means variations

k-medoids instead of mean, use medians of each
cluster
Mean of 1,3,5,7,1009 is
Median of 1,3,5,7,1009 is
For large databases, use sampling

205
5
22
How to choose k?

One important parameter k, but how to choose?
Domain dependent, we simply want k clusters
Alternative repeat for several values of k and
choose the best
Example
cluster mammals properties
each value of k leads to a different clustering
use an MDL-based encoding for the data in
clusters
each additional clusterintroduces a penalty
optimal for k 6

23
Clustering Evaluation

Manual inspection
Benchmarking on existing labels
Classification through clustering
Is this fair?
Cluster quality measures
distance measures
high similarity within a cluster, low across
clusters

24
Hierarchical Clustering
25
Hierarchical clustering

Hierarchical clustering represented in
dendrogram
tree structure containing hierarchical clusters
individual clusters in leafs, union of child
clusters in nodes

26
Bottom-up vs top-down clustering

Bottom up/Agglomerative
Start with single-instance clusters
At each step, join two closest clusters
Top down
Start with one universal cluster
Split in two clusters
Proceed recursively on each subset

27
Distance Between Clusters

Centroid distance between centroids
Sometimes hard to compute (e.g. mean of
molecules?)
Single Link smallest distance between points
Complete Link largest distance between points
Average Link average distance between points

28
Clustering dendrogram
29
How many clusters?
30
Probability-based Clustering

Given k clusters, each instance belongs to all
clusters (instead of a single one), with a
certain probability
mixture model set of k distributions (one per
cluster)
also each cluster has prior likelihood
If correct clustering known, we know parameters
and P(Ci) for each cluster calculate P(Cix)
using Bayes rule
How to estimate the unknown parameters?

31
Self-Organising Maps
32
Self Organizing Map

Group similar data together
Dimensionality reduction
Data visualization technique
Similar to neural networks
Neurons try to mimic the input vectors
The winning neuron (and its neighborhood) wins
Topology preserving, usingNeighborhood function

33
Self Organizing Map

Input high-dimensional input space
Output low dimensional (typically 2 or 3)
network topology
Training
Starting with a large learning rate and
neighborhood size, both are gradually decreased
to facilitate convergence
After learning, neurons with similar weights
tend to cluster on the map

34
Learning the SOM

Determine the winner (the neuron of which the
weight vector has the smallest distance to the
input vector)
Move the weight vector w of the winning neuron
towards the input i

35
SOM Learning Algorithm

Initialise SOM (random, or such that dissimilar
input is mapped far apart)
for t from 0 to N
Randomly select a training instance
Get the best matching neuron
calculate distance, e.g.
Scale neighbors
Which? decrease over time Hexagons, squares,
Gaussian,
Update of neighbors towards the training instance

36
Self Organizing Map

Neighborhood function to preserve topological
properties of the input space
Neighbors share the prize (postcode lottery
principle)

37
SOM of hand-written numerals
38
SOM of countries (poverty)
39
Clustering Summary

Unsupervised
Many approaches
k-means simple, sometimes useful
k-medoids is less sensitive to outliers
Hierarchical clustering works for symbolic
attributes
Self-Organizing Maps
Evaluation is a problem

Write a Comment

User Comments (0)

About PowerShow.com

Clustering - PowerPoint PPT Presentation

Clustering

k-Means, hierarchical clustering, Self-Organizing Maps Self Organizing Map Neighborhood function to preserve topological properties of the input space Neighbors share ... – PowerPoint PPT presentation