Clustering

- k-Means,
- hierarchical clustering,
- Self-Organizing Maps

Outline

- k-means clustering
- Hierarchical clustering
- Self-Organizing Maps

Classification vs. Clustering

Classification Supervised learning

Classification vs. Clustering

labels unknown

Clustering Unsupervised learning No labels, find

natural grouping of instances

Many Clustering Applications

- Basically, everywhere where labels are

unknown/uncertain/too expensive - Marketing find groups of similar customers
- Astronomy find groups of similar stars, galaxies
- Earthquake studies cluster earth quake

epicenters along continent faults - Genomics find groups of genes with similar

expressions

Clustering Methods Terminology

Non-overlapping

Overlapping

Clustering Methods Terminology

Bottom-up (agglomerative)

Top-down

Clustering Methods Terminology

Hierarchical

(vs flat)

Clustering Methods Terminology

Deterministic

Probabilistic

(vs flat)

k-Means Clustering

K-means clustering (k3)

Pick k random points initial cluster centers

K-means clustering (k3)

Assign each point to nearest cluster center

K-means clustering (k3)

Move cluster centers to mean of each cluster

K-means clustering (k3)

Reassign points to nearest cluster center

K-means clustering (k3)

Repeat step 3-4 until cluster centers converge

(dont/hardly move)

K-means

- Works with numeric data only
- Pick k random points initial cluster centers
- Assign every item to its nearest cluster center

(e.g. using Euclidean distance) - Move each cluster center to the mean of its

assigned items - Repeat steps 2,3 until convergence (change in

cluster assignments less than a threshold)

K-means clustering another example

http//www.youtube.com/watch?featureplayer_embedd

edvBVFG7fd1H30

Discussion

- Result can vary significantly depending on

initial choice of centers - Can get trapped in local minimum
- Example
- To increase chance of finding global optimum

restart with different random seeds

Discussion, circular data

- Arbitrary results
- Prototypes not on data

K-means clustering summary

- Advantages
- Simple, understandable
- Instances automatically assigned to clusters
- Fast

- Disadvantages
- Must pick number of clusters beforehand
- All instances forced into a single cluster
- Sensitive to outliers
- Random algorithm
- Random results
- Not always intuitive
- Higher dimensions

K-means variations

- k-medoids instead of mean, use medians of each

cluster - Mean of 1,3,5,7,1009 is
- Median of 1,3,5,7,1009 is
- For large databases, use sampling

205

5

How to choose k?

- One important parameter k, but how to choose?
- Domain dependent, we simply want k clusters
- Alternative repeat for several values of k and

choose the best - Example
- cluster mammals properties
- each value of k leads to a different clustering
- use an MDL-based encoding for the data in

clusters - each additional clusterintroduces a penalty
- optimal for k 6

Clustering Evaluation

- Manual inspection
- Benchmarking on existing labels
- Classification through clustering
- Is this fair?
- Cluster quality measures
- distance measures
- high similarity within a cluster, low across

clusters

Hierarchical Clustering

Hierarchical clustering

- Hierarchical clustering represented in

dendrogram - tree structure containing hierarchical clusters
- individual clusters in leafs, union of child

clusters in nodes

Bottom-up vs top-down clustering

- Bottom up/Agglomerative
- Start with single-instance clusters
- At each step, join two closest clusters
- Top down
- Start with one universal cluster
- Split in two clusters
- Proceed recursively on each subset

Distance Between Clusters

- Centroid distance between centroids
- Sometimes hard to compute (e.g. mean of

molecules?) - Single Link smallest distance between points
- Complete Link largest distance between points
- Average Link average distance between points

Clustering dendrogram

How many clusters?

Probability-based Clustering

- Given k clusters, each instance belongs to all

clusters (instead of a single one), with a

certain probability - mixture model set of k distributions (one per

cluster) - also each cluster has prior likelihood
- If correct clustering known, we know parameters

and P(Ci) for each cluster calculate P(Cix)

using Bayes rule - How to estimate the unknown parameters?

Self-Organising Maps

Self Organizing Map

- Group similar data together
- Dimensionality reduction
- Data visualization technique
- Similar to neural networks
- Neurons try to mimic the input vectors
- The winning neuron (and its neighborhood) wins
- Topology preserving, usingNeighborhood function

Self Organizing Map

- Input high-dimensional input space
- Output low dimensional (typically 2 or 3)
- network topology
- Training
- Starting with a large learning rate and

neighborhood size, both are gradually decreased

to facilitate convergence - After learning, neurons with similar weights

tend to cluster on the map

Learning the SOM

- Determine the winner (the neuron of which the

weight vector has the smallest distance to the

input vector) - Move the weight vector w of the winning neuron

towards the input i

SOM Learning Algorithm

- Initialise SOM (random, or such that dissimilar

input is mapped far apart) - for t from 0 to N
- Randomly select a training instance
- Get the best matching neuron
- calculate distance, e.g.
- Scale neighbors
- Which? decrease over time Hexagons, squares,

Gaussian, - Update of neighbors towards the training instance

Self Organizing Map

- Neighborhood function to preserve topological

properties of the input space - Neighbors share the prize (postcode lottery

principle)

SOM of hand-written numerals

SOM of countries (poverty)

Clustering Summary

- Unsupervised
- Many approaches
- k-means simple, sometimes useful
- k-medoids is less sensitive to outliers
- Hierarchical clustering works for symbolic

attributes - Self-Organizing Maps
- Evaluation is a problem