Clustering - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Clustering

Description:

Clustering Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters) [ACM CS 99] – PowerPoint PPT presentation

Number of Views:418

Avg rating:3.0/5.0

Slides: 54

Provided by: Euri2

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering

Clustering is the unsupervised classification of
patterns (observations, data items or feature
vectors) into groups (clusters) ACM CS99
Instances within a cluster are very similar
Instances in different clusters are very different

2
Example
3
Applications

Faster retrieval
Faster and better browsing
Structuring of search results
Revealing classes and other data regularities
Directory construction
Better data organization in general

4
Cluster Searching

Similar instances tend to be relevant to the same
requests
The query is mapped to the closest cluster by
comparison with the cluster-centroids

5
Notation

N number of elements
Class real world grouping ground truth
Cluster grouping by algorithm
The ideal clustering algorithm will produce
clusters equivalent to real world classes with
exactly the same members

6
Problems

How many clusters ?
Complexity? N is usually large
Quality of clustering
When a method is better than another?
Overlapping clusters
Sensitivity to outliers

7
Example
8
Clustering Approaches

Divisive build clusters top down starting from
the entire data set
K-means, Bisecting K-means
Hierarchical or flat clustering
Agglomerative build clusters bottom-up
starting with individual instances and by
iteratively combining them to form larger cluster
at higher level
Hierarchical clustering
Combinations of the above
Buckshot algorithm

9
Hierarchical Flat Clustering

Flat all clusters at the same level
K-means, Buckshot
Hierarchical nested sequence of clusters
Single cluster with all data at the top
singleton clusters at the bottom
Intermediate levels are more useful
Every intermediate level combines two clusters
from the next lower level
Agglomerative, Bisecting K-means

10
Flat Clustering
11
Hierarchical Clustering
12
Text Clustering

Finds overall similarities among documents or
groups of documents
Faster searching, browsing etc.
Needs to know how to compute the similarity (or
equivalently the distance) between documents

13
Query Document Similarity

Similarity is defined as the cosine of the angle
between document and query vectors

14
Document Distance

Consider documents d1, d2 with vectors u1, u2
Their distance is defined as the length AB

15
Normalization by Document Length

The longer the document is, the more likely it is
for a given term to appear in it
Normalize the term weights by document length
(so terms in long documents are not given more
weight)

16
Evaluation of Cluster Quality

Clusters can be evaluated using internal or
external knowledge
Internal Measures intra cluster cohesion and
cluster separability
intra cluster similarity
inter cluster similarity
External measures quality of clusters compared
to real classes
Entropy (E), Harmonic Mean (F)

17
Intra Cluster Similarity

A measure of cluster cohesion
Defined as the average pair-wise similarity of
documents in a cluster
Where cluster centroid
Documents (not centroids) have unit length

18
Inter Cluster Similarity

Single Link similarity of two most similar
members
Complete Link similarity of two least similar
members
Group Average average similarity between members

19
Example
20
Entropy

Measures the quality of flat clusters using
external knowledge
Pre-existing classification
Assessment by experts
Pij probability that a member of cluster j
belong to class i
The entropy of cluster j is defined as
Ej-SiPijlogPij

21
Entropy (cont)

Total entropy for all clusters
Where nj is the size of cluster j
m is the number of clusters
N is the number of instances
The smaller the value of E is the better the
quality of the algorithm is
The best entropy is obtained when each cluster
contains exactly one instance

22
Harmonic Mean (F)

Treats each cluster as a query result
F combines precision (P) and recall (R)
Fij for cluster j and class i is defined as
nij number of instances of class i in cluster
j,
ni number of instances of class i,
nj number of instances of cluster j

23
Harmonic Mean (cont)

The F value of any class i is the maximum value
it achieves over all j
Fi maxj Fij
The F value of a clustering solution is computed
as the weighted average over all classes
Where N is the number of data instances

24
Quality of Clustering

A good clustering method
Maximizes intra-cluster similarity
Minimizes inter cluster similarity
Minimizes Entropy
Maximizes Harmonic Mean
Difficult to achieve all together simultaneously
Maximize some objective function of the above
An algorithm is better than an other if it has
better values on most of these measures

25
K-means Algorithm

Select K centroids
Repeat I times or until the centroids do not
change
Assign each instance to the cluster represented
by its nearest centroid
Compute new centroids
Reassign instances
Compute new centroids
.

26
K-Means demo (1/7) http//www.delft-cluster.nl/t
extminer/theory/kmeans/kmeans.html
27
K-Means demo (2/7)
28
K-Means demo (3/7)
29
K-Means demo (4/7)
30
K-Means demo (5/7)
31
K-Means demo (6/7)
32
K-Means demo (7/7)
33
Comments on K-Means (1)

Generates a flat partition of K clusters
K is the desired number of clusters and must be
known in advance
Starts with K random cluster centroids
A centroid is the mean or the median of a group
of instances
The mean rarely corresponds to a real instance

34
Comments on K-Means (2)

Up to I10 iterations
Keep the clustering resulted in best inter/intra
similarity or the final clusters after I
iterations
Complexity O(IKN)
A repeated application of K-Means for K2, 4,
can produce a hierarchical clustering

35
Choosing Centroids for K-means

Quality of clustering depends on the selection of
initial centroids
Random selection may result in poor convergence
rate, or convergence to sub-optimal clusterings.
Select good initial centroids using a heuristic
or the results of another method
Buckshot algorithm

36
Incremental K-Means

Update each centroid during each iteration after
each point is assigned to a cluster rather than
at the end of each iteration
Reassign instances to clusters at the end of each
iteration
Converges faster than simple K-means
Usually 2-5 iterations

37
Bisecting K-Means

Starts with a single cluster with all instances
Select a cluster to split larger cluster or
cluster with less intra similarity
The selected cluster is split into 2 partitions
using K-means (K2)
Repeat up to the desired depth h
Hierarchical clustering
Complexity O(2hN)

38
Agglomerative Clustering

Compute the similarity matrix between all pairs
of instances
Starting from singleton clusters
Repeat until a single cluster remains
Merge the two most similar clusters
Replace them with a single cluster
Replace the merged cluster in the matrix and
update the similarity matrix
Complexity O(N2)

39
Similarity Matrix
C1d1 C2d2 CNdN
C1d1 1 0.8 0.3
C2d2 0.8 1 0.6
. 1
CNdN 0.3 0.6 1
40
Update Similarity Matrix
merged
C1d1 C2d2 CNdN
C1d1 1 0.8 0.3
C2d2 0.8 1 0.6
. 1
CNdN 0.3 0.6 1
merged
41
New Similarity Matrix
C12 d1 ? d2 CNdN
C12 d1 ? d2 1 0.4
1
CNdN 0.4 1
42
Single Link

Selecting the most similar clusters for merging
using single link
Can result in long and thin clusters due to
chaining effect
Appropriate in some domains, such as clustering
islands

43
Complete Link

Selecting the most similar clusters for merging
using complete link
Results in compact, spherical clusters that are
preferable

44
Group Average

Selecting the most similar clusters for merging
using group average
Fast compromise between single and complete link

45
Example
46
Inter Cluster Similarity

A new cluster is represented by its centroid
The document to cluster similarity is computed as
The cluster-to-cluster similarity can be computed
as single, complete or group average similarity

47
Buckshot K-Means

Combines Agglomerative and K-Means
Agglomerative results in a good clustering
solution but has O(N2) complexity
Randomly select a sample ?N instances
Applying Agglomerative on the sample which takes
(N) time
Take the centroids of the cluster as input to
K-Means
Overall complexity is O(N)

48
Example
initial cetroids for K-Means
49
More on Clustering

Sound methods based on the document-to-document
similarity matrix
graph theoretic methods
O(N2) time
Iterative methods operating directly on the
document vectors
O(NlogN),O(N2/logN), O(mN) time

50
Soft Clustering

Hard clustering each instance belongs to exactly
one cluster
Does not allow for uncertainty
An instance may belong to two or more clusters
Soft clustering is based on probabilities that an
instance belongs to each of a set of clusters
probabilities of all categories must sum to 1
Expectation Minimization (EM) is the most popular
approach

51
More Methods

Two documents with similarity gt T (threshold) are
connected with an edge DudaHart73
clusters the connected components (maximal
cliques) of the resulting graph
problem selection of appropriate threshold T
Zahns method Zahn71

52
Zahns method Zahn71
the dashed edge is inconsistent and is deleted

Find the minimum spanning tree
for each doc delete edges with length l gt lavg
lavg average distance if its incident edges
clusters the connected components of the graph

53
References

"Searching Multimedia Databases by Content",
Christos Faloutsos, Kluwer Academic Publishers,
1996
A Comparison of Document Clustering Techniques,
M. Steinbach, G. Karypis, V. Kumar, In KDD
Workshop on Text Mining,2000
Data Clustering A Review, A.K. Jain, M.N.
Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31,
No. 3, Sept. 99.
Algorithms for Clustering Data A.K. Jain, R.C.
Dubes Prentice-Hall , 1988, ISBN 0-13-022278-X
Automatic Text Processing The Transformation,
Analysis, and Retrieval of Information by
Computer, G. Salton, Addison-Wesley, 1989