1 / 53

Clustering

- Clustering is the unsupervised classification of

patterns (observations, data items or feature

vectors) into groups (clusters) ACM CS99 - Instances within a cluster are very similar
- Instances in different clusters are very different

Example

Applications

- Faster retrieval
- Faster and better browsing
- Structuring of search results
- Revealing classes and other data regularities
- Directory construction
- Better data organization in general

Cluster Searching

- Similar instances tend to be relevant to the same

requests - The query is mapped to the closest cluster by

comparison with the cluster-centroids

Notation

- N number of elements
- Class real world grouping ground truth
- Cluster grouping by algorithm
- The ideal clustering algorithm will produce

clusters equivalent to real world classes with

exactly the same members

Problems

- How many clusters ?
- Complexity? N is usually large
- Quality of clustering
- When a method is better than another?
- Overlapping clusters
- Sensitivity to outliers

Example

Clustering Approaches

- Divisive build clusters top down starting from

the entire data set - K-means, Bisecting K-means
- Hierarchical or flat clustering
- Agglomerative build clusters bottom-up

starting with individual instances and by

iteratively combining them to form larger cluster

at higher level - Hierarchical clustering
- Combinations of the above
- Buckshot algorithm

Hierarchical Flat Clustering

- Flat all clusters at the same level
- K-means, Buckshot
- Hierarchical nested sequence of clusters
- Single cluster with all data at the top

singleton clusters at the bottom - Intermediate levels are more useful
- Every intermediate level combines two clusters

from the next lower level - Agglomerative, Bisecting K-means

Flat Clustering

Hierarchical Clustering

Text Clustering

- Finds overall similarities among documents or

groups of documents - Faster searching, browsing etc.
- Needs to know how to compute the similarity (or

equivalently the distance) between documents

Query Document Similarity

- Similarity is defined as the cosine of the angle

between document and query vectors

Document Distance

- Consider documents d1, d2 with vectors u1, u2
- Their distance is defined as the length AB

Normalization by Document Length

- The longer the document is, the more likely it is

for a given term to appear in it - Normalize the term weights by document length

(so terms in long documents are not given more

weight)

Evaluation of Cluster Quality

- Clusters can be evaluated using internal or

external knowledge - Internal Measures intra cluster cohesion and

cluster separability - intra cluster similarity
- inter cluster similarity
- External measures quality of clusters compared

to real classes - Entropy (E), Harmonic Mean (F)

Intra Cluster Similarity

- A measure of cluster cohesion
- Defined as the average pair-wise similarity of

documents in a cluster - Where cluster centroid
- Documents (not centroids) have unit length

Inter Cluster Similarity

- Single Link similarity of two most similar

members - Complete Link similarity of two least similar

members - Group Average average similarity between members

Example

Entropy

- Measures the quality of flat clusters using

external knowledge - Pre-existing classification
- Assessment by experts
- Pij probability that a member of cluster j

belong to class i - The entropy of cluster j is defined as

Ej-SiPijlogPij

Entropy (cont)

- Total entropy for all clusters
- Where nj is the size of cluster j
- m is the number of clusters
- N is the number of instances
- The smaller the value of E is the better the

quality of the algorithm is - The best entropy is obtained when each cluster

contains exactly one instance

Harmonic Mean (F)

- Treats each cluster as a query result
- F combines precision (P) and recall (R)
- Fij for cluster j and class i is defined as
- nij number of instances of class i in cluster

j, - ni number of instances of class i,
- nj number of instances of cluster j

Harmonic Mean (cont)

- The F value of any class i is the maximum value

it achieves over all j - Fi maxj Fij
- The F value of a clustering solution is computed

as the weighted average over all classes - Where N is the number of data instances

Quality of Clustering

- A good clustering method
- Maximizes intra-cluster similarity
- Minimizes inter cluster similarity
- Minimizes Entropy
- Maximizes Harmonic Mean
- Difficult to achieve all together simultaneously
- Maximize some objective function of the above
- An algorithm is better than an other if it has

better values on most of these measures

K-means Algorithm

- Select K centroids
- Repeat I times or until the centroids do not

change - Assign each instance to the cluster represented

by its nearest centroid - Compute new centroids
- Reassign instances
- Compute new centroids
- .

K-Means demo (1/7) http//www.delft-cluster.nl/t

extminer/theory/kmeans/kmeans.html

K-Means demo (2/7)

K-Means demo (3/7)

K-Means demo (4/7)

K-Means demo (5/7)

K-Means demo (6/7)

K-Means demo (7/7)

Comments on K-Means (1)

- Generates a flat partition of K clusters
- K is the desired number of clusters and must be

known in advance - Starts with K random cluster centroids
- A centroid is the mean or the median of a group

of instances - The mean rarely corresponds to a real instance

Comments on K-Means (2)

- Up to I10 iterations
- Keep the clustering resulted in best inter/intra

similarity or the final clusters after I

iterations - Complexity O(IKN)
- A repeated application of K-Means for K2, 4,

can produce a hierarchical clustering

Choosing Centroids for K-means

- Quality of clustering depends on the selection of

initial centroids - Random selection may result in poor convergence

rate, or convergence to sub-optimal clusterings. - Select good initial centroids using a heuristic

or the results of another method - Buckshot algorithm

Incremental K-Means

- Update each centroid during each iteration after

each point is assigned to a cluster rather than

at the end of each iteration - Reassign instances to clusters at the end of each

iteration - Converges faster than simple K-means
- Usually 2-5 iterations

Bisecting K-Means

- Starts with a single cluster with all instances
- Select a cluster to split larger cluster or

cluster with less intra similarity - The selected cluster is split into 2 partitions

using K-means (K2) - Repeat up to the desired depth h
- Hierarchical clustering
- Complexity O(2hN)

Agglomerative Clustering

- Compute the similarity matrix between all pairs

of instances - Starting from singleton clusters
- Repeat until a single cluster remains
- Merge the two most similar clusters
- Replace them with a single cluster
- Replace the merged cluster in the matrix and

update the similarity matrix - Complexity O(N2)

Similarity Matrix

C1d1 C2d2 CNdN

C1d1 1 0.8 0.3

C2d2 0.8 1 0.6

. 1

CNdN 0.3 0.6 1

Update Similarity Matrix

merged

C1d1 C2d2 CNdN

C1d1 1 0.8 0.3

C2d2 0.8 1 0.6

. 1

CNdN 0.3 0.6 1

merged

New Similarity Matrix

C12 d1 ? d2 CNdN

C12 d1 ? d2 1 0.4

1

CNdN 0.4 1

Single Link

- Selecting the most similar clusters for merging

using single link - Can result in long and thin clusters due to

chaining effect - Appropriate in some domains, such as clustering

islands

Complete Link

- Selecting the most similar clusters for merging

using complete link - Results in compact, spherical clusters that are

preferable

Group Average

- Selecting the most similar clusters for merging

using group average - Fast compromise between single and complete link

Example

Inter Cluster Similarity

- A new cluster is represented by its centroid
- The document to cluster similarity is computed as

- The cluster-to-cluster similarity can be computed

as single, complete or group average similarity

Buckshot K-Means

- Combines Agglomerative and K-Means
- Agglomerative results in a good clustering

solution but has O(N2) complexity - Randomly select a sample ?N instances
- Applying Agglomerative on the sample which takes

(N) time - Take the centroids of the cluster as input to

K-Means - Overall complexity is O(N)

Example

initial cetroids for K-Means

More on Clustering

- Sound methods based on the document-to-document

similarity matrix - graph theoretic methods
- O(N2) time
- Iterative methods operating directly on the

document vectors - O(NlogN),O(N2/logN), O(mN) time

Soft Clustering

- Hard clustering each instance belongs to exactly

one cluster - Does not allow for uncertainty
- An instance may belong to two or more clusters
- Soft clustering is based on probabilities that an

instance belongs to each of a set of clusters - probabilities of all categories must sum to 1
- Expectation Minimization (EM) is the most popular

approach

More Methods

- Two documents with similarity gt T (threshold) are

connected with an edge DudaHart73 - clusters the connected components (maximal

cliques) of the resulting graph - problem selection of appropriate threshold T
- Zahns method Zahn71

Zahns method Zahn71

the dashed edge is inconsistent and is deleted

- Find the minimum spanning tree
- for each doc delete edges with length l gt lavg
- lavg average distance if its incident edges
- clusters the connected components of the graph

References

- "Searching Multimedia Databases by Content",

Christos Faloutsos, Kluwer Academic Publishers,

1996 - A Comparison of Document Clustering Techniques,

M. Steinbach, G. Karypis, V. Kumar, In KDD

Workshop on Text Mining,2000 - Data Clustering A Review, A.K. Jain, M.N.

Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31,

No. 3, Sept. 99. - Algorithms for Clustering Data A.K. Jain, R.C.

Dubes Prentice-Hall , 1988, ISBN 0-13-022278-X - Automatic Text Processing The Transformation,

Analysis, and Retrieval of Information by

Computer, G. Salton, Addison-Wesley, 1989