CLUSTERING

Overview

- Definition of Clustering
- Existing clustering methods
- Clustering examples

Definition

- Clustering can be considered the most important

unsupervised learning technique so, as every

other problem of this kind, it deals with finding

a structure in a collection of unlabeled data. - Clustering is the process of organizing objects

into groups whose members are similar in some

way. - A cluster is therefore a collection of objects

which are similar between them and are

dissimilar to the objects belonging to other

clusters.

(No Transcript)

Why clustering?

- A few good reasons ...
- Simplifications
- Pattern detection
- Useful in data concept construction
- Unsupervised learning process

Where to use clustering?

- Data mining
- Information retrieval
- text mining
- Web analysis
- medical diagnostic

Major Existing clustering methods

- Distance-based
- Hierarchical
- Partitioning
- Probabilistic

Measuring Similarity

- Dissimilarity/Similarity metric Similarity is

expressed in terms of a distance function, which

is typically metric d(i, j) - There is a separate quality function that

measures the goodness of a cluster. - The definitions of distance functions are usually

very different for interval-scaled, boolean,

categorical, ordinal and ratio variables. - Weights should be associated with different

variables based on applications and data

semantics. - It is hard to define similar enough or good

enough - the answer is typically highly subjective.

Hierarchical clustering

- Agglomerative (bottom up)
- start with 1 point (singleton)
- recursively add two or more appropriate clusters
- Stop when k number of clusters is achieved.

- Divisive (top down)
- Start with a big cluster
- Recursively divide into smaller clusters
- Stop when k number of clusters is achieved.

general steps of hierarchical clustering

- Given a set of N items to be clustered, and an

NN distance (or similarity) matrix, the basic

process of hierarchical clustering (defined by

S.C. Johnson in 1967) is this - Start by assigning each item to a cluster, so

that if you have N items, you now have N

clusters, each containing just one item. Let the

distances (similarities) between the clusters the

same as the distances (similarities) between the

items they contain. - Find the closest (most similar) pair of clusters

and merge them into a single cluster, so that now

you have one cluster less. - Compute distances (similarities) between the new

cluster and each of the old clusters. - Repeat steps 2 and 3 until all items are

clustered into K number of clusters

- Exclusive vs. non exclusive clustering
- In the first case data are grouped in an

exclusive way, so that if a certain datum belongs

to a definite cluster then it could not be

included in another cluster. A simple example of

that is shown in the figure below, where the

separation of points is achieved by a straight

line on a bi-dimensional plane. - On the contrary the second type, the overlapping

clustering, uses fuzzy sets to cluster data, so

that each point may belong to two or more

clusters with different degrees of membership.

Partitioning clustering

- Divide data into proper subset
- recursively go through each subset and relocate

points between clusters (opposite to visit-once

approach in Hierarchical approach) - This recursive relocation higher quality cluster

Probabilistic clustering

- Data are picked from mixture of probability

distribution. - Use the mean, variance of each distribution as

parameters for cluster - Single cluster membership

Single-Linkage Clustering(hierarchical)

- The NN proximity matrix is D d(i,j)
- The clusterings are assigned sequence numbers

0,1,......, (n-1) - L(k) is the level of the kth clustering
- A cluster with sequence number m is denoted (m)
- The proximity between clusters (r) and (s) is

denoted d (r),(s)

The algorithm is composed of the following steps

- Begin with the disjoint clustering having level

L(0) 0 and sequence number m 0. - Find the least dissimilar pair of clusters in the

current clustering, say pair (r), (s), according

tod(r),(s) min d(i),(j)where the

minimum is over all pairs of clusters in the

current clustering.

The algorithm is composed of the following

steps(cont.)

- Increment the sequence number m m 1. Merge

clusters (r) and (s) into a single cluster to

form the next clustering m. Set the level of this

clustering toL(m) d(r),(s) - Update the proximity matrix, D, by deleting the

rows and columns corresponding to clusters (r)

and (s) and adding a row and column corresponding

to the newly formed cluster. The proximity

between the new cluster, denoted (r,s) and old

cluster (k) is defined in this wayd(k),

(r,s) min d(k),(r), d(k),(s) - If all objects are in one cluster, stop. Else, go

to step 2.

Hierarchical clustering example

- Lets now see a simple example a hierarchical

clustering of distances in kilometers between

some Italian cities. The method used is

single-linkage. - Input distance matrix (L 0 for all the

clusters)

- The nearest pair of cities is MI and TO, at

distance 138. These are merged into a single

cluster called "MI/TO". The level of the new

cluster is L(MI/TO) 138 and the new sequence

number is m 1.Then we compute the distance

from this new compound object to all other

objects. In single link clustering the rule is

that the distance from the compound object to

another object is equal to the shortest distance

from any member of the cluster to the outside

object. So the distance from "MI/TO" to RM is

chosen to be 564, which is the distance from MI

to RM, and so on.

- After merging MI with TO we obtain the following

matrix

- min d(i,j) d(NA,RM) 219 gt merge NA and RM

into a new cluster called NA/RML(NA/RM) 219m

2

- min d(i,j) d(BA,NA/RM) 255 gt merge BA and

NA/RM into a new cluster called

BA/NA/RML(BA/NA/RM) 255m 3

- min d(i,j) d(BA/NA/RM,FI) 268 gt merge

BA/NA/RM and FI into a new cluster called

BA/FI/NA/RML(BA/FI/NA/RM) 268m 4

- Finally, we merge the last two clusters at level

295. - The process is summarized by the following

hierarchical tree

K-mean algorithm

- It accepts the number of clusters to group data

into, and the dataset to cluster as input values.

- It then creates the first K initial clusters (K

number of clusters needed) from the dataset by

choosing K rows of data randomly from the

dataset. For Example, if there are 10,000 rows of

data in the dataset and 3 clusters need to be

formed, then the first K3 initial clusters will

be created by selecting 3 records randomly from

the dataset as the initial clusters. Each of the

3 initial clusters formed will have just one row

of data.

3. The K-Means algorithm calculates the

Arithmetic Mean of each cluster formed in the

dataset. The Arithmetic Mean of a cluster is the

mean of all the individual records in the

cluster. In each of the first K initial

clusters, their is only one record. The

Arithmetic Mean of a cluster with one record is

the set of values that make up that record. For

Example if the dataset we are discussing is a set

of Height, Weight and Age measurements for

students in a University, where a record P in the

dataset S is represented by a Height, Weight and

Age measurement, then P Age, Height,

Weight. Then a record containing

the measurements of a student John, would be

represented as John 20, 170, 80 where John's

Age 20 years, Height 1.70 meters and Weight

80 Pounds. Since there is only one record in each

initial cluster then the Arithmetic Mean of a

cluster with only the record for John as a member

20, 170, 80.

- Next, K-Means assigns each record in the dataset

to only one of the initial clusters. Each record

is assigned to the nearest cluster (the cluster

which it is most similar to) using a measure of

distance or similarity like the Euclidean

Distance Measure or Manhattan/City-Block Distance

Measure. - K-Means re-assigns each record in

the dataset to the most similar cluster

and re-calculates the arithmetic mean of all the

clusters in the dataset. The arithmetic mean of a

cluster is the arithmetic mean of all the records

in that cluster. For Example, if a cluster

contains two records where the record of the set

of measurements for John 20, 170, 80 and

Henry 30, 160, 120, then the arithmetic mean

Pmean is represented as Pmean Agemean,

Heightmean, Weightmean). Agemean (20 30)/2,

Heightmean (170 160)/2 and Weightmean (80

120)/2. The arithmetic mean of this cluster

25, 165, 100. This new arithmetic mean becomes

the center of this new cluster. Following the

same procedure, new cluster centers are formed

for all the existing clusters.

- K-Means re-assigns each record in the dataset to

only one of the new clusters formed. A record or

data point is assigned to the nearest cluster

(the cluster which it is most similar to) using a

measure of distance or similarity - The preceding steps are repeated until stable

clusters are formed and the K-Means clustering

procedure is completed. Stable clusters are

formed when new iterations or repetitions of the

K-Means clustering algorithm does not create new

clusters as the cluster center or Arithmetic Mean

of each cluster formed is the same as the old

cluster center. There are different techniques

for determining when a stable cluster is formed

or when the k-means clustering algorithm

procedure is completed.