1 / 48

Clustering

- Quan Zou
- P.H.D. Assistant Professor
- http//datamining.xmu.edu.cn/zq/

Outline

- Introduction of Clustering
- K-means Clustering
- Hierarchical Clustering

What is Clustering

- Cluster A collection of data objects
- similar (or related) to one another within the

same group - dissimilar (or unrelated) to the objects in other

groups - Cluster analysis (or clustering, data

segmentation, ) - Finding similarities between data according to

the characteristics found in the data and

grouping similar data objects into clusters - Unsupervised learning no predefined classes

(i.e., learning by observations vs. learning by

examples supervised)

(No Transcript)

Clustering for Data Understanding and Applications

- Biology taxonomy of living things kingdom,

phylum, class, order, family, genus and species - Information retrieval document clustering
- Marketing Help marketers discover distinct

groups in their customer bases, and then use this

knowledge to develop targeted marketing programs - Typical applications
- As a stand-alone tool to get insight into data

distribution - As a preprocessing step for other algorithms

Clustering as a Preprocessing Tool (Utility)

- Summarization
- Preprocessing for regression, PCA,

classification, and association analysis - Compression
- Image processing vector quantization
- Finding K-nearest Neighbors
- Localizing search to one or a small number of

clusters - Outlier detection
- Outliers are often viewed as those far away

from any cluster

?????????

- ????
- ?????????
- ??????????
- ????????(??)
- ????
- ??????
- ??????
- ???????
- ????????????

?????????

???

?????

?? ?? ???

1 255106106 1.2

2 25511486 1.0

3 255239219 0.5

1 2 3

1 0

2 0

3 0

The K-Means Clustering Method

- Given k, the k-means algorithm is implemented in

four steps - Partition objects into k nonempty subsets
- Compute seed points as the centroids of the

clusters of the current partitioning (the

centroid is the center, i.e., mean point, of the

cluster) - Assign each object to the cluster with the

nearest seed point - Go back to Step 2, stop when the assignment does

not change

An Example of K-Means Clustering

K2 Arbitrarily partition objects into k groups

Update the cluster centroids

The initial data set

Loop if needed

Reassign objects

Do loop Until no change

Update the cluster centroids

Example

- Problem

Suppose we have 4 types of medicines and each has

two attributes (pH and weight index). Our goal is

to group these objects into K2 group of

medicine.

Medicine Weight pH-Index

A 1 1

B 2 1

C 4 3

D 5 4

11

Example

- Step 1 Use initial seed points for partitioning

Euclidean distance

12

Example

- Step 2 Compute new centroids of the current

partition

Knowing the members of each cluster, now we

compute the new centroid of each group based on

these new memberships.

13

Example

- Step 2 Renew membership based on new centroids

Compute the distance of all objects to the new

centroids

Assign the membership to objects

14

Example

- Step 3 Repeat the first two steps until its

convergence

Knowing the members of each cluster, now we

compute the new centroid of each group based on

these new memberships.

15

Example

- Step 3 Repeat the first two steps until its

convergence

Compute the distance of all objects to the new

centroids

Stop due to no new assignment

16

K-means Demo

- User set up the number of clusters theyd like.

(e.g. k5)

17

K-means Demo

- User set up the number of clusters theyd like.

(e.g. K5) - Randomly guess K cluster Center locations

18

K-means Demo

- User set up the number of clusters theyd like.

(e.g. K5) - Randomly guess K cluster Center locations
- Each data point finds out which Center its

closest to. (Thus each Center owns a set of

data points)

19

K-means Demo

- User set up the number of clusters theyd like.

(e.g. K5) - Randomly guess K cluster centre locations
- Each data point finds out which centre its

closest to. (Thus each Center owns a set of

data points) - Each centre finds the centroid of the points it

owns

20

K-means Demo

- User set up the number of clusters theyd like.

(e.g. K5) - Randomly guess K cluster centre locations
- Each data point finds out which centre its

closest to. (Thus each centre owns a set of

data points) - Each centre finds the centroid of the points it

owns - and jumps there

21

K-means Demo

- User set up the number of clusters theyd like.

(e.g. K5) - Randomly guess K cluster centre locations
- Each data point finds out which centre its

closest to. (Thus each centre owns a set of

data points) - Each centre finds the centroid of the points it

owns - and jumps there
- Repeat until terminated!

22

K-mean Algorithm

23

Variations of the K-Means Method

- Most of the variants of the k-means which differ

in - Selection of the initial k means
- Dissimilarity calculations
- Strategies to calculate cluster means
- Handling categorical data k-modes
- Replacing means of clusters with modes
- Using new dissimilarity measures to deal with

categorical objects - Using a frequency-based method to update modes of

clusters

??

- ??????
- ???? (????????)
- ??? (????????????)
- K???? (Xmeans)
- ??? (????)

????

- ??????

- ??????

??

- ??????
- ???? (????????)
- ??? (????????????)
- K???? (Xmeans)
- ??? (????)

??

- ??????
- ???? (????????)
- ??? (????????????)
- K???? (Xmeans,???????)
- ??? (????)

(No Transcript)

??

- ??????
- ???? (????????)
- ??? (????????????)
- K???? (Xmeans)
- ??? (????)

???

Hierarchical Clustering

- Use distance matrix as clustering criteria. This

method does not require the number of clusters k

as an input, but needs a termination condition - Termination condition Number of clusters

????????

- ????????????????????,?????????????????
- ????????????????,????????????,?????????

???????,???????????? - ????????????????,??????????????,???????

??????,???????????? - ????????AGNES???????????DIANA???

AGNES??

- AGNES(Agglomerative NESting)??????????????,???????

??????????????????????????????????????????????????

?????????????????????

????????(AGNES) ????n???????,????????k?

??k??,???????????? (1)????????????

(2)REPEAT (3)???????????????????? (4)?????,????

???? (5)UNTIL?????????

AGNES????

?? ??1 ??2 1 1

1 2 1 2 3 2

1 4 2 2 5

3 4 6 3

5 7 4 4 8 4

5

?1????????????????,????????????,????,?????1,???1,

2?????????? ?2????????????????,??????????????,???

3,4?????? ?3????2????,5,6?????? ?4????2????,7,8?

????? ?5???1,2,3,4???????????? ?6???5,6,7

,8,????????????????????????,?????

- ?? ?????? ?????? ??????
- 1 1 1,2

1,2,3,4,5,6,7,8 - 1 3,4

1,2,3,4,5,6,7,8 - 1 5,6

1,2,3,4,5,6,7,8 - 1 7,8

1,2,3,4,5,6,7,8 - 1 1,2,3,4

1,2,3,4,5,6,7,8 - 1 5,6,7,8

1,2,3,4,5,6,7,8??

(No Transcript)

(No Transcript)

(No Transcript)

AGNES????

- AGNES??????,??????????????????????????,???????????

??????????????,??????????????????????????????,????

?????????? - ?????????n??,???????1??,????????n???,??i????,?????

n-i1?????????????????????????????????,???????????

O(n²),?????n???????????

DIANA??

- DIANA(DIvisive ANAlysis)?????????????
- ????,????????????????????????,?????????
- ??????????????????????????

?? DIANA(????????) ????n???????,????????k? ??

k??,???????????? (1)?????????????? (2) FOR (i1

i?k i) DO BEGIN (3) ???????????????C (4)

??C????????????????p??p??splinter

group,?????old party? (5). REPEAT (6)

?old party???????splinter

group??????????old party?????????,??????splinter

group? (7) UNTIL ????old party??????splinter

group (8) splinter group?old

party?????????????,?????????????? (9) END.

DIANA????

?? ?? 1 ?? 2 1 1 1 2 1 2 3 2 1 4 2 2 5 3 4 6

3 5 7 4 4 8 4 5

?1?,??????????,??????????????(?????????)?

1?????(111.4143.64.244.475)/72.96

???,2??????2.5263??????2.684??????2.185??????2.

186??????2.687??????2.5268??????2.96?

???????????1??splinter group?,????old

party?? ?2?,?old party???????splinter

group??????????old party??????????,?????splinter

group?,???2? ?3?,???2????,splinter

group????3? ?4?,???2????,splinter

group????4? ?5?,???old party??????splinter

group????????(k2),??????????????,????????????????

???????

?? ???????? splinter group Old party 1 1,2,3,4,5

,6,7,8 1 2,3,4,5,6,7,8 2 1,2,3,4,5,6,7,

8 1,2 3,4,5,6,7,8 3 1,2,3,4,5,6,7,8

1,2,3 4,5,6,7,8 4 1,2,3,4,5,6,7,8

1,2,3,4 5,6,7,8 5 1,2,3,4,5,6,7,8

1,2,3,4 5,6,7,8 ??

(No Transcript)

?????????

- ????
- ??????,???????????
- ??????,?????????
- ????????
- ????????????

?????????

- ????
- ????????
- ????
- ?????????????
- ??????????????????
- ?????,????????

?????

- ???????????,??????????,??????????,??????????????

???????,??string?text,????????cluster????

Thanks for patience

- Email zouquan_at_xmu.edu.cn

??

- ???K-means?????,?????????
- ?????????????????????
- ???????????????