Clustering - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Clustering

Description:

Title: Clustering Author: Last modified by: zouquan Created Date: 10/31/2011 1:32:54 AM Document presentation format: Company – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 49
Provided by: 49882
Category:
Tags: clustering | data

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
  • Quan Zou
  • P.H.D. Assistant Professor
  • http//datamining.xmu.edu.cn/zq/

2
Outline
  • Introduction of Clustering
  • K-means Clustering
  • Hierarchical Clustering

3
What is Clustering
  • Cluster A collection of data objects
  • similar (or related) to one another within the
    same group
  • dissimilar (or unrelated) to the objects in other
    groups
  • Cluster analysis (or clustering, data
    segmentation, )
  • Finding similarities between data according to
    the characteristics found in the data and
    grouping similar data objects into clusters
  • Unsupervised learning no predefined classes
    (i.e., learning by observations vs. learning by
    examples supervised)

4
(No Transcript)
5
Clustering for Data Understanding and Applications
  • Biology taxonomy of living things kingdom,
    phylum, class, order, family, genus and species
  • Information retrieval document clustering
  • Marketing Help marketers discover distinct
    groups in their customer bases, and then use this
    knowledge to develop targeted marketing programs
  • Typical applications
  • As a stand-alone tool to get insight into data
    distribution
  • As a preprocessing step for other algorithms

6
Clustering as a Preprocessing Tool (Utility)
  • Summarization
  • Preprocessing for regression, PCA,
    classification, and association analysis
  • Compression
  • Image processing vector quantization
  • Finding K-nearest Neighbors
  • Localizing search to one or a small number of
    clusters
  • Outlier detection
  • Outliers are often viewed as those far away
    from any cluster

7
?????????
  • ????
  • ?????????
  • ??????????
  • ????????(??)
  • ????
  • ??????
  • ??????
  • ???????
  • ????????????

8
?????????
???
?????
?? ?? ???
1 255106106 1.2
2 25511486 1.0
3 255239219 0.5
1 2 3
1 0
2 0
3 0
9
The K-Means Clustering Method
  • Given k, the k-means algorithm is implemented in
    four steps
  • Partition objects into k nonempty subsets
  • Compute seed points as the centroids of the
    clusters of the current partitioning (the
    centroid is the center, i.e., mean point, of the
    cluster)
  • Assign each object to the cluster with the
    nearest seed point
  • Go back to Step 2, stop when the assignment does
    not change

10
An Example of K-Means Clustering
K2 Arbitrarily partition objects into k groups
Update the cluster centroids
The initial data set
Loop if needed
Reassign objects
Do loop Until no change
Update the cluster centroids
11
Example
  • Problem

Suppose we have 4 types of medicines and each has
two attributes (pH and weight index). Our goal is
to group these objects into K2 group of
medicine.
Medicine Weight pH-Index
A 1 1
B 2 1
C 4 3
D 5 4
11
12
Example
  • Step 1 Use initial seed points for partitioning

Euclidean distance
12
13
Example
  • Step 2 Compute new centroids of the current
    partition

Knowing the members of each cluster, now we
compute the new centroid of each group based on
these new memberships.
13
14
Example
  • Step 2 Renew membership based on new centroids

Compute the distance of all objects to the new
centroids
Assign the membership to objects
14
15
Example
  • Step 3 Repeat the first two steps until its
    convergence

Knowing the members of each cluster, now we
compute the new centroid of each group based on
these new memberships.
15
16
Example
  • Step 3 Repeat the first two steps until its
    convergence

Compute the distance of all objects to the new
centroids
Stop due to no new assignment
16
17
K-means Demo
  1. User set up the number of clusters theyd like.
    (e.g. k5)

17
18
K-means Demo
  • User set up the number of clusters theyd like.
    (e.g. K5)
  • Randomly guess K cluster Center locations

18
19
K-means Demo
  • User set up the number of clusters theyd like.
    (e.g. K5)
  • Randomly guess K cluster Center locations
  • Each data point finds out which Center its
    closest to. (Thus each Center owns a set of
    data points)

19
20
K-means Demo
  • User set up the number of clusters theyd like.
    (e.g. K5)
  • Randomly guess K cluster centre locations
  • Each data point finds out which centre its
    closest to. (Thus each Center owns a set of
    data points)
  • Each centre finds the centroid of the points it
    owns

20
21
K-means Demo
  • User set up the number of clusters theyd like.
    (e.g. K5)
  • Randomly guess K cluster centre locations
  • Each data point finds out which centre its
    closest to. (Thus each centre owns a set of
    data points)
  • Each centre finds the centroid of the points it
    owns
  • and jumps there

21
22
K-means Demo
  • User set up the number of clusters theyd like.
    (e.g. K5)
  • Randomly guess K cluster centre locations
  • Each data point finds out which centre its
    closest to. (Thus each centre owns a set of
    data points)
  • Each centre finds the centroid of the points it
    owns
  • and jumps there
  • Repeat until terminated!

22
23
K-mean Algorithm
23
24
Variations of the K-Means Method
  • Most of the variants of the k-means which differ
    in
  • Selection of the initial k means
  • Dissimilarity calculations
  • Strategies to calculate cluster means
  • Handling categorical data k-modes
  • Replacing means of clusters with modes
  • Using new dissimilarity measures to deal with
    categorical objects
  • Using a frequency-based method to update modes of
    clusters

25
??
  • ??????
  • ???? (????????)
  • ??? (????????????)
  • K???? (Xmeans)
  • ??? (????)

26
????
  • ??????
  • ??????

27
??
  • ??????
  • ???? (????????)
  • ??? (????????????)
  • K???? (Xmeans)
  • ??? (????)

28
??
  • ??????
  • ???? (????????)
  • ??? (????????????)
  • K???? (Xmeans,???????)
  • ??? (????)

29
(No Transcript)
30
??
  • ??????
  • ???? (????????)
  • ??? (????????????)
  • K???? (Xmeans)
  • ??? (????)

31
???
32
Hierarchical Clustering
  • Use distance matrix as clustering criteria. This
    method does not require the number of clusters k
    as an input, but needs a termination condition
  • Termination condition Number of clusters

33
????????
  • ????????????????????,?????????????????
  • ????????????????,????????????,?????????
    ???????,????????????
  • ????????????????,??????????????,???????
    ??????,????????????
  • ????????AGNES???????????DIANA???

34
AGNES??
  • AGNES(Agglomerative NESting)??????????????,???????
    ??????????????????????????????????????????????????
    ?????????????????????

????????(AGNES) ????n???????,????????k?
??k??,???????????? (1)????????????
(2)REPEAT (3)???????????????????? (4)?????,????
???? (5)UNTIL?????????
35
AGNES????
?? ??1 ??2 1 1
1 2 1 2 3 2
1 4 2 2 5
3 4 6 3
5 7 4 4 8 4
5
?1????????????????,????????????,????,?????1,???1,
2?????????? ?2????????????????,??????????????,???
3,4?????? ?3????2????,5,6?????? ?4????2????,7,8?
????? ?5???1,2,3,4???????????? ?6???5,6,7
,8,????????????????????????,?????
  • ?? ?????? ?????? ??????
  • 1 1 1,2
    1,2,3,4,5,6,7,8
  • 1 3,4
    1,2,3,4,5,6,7,8
  • 1 5,6
    1,2,3,4,5,6,7,8
  • 1 7,8
    1,2,3,4,5,6,7,8
  • 1 1,2,3,4
    1,2,3,4,5,6,7,8
  • 1 5,6,7,8
    1,2,3,4,5,6,7,8??

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
AGNES????
  • AGNES??????,??????????????????????????,???????????
    ??????????????,??????????????????????????????,????
    ??????????
  • ?????????n??,???????1??,????????n???,??i????,?????
    n-i1?????????????????????????????????,???????????
    O(n²),?????n???????????

40
DIANA??
  • DIANA(DIvisive ANAlysis)?????????????
  • ????,????????????????????????,?????????
  • ??????????????????????????

41
?? DIANA(????????) ????n???????,????????k? ??
k??,???????????? (1)?????????????? (2) FOR (i1
i?k i) DO BEGIN (3) ???????????????C (4)
??C????????????????p??p??splinter
group,?????old party? (5). REPEAT (6)
?old party???????splinter
group??????????old party?????????,??????splinter
group? (7) UNTIL ????old party??????splinter
group (8) splinter group?old
party?????????????,?????????????? (9) END.
42
DIANA????
?? ?? 1 ?? 2 1 1 1 2 1 2 3 2 1 4 2 2 5 3 4 6
3 5 7 4 4 8 4 5
?1?,??????????,??????????????(?????????)?
1?????(111.4143.64.244.475)/72.96
???,2??????2.5263??????2.684??????2.185??????2.
186??????2.687??????2.5268??????2.96?
???????????1??splinter group?,????old
party?? ?2?,?old party???????splinter
group??????????old party??????????,?????splinter
group?,???2? ?3?,???2????,splinter
group????3? ?4?,???2????,splinter
group????4? ?5?,???old party??????splinter
group????????(k2),??????????????,????????????????
???????
?? ???????? splinter group Old party 1 1,2,3,4,5
,6,7,8 1 2,3,4,5,6,7,8 2 1,2,3,4,5,6,7,
8 1,2 3,4,5,6,7,8 3 1,2,3,4,5,6,7,8
1,2,3 4,5,6,7,8 4 1,2,3,4,5,6,7,8
1,2,3,4 5,6,7,8 5 1,2,3,4,5,6,7,8
1,2,3,4 5,6,7,8 ??
43
(No Transcript)
44
?????????
  • ????
  • ??????,???????????
  • ??????,?????????
  • ????????
  • ????????????

45
?????????
  • ????
  • ????????
  • ????
  • ?????????????
  • ??????????????????
  • ?????,????????

46
?????
  • ???????????,??????????,??????????,??????????????
    ???????,??string?text,????????cluster????

47
Thanks for patience
  • Email zouquan_at_xmu.edu.cn

48
??
  • ???K-means?????,?????????
  • ?????????????????????
  • ???????????????
Write a Comment
User Comments (0)
About PowerShow.com