Cluster Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Cluster Analysis

Description:

Discontinuities: places to put divisions between clusters. Discontinuities: ... 2. The two clusters with least dissimilarity are merged ... – PowerPoint PPT presentation

Number of Views:857

Avg rating:3.0/5.0

Slides: 36

Provided by: halwhi

Learn more at: http://www.math.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis

1
Cluster Analysis

Hal Whitehead
BIOL4062/5062

What is cluster analysis?
Non-hierarchical cluster analysis
K-means
Hierarchical divisive cluster analysis
Hierarchical agglomerative cluster analysis
Linkage single, complete, average,
Cophenetic correlation coefficient
Additive trees
Problems with cluster analyses

3
Cluster Analysis

Classification
Maximize within cluster homogeneity
(similar individuals within cluster)
The Search for Discontinuities
Discontinuities places to put divisions between
clusters

4
Discontinuities

Discontinuities generally present
taxonomy
social organization
community ecology??

5
Types of cluster analysis

Uses data, dissimilarity, similarity matrix
Non-hierarchical
K-means
Hierarchical
Hierarchical divisive (repeated K-means)
Hierarchical agglomerative
single linkage, average linkage, ...
Additive trees

6
Non-hierarchical Clustering TechniquesK-Means

Uses data matrix with Euclidean distances
Maximizes between-cluster variance for given
number of clusters
i.e. Choose clusters to maximize F-ratio in 1-way
MANOVA

7
K-Means

Works iteratively
1. Choose number of clusters
2. Assigns points to clusters
Randomly or some other clustering technique
3. Moves each point to other clusters in
turn--increase in between cluster variance?
4. Repeat step 3. until no improvement possible

8
K-means with three clusters
9
K-means with three clusters

Variable Between SS df Within SS df
F-ratio
X 0.536 2 0.007 7
256.163
Y 0.541 2 0.050 7
37.566
TOTAL 1.078 4 0.058 14

10
K-means with three clusters

Cluster 1 of 3 contains 4 cases
Members
Statistics
Case Distance Variable Minimum
Mean Maximum St.Dev.
Case 1 0.02 X 0.41
0.45 0.49 0.04
Case 2 0.11 Y 0.03
0.19 0.27 0.11
Case 3 0.06
Case 4 0.05
Cluster 2 of 3 contains 4 cases
Members
Statistics
Case Distance Variable Minimum
Mean Maximum St.Dev.
Case 7 0.06 X 0.11
0.15 0.19 0.03
Case 8 0.03 Y 0.61
0.70 0.77 0.07
Case 9 0.02
Case 10 0.06
Cluster 3 of 3 contains 2 cases
Members
Statistics
Case Distance Variable Minimum
Mean Maximum St.Dev.
Case 5 0.01 X 0.77
0.77 0.78 0.01
Case 6 0.01 Y 0.33
0.35 0.36 0.02

11
Disadvantages of K-means

Reaches optimum, but not necessarily global
Must choose number of clusters before analysis
How many clusters?

12
Example Sperm whale codas

Patterned series of clicks
ic1 ic2 ic3 ic4
For 5-click codas 681 x 4 data set

13
5-click codas
ic1 ic2 ic3 ic4
93 of variance in 2 PCs
14
5-click codasK-means with 10 clusters
15
Hierarchical Cluster Analysis

Usually represented by
Dendrogram or tree-diagram

16
Hierarchical Cluster Analysis

Hierarchical Divisive Cluster Analysis
Hierarchical Agglomerative Cluster Analysis

17
Hierarchical Divisive Cluster Analysis

Starts with all units in one
cluster, successively splits them
Successive use of K-Means, or some other divisive
technique, with n2
Either Each time use the cluster with the
greatest sum of squared distances
Or Split each cluster each time.
Hierarchical divisive are good techniques, but
rarely used

18
Hierarchical Agglomerative Cluster Analysis

Start with each individual units occupying its
own cluster
The clusters are then gradually merged until just
one is left
The most common cluster analyses

19
Hierarchical Agglomerative Cluster Analysis

Works on dissimilarity matrix
or negative similarity matrix
may be Euclidean, Penrose, distances
At each step
1. There is a symmetric matrix of
dissimilarities between clusters
2. The two clusters with least dissimilarity are
merged
3. The dissimilarity between the new (merged)
cluster and all others is calculated
Different techniques do step 3. in different
ways

20
Hierarchical Agglomerative Cluster Analysis

A B C D E
A 0 . . . .
B 0.35 0 . . .
C 0.45 0.67 0 . .
D 0.11 0.45 0.57 0 .
E 0.22 0.56 0.78 0.19 0

AD B C E
AD 0 . . .
B ? 0 . .
C ? 0.67 0 .
E ? 0.56 0.78 0

How to calculate new disimmilarities?
21
Hierarchical Agglomerative Cluster
AnalysisSingle Linkage

A B C D E
A 0 . . . .
B 0.35 0 . . .
C 0.45 0.67 0 . .
D 0.11 0.45 0.57 0 .
E 0.22 0.56 0.78 0.19 0

AD B C E
AD 0 . . .
B 0.35 0 . .
C ? 0.67 0 .
E ? 0.56 0.78 0

d(AD,B)Mind(A,B), d(D,B)
22
Hierarchical Agglomerative Cluster
AnalysisComplete Linkage

A B C D E
A 0 . . . .
B 0.35 0 . . .
C 0.45 0.67 0 . .
D 0.11 0.45 0.57 0 .
E 0.22 0.56 0.78 0.19 0

AD B C E
AD 0 . . .
B 0.45 0 . .
C ? 0.67 0 .
E ? 0.56 0.78 0

d(AD,B)Maxd(A,B), d(D,B)
23
Hierarchical Agglomerative Cluster
AnalysisAverage Linkage

A B C D E
A 0 . . . .
B 0.35 0 . . .
C 0.45 0.67 0 . .
D 0.11 0.45 0.57 0 .
E 0.22 0.56 0.78 0.19 0

AD B C E
AD 0 . . .
B 0.40 0 . .
C ? 0.67 0 .
E ? 0.56 0.78 0

d(AD,B)Meand(A,B), d(D,B)
24
Hierarchical Agglomerative Cluster
AnalysisCentroid Clustering (uses data matrix,
or true distance matrix)

V1 V2 V3
A 0.11 0.75 0.33
B 0.35 0.99 0.41
C 0.45 0.67 0.22
D 0.11 0.71 0.37
E 0.22 0.56 0.78
F 0.13 0.14 0.55
G 0.55 0.90 0.21

V1 V2 V3
AD 0.11 0.73 0.35
B 0.35 0.99 0.41
C 0.45 0.67 0.22
E 0.22 0.56 0.78
F 0.13 0.14 0.55
G 0.55 0.90 0.21

V1(AD)MeanV1(A),V1(D)
25
Hierarchical Agglomerative Cluster
AnalysisWards Method

Minimizes within-cluster sum-of squares
Similar to centroid clustering

1 1.00
2 0.00 1.00
4 0.53 0.00 1.00
5 0.18 0.05 0.00 1.00
9 0.22 0.09 0.13 0.25 1.00
11 0.36 0.00 0.17 0.40 0.33 1.00
12 0.00 0.37 0.18 0.00 0.13 0.00 1.00
14 0.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00
15 0.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00
19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.09
1.00
20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18
0.25 1.00
1 2 4 5 9 11
12 14 15 19 20

27
(No Transcript)
28
Hierarchical Agglomerative Clustering Techniques

Single Linkage
Produces straggly clusters
Not recommended if much experimental error
Used in taxonomy
Invariant to transformations
Complete Linkage
Produces tight clusters
Not recommended if much experimental error
Invariant to transformations
Average Linkage, Centroid, Wards
Most likely to mimic input clusters
Not invariant to transformations in dissimilarity
measure

29
Cophenetic Correlation CoefficientCCC

Correlation between original disimilarity matrix
and dissimilarity inferred from cluster analysis
CCC gt 0.8 indicate a good match
CCC lt 0.8, dendrogram not a good representation
probably should not be displayed
Use CCC to choose best linkage method (highest
coefficient)

30
CCC0.83
CCC0.77
CCC0.75
CCC0.80
31
Additive trees

Dendrogram in which path lengths represent
dissimilarities
Computation quite complex (cross between
agglomerative techniques and multidimensional
scaling)
Good when data are measured as similarities and
dissimilarities
Often used in taxonomy and genetics

A B C D E A . . . . . B 14 . . . . C 6 12 . . . D
81 7 13 . . E 17 1 6 16 .
32
Problems with Cluster Analysis

Are there really biologically-meaningful clusters
in the data?
Does the dendrogram represent biological reality
(web-of-life versus tree-of-life)?
How many clusters to use?
stopping rules are arbitrary
Which method to use?
best technique is data-dependent
Dendrograms become messy with many units

33
Social Structure of 160 northern bottlenose whales
34
Clustering Techniques

Type Technique Use
Non-hierarchical K-Means Dividing data sets
Hierarchical divisive Repeated K-means Good
technique on small data sets
Hierarchical agglomerative
Single linkage Taxonomy
Complete linkage Tighter Clusters
Average linkage,
Centroid, Wards Usually Preferred
Hierarchical Additive trees Excellent for
displaying similarity/dissimilarity
taxonomy, genetics

35
(No Transcript)

Write a Comment

User Comments (0)