Cluster Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster Analysis

Description:

Discontinuities: places to put divisions between clusters. Discontinuities: ... 2. The two clusters with least dissimilarity are merged ... – PowerPoint PPT presentation

Number of Views:857
Avg rating:3.0/5.0
Slides: 36
Provided by: halwhi
Learn more at: http://www.math.utah.edu
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
  • Hal Whitehead
  • BIOL4062/5062

2
  • What is cluster analysis?
  • Non-hierarchical cluster analysis
  • K-means
  • Hierarchical divisive cluster analysis
  • Hierarchical agglomerative cluster analysis
  • Linkage single, complete, average,
  • Cophenetic correlation coefficient
  • Additive trees
  • Problems with cluster analyses

3
Cluster Analysis
  • Classification
  • Maximize within cluster homogeneity
  • (similar individuals within cluster)
  • The Search for Discontinuities
  • Discontinuities places to put divisions between
    clusters

4
Discontinuities
  • Discontinuities generally present
  • taxonomy
  • social organization
  • community ecology??

5
Types of cluster analysis
  • Uses data, dissimilarity, similarity matrix
  • Non-hierarchical
  • K-means
  • Hierarchical
  • Hierarchical divisive (repeated K-means)
  • Hierarchical agglomerative
  • single linkage, average linkage, ...
  • Additive trees

6
Non-hierarchical Clustering TechniquesK-Means
  • Uses data matrix with Euclidean distances
  • Maximizes between-cluster variance for given
    number of clusters
  • i.e. Choose clusters to maximize F-ratio in 1-way
    MANOVA

7
K-Means
  • Works iteratively
  • 1. Choose number of clusters
  • 2. Assigns points to clusters
  • Randomly or some other clustering technique
  • 3. Moves each point to other clusters in
    turn--increase in between cluster variance?
  • 4. Repeat step 3. until no improvement possible

8
K-means with three clusters
9
K-means with three clusters
  • Variable Between SS df Within SS df
    F-ratio
  • X 0.536 2 0.007 7
    256.163
  • Y 0.541 2 0.050 7
    37.566
  • TOTAL 1.078 4 0.058 14

10
K-means with three clusters
  • Cluster 1 of 3 contains 4 cases
  • Members
    Statistics
  • Case Distance Variable Minimum
    Mean Maximum St.Dev.
  • Case 1 0.02 X 0.41
    0.45 0.49 0.04
  • Case 2 0.11 Y 0.03
    0.19 0.27 0.11
  • Case 3 0.06
  • Case 4 0.05
  • Cluster 2 of 3 contains 4 cases
  • Members
    Statistics
  • Case Distance Variable Minimum
    Mean Maximum St.Dev.
  • Case 7 0.06 X 0.11
    0.15 0.19 0.03
  • Case 8 0.03 Y 0.61
    0.70 0.77 0.07
  • Case 9 0.02
  • Case 10 0.06
  • Cluster 3 of 3 contains 2 cases
  • Members
    Statistics
  • Case Distance Variable Minimum
    Mean Maximum St.Dev.
  • Case 5 0.01 X 0.77
    0.77 0.78 0.01
  • Case 6 0.01 Y 0.33
    0.35 0.36 0.02

11
Disadvantages of K-means
  • Reaches optimum, but not necessarily global
  • Must choose number of clusters before analysis
  • How many clusters?

12
Example Sperm whale codas
  • Patterned series of clicks
  • ic1 ic2 ic3 ic4
  • For 5-click codas 681 x 4 data set

13
5-click codas
ic1 ic2 ic3 ic4
93 of variance in 2 PCs
14
5-click codasK-means with 10 clusters
15
Hierarchical Cluster Analysis
  • Usually represented by
  • Dendrogram or tree-diagram

16
Hierarchical Cluster Analysis
  • Hierarchical Divisive Cluster Analysis
  • Hierarchical Agglomerative Cluster Analysis

17
Hierarchical Divisive Cluster Analysis
  • Starts with all units in one
    cluster, successively splits them
  • Successive use of K-Means, or some other divisive
    technique, with n2
  • Either Each time use the cluster with the
    greatest sum of squared distances
  • Or Split each cluster each time.
  • Hierarchical divisive are good techniques, but
    rarely used

18
Hierarchical Agglomerative Cluster Analysis
  • Start with each individual units occupying its
    own cluster
  • The clusters are then gradually merged until just
    one is left
  • The most common cluster analyses

19
Hierarchical Agglomerative Cluster Analysis
  • Works on dissimilarity matrix
  • or negative similarity matrix
  • may be Euclidean, Penrose, distances
  • At each step
  • 1. There is a symmetric matrix of
    dissimilarities between clusters
  • 2. The two clusters with least dissimilarity are
    merged
  • 3. The dissimilarity between the new (merged)
    cluster and all others is calculated
  • Different techniques do step 3. in different
    ways

20
Hierarchical Agglomerative Cluster Analysis
  • A B C D E
  • A 0 . . . .
  • B 0.35 0 . . .
  • C 0.45 0.67 0 . .
  • D 0.11 0.45 0.57 0 .
  • E 0.22 0.56 0.78 0.19 0
  • AD B C E
  • AD 0 . . .
  • B ? 0 . .
  • C ? 0.67 0 .
  • E ? 0.56 0.78 0

How to calculate new disimmilarities?
21
Hierarchical Agglomerative Cluster
AnalysisSingle Linkage
  • A B C D E
  • A 0 . . . .
  • B 0.35 0 . . .
  • C 0.45 0.67 0 . .
  • D 0.11 0.45 0.57 0 .
  • E 0.22 0.56 0.78 0.19 0
  • AD B C E
  • AD 0 . . .
  • B 0.35 0 . .
  • C ? 0.67 0 .
  • E ? 0.56 0.78 0

d(AD,B)Mind(A,B), d(D,B)
22
Hierarchical Agglomerative Cluster
AnalysisComplete Linkage
  • A B C D E
  • A 0 . . . .
  • B 0.35 0 . . .
  • C 0.45 0.67 0 . .
  • D 0.11 0.45 0.57 0 .
  • E 0.22 0.56 0.78 0.19 0
  • AD B C E
  • AD 0 . . .
  • B 0.45 0 . .
  • C ? 0.67 0 .
  • E ? 0.56 0.78 0

d(AD,B)Maxd(A,B), d(D,B)
23
Hierarchical Agglomerative Cluster
AnalysisAverage Linkage
  • A B C D E
  • A 0 . . . .
  • B 0.35 0 . . .
  • C 0.45 0.67 0 . .
  • D 0.11 0.45 0.57 0 .
  • E 0.22 0.56 0.78 0.19 0
  • AD B C E
  • AD 0 . . .
  • B 0.40 0 . .
  • C ? 0.67 0 .
  • E ? 0.56 0.78 0

d(AD,B)Meand(A,B), d(D,B)
24
Hierarchical Agglomerative Cluster
AnalysisCentroid Clustering (uses data matrix,
or true distance matrix)
  • V1 V2 V3
  • A 0.11 0.75 0.33
  • B 0.35 0.99 0.41
  • C 0.45 0.67 0.22
  • D 0.11 0.71 0.37
  • E 0.22 0.56 0.78
  • F 0.13 0.14 0.55
  • G 0.55 0.90 0.21
  • V1 V2 V3
  • AD 0.11 0.73 0.35
  • B 0.35 0.99 0.41
  • C 0.45 0.67 0.22
  • E 0.22 0.56 0.78
  • F 0.13 0.14 0.55
  • G 0.55 0.90 0.21

V1(AD)MeanV1(A),V1(D)
25
Hierarchical Agglomerative Cluster
AnalysisWards Method
  • Minimizes within-cluster sum-of squares
  • Similar to centroid clustering

26
  • 1 1.00
  • 2 0.00 1.00
  • 4 0.53 0.00 1.00
  • 5 0.18 0.05 0.00 1.00
  • 9 0.22 0.09 0.13 0.25 1.00
  • 11 0.36 0.00 0.17 0.40 0.33 1.00
  • 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00
  • 14 0.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00
  • 15 0.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00
  • 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.09
    1.00
  • 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18
    0.25 1.00
  • 1 2 4 5 9 11
    12 14 15 19 20

27
(No Transcript)
28
Hierarchical Agglomerative Clustering Techniques
  • Single Linkage
  • Produces straggly clusters
  • Not recommended if much experimental error
  • Used in taxonomy
  • Invariant to transformations
  • Complete Linkage
  • Produces tight clusters
  • Not recommended if much experimental error
  • Invariant to transformations
  • Average Linkage, Centroid, Wards
  • Most likely to mimic input clusters
  • Not invariant to transformations in dissimilarity
    measure

29
Cophenetic Correlation CoefficientCCC
  • Correlation between original disimilarity matrix
    and dissimilarity inferred from cluster analysis
  • CCC gt 0.8 indicate a good match
  • CCC lt 0.8, dendrogram not a good representation
  • probably should not be displayed
  • Use CCC to choose best linkage method (highest
    coefficient)

30
CCC0.83
CCC0.77
CCC0.75
CCC0.80
31
Additive trees
  • Dendrogram in which path lengths represent
    dissimilarities
  • Computation quite complex (cross between
    agglomerative techniques and multidimensional
    scaling)
  • Good when data are measured as similarities and
    dissimilarities
  • Often used in taxonomy and genetics

A B C D E A . . . . . B 14 . . . . C 6 12 . . . D
81 7 13 . . E 17 1 6 16 .
32
Problems with Cluster Analysis
  • Are there really biologically-meaningful clusters
    in the data?
  • Does the dendrogram represent biological reality
    (web-of-life versus tree-of-life)?
  • How many clusters to use?
  • stopping rules are arbitrary
  • Which method to use?
  • best technique is data-dependent
  • Dendrograms become messy with many units

33
Social Structure of 160 northern bottlenose whales
34
Clustering Techniques
  • Type Technique Use
  • Non-hierarchical K-Means Dividing data sets
  • Hierarchical divisive Repeated K-means Good
    technique on small data sets
  • Hierarchical agglomerative
  • Single linkage Taxonomy
  • Complete linkage Tighter Clusters
  • Average linkage,
  • Centroid, Wards Usually Preferred
  • Hierarchical Additive trees Excellent for
    displaying similarity/dissimilarity
    taxonomy, genetics

35
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com