# Clustering Techniques for Finding Patterns in Large Amounts of Biological Data - PowerPoint PPT Presentation

Title:

## Clustering Techniques for Finding Patterns in Large Amounts of Biological Data

Description:

### Clustering Techniques for Finding Patterns in Large Amounts of Biological Data Michael Steinbach Department of Computer Science steinbac_at_cs.umn.edu www.cs.umn.edu/~kumar – PowerPoint PPT presentation

Number of Views:233
Avg rating:3.0/5.0
Slides: 63
Provided by: umn58
Category:
Tags:
Transcript and Presenter's Notes

Title: Clustering Techniques for Finding Patterns in Large Amounts of Biological Data

1
Clustering Techniques for Finding Patterns in
Large Amounts of Biological Data
• Michael Steinbach
• Department of Computer Science
• steinbac_at_cs.umn.edu www.cs.umn.edu/kumar

2
Clustering
• Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

3
Applications of Clustering
• Applications
• Gene expression clustering
• Clustering of patients based on phenotypic and
genotypic factors for efficient disease diagnosis
• Market Segmentation
• Document Clustering
• Finding groups of driver behaviors based upon
patterns of automobile motions (normal, drunken,
sleepy, rush hour driving, etc)

Courtesy Michael Eisen
4
Notion of a Cluster can be Ambiguous
5
Similarity and Dissimilarity Measures
• Similarity measure
• Numerical measure of how alike two data objects
are.
• Is higher when objects are more alike.
• Often falls in the range 0,1
• Dissimilarity measure
• Numerical measure of how different are two data
objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

6
Euclidean Distance
• Euclidean Distance
• Where n is the number of dimensions
(attributes) and xk and yk are, respectively, the
kth attributes (components) or data objects x and
y.
• Correlation

7
Density
• Measures the degree to which data objects are
close to each other in a specified area
• The notion of density is closely related to that
of proximity
• Concept of density is typically used for
clustering and anomaly detection
• Examples
• Euclidean density
• Euclidean density number of points per unit
volume
• Probability density
• Estimate what the distribution of the data looks
like
• Graph-based density
• Connectivity

8
Types of Clusterings
• A clustering is a set of clusters
• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
• A division data objects into non-overlapping
subsets (clusters) such that each data object is
in exactly one subset
• Hierarchical clustering
• A set of nested clusters organized as a
hierarchical tree

9
Other Distinctions Between Sets of Clusters
• Exclusive versus non-exclusive
• In non-exclusive clusterings, points may belong
to multiple clusters.
• Can represent multiple classes or border points
• Fuzzy versus non-fuzzy
• In fuzzy clustering, a point belongs to every
cluster with some weight between 0 and 1
• Weights must sum to 1
• Probabilistic clustering has similar
characteristics
• Partial versus complete
• In some cases, we only want to cluster some of
the data
• Heterogeneous versus homogeneous
• Clusters of widely different sizes, shapes, and
densities

10
Types of Clusters Well-Separated
• Well-Separated Clusters
• A cluster is a set of points such that any point
in a cluster is closer (or more similar) to every
other point in the cluster than to any point not
in the cluster.

3 well-separated clusters
11
Types of Clusters Center-Based
• Center-based
• A cluster is a set of objects such that an
object in a cluster is closer (more similar) to
the center of a cluster, than to the center of
any other cluster
• The center of a cluster is often a centroid, the
average of all the points in the cluster, or a
medoid, the most representative point of a
cluster

4 center-based clusters
12
Types of Clusters Contiguity-Based
• Contiguous Cluster (Nearest neighbor or
Transitive)
• A cluster is a set of points such that a point in
a cluster is closer (or more similar) to one or
more other points in the cluster than to any
point not in the cluster.

8 contiguous clusters
13
Types of Clusters Density-Based
• Density-based
• A cluster is a dense region of points, which is
separated by low-density regions, from other
regions of high density.
• Used when the clusters are irregular or
intertwined, and when noise and outliers are
present.

6 density-based clusters
14
Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
• Other types of clustering

15
K-means Clustering
• Partitional clustering approach
• Number of clusters, K, must be specified
• Each cluster is associated with a centroid
(center point)
• Each point is assigned to the cluster with the
closest centroid
• The basic algorithm is very simple

16
Example of K-means Clustering
17
K-means Clustering Details
• The centroid is (typically) the mean of the
points in the cluster
• Initial centroids are often chosen randomly
• Clusters produced vary from one run to another
• Closeness is measured by Euclidean distance,
cosine similarity, correlation, etc
• Complexity is O( n K I d )
• n number of points, K number of clusters, I
number of iterations, d number of attributes

18
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the
nearest cluster
• To get SSE, we square these errors and sum them
• x is a data point in cluster Ci and mi is the
representative point for cluster Ci
• Given two sets of clusters, we prefer the one
with the smallest error
• One easy way to reduce SSE is to increase K, the
number of clusters

19
Two different K-means Clusterings
Original Points
Sub-optimal Clustering
Optimal Clustering
20
Limitations of K-means
• K-means has problems when clusters are of
differing
• Sizes
• Densities
• Non-globular shapes
• K-means has problems when the data contains
outliers.

21
Limitations of K-means Differing Sizes
K-means (3 Clusters)
Original Points
22
Limitations of K-means Differing Density
K-means (3 Clusters)
Original Points
23
Limitations of K-means Non-globular Shapes
Original Points
K-means (2 Clusters)
24
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of
merges or splits

25
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of
clusters
• Any desired number of clusters can be obtained by
cutting the dendrogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )

26
Hierarchical Clustering
• Two main types of hierarchical clustering
• Agglomerative
• At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
• Divisive
• At each step, split a cluster until each cluster
contains a point (or there are k clusters)
• Traditional hierarchical algorithms use a
similarity or distance matrix
• Merge or split one cluster at a time

27
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
• Compute the proximity matrix
• Let each data point be a cluster
• Repeat
• Merge the two closest clusters
• Update the proximity matrix
• Until only a single cluster remains
• Key operation is the computation of the proximity
of two clusters
• Different approaches to defining the distance
between clusters distinguish the different
algorithms

28
Starting Situation
proximity matrix

Proximity Matrix
29
Intermediate Situation
• After some merging steps, we have some clusters

C3
C4
C1
Proximity Matrix
C5
C2
30
Intermediate Situation
• We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.

C3
C4
Proximity Matrix
C1
C5
C2
31
After Merging
• The question is How do we update the proximity
matrix?

C2 U C5
C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
32
How to Define Inter-Cluster Distance
Similarity?
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
• Wards Method uses squared error

Proximity Matrix
33
How to Define Inter-Cluster Similarity
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
• Wards Method uses squared error

Proximity Matrix
34
How to Define Inter-Cluster Similarity
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
• Wards Method uses squared error

Proximity Matrix
35
How to Define Inter-Cluster Similarity
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
• Wards Method uses squared error

Proximity Matrix
36
How to Define Inter-Cluster Similarity
?
?
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
• Wards Method uses squared error

Proximity Matrix
37
Strength of MIN
Original Points
Six Clusters
• Can handle non-elliptical shapes

38
Limitations of MIN
Two Clusters
Original Points
• Sensitive to noise and outliers

Three Clusters
39
Strength of MAX
Original Points
Two Clusters
• Less susceptible to noise and outliers

40
Limitations of MAX
Original Points
Two Clusters
• Tends to break large clusters
• Biased towards globular clusters

41
Other Types of Cluster Algorithms
• Hundreds of clustering algorithms
• Some clustering algorithms
• K-means
• Hierarchical
• Statistically based clustering algorithms
• Mixture model based clustering
• Fuzzy clustering
• Self-organizing Maps (SOM)
• Density-based (DBSCAN)
• Proper choice of algorithms depends on the type
of clusters to be found, the type of data, and
the objective

42
Cluster Validity
• For supervised classification we have a variety
of measures to evaluate how good our model is
• Accuracy, precision, recall
• For cluster analysis, the analogous question is
how to evaluate the goodness of the resulting
clusters?
• But clusters are in the eye of the beholder!
• Then why do we want to evaluate them?
• To avoid finding patterns in noise
• To compare clustering algorithms
• To compare two sets of clusters
• To compare two clusters

43
Clusters found in Random Data
Random Points
44
Different Aspects of Cluster Validation
• Distinguishing whether non-random structure
actually exists in the data
• Comparing the results of a cluster analysis to
externally known results, e.g., to externally
given class labels
• Evaluating how well the results of a cluster
analysis fit the data without reference to
external information
• Comparing the results of two different sets of
cluster analyses to determine which is better
• Determining the correct number of clusters

45
Using Similarity Matrix for Cluster Validation
• Order the similarity matrix with respect to
cluster labels and inspect visually.

46
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp

DBSCAN
47
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp

K-means
48
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp

49
Using Similarity Matrix for Cluster Validation
DBSCAN
50
Measures of Cluster Validity
• Numerical measures that are applied to judge
various aspects of cluster validity, are
classified into the following three types of
indices.
• External Index Used to measure the extent to
which cluster labels match externally supplied
class labels.
• Entropy
• Internal Index Used to measure the goodness of
a clustering structure without respect to
external information.
• Sum of Squared Error (SSE)
• Relative Index Used to compare two different
clusterings or clusters.
• Often an external or internal index is used for
this function, e.g., SSE or entropy

51
Internal Measures Cohesion and Separation
• Cluster Cohesion Measures how closely related
are objects in a cluster
• Example SSE
• Cluster Separation Measure how distinct or
well-separated a cluster is from other clusters
• Example Squared Error
• Cohesion is measured by the within cluster sum of
squares (SSE)
• Separation is measured by the between cluster sum
of squares
• Where Ci is the size of cluster i

52
Internal Measures Silhouette Coefficient
• Silhouette Coefficient combine ideas of both
cohesion and separation, but for individual
points, as well as clusters and clusterings
• For an individual point, i
• Calculate a average distance of i to the points
in its cluster
• Calculate b min (average distance of i to
points in another cluster)
• The silhouette coefficient for a point is then
given by s (b a) / max(a,b)
• Typically between 0 and 1.
• The closer to 1 the better.
• Can calculate the average silhouette coefficient
for a cluster or a clustering

53
External Measures of Cluster Validity Entropy
and Purity
54
Clustering of ESTs in Protein Coding Database
Laboratory Experiments
New Protein
Functionality of the protein
Similarity Match
Researchers John Carlis John Riedl Ernest
Retzel Elizabeth Shoop
Clusters of Short Segments of Protein-Coding
Sequences (EST)
Known Proteins
55
Expressed Sequence Tags (EST)
• Generate short segments of protein-coding
sequences (EST).
• Match ESTs against known proteins using
similarity matching algorithms.
• Find Clusters of ESTs that have same
functionality.
• Match new protein against the EST clusters.
• Experimentally verify only the functionality of
the proteins represented by the matching EST
clusters

56
EST Clusters by Hypergraph-Based Scheme
• 662 different items corresponding to ESTs.
• 11,986 variables corresponding to known proteins
• Found 39 clusters
• 12 clean clusters each corresponds to single
protein family (113 ESTs)
• 6 clusters with two protein families
• 7 clusters with three protein families
• 3 clusters with four protein families
• 6 clusters with five protein families
• Runtime was less than 5 minutes.

57
Clustering Microarray Data
• Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions
• Data Expression profiles of approximately 3606
genes of E Coli are recorded for 30 experimental
conditions
• SAM (Significance Analysis of Microarrays)
package from Stanford University is used for the
analysis of the data and to identify the genes
that are substantially differentially upregulated
in the dataset 17 such genes are identified for
study purposes
• Hierarchical clustering is performed and plotted
using TreeView

58
Clustering Microarray Data
59
CLUTO for Clustering for Microarray Data
• CLUTO (Clustering Toolkit) George Karypis (UofM)
http//glaros.dtc.umn.edu/gkhome/views/cluto/
• CLUTO can also be used for clustering microarray
data

60
Issues in Clustering Expression Data
• Similarity uses all the conditions
• We are typically interested in sets of genes that
are similar for a relatively small set of
conditions
• Most clustering approaches assume that an object
can only be in one cluster
• A gene may belong to more than one functional
group
• Thus, overlapping groups are needed
• Can either use clustering that takes these
factors into account or use other techniques
• For example, association analysis

61
Clustering Packages
• Mathematical and Statistical Packages
• MATLAB
• SAS
• SPSS
• R
• CLUTO (Clustering Toolkit) George Karypis (UM)
http//glaros.dtc.umn.edu/gkhome/views/cluto/
• Cluster Michael Eisen (LBNL/UCB)
(microarray)http//rana.lbl.gov/EisenSoftware.htm
http//genome-www5.stanford.edu/resources/restech
.shtml (more microarray clustering algorithms)
• Many others
• KDNuggets http//www.kdnuggets.com/software/clust
ering.html

62
Data Mining Book
For further details and sample chapters
see www.cs.umn.edu/kumar/dmbook