Clustering Techniques for Finding Patterns in Large Amounts of Biological Data - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering Techniques for Finding Patterns in Large Amounts of Biological Data

Description:

Clustering Techniques for Finding Patterns in Large Amounts of Biological Data Michael Steinbach Department of Computer Science steinbac_at_cs.umn.edu www.cs.umn.edu/~kumar – PowerPoint PPT presentation

Number of Views:245

Avg rating:3.0/5.0

Slides: 63

Provided by: umn58

Learn more at: https://www-users.cse.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Techniques for Finding Patterns in Large Amounts of Biological Data

1
Clustering Techniques for Finding Patterns in
Large Amounts of Biological Data

Michael Steinbach
Department of Computer Science
steinbac_at_cs.umn.edu www.cs.umn.edu/kumar

2
Clustering

Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

3
Applications of Clustering

Applications
Gene expression clustering
Clustering of patients based on phenotypic and
genotypic factors for efficient disease diagnosis
Market Segmentation
Document Clustering
Finding groups of driver behaviors based upon
patterns of automobile motions (normal, drunken,
sleepy, rush hour driving, etc)

Courtesy Michael Eisen
4
Notion of a Cluster can be Ambiguous
5
Similarity and Dissimilarity Measures

Similarity measure
Numerical measure of how alike two data objects
are.
Is higher when objects are more alike.
Often falls in the range 0,1
Dissimilarity measure
Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity

6
Euclidean Distance

Euclidean Distance
Where n is the number of dimensions
(attributes) and xk and yk are, respectively, the
kth attributes (components) or data objects x and
y.
Correlation

7
Density

Measures the degree to which data objects are
close to each other in a specified area
The notion of density is closely related to that
of proximity
Concept of density is typically used for
clustering and anomaly detection
Examples
Euclidean density
Euclidean density number of points per unit
volume
Probability density
Estimate what the distribution of the data looks
like
Graph-based density
Connectivity

8
Types of Clusterings

A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping
subsets (clusters) such that each data object is
in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a
hierarchical tree

9
Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive
In non-exclusive clusterings, points may belong
to multiple clusters.
Can represent multiple classes or border points
Fuzzy versus non-fuzzy
In fuzzy clustering, a point belongs to every
cluster with some weight between 0 and 1
Weights must sum to 1
Probabilistic clustering has similar
characteristics
Partial versus complete
In some cases, we only want to cluster some of
the data
Heterogeneous versus homogeneous
Clusters of widely different sizes, shapes, and
densities

10
Types of Clusters Well-Separated

Well-Separated Clusters
A cluster is a set of points such that any point
in a cluster is closer (or more similar) to every
other point in the cluster than to any point not
in the cluster.

3 well-separated clusters
11
Types of Clusters Center-Based

Center-based
A cluster is a set of objects such that an
object in a cluster is closer (more similar) to
the center of a cluster, than to the center of
any other cluster
The center of a cluster is often a centroid, the
average of all the points in the cluster, or a
medoid, the most representative point of a
cluster

4 center-based clusters
12
Types of Clusters Contiguity-Based

Contiguous Cluster (Nearest neighbor or
Transitive)
A cluster is a set of points such that a point in
a cluster is closer (or more similar) to one or
more other points in the cluster than to any
point not in the cluster.

8 contiguous clusters
13
Types of Clusters Density-Based

Density-based
A cluster is a dense region of points, which is
separated by low-density regions, from other
regions of high density.
Used when the clusters are irregular or
intertwined, and when noise and outliers are
present.

6 density-based clusters
14
Clustering Algorithms

K-means and its variants
Hierarchical clustering
Other types of clustering

15
K-means Clustering

Partitional clustering approach
Number of clusters, K, must be specified
Each cluster is associated with a centroid
(center point)
Each point is assigned to the cluster with the
closest centroid
The basic algorithm is very simple

16
Example of K-means Clustering
17
K-means Clustering Details

The centroid is (typically) the mean of the
points in the cluster
Initial centroids are often chosen randomly
Clusters produced vary from one run to another
Closeness is measured by Euclidean distance,
cosine similarity, correlation, etc
Complexity is O( n K I d )
n number of points, K number of clusters, I
number of iterations, d number of attributes

18
Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the
nearest cluster
To get SSE, we square these errors and sum them
x is a data point in cluster Ci and mi is the
representative point for cluster Ci
Given two sets of clusters, we prefer the one
with the smallest error
One easy way to reduce SSE is to increase K, the
number of clusters

19
Two different K-means Clusterings
Original Points
Sub-optimal Clustering
Optimal Clustering
20
Limitations of K-means

K-means has problems when clusters are of
differing
Sizes
Densities
Non-globular shapes
K-means has problems when the data contains
outliers.

21
Limitations of K-means Differing Sizes
K-means (3 Clusters)
Original Points
22
Limitations of K-means Differing Density
K-means (3 Clusters)
Original Points
23
Limitations of K-means Non-globular Shapes
Original Points
K-means (2 Clusters)
24
Hierarchical Clustering

Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits

25
Strengths of Hierarchical Clustering

Do not have to assume any particular number of
clusters
Any desired number of clusters can be obtained by
cutting the dendrogram at the proper level
They may correspond to meaningful taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )

26
Hierarchical Clustering

Two main types of hierarchical clustering
Agglomerative
Start with the points as individual clusters
At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
Divisive
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster
contains a point (or there are k clusters)
Traditional hierarchical algorithms use a
similarity or distance matrix
Merge or split one cluster at a time

27
Agglomerative Clustering Algorithm

More popular hierarchical clustering technique
Basic algorithm is straightforward
Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the proximity
of two clusters
Different approaches to defining the distance
between clusters distinguish the different
algorithms

28
Starting Situation

Start with clusters of individual points and a
proximity matrix

Proximity Matrix
29
Intermediate Situation

After some merging steps, we have some clusters

C3
C4
C1
Proximity Matrix
C5
C2
30
Intermediate Situation

We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.

C3
C4
Proximity Matrix
C1
C5
C2
31
After Merging

The question is How do we update the proximity
matrix?

C2 U C5
C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
32
How to Define Inter-Cluster Distance
Similarity?

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
33
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
34
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
35
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
36
How to Define Inter-Cluster Similarity
?
?

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
37
Strength of MIN
Original Points
Six Clusters

Can handle non-elliptical shapes

38
Limitations of MIN
Two Clusters
Original Points

Sensitive to noise and outliers

Three Clusters
39
Strength of MAX
Original Points
Two Clusters

Less susceptible to noise and outliers

40
Limitations of MAX
Original Points
Two Clusters

Tends to break large clusters
Biased towards globular clusters

41
Other Types of Cluster Algorithms

Hundreds of clustering algorithms
Some clustering algorithms
K-means
Hierarchical
Statistically based clustering algorithms
Mixture model based clustering
Fuzzy clustering
Self-organizing Maps (SOM)
Density-based (DBSCAN)
Proper choice of algorithms depends on the type
of clusters to be found, the type of data, and
the objective

42
Cluster Validity

For supervised classification we have a variety
of measures to evaluate how good our model is
Accuracy, precision, recall
For cluster analysis, the analogous question is
how to evaluate the goodness of the resulting
clusters?
But clusters are in the eye of the beholder!
Then why do we want to evaluate them?
To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters

43
Clusters found in Random Data
Random Points
44
Different Aspects of Cluster Validation

Distinguishing whether non-random structure
actually exists in the data
Comparing the results of a cluster analysis to
externally known results, e.g., to externally
given class labels
Evaluating how well the results of a cluster
analysis fit the data without reference to
external information
Comparing the results of two different sets of
cluster analyses to determine which is better
Determining the correct number of clusters

45
Using Similarity Matrix for Cluster Validation

Order the similarity matrix with respect to
cluster labels and inspect visually.

46
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

DBSCAN
47
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means
48
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Complete Link
49
Using Similarity Matrix for Cluster Validation
DBSCAN
50
Measures of Cluster Validity

Numerical measures that are applied to judge
various aspects of cluster validity, are
classified into the following three types of
indices.
External Index Used to measure the extent to
which cluster labels match externally supplied
class labels.
Entropy
Internal Index Used to measure the goodness of
a clustering structure without respect to
external information.
Sum of Squared Error (SSE)
Relative Index Used to compare two different
clusterings or clusters.
Often an external or internal index is used for
this function, e.g., SSE or entropy

51
Internal Measures Cohesion and Separation

Cluster Cohesion Measures how closely related
are objects in a cluster
Example SSE
Cluster Separation Measure how distinct or
well-separated a cluster is from other clusters
Example Squared Error
Cohesion is measured by the within cluster sum of
squares (SSE)
Separation is measured by the between cluster sum
of squares
Where Ci is the size of cluster i

52
Internal Measures Silhouette Coefficient

Silhouette Coefficient combine ideas of both
cohesion and separation, but for individual
points, as well as clusters and clusterings
For an individual point, i
Calculate a average distance of i to the points
in its cluster
Calculate b min (average distance of i to
points in another cluster)
The silhouette coefficient for a point is then
given by s (b a) / max(a,b)
Typically between 0 and 1.
The closer to 1 the better.
Can calculate the average silhouette coefficient
for a cluster or a clustering

53
External Measures of Cluster Validity Entropy
and Purity
54
Clustering of ESTs in Protein Coding Database
Laboratory Experiments
New Protein
Functionality of the protein
Similarity Match
Researchers John Carlis John Riedl Ernest
Retzel Elizabeth Shoop
Clusters of Short Segments of Protein-Coding
Sequences (EST)
Known Proteins
55
Expressed Sequence Tags (EST)

Generate short segments of protein-coding
sequences (EST).
Match ESTs against known proteins using
similarity matching algorithms.
Find Clusters of ESTs that have same
functionality.
Match new protein against the EST clusters.
Experimentally verify only the functionality of
the proteins represented by the matching EST
clusters

56
EST Clusters by Hypergraph-Based Scheme

662 different items corresponding to ESTs.
11,986 variables corresponding to known proteins
Found 39 clusters
12 clean clusters each corresponds to single
protein family (113 ESTs)
6 clusters with two protein families
7 clusters with three protein families
3 clusters with four protein families
6 clusters with five protein families
Runtime was less than 5 minutes.

57
Clustering Microarray Data

Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions
Data Expression profiles of approximately 3606
genes of E Coli are recorded for 30 experimental
conditions
SAM (Significance Analysis of Microarrays)
package from Stanford University is used for the
analysis of the data and to identify the genes
that are substantially differentially upregulated
in the dataset 17 such genes are identified for
study purposes
Hierarchical clustering is performed and plotted
using TreeView

58
Clustering Microarray Data
59
CLUTO for Clustering for Microarray Data

CLUTO (Clustering Toolkit) George Karypis (UofM)
http//glaros.dtc.umn.edu/gkhome/views/cluto/
CLUTO can also be used for clustering microarray
data

60
Issues in Clustering Expression Data

Similarity uses all the conditions
We are typically interested in sets of genes that
are similar for a relatively small set of
conditions
Most clustering approaches assume that an object
can only be in one cluster
A gene may belong to more than one functional
group
Thus, overlapping groups are needed
Can either use clustering that takes these
factors into account or use other techniques
For example, association analysis

61
Clustering Packages

Mathematical and Statistical Packages
MATLAB
SAS
SPSS
R
CLUTO (Clustering Toolkit) George Karypis (UM)
http//glaros.dtc.umn.edu/gkhome/views/cluto/
Cluster Michael Eisen (LBNL/UCB)
(microarray)http//rana.lbl.gov/EisenSoftware.htm
http//genome-www5.stanford.edu/resources/restech
.shtml (more microarray clustering algorithms)
Many others
KDNuggets http//www.kdnuggets.com/software/clust
ering.html