Clustering Techniques for Finding Patterns in Large Amounts of Biological Data - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering Techniques for Finding Patterns in Large Amounts of Biological Data

Description:

Clustering Techniques for Finding Patterns in Large Amounts of Biological Data Michael Steinbach Department of Computer Science steinbac_at_cs.umn.edu www.cs.umn.edu/~kumar – PowerPoint PPT presentation

Number of Views:245
Avg rating:3.0/5.0
Slides: 63
Provided by: umn58
Category:

less

Transcript and Presenter's Notes

Title: Clustering Techniques for Finding Patterns in Large Amounts of Biological Data


1
Clustering Techniques for Finding Patterns in
Large Amounts of Biological Data
  • Michael Steinbach
  • Department of Computer Science
  • steinbac_at_cs.umn.edu www.cs.umn.edu/kumar

2
Clustering
  • Finding groups of objects such that the objects
    in a group will be similar (or related) to one
    another and different from (or unrelated to) the
    objects in other groups

3
Applications of Clustering
  • Applications
  • Gene expression clustering
  • Clustering of patients based on phenotypic and
    genotypic factors for efficient disease diagnosis
  • Market Segmentation
  • Document Clustering
  • Finding groups of driver behaviors based upon
    patterns of automobile motions (normal, drunken,
    sleepy, rush hour driving, etc)

Courtesy Michael Eisen
4
Notion of a Cluster can be Ambiguous
5
Similarity and Dissimilarity Measures
  • Similarity measure
  • Numerical measure of how alike two data objects
    are.
  • Is higher when objects are more alike.
  • Often falls in the range 0,1
  • Dissimilarity measure
  • Numerical measure of how different are two data
    objects
  • Lower when objects are more alike
  • Minimum dissimilarity is often 0
  • Upper limit varies
  • Proximity refers to a similarity or dissimilarity

6
Euclidean Distance
  • Euclidean Distance
  • Where n is the number of dimensions
    (attributes) and xk and yk are, respectively, the
    kth attributes (components) or data objects x and
    y.
  • Correlation

7
Density
  • Measures the degree to which data objects are
    close to each other in a specified area
  • The notion of density is closely related to that
    of proximity
  • Concept of density is typically used for
    clustering and anomaly detection
  • Examples
  • Euclidean density
  • Euclidean density number of points per unit
    volume
  • Probability density
  • Estimate what the distribution of the data looks
    like
  • Graph-based density
  • Connectivity

8
Types of Clusterings
  • A clustering is a set of clusters
  • Important distinction between hierarchical and
    partitional sets of clusters
  • Partitional Clustering
  • A division data objects into non-overlapping
    subsets (clusters) such that each data object is
    in exactly one subset
  • Hierarchical clustering
  • A set of nested clusters organized as a
    hierarchical tree

9
Other Distinctions Between Sets of Clusters
  • Exclusive versus non-exclusive
  • In non-exclusive clusterings, points may belong
    to multiple clusters.
  • Can represent multiple classes or border points
  • Fuzzy versus non-fuzzy
  • In fuzzy clustering, a point belongs to every
    cluster with some weight between 0 and 1
  • Weights must sum to 1
  • Probabilistic clustering has similar
    characteristics
  • Partial versus complete
  • In some cases, we only want to cluster some of
    the data
  • Heterogeneous versus homogeneous
  • Clusters of widely different sizes, shapes, and
    densities

10
Types of Clusters Well-Separated
  • Well-Separated Clusters
  • A cluster is a set of points such that any point
    in a cluster is closer (or more similar) to every
    other point in the cluster than to any point not
    in the cluster.

3 well-separated clusters
11
Types of Clusters Center-Based
  • Center-based
  • A cluster is a set of objects such that an
    object in a cluster is closer (more similar) to
    the center of a cluster, than to the center of
    any other cluster
  • The center of a cluster is often a centroid, the
    average of all the points in the cluster, or a
    medoid, the most representative point of a
    cluster

4 center-based clusters
12
Types of Clusters Contiguity-Based
  • Contiguous Cluster (Nearest neighbor or
    Transitive)
  • A cluster is a set of points such that a point in
    a cluster is closer (or more similar) to one or
    more other points in the cluster than to any
    point not in the cluster.

8 contiguous clusters
13
Types of Clusters Density-Based
  • Density-based
  • A cluster is a dense region of points, which is
    separated by low-density regions, from other
    regions of high density.
  • Used when the clusters are irregular or
    intertwined, and when noise and outliers are
    present.

6 density-based clusters
14
Clustering Algorithms
  • K-means and its variants
  • Hierarchical clustering
  • Other types of clustering

15
K-means Clustering
  • Partitional clustering approach
  • Number of clusters, K, must be specified
  • Each cluster is associated with a centroid
    (center point)
  • Each point is assigned to the cluster with the
    closest centroid
  • The basic algorithm is very simple

16
Example of K-means Clustering
17
K-means Clustering Details
  • The centroid is (typically) the mean of the
    points in the cluster
  • Initial centroids are often chosen randomly
  • Clusters produced vary from one run to another
  • Closeness is measured by Euclidean distance,
    cosine similarity, correlation, etc
  • Complexity is O( n K I d )
  • n number of points, K number of clusters, I
    number of iterations, d number of attributes

18
Evaluating K-means Clusters
  • Most common measure is Sum of Squared Error (SSE)
  • For each point, the error is the distance to the
    nearest cluster
  • To get SSE, we square these errors and sum them
  • x is a data point in cluster Ci and mi is the
    representative point for cluster Ci
  • Given two sets of clusters, we prefer the one
    with the smallest error
  • One easy way to reduce SSE is to increase K, the
    number of clusters

19
Two different K-means Clusterings
Original Points
Sub-optimal Clustering
Optimal Clustering
20
Limitations of K-means
  • K-means has problems when clusters are of
    differing
  • Sizes
  • Densities
  • Non-globular shapes
  • K-means has problems when the data contains
    outliers.

21
Limitations of K-means Differing Sizes
K-means (3 Clusters)
Original Points
22
Limitations of K-means Differing Density
K-means (3 Clusters)
Original Points
23
Limitations of K-means Non-globular Shapes
Original Points
K-means (2 Clusters)
24
Hierarchical Clustering
  • Produces a set of nested clusters organized as a
    hierarchical tree
  • Can be visualized as a dendrogram
  • A tree like diagram that records the sequences of
    merges or splits

25
Strengths of Hierarchical Clustering
  • Do not have to assume any particular number of
    clusters
  • Any desired number of clusters can be obtained by
    cutting the dendrogram at the proper level
  • They may correspond to meaningful taxonomies
  • Example in biological sciences (e.g., animal
    kingdom, phylogeny reconstruction, )

26
Hierarchical Clustering
  • Two main types of hierarchical clustering
  • Agglomerative
  • Start with the points as individual clusters
  • At each step, merge the closest pair of clusters
    until only one cluster (or k clusters) left
  • Divisive
  • Start with one, all-inclusive cluster
  • At each step, split a cluster until each cluster
    contains a point (or there are k clusters)
  • Traditional hierarchical algorithms use a
    similarity or distance matrix
  • Merge or split one cluster at a time

27
Agglomerative Clustering Algorithm
  • More popular hierarchical clustering technique
  • Basic algorithm is straightforward
  • Compute the proximity matrix
  • Let each data point be a cluster
  • Repeat
  • Merge the two closest clusters
  • Update the proximity matrix
  • Until only a single cluster remains
  • Key operation is the computation of the proximity
    of two clusters
  • Different approaches to defining the distance
    between clusters distinguish the different
    algorithms

28
Starting Situation
  • Start with clusters of individual points and a
    proximity matrix

Proximity Matrix
29
Intermediate Situation
  • After some merging steps, we have some clusters

C3
C4
C1
Proximity Matrix
C5
C2
30
Intermediate Situation
  • We want to merge the two closest clusters (C2 and
    C5) and update the proximity matrix.

C3
C4
Proximity Matrix
C1
C5
C2
31
After Merging
  • The question is How do we update the proximity
    matrix?

C2 U C5
C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
32
How to Define Inter-Cluster Distance
Similarity?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
33
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
34
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
35
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
36
How to Define Inter-Cluster Similarity
?
?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
37
Strength of MIN
Original Points
Six Clusters
  • Can handle non-elliptical shapes

38
Limitations of MIN
Two Clusters
Original Points
  • Sensitive to noise and outliers

Three Clusters
39
Strength of MAX
Original Points
Two Clusters
  • Less susceptible to noise and outliers

40
Limitations of MAX
Original Points
Two Clusters
  • Tends to break large clusters
  • Biased towards globular clusters

41
Other Types of Cluster Algorithms
  • Hundreds of clustering algorithms
  • Some clustering algorithms
  • K-means
  • Hierarchical
  • Statistically based clustering algorithms
  • Mixture model based clustering
  • Fuzzy clustering
  • Self-organizing Maps (SOM)
  • Density-based (DBSCAN)
  • Proper choice of algorithms depends on the type
    of clusters to be found, the type of data, and
    the objective

42
Cluster Validity
  • For supervised classification we have a variety
    of measures to evaluate how good our model is
  • Accuracy, precision, recall
  • For cluster analysis, the analogous question is
    how to evaluate the goodness of the resulting
    clusters?
  • But clusters are in the eye of the beholder!
  • Then why do we want to evaluate them?
  • To avoid finding patterns in noise
  • To compare clustering algorithms
  • To compare two sets of clusters
  • To compare two clusters

43
Clusters found in Random Data
Random Points
44
Different Aspects of Cluster Validation
  • Distinguishing whether non-random structure
    actually exists in the data
  • Comparing the results of a cluster analysis to
    externally known results, e.g., to externally
    given class labels
  • Evaluating how well the results of a cluster
    analysis fit the data without reference to
    external information
  • Comparing the results of two different sets of
    cluster analyses to determine which is better
  • Determining the correct number of clusters

45
Using Similarity Matrix for Cluster Validation
  • Order the similarity matrix with respect to
    cluster labels and inspect visually.

46
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

DBSCAN
47
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

K-means
48
Using Similarity Matrix for Cluster Validation
  • Clusters in random data are not so crisp

Complete Link
49
Using Similarity Matrix for Cluster Validation
DBSCAN
50
Measures of Cluster Validity
  • Numerical measures that are applied to judge
    various aspects of cluster validity, are
    classified into the following three types of
    indices.
  • External Index Used to measure the extent to
    which cluster labels match externally supplied
    class labels.
  • Entropy
  • Internal Index Used to measure the goodness of
    a clustering structure without respect to
    external information.
  • Sum of Squared Error (SSE)
  • Relative Index Used to compare two different
    clusterings or clusters.
  • Often an external or internal index is used for
    this function, e.g., SSE or entropy

51
Internal Measures Cohesion and Separation
  • Cluster Cohesion Measures how closely related
    are objects in a cluster
  • Example SSE
  • Cluster Separation Measure how distinct or
    well-separated a cluster is from other clusters
  • Example Squared Error
  • Cohesion is measured by the within cluster sum of
    squares (SSE)
  • Separation is measured by the between cluster sum
    of squares
  • Where Ci is the size of cluster i

52
Internal Measures Silhouette Coefficient
  • Silhouette Coefficient combine ideas of both
    cohesion and separation, but for individual
    points, as well as clusters and clusterings
  • For an individual point, i
  • Calculate a average distance of i to the points
    in its cluster
  • Calculate b min (average distance of i to
    points in another cluster)
  • The silhouette coefficient for a point is then
    given by s (b a) / max(a,b)
  • Typically between 0 and 1.
  • The closer to 1 the better.
  • Can calculate the average silhouette coefficient
    for a cluster or a clustering

53
External Measures of Cluster Validity Entropy
and Purity
54
Clustering of ESTs in Protein Coding Database
Laboratory Experiments
New Protein
Functionality of the protein
Similarity Match
Researchers John Carlis John Riedl Ernest
Retzel Elizabeth Shoop
Clusters of Short Segments of Protein-Coding
Sequences (EST)
Known Proteins
55
Expressed Sequence Tags (EST)
  • Generate short segments of protein-coding
    sequences (EST).
  • Match ESTs against known proteins using
    similarity matching algorithms.
  • Find Clusters of ESTs that have same
    functionality.
  • Match new protein against the EST clusters.
  • Experimentally verify only the functionality of
    the proteins represented by the matching EST
    clusters

56
EST Clusters by Hypergraph-Based Scheme
  • 662 different items corresponding to ESTs.
  • 11,986 variables corresponding to known proteins
  • Found 39 clusters
  • 12 clean clusters each corresponds to single
    protein family (113 ESTs)
  • 6 clusters with two protein families
  • 7 clusters with three protein families
  • 3 clusters with four protein families
  • 6 clusters with five protein families
  • Runtime was less than 5 minutes.

57
Clustering Microarray Data
  • Microarray analysis allows the monitoring of the
    activities of many genes over many different
    conditions
  • Data Expression profiles of approximately 3606
    genes of E Coli are recorded for 30 experimental
    conditions
  • SAM (Significance Analysis of Microarrays)
    package from Stanford University is used for the
    analysis of the data and to identify the genes
    that are substantially differentially upregulated
    in the dataset 17 such genes are identified for
    study purposes
  • Hierarchical clustering is performed and plotted
    using TreeView

58
Clustering Microarray Data
59
CLUTO for Clustering for Microarray Data
  • CLUTO (Clustering Toolkit) George Karypis (UofM)
    http//glaros.dtc.umn.edu/gkhome/views/cluto/
  • CLUTO can also be used for clustering microarray
    data

60
Issues in Clustering Expression Data
  • Similarity uses all the conditions
  • We are typically interested in sets of genes that
    are similar for a relatively small set of
    conditions
  • Most clustering approaches assume that an object
    can only be in one cluster
  • A gene may belong to more than one functional
    group
  • Thus, overlapping groups are needed
  • Can either use clustering that takes these
    factors into account or use other techniques
  • For example, association analysis

61
Clustering Packages
  • Mathematical and Statistical Packages
  • MATLAB
  • SAS
  • SPSS
  • R
  • CLUTO (Clustering Toolkit) George Karypis (UM)
    http//glaros.dtc.umn.edu/gkhome/views/cluto/
  • Cluster Michael Eisen (LBNL/UCB)
    (microarray)http//rana.lbl.gov/EisenSoftware.htm
    http//genome-www5.stanford.edu/resources/restech
    .shtml (more microarray clustering algorithms)
  • Many others
  • KDNuggets http//www.kdnuggets.com/software/clust
    ering.html

62
Data Mining Book
For further details and sample chapters
see www.cs.umn.edu/kumar/dmbook
Write a Comment
User Comments (0)
About PowerShow.com