Bioinformatics Pattern recognition - Clustering - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Bioinformatics Pattern recognition - Clustering

Description:

or the partitioning of a data set into subsets (clusters) ... http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm ... – PowerPoint PPT presentation

Number of Views:173
Avg rating:3.0/5.0
Slides: 28
Provided by: us01
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Pattern recognition - Clustering


1
BioinformaticsPattern recognition - Clustering
  • Ulf Schmitz
  • ulf.schmitz_at_informatik.uni-rostock.de
  • Bioinformatics and Systems Biology Group
  • www.sbi.informatik.uni-rostock.de

2
Outline
  • Introduction
  • Hierarchical clustering
  • Partitional clustering
  • k-means and derivatives
  • Fuzzy Clustering

3
Introduction into Clustering algorithms
  • Clustering is the classification of similar
    objects into separated groups
  • or the partitioning of a data set into subsets
    (clusters)
  • so that the data in each subset (ideally) share
    some common trait
  • Machine learning typically regards clustering as
    a form of unsupervised learning.
  • we distinguish
  • Hierarchical Clustering (finds successive
    clusters using previously established clusters)
  • Partitional Clustering (determines all clusters
    at once)

4
Introduction into Clustering algorithms
Applications
  • gene expression data analysis
  • identification of regulatory binding sites
  • phylogenetic tree clustering
  • (for inference of horizontally transferred
    genes)
  • protein domain identification
  • identification of structural motifs

5
Introduction into Clustering algorithms
Data matrix
  • data matrix collects observations of n objects,
    described by m measurements
  • rows refer to objects, characterised by values in
    the columns

if units of measurements, associated with the
columns of X differ, its necessary to normalise
  • column vector
  • mean
  • standard deviation

6
Hierarchical clustering
produces a sequence of nested partitions, the
steps are
  • find dis/similarity between every pair of objects
    in the data set by evaluating a distance measure
  • group the objects into a hierarchical cluster
    tree (dendrogram) by linking newly formed
    clusters
  • obtain a partition of the data set into clusters
    by selecting a suitable cut-level of the
    cluster tree

7
Hierarchical clustering
Agglomerative Hierarchical clustering
  • start with n clusters, each containing one object
    and calculate the distance matrix D1
  • determine from D1 which of the objects are least
    distant (e.g. I and J)
  • merge these objects into one cluster and form a
    new distance matrix by deleting the entries for
    the clustered objects and add distances for the
    new cluster
  • repeat steps 2 and 3 a total of m-1 times until a
    single cluster is formed
  • record which clusters are merged at each step
  • record the distances between the clusters that
    are merged in that step

8
Hierarchical clustering
calculating the distances
  • one treats the data matrix X as a set of n (row)
    vectors with m elements

Euclidian distance
an example
are row vectors of X
City block distance
9
Hierarchical clustering
an example
Euclidian distance
City block distance
10
Hierarchical clustering
Agglomerative Hierarchical clustering
  • start with n clusters, each containing one object
    and calculate the distance matrix D1
  • determine from D1 which of the objects are least
    distant (e.g. I and J)
  • merge these objects into one cluster and form a
    new distance matrix by deleting the entries for
    the clustered objects and add distances for the
    new cluster
  • repeat steps 2 and 3 a total of m-1 times until a
    single cluster is formed
  • record which clusters are merged at each step
  • record the distances between the clusters that
    are merged in that step

11
Hierarchical clustering
distance matrix
x1 x2 x3 x4 x5
x1 0 2.9155 1.0000 3.0414 3.0414
x2 2.9155 0 2.5495 3.3541 2.5000
x3 1.0000 2.5495 0 2.0616 2.0616
x4 3.0414 3.3541 2.0616 0 1.0000
x5 3.0414 2.5000 2.0616 1.0000 0
x1, x3 X2 x4, x5
x1, x3 0 2.9155 2.0616
X2 2.9155 0 2.5000
x4, x5 2.0616 2.5000 0
12
Hierarchical clustering
Methods to define a distance between clusters
single linkage
dIJ
complete linkage
dIJ
group average
d13
3
1
d14
4
d23
d15
d24
2
N is the number of members in a cluster
d25
5
centroid linkage
13
Hierarchical clustering
14
Hierarchical clustering
Agglomerative Hierarchical clustering
  • start with n clusters, each containing one object
    and calculate the distance matrix D1
  • determine from D1 which of the objects are least
    distant (e.g. I and J)
  • merge these objects into one cluster and form a
    new distance matrix by deleting the entries for
    the clustered objects and add distances for the
    new cluster
  • repeat steps 2 and 3 a total of m-1 times until a
    single cluster is formed
  • record which clusters are merged at each step
  • record the distances between the clusters that
    are merged in that step

15
Hierarchical clustering
Limits of hierarchical clustering
  • the choice of distance measure is important
  • there is no provision for reassigning objects
    that have been incorrectly grouped
  • errors are not handled explicitly in the
    procedure
  • no method of calculating intercluster distances
    is universally the best
  • but, single-linkage clustering is least
    successful
  • and, group average clustering tends to be fairly
    well

16
Partitional clustering K means
  • Involves prior specification of the number of
    clusters, k
  • no pairwise distance matrix is required
  • The relevant distance is the distance from the
    object to the cluster center (centroid)

17
Partitional clustering K means
  1. partition the objects in k clusters (can be done
    by random partitioning or by arbitrarily
    clustering around two or more objects)
  2. calculate the centroids of the clusters
  3. assign or reassign each object to that cluster
    whose centroid is closest (distance is calculated
    as Euclidean distance)
  4. recalculate the centroids of the new clusters
    formed after the gain or loss of objects to or
    from the previous clusters
  5. repeat steps 3 and 4 for a predetermined number
    of iterations or until membership of the groups
    no longer changes

18
Partitional clustering K means
object x1 x2
A 1 1
B 3 1
C 4 8
D 8 10
E 9 6
step 1 make an arbitrary partition of the
objects into clusters e.g. objects with
into Cluster 1, all other into Cluster 2 A,B
and C in Cluster 1, and D and E in Cluster 2 step
2 calculate the centroids of the
clusters cluster 1 cluster 2 step 3
calculate the Euclidean distance between each
object and each of the two clusters
centroids
object d(x1,c1) d(x2,c2)
A 2.87 10.26
B 2.35 8.90
C 4.86 4.50
D 8.54 2.06
E 6.87 2.06
19
Partitional clustering K means
  1. partition the objects in k clusters (can be done
    by random partitioning or by arbitrarily
    clustering around two or more objects)
  2. calculate the centroids of the clusters
  3. assign or reassign each object to that cluster
    whose centroid is closest (distance is calculated
    as Euclidean distance)
  4. recalculate the centroids of the new clusters
    formed after the gain or loss of objects to or
    from the previous clusters
  5. repeat steps 3 and 4 for a predetermined number
    of iterations or until membership of the groups
    no longer changes

20
Partitional clustering K means
step 4 C turns out to be closer to Cluster 2 and
has to be reassigned repeat step2 and step3
object d(X,1) d(X,2)
A 1.00 9.22
B 1.00 8.06
C 7.28 3.00
D 10.82 2.24
E 8.60 2.83
cluster 1 cluster 2
no further reassignments are necessary
21
Partitional clustering K means
22
Fuzzy clustering
  • is an extension of k means clustering
  • an objects belongs to a cluster in a certain
    degree
  • for all objects the degrees of membership in the
    k clusters adds up to one
  • a fuzzy weight is introduced, which determines
    the fuzziness of the resulting clusters
  • for ? ? 1, the cluster becomes a hard partition
  • for ? ? 8, the degree of membership approximates
    1/k
  • typical values are ? 1.25 and ? 2

23
Fuzzy clustering
fix k, 2 k lt n and choose a distance measure
(Euclidean, city block, etc.), a termination
tolerance dgt0 (e.g. 0.01 or 0.001), and fix ?, 1
? lt 8. Initialize first cluster set randomly.
step1 compute cluster centers
step2 compute distances between objects and
cluster centers
24
Fuzzy clustering
step3 update partition matrix
until
the algorithm is terminated if changes in the
partition matrix are negligible
25
Clustering Software
  • Cluster 3.0 (for gene expression data analysis )
  • PyCluster (Python Module)
  • AlgorithmCluster (Perl package)
  • C clustering library

http//bonsai.ims.u-tokyo.ac.jp/mdehoon/software/
cluster/software.htm
26
Outlook
  • Bioperl

27
Thanx for your attention!!!
Write a Comment
User Comments (0)
About PowerShow.com