Clustering microarray data - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Clustering microarray data

Description:

Clustering microarray data 09/26/07 Overview Clustering is an unsupervised learning clustering is used to build groups of genes with related expression patterns. – PowerPoint PPT presentation

Number of Views:328

Avg rating:3.0/5.0

Slides: 54

Provided by: gcyuan

Category:

more less

Transcript and Presenter's Notes

Title: Clustering microarray data

1
Clustering microarray data

09/26/07

2
Sub-classes of lung cancer types have signature
genes
(Bhattacharjee 2001)
3
Promoter analysis of commonly regulated genes
David J. Lockhart Elizabeth A. Winzeler, NATURE
VOL 405 15 JUNE 2000, p827
4
Discovery of new cancer subtype
These classes are unknown at the time of study.
5
Overview

Clustering is an unsupervised learning clustering
is used to build groups of genes with related
expression patterns.
The classes are not known in advance.
Aim is to discover new patterns from microarray
data.
In contrast, supervised learning refers to the
learning process where classes are known. The aim
is to define classification rules to separate the
classes. Supervised learning will be discussed in
the next lecture.

6
Dissimilar function

To identify clusters, we first need to define
what close means. There are many choices of
distances
Euclidian distance
1 Pearson correlation
Manhattan distance

7
(No Transcript)
8
Where is the truth?

In the context of unsupervised learning, there
is no such direct measure of success. It is
difficult to ascertain the validity of inference
drawn from the output of most unsupervised
learning algorithms. One must often resort to
heuristic arguments not only for motivating the
algorithm, but also for judgments as to the
quality of results. This uncomfortable situation
has led to heavy proliferation of proposed
methods, since effectiveness is a matter of
opinion and cannot be verified directly.

Hastie et al. 2001 ESL
9
Clustering Methods

Partitioning methods
Seek to optimally divide objects into a fixed
number of clusters.
Hierarchical methods
Produce a nested sequence of clusters

(Speed, Chapter 4)
10
Methods

k-means
Hierarchical clustering
Self-organizing maps (SOM)

11
k-means

Divide objects into k clusters.
Goal is to minimize total intra-cluster variance
Global minimum is difficult to obtain.

12
Algorithm for k-means clustering

Step 1 Initialization randomly select k
centroids.
Step 2 For each object, find its closest
centroid, assign the object to the corresponding
cluster.
Step 3 For each cluster, update its centroid to
the mean position of all objects in that cluster.
Repeat Steps 2 and 3 until convergence.

13
Shows the initial randomized centers and a number
of points.
14
Centers have been associated with the points and
have been moved to the respective centroids.
15
Now, the association is shown in more detail,
once the centroids have been moved.
16
Again, the centers are moved to the centroids of
the corresponding associated points.
17
Properties of k-means

Achieves local minimum of
Very fast.

18
Practical issues with k-means

k must be known in advance
Results are dependent on initial assignment of
centroids.

19
How to choose k?
Milligan Cooper(1985) compared 30 published
rules. 1. Calinski Harabasz (1974) 2.
Hartigan (1975) , Stop when
H(k)lt10 .
W(k) total sum of squares within clusters B(k)
sum of squares between cluster means
20
How to choose k (continued)?
Random
(Tibshriani 2001) Estimate log Wk for randomly
data (uniformly distributed in a
rectangle) Choose k so that Gap is largest.
Observed
log WK
Gap
k
21
How to select initial centroids

Repeat the procedure many times with randomly
chosen initial centroids.
Alternatively, initialize centroids smartly,
e.g. by hierarchical clustering

22
K-means requires good initial values.
Hierarchical Clustering could be used but
sometimes performs poorly.
with-in sum of Sq. X965.32 O305.09
23
Hierarchical clustering
Hierarchical clustering builds a hierarchy of
clusters, represented by a tree (called a
dendrogram). Close clusters are joined together.
Height of a branch represents the dissimilarity
between the two clusters joined by it.
24
How to construct a dendrogram

Bottom-up approach
Initialization each cluster contains a single
object
Iteration merge the closest clusters.
Stop when all objects are included in a single
cluster
Top-down approach
Starting from a single cluster containing all
objects, iteratively partition into smaller
clusters.
Truncate dendrogram at a similarity threshold
level, e.g., correlation gt 0.6 or requiring a
cluster containing at least a minimum number of
objects.

25
Hierarchical Clustering
5
3
4
2
1
6
26
Dendrogram can be reordered
27
Ordered dendrograms

2 n-1 linear orderings of n elements
(n genes or conditions)
Maximizing adjacent similarity is impractical.
So order by
Average expression level,
Time of max induction, or
Chromosome positioning

Eisen98
28
Properties of Hierarchical Clustering

Top-down approach is more favorable when only a
few clusters are desired.
Single linkage tends to produce long chains of
clusters.
Complete linkage tends to produce compact
clusters.

29
(No Transcript)
30
Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 4
31
Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 3
32
Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 2
33
Self-organizing Map

Impose partial structure on the clusters (in
contrast to the rigid structure of hierarchical
clustering, the strong prior hypotheses used in
Bayesian clus-tering, and the nonstructure of
k-means clustering)
easy visualization and interpretation.

34
SOM Algorithm

Initialize prototypes mj on a lattice of p X q
nodes. Each prototype is a weight vector whose
dimension is the same as input data.
Iteration for each observation xi, find the
closest prototype mj, and for all neighbors of mk
of mj, move by
During iterations, reduce learning rate a and
neighborhood size r gradually.
May take many iterations before convergence.

35
(No Transcript)
36
(Hastie 2001)
37
(Hastie 2001)
38
(Hastie 2001)
39
SOM clustering of periodic genes
40
Applications to microarray data
41

With only a few nodes, one tends not to see
distinct patterns and there is large
within-cluster scatter. As nodes are added,
distinctive and tight clusters emerge.
SOM is an incremental learning algorithm
involving cases by case presentation rather than
batch presentation.
As with all exploratory data analysis tools, the
use of SOMs involves inspection of the data to
extract insights.

42
Other Clustering Methods

Gene Shaving
MDS
Affinity Propagation
Spectral Clustering
Two-way clustering

Algorithms for unsupervised classification or
cluster analysis abound. Unfortunately however,
algorithm development seems to be a preferred
activity to algorithm evaluation among
methodologists.
No consensus or clear guidelines exist to guide
these decisions. Cluster analysis always produces
clustering, but whether a pattern observed in the
sample data characterizes a pattern present in
the population remains an open question.
Resampling-based methods can address this last
point, but results indicate that most clusterings
in microarray data sets are unlikely to reflect
reproducible patterns or patterns in the overall
population.
-Allison et al. (2006)

44
Stability of a cluster

Motivation Real clusters should be reproducible
under perturbation adding noise, omission of
data, etc.
Procedure
Perturb observed data by adding noise.
Apply clustering procedure to cluster the
perturbed data.
Repeat the above procedures, generate a sample of
clusters.
Global test
Cluster-specific tests R-index, D-index.

(McShane et al. 2002)
45
5
3
4
2
1
6
46
Global test

Null hypothesis Data come from a multivariate
Gaussian distribution.
Procedure
Consider a subspace spanned by top principle
components.
Estimate distribution of nearest neighbor
distances
Compare observed with simulated data.

47
R-index

If cluster i contains ni objects, then it
contains mi ni(ni 1)/2 of pairs.
Let ci be the number of pairs that fall in the
same cluster for the re-clustered perturbed data.
ri ci/mi measures the robustness of the
cluster i.
R-index Si ci / Si mi measures overall
stability of a clustering algorithm.

48
D-index

For each cluster, determine the closest cluster
for the perturbed data
Calculated the average discrepancy between the
clusters for the original and perturbed data
omission vs addition.
D-index is a summation of all cluster-specific
discrepancy.

49
Applications