Cluster analysis for microarray data - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster analysis for microarray data

Description:

Genes and experiments/samples are given as the row and ... Even 'random' data with no structure can yield clusters or exhibit interesting looking patterns. ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 27
Provided by: adm276
Category:

less

Transcript and Presenter's Notes

Title: Cluster analysis for microarray data


1
Cluster analysis for microarray data
  • Anja von Heydebreck

2
Aim of clustering Group objects according to
their similarity
Cluster a set of objects that are similar to
each other and separated from the
other objects. Example green/ red data
points were generated from two different normal
distributions
3
Clustering microarray data
gene expression data matrix
  • Genes and experiments/samples are given as the
    row and column vectors of a gene expression data
    matrix.
  • Clustering may be applied either to genes or
    experiments (regarded as vectors in Rp or Rn).

n experiments
p genes
4
Why cluster genes?
  • Identify groups of possibly co-regulated genes
    (e.g. in conjunction with sequence data).
  • Identify typical temporal or spatial gene
    expression patterns (e.g. cell cycle data).
  • Arrange a set of genes in a linear order that is
    at least not totally meaningless.

5
Why cluster experiments/samples?
  • Quality control Detect experimental
    artifacts/bad hybridizations
  • Check whether samples are grouped according to
    known categories (though this might be better
    addressed using a supervised approach
    statistical tests, classification)
  • Identify new classes of biological samples (e.g.
    tumor subtypes)

6
Alizadeh et al., Nature 403503-11, 2000
7
Cluster analysis
  • Generally, cluster analysis is based on two
    ingredients
  • Distance measure Quantification of
    (dis-)similarity of objects.
  • Cluster algorithm A procedure to group objects.
    Aim small within-cluster distances, large
    between-cluster distances.

8
Some distance measures
  • Given vectors x (x1, , xn), y (y1, , yn)
  • Euclidean distance
  • Manhattan distance
  • Correlation
  • distance

9
Which distance measure to use?
  • The choice of distance measure should be based
    on the application area. What sort of
    similarities would you like to detect?
  • Correlation distance dc measures trends/relative
    differences
  • dc(x, y) dc(axb, y) if a gt 0.

x (1, 1, 1.5, 1.5) y (2.5, 2.5, 3.5, 3.5)
2x 0.5 z (1.5, 1.5, 1, 1)
dc(x, y) 0, dc(x, z) 2. dE(x, z) 1, dE(x,
y) 3.54.
10
Which distance measure to use?
  • Euclidean and Manhattan distance both measure
    absolute differences between vectors. Manhattan
    distance is more robust against outliers.
  • May apply standardization to the observations
  • Subtract mean and divide by standard
    deviation
  • After standardization, Euclidean and correlation
    distance are equivalent

11
K-means clustering
  • Input N objects given as data points in Rp
  • Specify the number k of clusters.
  • Initialize k cluster centers. Iterate until
    convergence
  • - Assign each object to the cluster with the
    closest center (wrt Euclidean distance).
  • - The centroids/mean vectors of the obtained
    clusters are taken as new cluster centers.
  • K-means can be seen as an optimization problem
  • Minimize the sum of squared within-cluster
    distances,
  • Results depend on the initialization. Use several
    starting points and choose the best solution
    (with minimal W(C)).

12
Partitioning around medoids (PAM)
  • K-means clustering is based on Euclidean
    distance.
  • Partitioning around medoids (PAM) generalizes the
    idea and can be used with any distance measure d
    (objects xi need not be vectors).
  • The cluster centers/prototypes are required to be
    observations
  • Try to minimize the sum of distances of the
    objects to their cluster centers,
  • using an iterative procedure analogous to the
    one in K-means clustering.

13
K-means/PAM How to choose K (the number of
clusters)?
  • There is no easy answer.
  • Many heuristic approaches try to compare the
    quality of clustering results for different
    values of K (for an overview see Dudoit/Fridlyand
    2002).
  • The problem can be better addressed in
    model-based clustering, where each cluster
    represents a probability distribution, and a
    likelihood-based framework can be used.

14
Self-organizing maps
  • K rs clusters are arranged as nodes of a
    two-dimensional grid. Nodes represent cluster
    centers/prototype vectors.
  • This allows to represent similarity between
    clusters.
  • Algorithm
  • Initialize nodes at random positions.
    Iterate
  • - Randomly pick one data point (gene) x.
  • - Move nodes towards x, the closest node
    most, remote nodes (in terms of the grid) less.
    Decrease amount of movements with no. of
    iterations.

from Tamayo et al. 1999
15
Self-organizing maps
from Tamayo et al. 1999 (yeast cell cycle data)
16
Hierarchical clustering
  • Similarity of objects is represented in a tree
    structure (dendrogram).
  • Advantage no need to specify the number of
    clusters in advance. Nested clusters can be
    represented.

Golub data different types of leukemia.
Clustering based on the 150 genes with highest
variance across all samples.
17
Agglomerative hierarchical clustering
  • Bottom-up algorithm (top-down (divisive) methods
    are less common).
  • Start with the objects as clusters.
  • In each iteration, merge the two clusters with
    the minimal distance from each other - until you
    are left with a single cluster comprising all
    objects.
  • But what is the distance between two clusters?

18
Distances between clusters used for hierarchical
clustering
  • Calculation of the distance between two
    clusters is based on the pairwise distances
    between members of the clusters.
  • Complete linkage largest distance
  • Average linkage average distance
  • Single linkage smallest distance

Complete linkage gives preference to
compact/spherical clusters. Single linkage can
produce long stretched clusters.
19
Hierarchical clustering
  • The height of a node in the dendrogram represents
    the distance of the two children clusters.
  • Loss of information n objects have n(n-1)/2
    pairwise distances, tree has n-1 inner nodes.
  • The ordering of the leaves is not uniquely
    defined by the dendrogram 2n-2 possible choices.

Golub data different types of leukemia.
Clustering based on the 150 genes with highest
variance across all samples.
20
Alternative direct visualization of
similarity/distance matrices
Useful if one wants to investigate a specific
factor (advantage no loss of information). Sort
experiments according to that factor.
21
The role of feature selection
  • Sometimes, people first select genes that appear
    to be differentially expressed between groups of
    samples. Then they cluster the samples based on
    the expression levels of these genes. Is it
    remarkable if the samples then cluster into the
    two groups?
  • No, this doesnt prove anything, because the
    genes were selected with respect to the two
    groups! Such effects can even be obtained with a
    matrix of i.i.d. random numbers.

22
Clustering of time course data
  • Suppose we have expression data from different
    time points t1, , tn, and want to identify
    typical temporal expression profiles by
    clustering the genes.
  • Usual clustering methods/distance measures dont
    take the ordering of the time points into account
    the result would be the same if the time points
    were permuted.
  • Simple modification Consider the difference
    xi(j1) xij between consecutive timepoints as
    an additional observation yij. Then apply a
    clustering algorithm such as K-means to the
    augmented data matrix.

23
Biclustering
  • Usual clustering algorithms are based on global
    similarities of rows or columns of an expression
    data matrix.
  • But the similarity of the expression profiles of
    a group of genes may be restricted to certain
    experimental conditions.
  • Goal of biclustering identify homogeneous
    submatrices.
  • Difficulties computational complexity, assessing
    the statistical significance of results
  • Example Tanay et al. 2002.

24
Conclusions
  • Clustering has been very popular in the analysis
    of microarray data. However, many typical
    questions in microarray studies (e.g. identifying
    expression signatures of tumor types) are better
    addressed by other methods (classification,
    statistical tests).
  • Clustering algorithms are easy to apply (you
    cannot really do it wrong), and they are useful
    for exploratory analysis. But it is difficult to
    assess the validity/significance of the results.
    Even random data with no structure can yield
    clusters or exhibit interesting looking patterns.

25
Clustering in R
  • library mva
  • - Hierarchical clustering hclust, heatmap
  • - k-means kmeans
  • library class
  • - Self-organizing maps SOM
  • library cluster
  • - pam and other functions

26
References
  • T. Hastie, R. Tibshirani, J. Friedman The
    elements of statistical learning (Chapter 14).
    Springer 2001.
  • M. Eisen et al. Cluster analysis and display of
    genome-wide expression patterns.
    Proc.Natl.Acad.Sci.USA 95, 14863-8, 1998.
  • P. Tamayo et al. Interpreting patterns of gene
    expression with self-organizing maps Methods and
    application to hematopoietic differentiation.
    Proc.Natl.Acad.Sci.USA 96, 2907-12, 1999.
  • S. Dudoit, J. Fridlyand A prediction-based
    resampling method for estimating the number of
    clusters in a dataset. Genome Biology 3(7), 2002.
  • A. Tanay, R. Sharan, R. Shamir Discovering
    statistically significant biclusters in gene
    expression data. Bioinformatics 18, Suppl.1,
    136-44, 2002.
Write a Comment
User Comments (0)
About PowerShow.com