Sin ttulo de diapositiva - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Sin ttulo de diapositiva

Description:

... to the trait we are studying (account forunrelated physiological conditions, etc. ... Expression profile of all the genes for a experimental condition (array) ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 39
Provided by: joaquin7
Category:
Tags: diapositiva | sin | ttulo

less

Transcript and Presenter's Notes

Title: Sin ttulo de diapositiva


1
Clustering of DNA microarray data
Joaquín Dopazo. Bioinformatics Unit,
CNIO. http//bioinfo.cnio.es
2
Supervised vs Unsupervised clustering
Sample annotation information (additional rows,
eg cell type, treatment, disease state, time
course info)
Gene annotation Information (additional
columns) Eg gene name Function Genome location
Gene expression levels
3
Look for structure within the gene matrix
Annotation information used later
Unsupervised Clustering
What do they have in common?
Identify Co-expressing genes...
Genes of a class
What profile(s) do they display? and...
Are there more genes?
Molecular classification of samples
4
Analysis of genes with correlated expression
Gene
  • Genes with correlated expression
  • markers
  • functionally related genes

5
Genes of the same functional class have
correlated expression patterns
25 out of 40 ORFs belong to the functional class
cytoplasmic degradation (MIPS), most of them
are proteasome subunits
6
Molecular classification of samples
D
C
Clustering
B
Sample A
A
7
Taxonomic Relationships Between Normal
Malignant Lymphoid Populations
Alizadeh et al., Nature 2000(96 samples)
8
The data
A
B
C
Different classes of experimental conditions,
e.g. Cancer types, tissues, drug treatments, time
survival, etc.
  • Characteristics of the data
  • We have much much more variables than experimets
  • Low signal to noise ratio
  • High redundancy and intra-gene correlations
  • Most of the genes are not informative with
    respect to the trait we are studying (account
    forunrelated physiological conditions, etc.)
  • Many genes have no annotation!!

Expression profile of all the genes for a
experimental condition (array)
Genes (thousands)
Expression profile of a gene across the
experimental conditions
Experimental conditions (from tens up to no
more than a few hundreds)
9
Be familiar with the data in your gene expression
matrix
  • Absolute vs relative gene expression values (i.e.
    ratios)
  • Relative expression ratios of log transformed
  • Normalization between samples. Are the columns
    comparable?
  • Set of related hybridizations with a common
    reference sample or amalgamated data?
  • Gene replicates duplicates of same probes or
    different probes
  • Sample replicates same number for each
    experimental condition

10
Gene expression matrixPoints of caution for
unsupervised clustering
  • Many analytical methods are based on log2 ratio
    expression values and not absolute values
  • Check for missing values Some analysis methods
    cannot handle matrices with missing values
  • Either delete suspect row or column, or
    interpolate from known values
  • Reduce the size of your matrix by only
    considering genes that undergo a specified
    fold-change in at least one of the samples, or
    whose levels change significantly over samples
    being compared (i.e remove genes with flat
    patterns)

11
Unsupervised Clustering Distance
  • You do not have external information on how
    the data are arranged
  • You only have the values measured in the
    experiment
  • You need to be able to measure the distance
    between the profiles of expression values of
    two genes, or the distance between the gene
    expression values in two samples
  • The distance measure should be a quantitative
    and non-subjective measure of the closeness of
    a pair of data.

12
Euclidian Distance
gene
t1
t2
A
x1
x2
B
y1
y2
B
y2
d
y1
Euclidean distance squared Manhattan
distance Minkowski distance (generalized)
x1
x2
A
13
Linear Correlation
The correlation coefficient between n pairs of
observations, whose values are (xi, yi) is
The linear correlation coefficient measures the
strength of the linear relationships between the
paired x and y values in a sample.
0
-1
1
y
y
y
x
x
x
14
Distance types
Differences (euclidean) BltgtC Correlation Alt
gtB
15
Different distances account for different
properties
A B C
correlation
A B C
euclidean
Correlation tendencies euclidean global
similarity
16
Unsupervised Clustering other important choices
  • Measurement of pair-wise distances between genes
    expression values
  • NEXT STEP
  • Measurement of pair-wise distances between
    clusters
  • Single linkage (or nearest neighbour)
  • Complete linkage (or furthest neighbour, or
    maximum distance)
  • Average linkage
  • i) average distance between each point in a
    cluster and every point in the other cluster
  • weighted methods compensate for size of cluster
    (WPGMA)
  • unweighted methods treat clusters of different
    sizes equally (UPGMA
  • ii) from mean centroid of each cluster
  • weighted (WPGMC )

17
Unsupervised Clustering Which computation
algorithm should I use?
  • The aim of clustering is to group together genes
    or samples that have similar expression profiles
  • There are many different computational algorithms
    for doing this
  • You can have
  • hierarchical clustering
  • agglomerative clustering
  • divisive clustering (SOTA)
  • flat (or non-hierarchical clustering) (K-means,
    SOM)

18
Unsupervised clustering methods
Non hierarchical
hierarchical
K-means, PCA
hierarchical
quick and robust
SOM
SOTA
Different levels of information
19
Aggregative hierarchical clustering
Relationships among profiles are represented by
branch lengths. The closest pair of profiles are
recursively linked until the complete hierarchy
is reconstructed Allows to explore the
relationship among groups of related genes at
higher levels.
CLUSTER
20
c1
c2
c3
c4
c5
Aggregative hierarchical clustering
The pair of closest profiles is recursively
joined until a complete hierarchy is
constructed Branch lengths are proportional to
the differences between profiles.
21
Different aggregative criterion
minimum
maximum
22
Exercise
  • Using real data, try to build a tree with average
    linkage.
  • Steps
  • Construct distance matrices (use d (1-dc)/2 in
    correlation)
  • Use the algorithm. Select the closest pair, and
    collapse column and row joining entries as dxy,z
    (dx,zdy,z)/2

correlation
ORF R1 R2 R3 R4 R5 YHR007C 0.16 0.25 0.40 -0.19
-0.25 YBR218C 0.24 0.30 -0.38 -0.43 -0.33 YAL051
W -0.04 0.40 0.41 0.24 0.17 YAL053W 0.19 0.41 0.2
3 -0.01 -0.31 YAL054C -0.67 -0.19 0.00 -0.19 -0.3
0 YAL055W -0.56 0.00 -0.13 -0.06 -0.31 YAL056W 0
.01 0.65 0.24 -0.00 -0.09 YAL058W 0.04 0.30 0.20
0.05 -0.20 YOL109W 0.63 0.65 0.91 0.55 0.17 YAL0
65C -0.13 -0.62 0.18 -0.05 -0.35 YAL066W -0.58 -0
.22 0.03 -0.26 -0.19 YAL067C -1.12 -0.99 -0.41 -1
.03 -0.89
dx,y
euclidean
23
Differences in clustering of experiments
Euclidean
Correlation
24
Results
The best correlated is not the most similar.
Correlation
...and the most similar is not the best correlated
Euclidean
25
Aggregative hierarchical clustering
  • Problems
  • lack of robustness
  • difficult interpretation
  • subjective cluster definition

26
Clustering methods
Non hierarchical
Hierarchical
deterministic
K-means, PCA
UPGMA
NN
SOTA
SOM
Robust
Provides different levels of information
Properties
27
K-Means clustering
The idea is to find the best division of N
samples by K clusters Ci such that the total
distance between the clustered samples and their
respective centers (that is, the total variance)
is minimized. This criterion is expressed like
this
where ?i is the center of class i. Analogy to
linear regression can be seen there the
residuals are the distance from each point to the
regression line. In clustering, the residuals are
the distance between each point and its cluster
center. The k-means algorithm starts by randomly
assigning instances to the classes, computes the
centers according to
then reassignes the instances to the nearest
clusters center, recalculates centers, reassigns
the instances, etc. until J stops decreasing (or
centers stop to move). Here is a two-dimensional
example of clustering
28
K-Means clustering
K-means clustering algorithm
  • Partition the items randomly into k initial
    clusters
  • Decide which distance measure to use
  • Determine the centroid (or mean of distances)
    for each cluster
  • Then, for each item, in turn
  • a) Calculate the distance between the item and
    all the means
  • b) Re-assign the item to the cluster with the
    closest mean (or centroid)
  • c) Recalculate the centroids for the cluster
    gaining and
  • the cluster losing an item
  • 3. Repeat step 2 until no more reassignments
    take place

29
Self organising maps SOM
Bidimensional hexagonal or rectangular network
Output nodes
exp1 exp2 .. expp gen1 a11 a12
.. a1p gen2 a21 a22 .. a2p
genn an1 an2 .. anp
30
SOM The algorithm
Step 1. Initialize nodes to random values. Set
the initial radius of the neighborhood. Step
2. Present new input Compute distances to all
nodes. Euclidean distances are commonly
used Step 3. Select output node j with minimum
distance dj. Update node j and neighbors. Nodes
updated for the neighborhood NEj(t) as wij(t1)
wij(t) ?(t)(xi(t) - wij(t)) for j ?
NEj(t) ?(t) is a gain term that decreases in
time. Step4 Repeat by going to Step 2 until
convergence.
Input
31
SOM results
DeRisi et al. (1997) Exploring the Metabolic and
Genetic Control of Gene Expression on a genomic
Scale. Science, 278, 680-686
32
SOMExample
Response of human fibroblasts to serum Iyer et
al., 1999 Science 28383-87
If a given class is overrepresented, it takes
over many neurons
33
Clustering methods
Non hierarchical
Hierarchical
deterministic
K-means, PCA
UPGMA
NN
SOM
SOTA
Robust
Provides different levels of information
Properties
34
SOTA clustering
A
B
E
Interactive Web based Configurable
D
C
F
35
SOTAThe algorithm
Step 1. Initialize nodes to random values. Step
2. Present new input Compute distances to all
terminal nodes. Step 3. Select output node j
with minimum distance dj. Update node j and
neighbors. Nodes updated for the neighborhood
NEj(t) as wij(t1) wij(t) ?(t)(xi(t) -
wij(t)) for j ? NEj(t) ?(t) is a gain term than
decreases in time. Step 4 Repeat by going to
Step 2 until convergence. Step 5 Reproduce the
node with highest variability.
The Self Organising Tree Algorithm (SOTA) is a
hierarchical divisive method based on a neural
network
SOTA, unlike other hierarchical methods, grows
from top to bottom until an appropriate level of
variability is reached
Input
Dopazo, Carazo (1997) Herrero, Valencia, Dopazo
(2001)
36
Advantages of SOTA
Robusteness against noise
Divisive algorithm SOTA grows from top to bottom
growing can be stopped at any desired level of
variability.
Clusterspatterns Each node of the tree has a
pattern associated which corresponds to the
cluster under itself.
Distribution preserving The number of clusters
depends on the variability of the data.
37
SOTA/SOM vs classical clustering (UPGMA)
38
What we have learned? Lessons from the
firs-generation algorithms and specific demands
for clustering microarray data
  • Number of clusters. K-means, SOM and hierarchical
    methods do not provide any method for defining
    the true number of clusters
  • The wish list
  • Methods must be fast
  • Robustness and noise tolerance
  • Deterministic
  • Able to decide the number of clusters
    automatically
Write a Comment
User Comments (0)
About PowerShow.com