CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch

About This Presentation

Title:

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch

Description:

... the rows, group genes together that behave similarly across different conditions. ... group different conditions together that behave similarly across most genes. ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 68

Provided by: dbs7

Category:

more less

Transcript and Presenter's Notes

Title: CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch

1
CZ5211 Topics in Computational Biology Lecture
3 Clustering Analysis for Microarray Data
IProf. Chen Yu ZongTel 6874-6877Email
yzchen_at_cz3.nus.edu.sghttp//xin.cz3.nus.edu.sgRo
om 07-24, level 7, SOC1, NUS
2
Clustering Algorithms

Be weary - confounding computational artifacts
are associated with all clustering algorithms.
-You should always understand the basic concepts
behind an algorithm before using it.
Anything will cluster! Garbage In means Garbage
Out.

3
Supervised vs. Unsupervised Learning

Supervised there is a teacher, class labels are
known
Support vector machines
Backpropagation neural networks
Unsupervised No teacher, class labels are
unknown
Clustering
Self-organizing maps

4
Gene Expression Data

Gene expression data on p genes for n samples

mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j
Log (Red intensity / Green intensity)

Log(Avg. PM - Avg. MM)
5
Expression Vectors

Gene Expression Vectors encapsulate the
expression of a gene over a set of experimental
conditions or sample types.

1.5
-0.8
1.8
0.5
-1.3
-0.4
1.5
0.8
Numeric Vector
Line Graph
Heatmap
6
Expression Vectors As Points in Expression Space
t 1
t 2
t 3
G1
-0.8
-0.3
-0.7
G2
-0.8
-0.7
-0.4
Similar Expression
G3
-0.4
-0.6
-0.8
G4
0.9
1.2
1.3
G5
1.3
0.9
-0.6
Experiment 3
Experiment 2
Experiment 1
7
Cluster Analysis

Group a collection of objects into subsets or
clusters such that objects within a cluster are
closely related to one another than objects
assigned to different clusters.

8
How can we do this?

What is closely related?
Distance or similarity metric
What is close?
Clustering algorithm
How do we minimize distance between objects in a
group while maximizing distances between groups?

9
Distance Metrics
(5.5,6)

Euclidean Distance measures average distance
Manhattan (City Block) measures average in each
dimension
Correlation measures difference with respect to
linear trends

(3.5,4)
Gene Expression 2
Gene Expression 1
10
Clustering Gene Expression Data
Expression Measurements

Cluster across the rows, group genes together
that behave similarly across different
conditions.
Cluster across the columns, group different
conditions together that behave similarly across
most genes.

j
Genes
i
11
Clustering Time Series Data

Measure gene expression on consecutive days
Gene Measurement matrix
G1 1.2 4.0 5.0 1.0
G2 2.0 2.5 5.5 6.0
G3 4.5 3.0 2.5 1.0
G4 3.5 1.5 1.2 1.5

12
Euclidean Distance

Distance is the square root of the sum of the
squared distance between coordinates

13
City Block or Manhattan Distance

G1 1.2 4.0 5.0 1.0
G2 2.0 2.5 5.5 6.0
G3 4.5 3.0 2.5 1.0
G4 3.5 1.5 1.2 1.5

Distance is the sum of the absolute value between
coordinates

14
Correlation Distance

Pearson correlation measures the degree of linear
relationship between variables, -1,1
Distance is 1-(pearson correlation), range of
0,2

15
Similarity Measurements

Pearson Correlation

Two profiles (vectors)
and
1 ? Pearson Correlation ? 1
16
Similarity Measurements

Cosine Correlation

1 ? Cosine Correlation ? 1
17
Hierarchical Clustering

IDEA Iteratively combines genes into groups
based on similar patterns of observed expression
By combining genes with genes OR genes with
groups algorithm produces a dendrogram of the
hierarchy of relationships.
Display the data as a heatmap and dendrogram
Cluster genes, samples or both

(HCL-1)
18
Hierarchical Clustering
Venn Diagram of Clustered Data
Dendrogram
19
Hierarchical clustering

Merging (agglomerative) start with every
measurement as a separate cluster then combine
Splitting make one large cluster, then split up
into smaller pieces
What is the distance between two clusters?

20
Distance between clusters

Single-link distance is the shortest distance
from any member of one cluster to any member of
the other cluster
Complete link distance is the longest distance
from any member of one cluster to any member of
the other cluster
Average Distance between the average of all
points in each cluster
Ward minimizes the sum of squares of any two
clusters

21
Hierarchical Clustering-Merging

Euclidean distance
Average linking

Distance between clusters when combined
Gene expression time series
22
Manhattan Distance
Distance between clusters when combined

Average linking

Gene expression time series
23
Correlation Distance
24
Data Standardization

Data points are normalized with respect to mean
and variance, sphering the data
After sphering, Euclidean and correlation
distance are equivalent
Standardization makes sense if you are not
interested in the size of the effects, but in the
effect itself
Results are misleading for noisy data

25
Distance Comments

Every clustering method is based SOLELY on the
measure of distance or similarity
E.G. Correlation measures linear association
between two genes
What if data are not properly transformed?
What about outliers?
What about saturation effects?
Even good data can be ruined with the wrong
choice of distance metric

26
Hierarchical Clustering
Distance Matrix
Initial Data Items
27
Hierarchical Clustering
Distance Matrix
Initial Data Items
28
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
2
29
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
30
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
31
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
3
32
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
33
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
34
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
10
35
Hierarchical Clustering
Single Linkage
Distance Matrix
Final Result
36
Hierarchical Clustering
37
Hierarchical Clustering
38
Hierarchical Clustering
39
Hierarchical Clustering
40
Hierarchical Clustering
41
Hierarchical Clustering
42
Hierarchical Clustering
43
Hierarchical Clustering
44
Hierarchical Clustering
H
L
45
Hierarchical Clustering
Samples
Genes
The Leaf Ordering Problem

Find optimal layout of branches for a given
dendrogram architecture
2N-1 possible orderings of the branches
For a small microarray dataset of 500 genes,
there are 1.6E150 branch configurations

46
Hierarchical Clustering
The Leaf Ordering Problem
47
Hierarchical Clustering

Pros
Commonly used algorithm
Simple and quick to calculate
Cons
Real genes probably do not have a hierarchical
organization

48
Using Hierarchical Clustering

Choose what samples and genes to use in your
analysis
Choose similarity/distance metric
Choose clustering direction
Choose linkage method
Calculate the dendrogram
Choose height/number of clusters for
interpretation
Assess results
Interpret cluster structure

49
Choose what samples/genes to include

Very important step
Do you want to include housekeeping genes or
genes that didnt change in your results?
How do you handle replicates from the same
sample?
Noisy samples?
Dendrogram is a mess if everything is included in
large datasets
Gene screening

50
No Filtering
51
Filtering 100 relevant genes
52
2. Choose distance metric

Metric should be a valid measure of the
distance/similarity of genes
Examples
Applying Euclidean distance to categorical data
is invalid
Correlation metric applied to highly skewed data
will give misleading results

53
3. Choose clustering direction

Merging clustering (bottom up)
Divisive
split so that genes in the two clusters are the
most similar, maximize distance between clusters

54
Nearest Neighbor Algorithm

Nearest Neighbor Algorithm is an agglomerative
approach (bottom-up).
Starts with n nodes (n is the size of our
sample), merges the 2 most similar nodes at each
step, and stops when the desired number of
clusters is reached.

55
Nearest Neighbor, Level 3, k 6 clusters.
56
Nearest Neighbor, Level 4, k 5 clusters.
57
Nearest Neighbor, Level 5, k 4 clusters.
58
Nearest Neighbor, Level 6, k 3 clusters.
59
Nearest Neighbor, Level 7, k 2 clusters.
60
Nearest Neighbor, Level 8, k 1 cluster.
61
Hierarchical Clustering
Calculate the similarity between all possible
combinations of two profiles

Keys
Similarity
Clustering

Two most similar clusters are grouped together to
form a new cluster
Calculate the similarity between the new cluster
and all remaining clusters.
62
Hierarchical Clustering
C1
Merge which pair of clusters?
C2
C3
63
Hierarchical Clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters

C2
C1
Tend to generate long chains
64
Hierarchical Clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters

C2
C1
Tend to generate clumps
65
Hierarchical Clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).

C2
C1
66
Hierarchical Clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.

C2
C1
67
Which one?

Both methods are step-wise optimal, at each
step the optimal split or merge is performed
Doesnt mean that the final result is optimal
Merging
Computationally simple
Precise at bottom of tree
Good for many small clusters
Divisive
More complex, but more precise at the top of the
tree
Good for looking at large and/or few clusters
For Gene expression applications, divisive makes
more sense

Write a Comment

User Comments (0)

About PowerShow.com

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch - PowerPoint PPT Presentation

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch

... the rows, group genes together that behave similarly across different conditions. ... group different conditions together that behave similarly across most genes. ... – PowerPoint PPT presentation