Title: CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch
1CZ5211 Topics in Computational Biology Lecture
3 Clustering Analysis for Microarray Data
IProf. Chen Yu ZongTel 6874-6877Email
yzchen_at_cz3.nus.edu.sghttp//xin.cz3.nus.edu.sgRo
om 07-24, level 7, SOC1, NUS
2Clustering Algorithms
- Be weary - confounding computational artifacts
are associated with all clustering algorithms.
-You should always understand the basic concepts
behind an algorithm before using it. - Anything will cluster! Garbage In means Garbage
Out.
3Supervised vs. Unsupervised Learning
- Supervised there is a teacher, class labels are
known - Support vector machines
- Backpropagation neural networks
- Unsupervised No teacher, class labels are
unknown - Clustering
- Self-organizing maps
4Gene Expression Data
- Gene expression data on p genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j
Log (Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
5Expression Vectors
- Gene Expression Vectors encapsulate the
expression of a gene over a set of experimental
conditions or sample types.
1.5
-0.8
1.8
0.5
-1.3
-0.4
1.5
0.8
Numeric Vector
Line Graph
Heatmap
6Expression Vectors As Points in Expression Space
t 1
t 2
t 3
G1
-0.8
-0.3
-0.7
G2
-0.8
-0.7
-0.4
Similar Expression
G3
-0.4
-0.6
-0.8
G4
0.9
1.2
1.3
G5
1.3
0.9
-0.6
Experiment 3
Experiment 2
Experiment 1
7Cluster Analysis
- Group a collection of objects into subsets or
clusters such that objects within a cluster are
closely related to one another than objects
assigned to different clusters.
8How can we do this?
- What is closely related?
- Distance or similarity metric
- What is close?
- Clustering algorithm
- How do we minimize distance between objects in a
group while maximizing distances between groups?
9Distance Metrics
(5.5,6)
- Euclidean Distance measures average distance
- Manhattan (City Block) measures average in each
dimension - Correlation measures difference with respect to
linear trends
(3.5,4)
Gene Expression 2
Gene Expression 1
10Clustering Gene Expression Data
Expression Measurements
- Cluster across the rows, group genes together
that behave similarly across different
conditions. - Cluster across the columns, group different
conditions together that behave similarly across
most genes.
j
Genes
i
11Clustering Time Series Data
- Measure gene expression on consecutive days
- Gene Measurement matrix
- G1 1.2 4.0 5.0 1.0
- G2 2.0 2.5 5.5 6.0
- G3 4.5 3.0 2.5 1.0
- G4 3.5 1.5 1.2 1.5
12Euclidean Distance
- Distance is the square root of the sum of the
squared distance between coordinates -
13City Block or Manhattan Distance
- G1 1.2 4.0 5.0 1.0
- G2 2.0 2.5 5.5 6.0
- G3 4.5 3.0 2.5 1.0
- G4 3.5 1.5 1.2 1.5
- Distance is the sum of the absolute value between
coordinates
14Correlation Distance
- Pearson correlation measures the degree of linear
relationship between variables, -1,1 - Distance is 1-(pearson correlation), range of
0,2
15Similarity Measurements
Two profiles (vectors)
and
1 ? Pearson Correlation ? 1
16Similarity Measurements
1 ? Cosine Correlation ? 1
17Hierarchical Clustering
- IDEA Iteratively combines genes into groups
based on similar patterns of observed expression - By combining genes with genes OR genes with
groups algorithm produces a dendrogram of the
hierarchy of relationships. - Display the data as a heatmap and dendrogram
- Cluster genes, samples or both
(HCL-1)
18Hierarchical Clustering
Venn Diagram of Clustered Data
Dendrogram
19Hierarchical clustering
- Merging (agglomerative) start with every
measurement as a separate cluster then combine - Splitting make one large cluster, then split up
into smaller pieces - What is the distance between two clusters?
20Distance between clusters
- Single-link distance is the shortest distance
from any member of one cluster to any member of
the other cluster - Complete link distance is the longest distance
from any member of one cluster to any member of
the other cluster - Average Distance between the average of all
points in each cluster - Ward minimizes the sum of squares of any two
clusters
21Hierarchical Clustering-Merging
- Euclidean distance
- Average linking
Distance between clusters when combined
Gene expression time series
22Manhattan Distance
Distance between clusters when combined
Gene expression time series
23Correlation Distance
24Data Standardization
- Data points are normalized with respect to mean
and variance, sphering the data - After sphering, Euclidean and correlation
distance are equivalent - Standardization makes sense if you are not
interested in the size of the effects, but in the
effect itself - Results are misleading for noisy data
25Distance Comments
- Every clustering method is based SOLELY on the
measure of distance or similarity - E.G. Correlation measures linear association
between two genes - What if data are not properly transformed?
- What about outliers?
- What about saturation effects?
- Even good data can be ruined with the wrong
choice of distance metric
26Hierarchical Clustering
Distance Matrix
Initial Data Items
27Hierarchical Clustering
Distance Matrix
Initial Data Items
28Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
2
29Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
30Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
31Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
3
32Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
33Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
34Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
10
35Hierarchical Clustering
Single Linkage
Distance Matrix
Final Result
36Hierarchical Clustering
37Hierarchical Clustering
38Hierarchical Clustering
39Hierarchical Clustering
40Hierarchical Clustering
41Hierarchical Clustering
42Hierarchical Clustering
43Hierarchical Clustering
44Hierarchical Clustering
H
L
45Hierarchical Clustering
Samples
Genes
The Leaf Ordering Problem
- Find optimal layout of branches for a given
dendrogram architecture - 2N-1 possible orderings of the branches
- For a small microarray dataset of 500 genes,
there are 1.6E150 branch configurations
46Hierarchical Clustering
The Leaf Ordering Problem
47Hierarchical Clustering
- Pros
- Commonly used algorithm
- Simple and quick to calculate
- Cons
- Real genes probably do not have a hierarchical
organization
48Using Hierarchical Clustering
- Choose what samples and genes to use in your
analysis - Choose similarity/distance metric
- Choose clustering direction
- Choose linkage method
- Calculate the dendrogram
- Choose height/number of clusters for
interpretation - Assess results
- Interpret cluster structure
49Choose what samples/genes to include
- Very important step
- Do you want to include housekeeping genes or
genes that didnt change in your results? - How do you handle replicates from the same
sample? - Noisy samples?
- Dendrogram is a mess if everything is included in
large datasets - Gene screening
50No Filtering
51Filtering 100 relevant genes
522. Choose distance metric
- Metric should be a valid measure of the
distance/similarity of genes - Examples
- Applying Euclidean distance to categorical data
is invalid - Correlation metric applied to highly skewed data
will give misleading results
533. Choose clustering direction
- Merging clustering (bottom up)
- Divisive
- split so that genes in the two clusters are the
most similar, maximize distance between clusters
54Nearest Neighbor Algorithm
- Nearest Neighbor Algorithm is an agglomerative
approach (bottom-up). - Starts with n nodes (n is the size of our
sample), merges the 2 most similar nodes at each
step, and stops when the desired number of
clusters is reached.
55Nearest Neighbor, Level 3, k 6 clusters.
56Nearest Neighbor, Level 4, k 5 clusters.
57Nearest Neighbor, Level 5, k 4 clusters.
58Nearest Neighbor, Level 6, k 3 clusters.
59Nearest Neighbor, Level 7, k 2 clusters.
60Nearest Neighbor, Level 8, k 1 cluster.
61Hierarchical Clustering
Calculate the similarity between all possible
combinations of two profiles
- Keys
- Similarity
- Clustering
Two most similar clusters are grouped together to
form a new cluster
Calculate the similarity between the new cluster
and all remaining clusters.
62Hierarchical Clustering
C1
Merge which pair of clusters?
C2
C3
63Hierarchical Clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters
C2
C1
Tend to generate long chains
64Hierarchical Clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters
C2
C1
Tend to generate clumps
65Hierarchical Clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).
C2
C1
66Hierarchical Clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.
C2
C1
67Which one?
- Both methods are step-wise optimal, at each
step the optimal split or merge is performed - Doesnt mean that the final result is optimal
- Merging
- Computationally simple
- Precise at bottom of tree
- Good for many small clusters
- Divisive
- More complex, but more precise at the top of the
tree - Good for looking at large and/or few clusters
- For Gene expression applications, divisive makes
more sense