CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch

Description:

... the rows, group genes together that behave similarly across different conditions. ... group different conditions together that behave similarly across most genes. ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 68
Provided by: dbs7
Category:

less

Transcript and Presenter's Notes

Title: CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Ch


1
CZ5211 Topics in Computational Biology Lecture
3 Clustering Analysis for Microarray Data
IProf. Chen Yu ZongTel 6874-6877Email
yzchen_at_cz3.nus.edu.sghttp//xin.cz3.nus.edu.sgRo
om 07-24, level 7, SOC1, NUS
2
Clustering Algorithms
  • Be weary - confounding computational artifacts
    are associated with all clustering algorithms.
    -You should always understand the basic concepts
    behind an algorithm before using it.
  • Anything will cluster! Garbage In means Garbage
    Out.

3
Supervised vs. Unsupervised Learning
  • Supervised there is a teacher, class labels are
    known
  • Support vector machines
  • Backpropagation neural networks
  • Unsupervised No teacher, class labels are
    unknown
  • Clustering
  • Self-organizing maps

4
Gene Expression Data
  • Gene expression data on p genes for n samples

mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j
Log (Red intensity / Green intensity)

Log(Avg. PM - Avg. MM)
5
Expression Vectors
  • Gene Expression Vectors encapsulate the
    expression of a gene over a set of experimental
    conditions or sample types.

1.5
-0.8
1.8
0.5
-1.3
-0.4
1.5
0.8
Numeric Vector
Line Graph
Heatmap
6
Expression Vectors As Points in Expression Space
t 1
t 2
t 3
G1
-0.8
-0.3
-0.7
G2
-0.8
-0.7
-0.4
Similar Expression
G3
-0.4
-0.6
-0.8
G4
0.9
1.2
1.3
G5
1.3
0.9
-0.6
Experiment 3
Experiment 2
Experiment 1
7
Cluster Analysis
  • Group a collection of objects into subsets or
    clusters such that objects within a cluster are
    closely related to one another than objects
    assigned to different clusters.

8
How can we do this?
  • What is closely related?
  • Distance or similarity metric
  • What is close?
  • Clustering algorithm
  • How do we minimize distance between objects in a
    group while maximizing distances between groups?

9
Distance Metrics
(5.5,6)
  • Euclidean Distance measures average distance
  • Manhattan (City Block) measures average in each
    dimension
  • Correlation measures difference with respect to
    linear trends

(3.5,4)
Gene Expression 2
Gene Expression 1
10
Clustering Gene Expression Data
Expression Measurements
  • Cluster across the rows, group genes together
    that behave similarly across different
    conditions.
  • Cluster across the columns, group different
    conditions together that behave similarly across
    most genes.

j
Genes
i
11
Clustering Time Series Data
  • Measure gene expression on consecutive days
  • Gene Measurement matrix
  • G1 1.2 4.0 5.0 1.0
  • G2 2.0 2.5 5.5 6.0
  • G3 4.5 3.0 2.5 1.0
  • G4 3.5 1.5 1.2 1.5

12
Euclidean Distance
  • Distance is the square root of the sum of the
    squared distance between coordinates

13
City Block or Manhattan Distance
  • G1 1.2 4.0 5.0 1.0
  • G2 2.0 2.5 5.5 6.0
  • G3 4.5 3.0 2.5 1.0
  • G4 3.5 1.5 1.2 1.5
  • Distance is the sum of the absolute value between
    coordinates

14
Correlation Distance
  • Pearson correlation measures the degree of linear
    relationship between variables, -1,1
  • Distance is 1-(pearson correlation), range of
    0,2

15
Similarity Measurements
  • Pearson Correlation

Two profiles (vectors)
and
1 ? Pearson Correlation ? 1
16
Similarity Measurements
  • Cosine Correlation

1 ? Cosine Correlation ? 1
17
Hierarchical Clustering
  • IDEA Iteratively combines genes into groups
    based on similar patterns of observed expression
  • By combining genes with genes OR genes with
    groups algorithm produces a dendrogram of the
    hierarchy of relationships.
  • Display the data as a heatmap and dendrogram
  • Cluster genes, samples or both

(HCL-1)
18
Hierarchical Clustering
Venn Diagram of Clustered Data
Dendrogram
19
Hierarchical clustering
  • Merging (agglomerative) start with every
    measurement as a separate cluster then combine
  • Splitting make one large cluster, then split up
    into smaller pieces
  • What is the distance between two clusters?

20
Distance between clusters
  • Single-link distance is the shortest distance
    from any member of one cluster to any member of
    the other cluster
  • Complete link distance is the longest distance
    from any member of one cluster to any member of
    the other cluster
  • Average Distance between the average of all
    points in each cluster
  • Ward minimizes the sum of squares of any two
    clusters

21
Hierarchical Clustering-Merging
  • Euclidean distance
  • Average linking

Distance between clusters when combined
Gene expression time series
22
Manhattan Distance
Distance between clusters when combined
  • Average linking

Gene expression time series
23
Correlation Distance
24
Data Standardization
  • Data points are normalized with respect to mean
    and variance, sphering the data
  • After sphering, Euclidean and correlation
    distance are equivalent
  • Standardization makes sense if you are not
    interested in the size of the effects, but in the
    effect itself
  • Results are misleading for noisy data

25
Distance Comments
  • Every clustering method is based SOLELY on the
    measure of distance or similarity
  • E.G. Correlation measures linear association
    between two genes
  • What if data are not properly transformed?
  • What about outliers?
  • What about saturation effects?
  • Even good data can be ruined with the wrong
    choice of distance metric

26
Hierarchical Clustering
Distance Matrix
Initial Data Items
27
Hierarchical Clustering
Distance Matrix
Initial Data Items
28
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
2
29
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
30
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
31
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
3
32
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
33
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
34
Hierarchical Clustering
Single Linkage
Distance Matrix
Current Clusters
10
35
Hierarchical Clustering
Single Linkage
Distance Matrix
Final Result
36
Hierarchical Clustering
37
Hierarchical Clustering
38
Hierarchical Clustering
39
Hierarchical Clustering
40
Hierarchical Clustering
41
Hierarchical Clustering
42
Hierarchical Clustering
43
Hierarchical Clustering
44
Hierarchical Clustering
H
L
45
Hierarchical Clustering
Samples
Genes
The Leaf Ordering Problem
  • Find optimal layout of branches for a given
    dendrogram architecture
  • 2N-1 possible orderings of the branches
  • For a small microarray dataset of 500 genes,
    there are 1.6E150 branch configurations

46
Hierarchical Clustering
The Leaf Ordering Problem
47
Hierarchical Clustering
  • Pros
  • Commonly used algorithm
  • Simple and quick to calculate
  • Cons
  • Real genes probably do not have a hierarchical
    organization

48
Using Hierarchical Clustering
  • Choose what samples and genes to use in your
    analysis
  • Choose similarity/distance metric
  • Choose clustering direction
  • Choose linkage method
  • Calculate the dendrogram
  • Choose height/number of clusters for
    interpretation
  • Assess results
  • Interpret cluster structure

49
Choose what samples/genes to include
  • Very important step
  • Do you want to include housekeeping genes or
    genes that didnt change in your results?
  • How do you handle replicates from the same
    sample?
  • Noisy samples?
  • Dendrogram is a mess if everything is included in
    large datasets
  • Gene screening

50
No Filtering
51
Filtering 100 relevant genes
52
2. Choose distance metric
  • Metric should be a valid measure of the
    distance/similarity of genes
  • Examples
  • Applying Euclidean distance to categorical data
    is invalid
  • Correlation metric applied to highly skewed data
    will give misleading results

53
3. Choose clustering direction
  • Merging clustering (bottom up)
  • Divisive
  • split so that genes in the two clusters are the
    most similar, maximize distance between clusters

54
Nearest Neighbor Algorithm
  • Nearest Neighbor Algorithm is an agglomerative
    approach (bottom-up).
  • Starts with n nodes (n is the size of our
    sample), merges the 2 most similar nodes at each
    step, and stops when the desired number of
    clusters is reached.

55
Nearest Neighbor, Level 3, k 6 clusters.
56
Nearest Neighbor, Level 4, k 5 clusters.
57
Nearest Neighbor, Level 5, k 4 clusters.
58
Nearest Neighbor, Level 6, k 3 clusters.
59
Nearest Neighbor, Level 7, k 2 clusters.
60
Nearest Neighbor, Level 8, k 1 cluster.
61
Hierarchical Clustering
Calculate the similarity between all possible
combinations of two profiles
  • Keys
  • Similarity
  • Clustering

Two most similar clusters are grouped together to
form a new cluster
Calculate the similarity between the new cluster
and all remaining clusters.
62
Hierarchical Clustering
C1
Merge which pair of clusters?
C2
C3
63
Hierarchical Clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters


C2
C1
Tend to generate long chains
64
Hierarchical Clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters


C2
C1
Tend to generate clumps
65
Hierarchical Clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).


C2
C1
66
Hierarchical Clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.


C2
C1
67
Which one?
  • Both methods are step-wise optimal, at each
    step the optimal split or merge is performed
  • Doesnt mean that the final result is optimal
  • Merging
  • Computationally simple
  • Precise at bottom of tree
  • Good for many small clusters
  • Divisive
  • More complex, but more precise at the top of the
    tree
  • Good for looking at large and/or few clusters
  • For Gene expression applications, divisive makes
    more sense
Write a Comment
User Comments (0)
About PowerShow.com