DNA Chips and Their Analysis Comp. Genomics: Lecture 13 based on many sources, primarily Zohar Yakhini

About This Presentation

Title:

DNA Chips and Their Analysis Comp. Genomics: Lecture 13 based on many sources, primarily Zohar Yakhini

Description:

... lecture and Sorin Draghici's book 'Data Analysis Tools for DNA Microarrays' ... to work through this large data set and make sense of the data are desired. ... – PowerPoint PPT presentation

Number of Views:276

Avg rating:3.0/5.0

Slides: 98

Provided by: SteveH149

Category:

more less

Transcript and Presenter's Notes

Title: DNA Chips and Their Analysis Comp. Genomics: Lecture 13 based on many sources, primarily Zohar Yakhini

1
DNA Chips and Their AnalysisComp. Genomics
Lecture 13based on many sources, primarily Zohar
Yakhini
2
DNA Microarras Basics

What are they.
Types of arrays (cDNA arrays, oligo arrays).
What is measured using DNA microarrays.
How are the measurements done?

3
DNA Microarras Computational Questions

Design of arrays.
Techniques for analyzing experiments.
Detecting differential expression.
Similar expression Clustering.
Other analysis techniques (mmmmmany).
Machine learning techniques, and applications for
advanced diagnosis.

4
What is a DNA Microarray (I)

A surface (nylon, glass, or plastic).
Containing hundreds to thousand pixels.
Each pixel has copies of a sequence
of single stranded DNA (ssDNA).
Each such sequence is called a probe.

5
What is a DNA Microarray (II)

An experiment with 500-10k elements.
Way to concurrently explore the function of
multiple genes.
A snapshot of the expression level of 500-10k
genes under given test conditions

6
Some Microarray Terminology

Probe ssDNA printed on the solid substrate
(nylon or glass). These are
short substrings of the genes we are going to
be testing
Target cDNA which has been labeled and is to be
washed over the probe

7
Back to Basics Watson and Crick
James Watson and Francis Crick discovered, in
1953, the double helix structure of DNA.
From Zohar Yakhini
8
Watson-Crick Complimentarity
A binds to T C binds to G
From Zohar Yakhini
9
Array Based Hybridization Assays (DNA Chips)

Array of probes
Thousands to millions of different probe
sequences per array.

Unknown sequence or mixture (target).Many copies.
From Zohar Yakhini
10
Array Based Hyb Assays

Target hybs to WC complimentary probes only
Therefore the fluorescence pattern is
indicative of the target sequence.

From Zohar Yakhini
11
DNA Sequencing Sanger Method

Generate all A,C,G,T terminated prefixes of the
sequence, by a polymerase reaction with
terminating corresponding bases.
Run in four different gel lanes.
Reconstruct sequence from the information on the
lengths of all A,C,G,T terminated prefixes.
The need for 4 different reactions is avoided by
using differentially dye labeled terminating
bases.

From Zohar Yakhini
12
Central Dogma of Molecular Biology(reminder)
Cells express different subset of the genes in
different tissues and under different conditions
Gene (DNA)
From Zohar Yakhini
13
Expression Profiling on MicroArrays

Differentially label the query sample and the
control (1-3).
Mix and hybridize to an array.
Analyze the image to obtain expression levels
information.

From Zohar Yakhini
14
Microarray 2 Types of Fabrication

cDNA Arrays Deposition of DNA fragments
Deposition of PCR-amplified cDNA clones
Printing of already synthesized oligonucleotieds
Oligo Arrays In Situ synthesis
Photolithography
Ink Jet Printing
Electrochemical Synthesis

By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
15
cDNA Microarrays vs. Oligonucleotide Probes and
Cost
cDNA Arrays Oligonucleotide Arrays
Long Sequences Spot Unknown Sequences More variability Arrays cheaper Short Sequences Spot Known Sequences More reliable data Arrays typically more expensive
By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
16
Photolithography (Affymetrix)
Photodeprotection

Similar to process used to generate VLSI circuits
Photolithographic masks are used to add each base
If base is present, there will be a hole in the
corresponding mask
Can create high density arrays, but sequence
length is limited

mask
C
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
17
Photolithography (Affymetrix)
From Zohar Yakhini
18
Ink Jet Printing

Four cartridges are loaded with the four
nucleotides A, G, C,T
As the printer head moves across the array, the
nucleotides are deposited in pixels where they
are needed.
This way (many copies of) a 20-60 base long oligo
is deposited in each pixel.

By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
19
Ink Jet Printing (Agilent)
The array is a stack of images in the colors A,
C, G, T.

From Zohar Yakhini
20
Inkjet Printed Microarrays
Inkjet head, squirting phosphor-ammodites
From Zohar Yakhini
21
Electrochemical Synthesis

Electrodes are embedded in the substrate to
manage individual reaction sites
Electrodes are activated in necessary positions
in a predetermined sequence that allows the
sequences to be constructed base by base
Solutions containing specific bases are washed
over the substrate while the electrodes are
activated

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
22
Preparation of Samples

Use oligo(dT) on a separation column to extract
mRNA from total cell populations.
Use olig(dT) initiated polymerase to reverse
transcribe RNA into fluorescence labeled cDNA.
RNA is unstable because of environment
RNA-digesting enzymes.
Alternatively use random priming for this
purpose, generating a population of transcript
subsequences

From Zohar Yakhini
23
Expression Profiling on MicroArrays

Differentially label the query sample and the
control (1-3).
Mix and hybridize to an array.
Analyze the image to obtain expression levels
information.

From Zohar Yakhini
24
Expression Profiling a FLASH Demo
URL
http//www.bio.davidson.edu/courses/genomics/chip/
chip.html
25
Expression Profiling Probe Design Issues

Probe specificity and sensitivity.
Special designs for splice variations or other
custom purposes.
Flat thermodynamics.
Generic and universal systems

From Zohar Yakhini
26
Hybridization Probes

SensitivityStrong interaction between the probe
and its intended target, under the assay's
conditions.How much target is needed for the
reaction to be detectable or quantifiable?
SpecificityNo potential cross hybridization.

From Zohar Yakhini
27
Specificity

Symbolic specificity
Statistical protection in the unknown part of the
genome.

Methods, software and application in
collaboration with Peter Webb, Doron Lipson.
From Zohar Yakhini
28
Reading Results Color Coding
Campbell Heyer, 2003

Numeric tables are difficult to read
Data is presented with a color scale
Coding scheme
Green repressed (less mRNA) gene in experiment
Red induced (more mRNA) gene in experiment
Black no change (11 ratio)
Or
Green control condition (e.g. aerobic)
Red experimental condition (e.g. anaerobic)
We usually use ratio

29
Thermal Ink Jet Arrays, by Agilent Technologies
In-Situ synthesized oligonucleotide array. 25-60
mers.
cDNA array, Inkjet deposition
30
Application of Microarrays

We only know the function of about 30 of the
30,000 genes in the Human Genome
Gene exploration
Functional Genomics
First among many high
throughput genomic devices

http//www.gene-chips.com/sample1.html
By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
31
A Data Mining Problem

On a given microarray, we test on the order of
10k elements in one time
Number of microarrays used in typical
experiment is no more than 100.
Insufficient sampling.
Data is obtained faster than it can be processed.
High noise.
Algorithmic approaches to work through this large
data set and make sense of the data are desired.

32
Informative Genes in aTwo Classes Experiment

Differentially expressed in the two classes.
Identifying (statistically significant)
informative genes
- Provides biological insight
- Indicate promising research directions
- Reduce data dimensionality
- Diagnostic assay

From Zohar Yakhini
33
Scoring Genes
Expression pattern and pathological diagnosis
information (annotation), for a single gene
- - - - - -
- a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
a12 a13 a14 a15 Permute the annotation by
sorting the expression pattern (ascending, say).
From Zohar Yakhini
34
Separation Score

Compute a Gaussian fit for each class ? (?1 ,
?1) , (?2 , ?2) .
The Separation Score is(?1 - ?2)/(?1 ?2)

35
Threshold Error Rate (TNoM) Score
Find the threshold that best separates tumors
from normals, count the number of errors
committed there.
Ex 1
- - - - - - -
From Zohar Yakhini
36
p-Values

Relevance scores are more useful when we can
compute their significance
p-value The probability of finding a gene with a
given score if the labeling is random
p-Values allow for higher level statistical
assessment of data quality.
p-Values provide a uniform platform for comparing
relevance, across data sets.
p-Values enable class discovery

From Zohar Yakhini
37
BRCA1 Differential Expression
Genes over-expressed in BRCA1 wildtype
Genes over-expressed in BRCA1 mutants
Collab with NIH NEJM 2001
Sporadic sample s14321 With BRCA1-mutant
expression profile
BRCA1 Wildtype
BRCA1 mutants
From Zohar Yakhini
38
Data Analysis Leave One Out Cross Validation
(LOOCV)

Repeat, for each tissue (tumor/normal)
Hide the label of the test tissue
Diagnose the test tissue based on the remaining
data
Compare the diagnosis to the hidden label

From Zohar Yakhini
39
BRCA1 LOOCV Results
From Zohar Yakhini
40
Lung Cancer Informative Genes
Data from Naftali Kaminskis lab, at Sheba.

24 tumors (various types and origins)
10 normals (normal edges and normal lung pools)

From Zohar Yakhini
41
And Now Global Analysisof Gene Expression Data

First (but not least) Clustering
either of genes, or of experiments

42
Example data fold change (ratios)
What is the pattern?
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 1 8 12 16 12 8
Gene D 1 3 4 4 3 2
Gene E 1 4 8 8 8 8
Gene F 1 1 1 0.25 0.25 0.1
Gene G 1 2 3 4 3 2
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene I 1 4 8 4 1 0.5
Gene J 1 2 1 2 1 2
Gene K 1 1 1 1 3 3
Gene L 1 2 3 4 3 2
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Campbell Heyer, 2003
43
Example data 2
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 0 3 3.58 4 3.58 3
Gene D 0 1.58 2 2 1.58 1
Gene E 0 2 3 3 3 3
Gene F 0 0 0 -2 -2 -3.32
Gene G 0 1 1.58 2 1.58 1
Gene H 0 -1 -1.60 -2 -1.60 -1
Gene I 0 2 3 2 0 -1
Gene J 0 1 0 1 0 1
Gene K 0 0 0 0 1.58 1.58
Gene L 0 1 1.58 2 1.58 1
Gene M 0 -1.60 -2 -2 -1.60 -1
Gene N 0 -3 -3.59 -4 -3.59 -3
Campbell Heyer, 2003
44
Pearson Correlation Coefficient, r. values
in -1,1 interval

Gene expression over d experiments is a vector in
Rd, e.g. for gene C (0, 3, 3.58, 4, 3.58, 3)
Given two vectors X and Y that contain N
elements, we calculate r as follows

Cho Won, 2003
45
Example Pearson Correlation Coefficient, r

X Gene C (0, 3.00, 3.58, 4, 3.58, 3)Y Gene
D (0, 1.58, 2.00, 2, 1.58, 1)
?XY (0)(0)(3)(1.58)(3.58)(2)(4)(2)(3.58)(1.5
8)(3)(1) 28.5564
?X 33.5843.583 17.16
?X2 323.582423.58232 59.6328
?Y 1.58221.581 8.16
?Y2 1.58222221.58212 13.9928
N 6
?XY ?X?Y/N 28.5564 (17.16)(8.16)/6 5.2188
?X2 (?X)2/N 59.6328 (17.16)2/6 10.5552
?Y2 (?Y)2/N 13.9928 (8.16)2/6 2.8952
r 5.2188 / sqrt((10.5552)(2.8952))
0.944

46
Example data Pearson correlation coefficients
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 1 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.94 1 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E 0.96 0.84 1 -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.40 -0.10 -0.57 1 -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.95 -0.94 -0.89 0.35 -1 1 -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0.41 0.68 0.21 0.60 0.48 -0.48 1 0 -0.75 0.48 -0.68 -0.41
Gene J 0.36 0.24 0.30 -0.43 0.22 -0.21 0 1 0 0.22 -0.24 -0.36
Gene K 0.23 -0.07 0.43 -0.79 0.11 -0.11 -0.75 0 1 0.11 0.07 -0.23
Gene L 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene M -0.94 -1 -0.84 0.10 -0.94 0.94 -0.68 -0.24 0.07 -0.94 1 0.94
Gene N -1 -0.94 -0.96 0.40 -0.95 0.95 -0.41 -0.36 -0.23 -0.95 0.94 1
Campbell Heyer, 2003
47
Example Reorganization of data
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene K 1 1 1 1 3 3
Gene J 1 2 1 2 1 2
Gene E 1 4 8 8 8 8
Gene C 1 8 12 16 12 8
Gene L 1 2 3 4 3 2
Gene G 1 2 3 4 3 2
Gene D 1 3 4 4 3 2
Gene I 1 4 8 4 1 0.5
Gene F 1 1 1 0.25 0.25 0.1
Campbell Heyer, 2003
48
Spearman Rank Order Coefficient

Replace each entry xi by its rank in vector x.
Then compute Pearson correlation coefficients of
rank vectors.
Example X Gene C (0, 3.00, 3.41, 4, 3.58,
3.01) Y Gene D (0, 1.51,
2.00, 2.32, 1.58, 1)
Ranks(X) (1,2,4,6,5,3)
Ranks(Y) (1,3,5,6,4,2)
Ties should be taken care of (1) rare
(2) randomize (small effect)

49
Grouping and Reduction

Grouping Partition items into groups. Items in
same group should be similar.
Items in different groups should be
dissimilar.
Grouping may help discover patterns in the data.
Reduction reduce the complexity of data by
removing redundant probes (genes).

50
Unsupervised Grouping Clustering

Pattern discovery via clustering
similarly expressed genes together
Techniques most often used

k-Means Clustering
Hierarchical Clustering
Biclustering
Alternative Methods Self Organizing Maps
(SOMS),
plaid models, singular value decomposition
(SVD),
order preserving submatrices (OPSM),

51
Clustering Overview

Different similarity measures in use
Pearson Correlation Coefficient
Cosine Coefficient
Euclidean Distance
Information Gain
Mutual Information
Signal to noise ratio
Simple Matching for Nominal

52
Clustering Overview (cont.)

Different Clustering Methods
Unsupervised
k-means Clustering (k nearest neighbors)
Hierarchical Clustering
Self-organizing map
Supervised
Support vector machine
Ensemble classifier
Data Mining

53
Clustering Limitations

Any data can be clustered, therefore we must be
careful what conclusions we draw from our results
Clustering is often randomized and can and will
produce different results for different runs on
same data

54
K-means Clustering

Given a set of m data points in
n-dimensional space and an integer k.
We want to find the set of k centers in
n-dimensional space that minimizes the
Euclidean (mean squared) distance from each data
point to its nearest center.
No exact polynomial-time algorithms are
known for this problem (no wonder, NP-hard!).

A Local Search Approximation Algorithm for
k-Means Clustering by Kanungo et. al
55
K-means Heuristic (Lloyds Algorithm)

Has been shown to converge to a locally optimal
solution
But can converge to a solution arbitrarily bad
compared to the optimal solution

Data Points
Optimal Centers
Heuristic Centers
K3

K-means-type algorithms A generalized
convergence theorem and characterization of local
optimality by Selim and Ismail
A Local Search Approximation Algorithm for
k-Means Clustering by Kanungo et al.

56
Euclidean Distance
Now to find the distance between two points, say
the origin and the point (3,4)
Simple and Fast! Remember this when we consider
the complexity!
57
Finding a Centroid

We use the following equation to find the n
dimensional centroid point (center of mass) amid
k (n dimensional) points

Example Lets find the midpoint between three 2D
points, say (2,4) (5,2) (8,9)
58
K-means Iterative Heuristic

Choose k initial center points randomly
Cluster data using Euclidean distance (or other
distance metric)
Calculate new center points for each cluster,
using only points within the cluster
Re-Cluster all data using the new center points
(this step could cause some data points to be
placed in a
different cluster)
Repeat steps 3 4 until no data points are moved
from one cluster to another (stabilization), or
till some other convergence criteria is met

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
59
An example with 2 clusters

We Pick 2 centers at random
We cluster our data around these center points

Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
60
K-means example with k2

We recalculate centers based on our current
clusters

Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
61
K-means example with k2

We re-cluster our data around our new center
points

Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
62
K-means example with k2
5. We repeat the last two steps until no more
data points are moved into a different cluster
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
63
Choosing k

Run algorithm on data with several different
values of k
Use advance knowledge about the characteristics
of your test
(e.g. Cancerous vs Non-Cancerous Tissues,
in case the experiments are being clustered)

64
Cluster Quality

Since any data can be clustered, how do we know
our clusters are meaningful?
The size (diameter) of the cluster
vs. the inter-cluster distance
Distance between the members of a cluster and the
clusters center
Diameter of the smallest sphere containing the
cluster

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
65
Cluster Quality Continued
distance5
diameter5
distance20
Quality of cluster assessed by ratio of distance
to nearest cluster and cluster diameter
diameter5
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
66
Cluster Quality Continued
Quality can be assessed simply by looking at the
diameter of a cluster (alone????)
A cluster can be formed by the heuristic even
when there is no similarity between clustered
patterns. This occurs because the algorithm
forces k clusters to be created.
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
67
Characteristics of k-means Clustering

The random selection of initial center points
creates the following properties
Non-Determinism
May produce clusters without patterns
One solution is to choose the centers randomly
from existing patterns

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
68
Heuristics Complexity

Linear in the number of data points, N
Can be shown to have run time cN, where c does
not depend on N, but rather the number of
clusters, k
(not sure about dependence on dimension, n?)
? heuristic is efficient

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
69
Hierarchical Clustering

a different clustering paradigm

Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
70
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0 -0.75 0.48 -0.68 -0.41
Gene J 0 0.22 -0.24 -0.36
Gene K 0.11 0.07 -0.23
Gene L -0.94 -0.95
Gene M 0.94
Gene N
Campbell Heyer, 2003
71
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G
Gene C 0.94 0.96 -0.40 0.95
Gene D 0.84 -0.10 0.94
Gene E -0.57 0.89
Gene F -0.35
Gene G
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
C

Average similarity to
Gene D (0.940.84)/2 0.89
Gene F (-0.40(-0.57))/2 -0.485
Gene G (0.950.89)/2 0.92

1
D
E
F
1
G
C
E
72
Hierarchical Clustering (cont.)
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
1
2
D
C
E
G
D
F
G
73
Hierarchical Clustering (cont.)
1 2 Gene F
1 0.905 -0.485
2 -0.225
Gene F
3
1
2
C
E
G
D
F
74
Hierarchical Clustering (cont.)
4
3 Gene F
3 -0.355
Gene F
3
F
1
2
F
C
E
G
D
75
Hierarchical Clustering (cont.)
algorithm looks familiar?
4
Remember Neighbor-Joining !
3
1
2
F
C
E
G
D
76
Clustering of entire yeast genome
Campbell Heyer, 2003
77
Hierarchical ClusteringYeast Gene Expression
Data
Eisen et al., 1998
78
A SOFM Example With Yeast
Interpresting patterns of gene expression with
self-organizing maps Methods and application to
hematopoietic differentiation by Tamayo et al.
79
SOM Description

Each unit of the SOM has a weighted connection to
all inputs
As the algorithm progresses, neighboring units
are grouped by similarity

Output Layer
Input Layer
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
80
An Example Using Color
Each color in the map is associated with a weight
From http//davis.wpi.edu/matt/courses/soms/
81
Cluster Analysis of Microarray Expression Data
Matrices

Application of cluster analysis techniques in the
elucidation gene expression data

82
Function of Genes
The features of a living organism are governed
principally by its genes. If we want to fully
understand living systems we must know the
function of each gene. Once we know a genes
sequence we can design experiments to find its
function
The Classical Approach of Assigning a function to
a Gene
("???? ??? ?????? ???")
Delete Gene X
Gene X
Conclusion Gene X left eye gene.
However this approach is too slow to handle all
the gene sequence information we have today
(HGSP).
83
Microarray Analysis
Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions. Experiments are carried out on a
Physical Matrix like the one below
To facilitate computational analysis the physical
matrix which may contain 1000s of genes is
converted into a numerical matrix using image
analysis equipment.
Possible inference If Gene Xs activity
(expression) is affected by Condition Y (Extreme
Heat), then Gene X may be involved in protecting
the cellular components from extreme heat. Each
Gene has its corresponding Expression Profile for
a set of conditions. This Expression Profile may
be thought of as a feature profile for that gene
for that set of conditions (A condition feature
profile).
84
Cluster Analysis

Cluster Analysis is an unsupervised procedure
which involves grouping of objects based on their
similarity in feature space.
In the Gene Expression context Genes are grouped
based on the similarity of their Condition
feature profile.
Cluster analysis was first applied to Gene
Expression data from Brewers Yeast
(Saccharomyces cerevisiae) by Eisen et al. (1998).

Clustering

Two general conclusions can be drawn from these
clusters
Genes clustered together may be related within a
biological module/system.
If there are genes of known function within a
cluster these may help to class this
biological/module system.

85
From Data to Biological Hypothesis
Gene Expression Microarray
Cluster Set
Cluster C with four Genes may represent System
C Relating these genes aids in elucidation of
this System C
Conditions (A-Z)
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7
System C
External Stimulus( Condition X)
Cell Membrane
Regulator Protein
86
Some Drawbacks of Clustering Biological Data

Clustering works well over small numbers of
conditions but a typical Microarray may have
hundreds of experimental conditions. A global
clustering may not offer sufficient resolution
with so many features.
As with other clustering applications, it may be
difficult to cluster noisy expression data.
Biological Systems tend to be inter-related and
may share numerous factors (Genes) Clustering
enforces partitions which may not accurately
represent these intimacies.
Clustering Genes over all Conditions only finds
the strongest signals in the dataset as a whole.
More local signals within the data matrix may
be missed.

87
How do we better model more complex systems?

One technique that allows detection of all
signals in the data is biclustering.
Instead of clustering genes over all conditions
biclustering clusters genes with respect to
subsets of conditions.

This enables better representation of
88
Biclustering
Conditions
A B C D E F G H
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene
7 Gene 8 Gene 9
Clustering misses local signal (B,E,F),(1,4,6,7,9
) present over subset of conditions.
Biclustering discovers local coherences over a
subset of conditions

Technique first described by J.A. Hartigan in
1972 and termed Direct Clustering.
First Introduced to Microarray expression data by
Cheng and Church(2000)

89
Approaches to Biclustering Microarray Gene
Expression

First applied to Gene Expression Data by Cheng
and Church(2000).
Used a sub-matrix scoring technique to locate
biclusters.
Tanay et al.(2000)
Modelled the expression data on Bipartite graphs
and used graph techniques to find complete
graphs or biclusters.
Lazzeroni and Owen
Used matrix reordering to represent different
layers of signals (biclusters) Plaid Models
to represent multiple signals within data.
Ben-Dor et al. (2002)
Biclusters depending on order relations (OPSM).

90
Bipartite Graph Modelling

First proposed in Discovering statically
significant biclusters in gene expressing data
Tanay et al. Bioinformatics 2000

Within the graph modelling paradigm biclusters
are equivalent to complete bipartite
sub-graphs. Tanay and colleagues used
probabilistic models to determine the least
probable sub-graphs (those showing most order and
consequently most surprising) to identify
biclusters.
91
The Cheng and Church Approach
The core element in this approach is the
development of a scoring to prioritise
sub-matrices. This scoring is based on the
concept of the residue of an entry in a
matrix. In the Matrix (I,J) the residue score of
element is given by
J
j
I
a
i
92
The Cheng and Church Approach(2)
The mean squared residue score (H) for a matrix
(I,J) is then calculated
This Global H score gives an indication of how
the data fits together within that matrix-
whether it has some coherence or is random.
A high H value signifies that the data is
uncorrelated. - a matrix of equally spread
random values over the range a,b, has an
expected H score of (b-a)/12. range 0,800
then H(I,J) 53,333
A low H score means that there is a correlation
in the matrix - a score of H(I,J) 0 would
mean that the data in the matrix fluctuates in
unison i.e. the sub-matrix is a bicluster
93
Worked example of H score
Matrix (M) Avg. 6.5
Row Avg. 2 5 8 11
R(1) 1- 2 - 5.4 6.5 0.1 R(2) 2 - 2 - 6.4
6.5 0.1 R(12) 12 - 11 -7.4 6.5
0.1
Col Avg. 5.4 6.4 7.4
H (M) (0.01x12)/12 0.01
If 5 was replaced with 3 then the score would
changed to H(M2) 2.06 If the matrix was
reshuffled randomly the score would be
around H(M3) sqr(12-1)/12 10.08
94
The Cheng and Church Approach Node Deletion
Biclustering Algorithm
In order to find all possible biclusters in an
Expression Matrix all sub-matrices must be tested
using the H score.
Node Deletion
In a node deletion algorithm all columns and rows
are tested for deletion. If removing a row or
column decreases the H score of the Matrix than
it is removed.
This continues until it is not possible to
decrease the H score further. This low H score
coherent sub-matrix (bicluster) is then returned.
95
The Cheng and Church Approach
Some results on lymphoma data (4026?96)
No. of genes, no. of conditions No. of genes, no. of conditions No. of genes, no. of conditions
4, 96 10, 29 11, 25
103, 25 127, 13 13, 21
10, 57 2, 96 25, 12
9, 51 3, 96 2, 96
96

Conclusions
High throughput Functional Genomics (Microarrays)
requires Data Mining Applications.
Biclustering resolves Expression Data more
effectively than single dimensional Cluster
Analysis.
Cheng and Church Approach offers good base for
future work.
Future Research/Questions
Implement a simple H score program to facilitate
study if H score concept.
Are there other alternative scorings which would
better apply to gene expression data?
Have unbiclustered genes any significance?
Horizontally transferred genes?
Implement full scale biclustering program and
look at better adaptation to expression data sets
and the biological context.

97
References

Basic microarray analysis grouping and feature
reduction by Soumya Raychaudhuri, Patrick D.
Sutphin, Jeffery T. Chang and Russ B. Altman
Trends in Biotechnology Vol. 19 No. 5 May 2001
Self Organizing Maps, Tom Germano,
http//davis.wpi.edu/matt/courses/soms
Data Analysis Tools for DNA Microarrays by
Sorin Draghici Chapman Hall/CRC 2003
Self-Organizing-Feature-Maps versus Statistical
Clustering Methods A Benchmark by A. Ultsh, C.
Vetter FG Neuroinformatik Kunstliche
Intelligenz Research Report 0994

98
References

Interpreting patterns of gene expression with
self-organizing maps Methods and application to
hematopoietic differentiation by Tamayo et al.
A Local Search Approximation Algorithm for
k-Means Clustering by Kanungo et al.
K-means-type algorithms A generalized
convergence theorem and characterization of local
optimality by Selim and Ismail

Write a Comment

User Comments (0)