Title: DNA Chips and Their Analysis Comp. Genomics: Lecture 13 based on many sources, primarily Zohar Yakhini
1DNA Chips and Their AnalysisComp. Genomics
Lecture 13based on many sources, primarily Zohar
Yakhini
2DNA Microarras Basics
- What are they.
- Types of arrays (cDNA arrays, oligo arrays).
- What is measured using DNA microarrays.
- How are the measurements done?
3DNA Microarras Computational Questions
- Design of arrays.
- Techniques for analyzing experiments.
- Detecting differential expression.
- Similar expression Clustering.
- Other analysis techniques (mmmmmany).
- Machine learning techniques, and applications for
advanced diagnosis.
4What is a DNA Microarray (I)
- A surface (nylon, glass, or plastic).
- Containing hundreds to thousand pixels.
- Each pixel has copies of a sequence
- of single stranded DNA (ssDNA).
- Each such sequence is called a probe.
5What is a DNA Microarray (II)
- An experiment with 500-10k elements.
- Way to concurrently explore the function of
multiple genes. - A snapshot of the expression level of 500-10k
genes under given test conditions
6Some Microarray Terminology
- Probe ssDNA printed on the solid substrate
(nylon or glass). These are - short substrings of the genes we are going to
be testing - Target cDNA which has been labeled and is to be
washed over the probe
7Back to Basics Watson and Crick
James Watson and Francis Crick discovered, in
1953, the double helix structure of DNA.
From Zohar Yakhini
8Watson-Crick Complimentarity
A binds to T C binds to G
From Zohar Yakhini
9Array Based Hybridization Assays (DNA Chips)
- Array of probes
- Thousands to millions of different probe
sequences per array.
Unknown sequence or mixture (target).Many copies.
From Zohar Yakhini
10Array Based Hyb Assays
- Target hybs to WC complimentary probes only
- Therefore the fluorescence pattern is
indicative of the target sequence.
From Zohar Yakhini
11DNA Sequencing Sanger Method
- Generate all A,C,G,T terminated prefixes of the
sequence, by a polymerase reaction with
terminating corresponding bases. - Run in four different gel lanes.
- Reconstruct sequence from the information on the
lengths of all A,C,G,T terminated prefixes. - The need for 4 different reactions is avoided by
using differentially dye labeled terminating
bases.
From Zohar Yakhini
12Central Dogma of Molecular Biology(reminder)
Cells express different subset of the genes in
different tissues and under different conditions
Gene (DNA)
From Zohar Yakhini
13Expression Profiling on MicroArrays
- Differentially label the query sample and the
control (1-3). - Mix and hybridize to an array.
- Analyze the image to obtain expression levels
information.
From Zohar Yakhini
14Microarray 2 Types of Fabrication
- cDNA Arrays Deposition of DNA fragments
- Deposition of PCR-amplified cDNA clones
- Printing of already synthesized oligonucleotieds
- Oligo Arrays In Situ synthesis
- Photolithography
- Ink Jet Printing
- Electrochemical Synthesis
By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
15cDNA Microarrays vs. Oligonucleotide Probes and
Cost
cDNA Arrays Oligonucleotide Arrays
Long Sequences Spot Unknown Sequences More variability Arrays cheaper Short Sequences Spot Known Sequences More reliable data Arrays typically more expensive
By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
16Photolithography (Affymetrix)
Photodeprotection
- Similar to process used to generate VLSI circuits
- Photolithographic masks are used to add each base
- If base is present, there will be a hole in the
corresponding mask - Can create high density arrays, but sequence
length is limited
mask
C
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
17Photolithography (Affymetrix)
From Zohar Yakhini
18Ink Jet Printing
- Four cartridges are loaded with the four
nucleotides A, G, C,T - As the printer head moves across the array, the
nucleotides are deposited in pixels where they
are needed. - This way (many copies of) a 20-60 base long oligo
is deposited in each pixel.
By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
19Ink Jet Printing (Agilent)
The array is a stack of images in the colors A,
C, G, T.
From Zohar Yakhini
20Inkjet Printed Microarrays
Inkjet head, squirting phosphor-ammodites
From Zohar Yakhini
21Electrochemical Synthesis
- Electrodes are embedded in the substrate to
manage individual reaction sites - Electrodes are activated in necessary positions
in a predetermined sequence that allows the
sequences to be constructed base by base - Solutions containing specific bases are washed
over the substrate while the electrodes are
activated
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
22Preparation of Samples
- Use oligo(dT) on a separation column to extract
mRNA from total cell populations. - Use olig(dT) initiated polymerase to reverse
transcribe RNA into fluorescence labeled cDNA.
RNA is unstable because of environment
RNA-digesting enzymes. - Alternatively use random priming for this
purpose, generating a population of transcript
subsequences
From Zohar Yakhini
23Expression Profiling on MicroArrays
- Differentially label the query sample and the
control (1-3). - Mix and hybridize to an array.
- Analyze the image to obtain expression levels
information.
From Zohar Yakhini
24Expression Profiling a FLASH Demo
URL
http//www.bio.davidson.edu/courses/genomics/chip/
chip.html
25Expression Profiling Probe Design Issues
- Probe specificity and sensitivity.
- Special designs for splice variations or other
custom purposes. - Flat thermodynamics.
- Generic and universal systems
From Zohar Yakhini
26Hybridization Probes
- SensitivityStrong interaction between the probe
and its intended target, under the assay's
conditions.How much target is needed for the
reaction to be detectable or quantifiable? - SpecificityNo potential cross hybridization.
From Zohar Yakhini
27Specificity
- Symbolic specificity
- Statistical protection in the unknown part of the
genome.
Methods, software and application in
collaboration with Peter Webb, Doron Lipson.
From Zohar Yakhini
28Reading Results Color Coding
Campbell Heyer, 2003
- Numeric tables are difficult to read
- Data is presented with a color scale
- Coding scheme
- Green repressed (less mRNA) gene in experiment
- Red induced (more mRNA) gene in experiment
- Black no change (11 ratio)
- Or
- Green control condition (e.g. aerobic)
- Red experimental condition (e.g. anaerobic)
- We usually use ratio
29Thermal Ink Jet Arrays, by Agilent Technologies
In-Situ synthesized oligonucleotide array. 25-60
mers.
cDNA array, Inkjet deposition
30Application of Microarrays
- We only know the function of about 30 of the
30,000 genes in the Human Genome - Gene exploration
- Functional Genomics
- First among many high
- throughput genomic devices
http//www.gene-chips.com/sample1.html
By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
31A Data Mining Problem
- On a given microarray, we test on the order of
10k elements in one time - Number of microarrays used in typical
- experiment is no more than 100.
- Insufficient sampling.
- Data is obtained faster than it can be processed.
- High noise.
- Algorithmic approaches to work through this large
data set and make sense of the data are desired.
32Informative Genes in aTwo Classes Experiment
- Differentially expressed in the two classes.
- Identifying (statistically significant)
informative genes - - Provides biological insight
- - Indicate promising research directions
- - Reduce data dimensionality
- - Diagnostic assay
From Zohar Yakhini
33Scoring Genes
Expression pattern and pathological diagnosis
information (annotation), for a single gene
- - - - - -
- a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
a12 a13 a14 a15 Permute the annotation by
sorting the expression pattern (ascending, say).
From Zohar Yakhini
34Separation Score
- Compute a Gaussian fit for each class ? (?1 ,
?1) , (?2 , ?2) . - The Separation Score is(?1 - ?2)/(?1 ?2)
35Threshold Error Rate (TNoM) Score
Find the threshold that best separates tumors
from normals, count the number of errors
committed there.
Ex 1
- - - - - - -
From Zohar Yakhini
36p-Values
- Relevance scores are more useful when we can
compute their significance - p-value The probability of finding a gene with a
given score if the labeling is random - p-Values allow for higher level statistical
assessment of data quality. - p-Values provide a uniform platform for comparing
relevance, across data sets. - p-Values enable class discovery
From Zohar Yakhini
37BRCA1 Differential Expression
Genes over-expressed in BRCA1 wildtype
Genes over-expressed in BRCA1 mutants
Collab with NIH NEJM 2001
Sporadic sample s14321 With BRCA1-mutant
expression profile
BRCA1 Wildtype
BRCA1 mutants
From Zohar Yakhini
38Data Analysis Leave One Out Cross Validation
(LOOCV)
- Repeat, for each tissue (tumor/normal)
- Hide the label of the test tissue
- Diagnose the test tissue based on the remaining
data - Compare the diagnosis to the hidden label
From Zohar Yakhini
39BRCA1 LOOCV Results
From Zohar Yakhini
40Lung Cancer Informative Genes
Data from Naftali Kaminskis lab, at Sheba.
- 24 tumors (various types and origins)
- 10 normals (normal edges and normal lung pools)
From Zohar Yakhini
41And Now Global Analysisof Gene Expression Data
- First (but not least) Clustering
- either of genes, or of experiments
42Example data fold change (ratios)
What is the pattern?
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 1 8 12 16 12 8
Gene D 1 3 4 4 3 2
Gene E 1 4 8 8 8 8
Gene F 1 1 1 0.25 0.25 0.1
Gene G 1 2 3 4 3 2
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene I 1 4 8 4 1 0.5
Gene J 1 2 1 2 1 2
Gene K 1 1 1 1 3 3
Gene L 1 2 3 4 3 2
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Campbell Heyer, 2003
43Example data 2
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 0 3 3.58 4 3.58 3
Gene D 0 1.58 2 2 1.58 1
Gene E 0 2 3 3 3 3
Gene F 0 0 0 -2 -2 -3.32
Gene G 0 1 1.58 2 1.58 1
Gene H 0 -1 -1.60 -2 -1.60 -1
Gene I 0 2 3 2 0 -1
Gene J 0 1 0 1 0 1
Gene K 0 0 0 0 1.58 1.58
Gene L 0 1 1.58 2 1.58 1
Gene M 0 -1.60 -2 -2 -1.60 -1
Gene N 0 -3 -3.59 -4 -3.59 -3
Campbell Heyer, 2003
44Pearson Correlation Coefficient, r. values
in -1,1 interval
- Gene expression over d experiments is a vector in
Rd, e.g. for gene C (0, 3, 3.58, 4, 3.58, 3) - Given two vectors X and Y that contain N
elements, we calculate r as follows
Cho Won, 2003
45Example Pearson Correlation Coefficient, r
- X Gene C (0, 3.00, 3.58, 4, 3.58, 3)Y Gene
D (0, 1.58, 2.00, 2, 1.58, 1) - ?XY (0)(0)(3)(1.58)(3.58)(2)(4)(2)(3.58)(1.5
8)(3)(1) 28.5564 - ?X 33.5843.583 17.16
- ?X2 323.582423.58232 59.6328
- ?Y 1.58221.581 8.16
- ?Y2 1.58222221.58212 13.9928
- N 6
- ?XY ?X?Y/N 28.5564 (17.16)(8.16)/6 5.2188
- ?X2 (?X)2/N 59.6328 (17.16)2/6 10.5552
- ?Y2 (?Y)2/N 13.9928 (8.16)2/6 2.8952
- r 5.2188 / sqrt((10.5552)(2.8952))
- 0.944
46Example data Pearson correlation coefficients
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 1 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.94 1 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E 0.96 0.84 1 -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.40 -0.10 -0.57 1 -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.95 -0.94 -0.89 0.35 -1 1 -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0.41 0.68 0.21 0.60 0.48 -0.48 1 0 -0.75 0.48 -0.68 -0.41
Gene J 0.36 0.24 0.30 -0.43 0.22 -0.21 0 1 0 0.22 -0.24 -0.36
Gene K 0.23 -0.07 0.43 -0.79 0.11 -0.11 -0.75 0 1 0.11 0.07 -0.23
Gene L 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene M -0.94 -1 -0.84 0.10 -0.94 0.94 -0.68 -0.24 0.07 -0.94 1 0.94
Gene N -1 -0.94 -0.96 0.40 -0.95 0.95 -0.41 -0.36 -0.23 -0.95 0.94 1
Campbell Heyer, 2003
47Example Reorganization of data
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene K 1 1 1 1 3 3
Gene J 1 2 1 2 1 2
Gene E 1 4 8 8 8 8
Gene C 1 8 12 16 12 8
Gene L 1 2 3 4 3 2
Gene G 1 2 3 4 3 2
Gene D 1 3 4 4 3 2
Gene I 1 4 8 4 1 0.5
Gene F 1 1 1 0.25 0.25 0.1
Campbell Heyer, 2003
48Spearman Rank Order Coefficient
- Replace each entry xi by its rank in vector x.
- Then compute Pearson correlation coefficients of
rank vectors. - Example X Gene C (0, 3.00, 3.41, 4, 3.58,
3.01) Y Gene D (0, 1.51,
2.00, 2.32, 1.58, 1) - Ranks(X) (1,2,4,6,5,3)
- Ranks(Y) (1,3,5,6,4,2)
- Ties should be taken care of (1) rare
-
(2) randomize (small effect)
49Grouping and Reduction
- Grouping Partition items into groups. Items in
same group should be similar. - Items in different groups should be
dissimilar. - Grouping may help discover patterns in the data.
- Reduction reduce the complexity of data by
removing redundant probes (genes).
50Unsupervised Grouping Clustering
- Pattern discovery via clustering
- similarly expressed genes together
- Techniques most often used
- k-Means Clustering
- Hierarchical Clustering
- Biclustering
- Alternative Methods Self Organizing Maps
(SOMS), - plaid models, singular value decomposition
(SVD), - order preserving submatrices (OPSM),
51Clustering Overview
- Different similarity measures in use
- Pearson Correlation Coefficient
- Cosine Coefficient
- Euclidean Distance
- Information Gain
- Mutual Information
- Signal to noise ratio
- Simple Matching for Nominal
-
-
52Clustering Overview (cont.)
- Different Clustering Methods
- Unsupervised
- k-means Clustering (k nearest neighbors)
- Hierarchical Clustering
- Self-organizing map
- Supervised
- Support vector machine
- Ensemble classifier
- Data Mining
53Clustering Limitations
- Any data can be clustered, therefore we must be
careful what conclusions we draw from our results - Clustering is often randomized and can and will
produce different results for different runs on
same data
54K-means Clustering
- Given a set of m data points in
- n-dimensional space and an integer k.
- We want to find the set of k centers in
- n-dimensional space that minimizes the
Euclidean (mean squared) distance from each data
point to its nearest center. - No exact polynomial-time algorithms are
- known for this problem (no wonder, NP-hard!).
A Local Search Approximation Algorithm for
k-Means Clustering by Kanungo et. al
55K-means Heuristic (Lloyds Algorithm)
- Has been shown to converge to a locally optimal
solution - But can converge to a solution arbitrarily bad
compared to the optimal solution
Data Points
Optimal Centers
Heuristic Centers
K3
- K-means-type algorithms A generalized
convergence theorem and characterization of local
optimality by Selim and Ismail - A Local Search Approximation Algorithm for
k-Means Clustering by Kanungo et al.
56Euclidean Distance
Now to find the distance between two points, say
the origin and the point (3,4)
Simple and Fast! Remember this when we consider
the complexity!
57Finding a Centroid
- We use the following equation to find the n
dimensional centroid point (center of mass) amid
k (n dimensional) points
Example Lets find the midpoint between three 2D
points, say (2,4) (5,2) (8,9)
58K-means Iterative Heuristic
- Choose k initial center points randomly
- Cluster data using Euclidean distance (or other
distance metric) - Calculate new center points for each cluster,
- using only points within the cluster
- Re-Cluster all data using the new center points
- (this step could cause some data points to be
placed in a - different cluster)
- Repeat steps 3 4 until no data points are moved
from one cluster to another (stabilization), or
till some other convergence criteria is met
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
59An example with 2 clusters
- We Pick 2 centers at random
- We cluster our data around these center points
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
60K-means example with k2
- We recalculate centers based on our current
clusters
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
61K-means example with k2
- We re-cluster our data around our new center
points
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
62K-means example with k2
5. We repeat the last two steps until no more
data points are moved into a different cluster
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
63Choosing k
- Run algorithm on data with several different
values of k - Use advance knowledge about the characteristics
of your test - (e.g. Cancerous vs Non-Cancerous Tissues,
- in case the experiments are being clustered)
64Cluster Quality
- Since any data can be clustered, how do we know
our clusters are meaningful? - The size (diameter) of the cluster
- vs. the inter-cluster distance
- Distance between the members of a cluster and the
clusters center - Diameter of the smallest sphere containing the
cluster
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
65Cluster Quality Continued
distance5
diameter5
distance20
Quality of cluster assessed by ratio of distance
to nearest cluster and cluster diameter
diameter5
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
66Cluster Quality Continued
Quality can be assessed simply by looking at the
diameter of a cluster (alone????)
A cluster can be formed by the heuristic even
when there is no similarity between clustered
patterns. This occurs because the algorithm
forces k clusters to be created.
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
67Characteristics of k-means Clustering
- The random selection of initial center points
creates the following properties - Non-Determinism
- May produce clusters without patterns
- One solution is to choose the centers randomly
from existing patterns
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
68Heuristics Complexity
- Linear in the number of data points, N
- Can be shown to have run time cN, where c does
not depend on N, but rather the number of
clusters, k - (not sure about dependence on dimension, n?)
- ? heuristic is efficient
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
69Hierarchical Clustering
- a different clustering paradigm
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
70Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0 -0.75 0.48 -0.68 -0.41
Gene J 0 0.22 -0.24 -0.36
Gene K 0.11 0.07 -0.23
Gene L -0.94 -0.95
Gene M 0.94
Gene N
Campbell Heyer, 2003
71Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G
Gene C 0.94 0.96 -0.40 0.95
Gene D 0.84 -0.10 0.94
Gene E -0.57 0.89
Gene F -0.35
Gene G
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
C
- Average similarity to
- Gene D (0.940.84)/2 0.89
- Gene F (-0.40(-0.57))/2 -0.485
- Gene G (0.950.89)/2 0.92
1
D
E
F
1
G
C
E
72Hierarchical Clustering (cont.)
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
1
2
D
C
E
G
D
F
G
73Hierarchical Clustering (cont.)
1 2 Gene F
1 0.905 -0.485
2 -0.225
Gene F
3
1
2
C
E
G
D
F
74Hierarchical Clustering (cont.)
4
3 Gene F
3 -0.355
Gene F
3
F
1
2
F
C
E
G
D
75Hierarchical Clustering (cont.)
algorithm looks familiar?
4
Remember Neighbor-Joining !
3
1
2
F
C
E
G
D
76Clustering of entire yeast genome
Campbell Heyer, 2003
77Hierarchical ClusteringYeast Gene Expression
Data
Eisen et al., 1998
78A SOFM Example With Yeast
Interpresting patterns of gene expression with
self-organizing maps Methods and application to
hematopoietic differentiation by Tamayo et al.
79SOM Description
- Each unit of the SOM has a weighted connection to
all inputs - As the algorithm progresses, neighboring units
are grouped by similarity
Output Layer
Input Layer
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
80An Example Using Color
Each color in the map is associated with a weight
From http//davis.wpi.edu/matt/courses/soms/
81Cluster Analysis of Microarray Expression Data
Matrices
- Application of cluster analysis techniques in the
elucidation gene expression data
82Function of Genes
The features of a living organism are governed
principally by its genes. If we want to fully
understand living systems we must know the
function of each gene. Once we know a genes
sequence we can design experiments to find its
function
The Classical Approach of Assigning a function to
a Gene
("???? ??? ?????? ???")
Delete Gene X
Gene X
Conclusion Gene X left eye gene.
However this approach is too slow to handle all
the gene sequence information we have today
(HGSP).
83Microarray Analysis
Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions. Experiments are carried out on a
Physical Matrix like the one below
To facilitate computational analysis the physical
matrix which may contain 1000s of genes is
converted into a numerical matrix using image
analysis equipment.
Possible inference If Gene Xs activity
(expression) is affected by Condition Y (Extreme
Heat), then Gene X may be involved in protecting
the cellular components from extreme heat. Each
Gene has its corresponding Expression Profile for
a set of conditions. This Expression Profile may
be thought of as a feature profile for that gene
for that set of conditions (A condition feature
profile).
84Cluster Analysis
- Cluster Analysis is an unsupervised procedure
which involves grouping of objects based on their
similarity in feature space. - In the Gene Expression context Genes are grouped
based on the similarity of their Condition
feature profile. - Cluster analysis was first applied to Gene
Expression data from Brewers Yeast
(Saccharomyces cerevisiae) by Eisen et al. (1998).
Clustering
- Two general conclusions can be drawn from these
clusters - Genes clustered together may be related within a
biological module/system. - If there are genes of known function within a
cluster these may help to class this
biological/module system.
85From Data to Biological Hypothesis
Gene Expression Microarray
Cluster Set
Cluster C with four Genes may represent System
C Relating these genes aids in elucidation of
this System C
Conditions (A-Z)
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7
System C
External Stimulus( Condition X)
Cell Membrane
Regulator Protein
86Some Drawbacks of Clustering Biological Data
- Clustering works well over small numbers of
conditions but a typical Microarray may have
hundreds of experimental conditions. A global
clustering may not offer sufficient resolution
with so many features. - As with other clustering applications, it may be
difficult to cluster noisy expression data. - Biological Systems tend to be inter-related and
may share numerous factors (Genes) Clustering
enforces partitions which may not accurately
represent these intimacies. - Clustering Genes over all Conditions only finds
the strongest signals in the dataset as a whole.
More local signals within the data matrix may
be missed.
87How do we better model more complex systems?
- One technique that allows detection of all
signals in the data is biclustering. - Instead of clustering genes over all conditions
biclustering clusters genes with respect to
subsets of conditions.
This enables better representation of
88Biclustering
Conditions
A B C D E F G H
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene
7 Gene 8 Gene 9
Clustering misses local signal (B,E,F),(1,4,6,7,9
) present over subset of conditions.
Biclustering discovers local coherences over a
subset of conditions
- Technique first described by J.A. Hartigan in
1972 and termed Direct Clustering. - First Introduced to Microarray expression data by
Cheng and Church(2000)
89Approaches to Biclustering Microarray Gene
Expression
- First applied to Gene Expression Data by Cheng
and Church(2000). - Used a sub-matrix scoring technique to locate
biclusters. - Tanay et al.(2000)
- Modelled the expression data on Bipartite graphs
and used graph techniques to find complete
graphs or biclusters. - Lazzeroni and Owen
- Used matrix reordering to represent different
layers of signals (biclusters) Plaid Models
to represent multiple signals within data. - Ben-Dor et al. (2002)
- Biclusters depending on order relations (OPSM).
90Bipartite Graph Modelling
- First proposed in Discovering statically
significant biclusters in gene expressing data
Tanay et al. Bioinformatics 2000
Within the graph modelling paradigm biclusters
are equivalent to complete bipartite
sub-graphs. Tanay and colleagues used
probabilistic models to determine the least
probable sub-graphs (those showing most order and
consequently most surprising) to identify
biclusters.
91The Cheng and Church Approach
The core element in this approach is the
development of a scoring to prioritise
sub-matrices. This scoring is based on the
concept of the residue of an entry in a
matrix. In the Matrix (I,J) the residue score of
element is given by
J
j
I
a
i
92The Cheng and Church Approach(2)
The mean squared residue score (H) for a matrix
(I,J) is then calculated
This Global H score gives an indication of how
the data fits together within that matrix-
whether it has some coherence or is random.
A high H value signifies that the data is
uncorrelated. - a matrix of equally spread
random values over the range a,b, has an
expected H score of (b-a)/12. range 0,800
then H(I,J) 53,333
A low H score means that there is a correlation
in the matrix - a score of H(I,J) 0 would
mean that the data in the matrix fluctuates in
unison i.e. the sub-matrix is a bicluster
93Worked example of H score
Matrix (M) Avg. 6.5
Row Avg. 2 5 8 11
R(1) 1- 2 - 5.4 6.5 0.1 R(2) 2 - 2 - 6.4
6.5 0.1 R(12) 12 - 11 -7.4 6.5
0.1
Col Avg. 5.4 6.4 7.4
H (M) (0.01x12)/12 0.01
If 5 was replaced with 3 then the score would
changed to H(M2) 2.06 If the matrix was
reshuffled randomly the score would be
around H(M3) sqr(12-1)/12 10.08
94The Cheng and Church Approach Node Deletion
Biclustering Algorithm
In order to find all possible biclusters in an
Expression Matrix all sub-matrices must be tested
using the H score.
Node Deletion
In a node deletion algorithm all columns and rows
are tested for deletion. If removing a row or
column decreases the H score of the Matrix than
it is removed.
This continues until it is not possible to
decrease the H score further. This low H score
coherent sub-matrix (bicluster) is then returned.
95The Cheng and Church Approach
Some results on lymphoma data (4026?96)
No. of genes, no. of conditions No. of genes, no. of conditions No. of genes, no. of conditions
4, 96 10, 29 11, 25
103, 25 127, 13 13, 21
10, 57 2, 96 25, 12
9, 51 3, 96 2, 96
96- Conclusions
- High throughput Functional Genomics (Microarrays)
requires Data Mining Applications. - Biclustering resolves Expression Data more
effectively than single dimensional Cluster
Analysis. - Cheng and Church Approach offers good base for
future work. - Future Research/Questions
- Implement a simple H score program to facilitate
study if H score concept. - Are there other alternative scorings which would
better apply to gene expression data? - Have unbiclustered genes any significance?
Horizontally transferred genes? - Implement full scale biclustering program and
look at better adaptation to expression data sets
and the biological context.
97References
- Basic microarray analysis grouping and feature
reduction by Soumya Raychaudhuri, Patrick D.
Sutphin, Jeffery T. Chang and Russ B. Altman
Trends in Biotechnology Vol. 19 No. 5 May 2001 - Self Organizing Maps, Tom Germano,
http//davis.wpi.edu/matt/courses/soms - Data Analysis Tools for DNA Microarrays by
Sorin Draghici Chapman Hall/CRC 2003 - Self-Organizing-Feature-Maps versus Statistical
Clustering Methods A Benchmark by A. Ultsh, C.
Vetter FG Neuroinformatik Kunstliche
Intelligenz Research Report 0994
98References
- Interpreting patterns of gene expression with
self-organizing maps Methods and application to
hematopoietic differentiation by Tamayo et al. - A Local Search Approximation Algorithm for
k-Means Clustering by Kanungo et al. - K-means-type algorithms A generalized
convergence theorem and characterization of local
optimality by Selim and Ismail