Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV) - PowerPoint PPT Presentation

About This Presentation

Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV)


Title: PowerPoint Presentation Author: nbhagaba Last modified by: nbhagaba Created Date: 7/8/2002 2:59:22 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:307
Avg rating:3.0/5.0
Slides: 131
Provided by: nbhagaba


Transcript and Presenter's Notes

Title: Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV)

Analysis of Multiple ExperimentsTIGR Multiple
Experiment Viewer (MeV)
Advanced Course Coverage
  • Introduction
  • -fundamental concepts, expression vectors and
    distance metrics
  • -fundamental statistical concepts encountered in
    mev analysis modules
  • Algorithm Coverage
  • -Lecture / Hands on Exercises
  • (refer to algorithm handout for order)

Microarray Data Flow
Scheduler (Machine Scheduling)
SliTrack (Machine Control)
PCR Score
MABCOS (Barcode System)
Exp Designer
.tiff Image File
Spotfinder (Image Analysis)
MADAM (Data Manager)
Expression Data
Raw .tav File
Miner (.tav File Creator)
Raw .tav File
MIDAS (Normalization)
GenePix Converter
Normalized .tav File
Query Window
MeV (Data Analysis)
The Expression Matrix is a representation of data
from multiple microarray experiments.
Each element is a log ratio (usually log 2 (Cy5 /
Cy3) )
Black indicates a log ratio of zero, i. e., Cy5
and Cy3 are very close in value
Green indicates a negative log ratio , i.e., Cy5
lt Cy3
Gray indicates missing data
Red indicates a positive log ratio, i.e, Cy5 gt
Expression Vectors
  • -Gene Expression Vectors
  • encapsulate the expression of a gene over a set
    of experimental conditions or sample types.

Expression Vectors As Points inExpression Space
Exp 1
Exp 2
Exp 3
Similar Expression
Experiment 3
Experiment 2
Experiment 1
Distance and Similarity
-the ability to calculate a distance (or
similarity, its inverse) between two expression
vectors is fundamental to clustering
algorithms -distance between vectors is the basis
upon which decisions are made when grouping
similar patterns of expression -selection of a
distance metric defines the concept of distance
Distance a measure of similarity between genes.
  • Some distances (MeV provides 11 metrics)
  • Euclidean ??i 1 (xiA - xiB)2

3. Pearson correlation
Distance is Defined by a Metric
Statistical Concepts
Probability distributions
The probability of an event is the likelihood of
its occurring. It is sometimes computed as a
relative frequency (rf), where the number of
favorable outcomes for an event rf
---------------- the total number of possible
outcomes for that event.
The probability of an event can sometimes be
inferred from a theoretical probability
distribution, such as a normal distribution.
Normal distribution
s std. deviation of the distribution
X µ (mean of the distribution)
Less than a 5 chance that the sample with mean s
came from population 1, i.e., s is significantly
different from mean 1 at the p lt 0.05
significance level. But we cannot reject the
hypothesis that the sample came from population 2.
Many biological variables, such as height and
weight, can reasonably be assumed to approximate
the normal distribution. But expression
measurements? Probably not. Fortunately, many
statistical tests are considered to be fairly
robust to violations of the normality
assumption, and other assumptions used in these
tests. Randomization / resampling based tests
can be used to get around the violation of the
normality assumption. Even when parametric
statistical tests (the ones that make use of
normal and other distributions) are valid,
randomization tests are still useful.
Outline of a randomization test - 1
  1. Compute the value of interest (i.e., the
    test-statistic s) from your data set.

Original data set
  1. Make fake data sets from your original data, by
    taking a random sub-sample of the data, or by
    re-arranging the data in a random fashion.
  2. Re-compute s from the fake data set.

fake s
fake s
fake s
. . .
Randomized data sets
Outline of a randomization test - 2
4. Repeat steps 2 and 3 many times (often several
hundred to several thousand times). Keep a
record of the fake s values from step 3. 5.
Draw inferences about the significance of your
original s value by comparing it with the
distribution of the randomized (fake) s values.
Original s value could be significant as it
exceeds most of the randomized s values
Range of randomized s values
Outline of a randomization test - 3
Rationale Ideally, we want to know the
behavior of the larger population from which
the sample is drawn, in order to make
statistical inferences. Here, we dont know
that the larger population behaves like a
normal distribution, or some other idealized
distribution. All we have to work with are the
data in hand. Our fake data sets are our best
guess about this behavior (i.e., if we had been
pulling data at random from an infinitely large
population, we might expect to get a
distribution similar to what we get by pulling
random sub-samples, or by reshuffling the order
of the data in our sample)
  • The problem of multiple testing
  • (adapted from presentation by Anja von
    Heydebreck, MaxPlanckInstitute for Molecular
  • Dept. Computational Molecular Biology, Berlin,
  • http//
  • Lets imagine there are 10,000 genes on a chip,
  • None of them is differentially expressed.
  • Suppose we use a statistical test for
  • expression, where we consider a gene to be
    differentially expressed if it meets the
    criterion at a
  • p-value of p lt 0.05.

  • The problem of multiple testing 2
  • Lets say that applying this test to gene G1
    yields a p-value of p 0.01
  • Remember that a p-value of 0.01 means that there
    is a 1 chance that the gene is not
    differentially expressed, i.e.,
  • Even though we conclude that the gene is
    differentially expressed (because p lt 0.05),
    there is a 1 chance that our conclusion is
  • We might be willing to live with such a low
  • of being wrong
  • BUT .....

  • The problem of multiple testing 3
  • We are testing 10,000 genes, not just one!!!
  • Even though none of the genes is differentially
    expressed, about 5 of the genes (i.e., 500
    genes) will be erroneously concluded to be
    differentially expressed, because we have decided
    to live with a p-value of 0.05
  • If only one gene were being studied, a 5 margin
    of error might not be a big deal, but 500 false
    conclusions in one study? That doesnt sound too

  • The problem of multiple testing - 4
  • There are tricks we can use to reduce the
    severity of
  • this problem.
  • They all involve slashing the p-value for each
  • (i.e., gene), so that while the critical p-value
    for the entire
  • data set might still equal 0.05, each gene will
  • evaluated at a lower p-value.
  • Well go into some of these techniques later.

  • Dont get too hung up on p-values.
  • Ultimately, what matters is biological
  • P-values should help you evaluate the strength of
  • evidence, rather than being used as an absolute
  • of significance. Statistical significance is not
  • the same as biological significance.

  • i.e., you dont want to belong to that group of
    people whose aim in life is to be wrong 5 of the

Kempthorne, O., and T.E. Deoerfler 1969 The
behaviour of some significance tests under
experimental randomization. Biometrika
56231-248, as cited in Manly, B.J.F. 1997.
Randomization, bootstrap and Monte Carlo methods
in biology pg. 1. Chapman and Hall / CRC
  • Pearson correlation coefficient r
  • Indicates the degree to which a linear
    relationship can be approximated between two
  • Can range from (1.0) to (1.0).
  • Positive r between two variables X and Y as X
    increases, so does Y on the whole.
  • Negative r as X increases, Y generally
  • The higher the magnitude of r (in the positive
    or negative direction), the more linear the

  • Pearson correlation - 2
  • Sometimes, a p-value is associated with the
    correlation coefficient r.
  • This p-value is computed from a theoretical
    distribution of the correlation coefficient,
    similar to the normal distribution.

This is the p-value for the null hypothesis
that the X and Y data for our sample come from a
population in which their correlation is zero,
i.e., the null hypothesis is that there is no
linear relationship between X and Y.   If p is
sufficiently small (often p lt 0.05), we can
reject the null hypothesis, i.e., we conclude
that there is indeed a linear relationship
between X and Y.
Pearson correlation - 3 The square of the
Pearson correlation, r2, also known as the
coefficient of determination, is a measure of the
strength of the linear relationship between X
and Y.   It is the proportion of the total
variation in X and Y that is explained by a
linear relationship.
Hierarchical Clustering (HCL)
HCL is an agglomerative clustering method which
joins similar genes into groups. The iterative
process continues with the joining of resulting
groups based on their similarity until all groups
are connected in a hierarchical tree.

Hierarchical Clustering
g1 is most like g8
g4 is most like g1, g8
Hierarchical Clustering
g5 is most like g7
g5,g7 is most like g1, g4, g8
Hierarchical Tree
Hierarchical Clustering
During construction of the hierarchy, decisions
must be made to determine which clusters should
be joined. The distance or similarity between
clusters must be calculated. The rules that
govern this calculation are linkage methods.
Agglomerative Linkage Methods
  • Linkage methods are rules or metrics that return
    a value that can be used to determine which
    elements (clusters) should be linked.
  • Three linkage methods that are commonly used are
  • Single Linkage
  • Average Linkage
  • Complete Linkage

Single Linkage
Cluster-to-cluster distance is defined as the
minimum distance between members of one cluster
and members of the another cluster. Single
linkage tends to create elongated clusters with
individual genes chained onto clusters. DAB
min ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
Average Linkage
Cluster-to-cluster distance is defined as the
average distance between all members of one
cluster and all members of another cluster.
Average linkage has a slight tendency to produce
clusters of similar variance. DAB 1/(NANB) S
S ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
Complete Linkage
Cluster-to-cluster distance is defined as the
maximum distance between members of one cluster
and members of the another cluster. Complete
linkage tends to create clusters of similar size
and variability. DAB max ( d(ui, vj) ) where
u Î A and v Î B for all i 1 to NA and j 1 to
Comparison of Linkage Methods
Bootstrapping (ST)
Bootstrapping resampling with replacement
Original expression matrix
Various bootstrapped matrices (by experiments)
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Jackknifing (ST)
Jackknifing resampling without replacement
Original expression matrix
Various jackknifed matrices (by experiments)
Analysis of Bootstrapped and Jackknifed Support
  • Bootstrapped or jackknifed expression matrices
    are created many times by randomly resampling the
    original expression matrix, using either the
    bootstrap or jackknife procedure.
  • Each time, hierarchical trees are created from
    the resampled matrices.
  • The trees are compared to the tree obtained from
    the original data set.
  • The more frequently a given cluster from the
    original tree is found in the resampled trees,
    the stronger the support for the cluster.
  • As each resampled matrix lacks some of the
    original data, high support for a cluster means
    that the clustering is not biased by a small
    subset of the data.

K-Means / K-Medians Clustering (KMC) 1
1. Specify number of clusters, e.g., 5.
2. Randomly assign genes to clusters.
K-Means Clustering 2
3. Calculate mean / median expression profile of
each cluster.
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean / median
expression profile (calculated in step 3) is the
closest to that genes expression profile.
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached.
K-Means / K-Medians is most useful when the user
has an a-priori hypothesis about the number of
clusters the genes should group into.
Principal Components (PCAG and PCAE) 1
  • PCA simplifies the views of the data.
  • Suppose we have measurements for each gene on
  • experiments.
  • Suppose some of the experiments are correlated.
  • PCA will ignore the redundant experiments, and
    will take a
  • weighted average of some of the experiments, thus
    possibly making
  • the trends in the data more interpretable.
  • 5. The components can be thought of as axes in
  • space, where n is the number of components. Each
    axis represents a
  • different trend in the data.

PCAG and PCAE - 2
In this example, x-axis could mean a continuum
from over-to under-expression (blue and
green genes over-expressed, yellow genes
under-expressed) y-axis could mean that gray
genes are over-expressed in first five expts and
under expressed in The remaining expts, while
brown genes are under-expressed in the first
five expts, and over-expressed in the remaining
expts. z-axis might represent different cyclic
patterns, e.g., red genes might be
over-expressed in odd-numbered expts and
under-expressed in even-numbered ones, whereas
the opposite is true for purple
genes. Interpretation of components is somewhat
Cluster Affinity Search Technique (CAST)
-uses an iterative approach to segregate elements
with high affinity into a cluster -the process
iterates through two phases -addition of high
affinity elements to the cluster being
created -removal or clean-up of low affinity
elements from the cluster being created
Clustering Affinity Search Technique (CAST)-1
Affinity a measure of similarity between a
gene, and all the genes in a cluster. Threshold
affinity user-specified criterion for retaining
a gene in a cluster, defined as age of maximum
affinity at that point
1. Create a new empty cluster C1.
2. Set initial affinity of all genes to zero
3. Move the two most similar genes into the new
4. Update the affinities of all the genes (new
affinity of a gene its previous affinity its
similarity to the gene(s) newly added to the
cluster C1)
5. While there exists an unassigned gene whose
affinity to the cluster C1 exceeds
the user-specified threshold affinity, pick the
unassigned gene whose affinity is the
highest, and add it to cluster C1. Update the
affinities of all the genes accordingly.
6. When there are no more unassigned
high-affinity genes, check to see if cluster C1
contains any elements whose affinity is lower
than the current threshold. If so, remove the
lowest-affinity gene from C1. Update the
affinities of all genes by subtracting from each
genes affinity, its similarity to the removed
7. Repeat step 6 while C1 contains a low-affinity
Current cluster C1
Unassigned genes
8. Repeat steps 5-7 as long as changes occur to
the cluster C1.
9. Form a new cluster with the genes that were
not assigned to cluster C1, repeating steps 1-8.
10. Keep forming new clusters following steps
1-9, until all genes have been assigned to a
QT-Clust (from Heyer et. al. 1999) (HJC) -1
  • Compute a jackknifed distance between all pairs
    of genes
  • (Jackknifed distance The data from one
    experiment are excluded from both genes, and the
  • distance is calculated. Each experiment is thus
    excluded in turn, and the maximum distance
  • between the two genes (over all exclusions) is
    the jackknifed distance. This is a conservative
  • estimate of distance that accounts for bias that
    might be introduced by single outlier

2. Choose a gene as the seed for a new cluster.
Add the gene which increases cluster diameter
the least. Continue adding genes until
additional genes will exceed the specified
cluster diameter limit.
3. Repeat step 2 for every gene, so that each
gene has the chance to be the seed of a new
cluster. All clusters are provisional at this
QT-Clust 2
4. Choose the largest cluster obtained from steps
2 and 3. In case of a tie, pick one of the
largest clusters at random.
Seed gene
Pick this cluster
5. All genes that are not in the cluster selected
above are treated as currently unassigned.
Repeat steps 2-4 on these unassigned genes.
6. Stop when the last cluster thus formed has
fewer genes than a user-specified number. All
genes that are not in a cluster at this point are
treated as unassigned.
Self Organizing Tree Algorithm
SOTA - 1
  • Dopazo, J. , J.M Carazo, Phylogenetic
    reconstruction using and unsupervised growing
    neural network that adopts the topology of a
    phylogenetic tree. J. Mol. Evol. 44226-233,
  • Herrero, J., A. Valencia, and J. Dopazo. A
    hierarchical unsupervised growing neural network
    for clustering gene expression patterns.
    Bioinformatics, 17(2)126-136, 2001.

SOTA Characteristics
SOTA - 2
  • Divisive clustering, allowing high level
    hierarchical structure to be revealed without
    having to completely partition the data set down
    to single gene vectors
  • Data set is reduced to clusters arranged in a
    binary tree topology
  • The number of resulting clusters is not fixed
    before clustering
  • Neural network approach which has advantages
    similar to SOMs such as handling large data sets
    that have large amounts of noise

SOTA Topology
SOTA - 3
Centroid Vector
Parent Node
Winning Cell
Sister Cell
a migration factor (as lt ap lt aw)
Adaptation Overview
SOTA - 4
-each gene vector associated with the parent is
compared to the centroid vector of its offspring
cells. -the most similar cells centroid and
its neighboring cells are adapted using the
appropriate migration weights.
SOTA - 5
-following the presentation of all genes to the
system a measure of system diversity is used to
determine if training has found an optimal
position for the offspring. -if the system
diversity improves (decreases) then another
training epoch is started otherwise training ends
and a new cycle starts with a cell division.
SOTA - 6
The most diverse cell is selected for division
at the start of the next training cycle.
Growth Termination
SOTA - 7
Expansion stops when the most diverse cells
diversity falls below a threshold.
SOTA - 8
Each training cycle ends when the overall tree
diversity stabilizes. This triggers a cell
division and possibly a new training cycle.
Self-organizing maps (SOMs) 1
1. Specify the number of nodes (clusters)
desired, and also specify a 2-D geometry for the
nodes, e.g., rectangular or hexagonal
N Nodes G Genes
SOMs 2
2. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The
node closest to G9 (N2) is moved the most, and
the other nodes are moved by smaller varying
amounts. The further away the node is from N2,
the less it is moved.
SOM Neighborhood Options
Gaussian Neighborhood
Bubble Neighborhood
Some move, alpha is constant.
All move, alpha is scaled.
SOMs 3
4. Steps 2 and 3 (i.e., choosing a random gene
and moving the nodes towards it) are repeated
many (usually several thousand) times. However,
with each iteration, the amount that the nodes
are allowed to move is decreased.
5. Finally, each node will nestle among a
cluster of genes, and a gene will be considered
to be in the cluster if its distance to the node
in that cluster is less than its distance to any
other node
Template Matching
-template matching allows one to find expression
vectors which match a provided template -a
template can be derived from - a gene known to
be central to the area of study - a sample or
set of samples of a particular type - a cluster
with a mean pattern of interest - a pattern
constructed to reveal trends based on
knowledge of the experimental design
-Sometimes it is useful to identify elements that
have complementary patterns by selecting to use
the absolute value of r.
K-Means / K-Medians Support (KMS)
  • Because of the random initialization of K-Means /
  • clustering results may vary somewhat between
    successive runs on
  • the same dataset. KMS helps us validate the
    clustering results
  • obtained from K-Means / K-Medians.
  • Run K-Means / K-Medians multiple times.
  • The KMS module generates clusters in which the
    member genes
  • frequently group together in the same clusters
    (consensus clusters)
  • across multiple runs of K-Means / K-Medians.
  • 3. The consensus clusters consist of genes that
    clustered together
  • in at least x of the K-Means / Medians runs,
    where x is the
  • threshold percentage input by the user.

Gene Shaving
Results in a series of nested clusters
Choose cluster of appropriate size as determined
by gap statistic calculation
Repeat until only one gene remains
Orthogonalize expression matrix with respect to
the average gene in the cluster and repeat
shaving procedure
Gene Shaving
Gap statistic calculation (choosing cluster size)
Quality measure for clusters
between variance of mean gene across experiments
within variance of each gene about the cluster
Large R2 implies a tight cluster of coherent genes
The final cluster contains a set of genes that
are greatly affected by the experimental
conditions in a similar way.
Create random permutations of the expression
matrix and calculate R2 for each
Compare R2 of each cluster to that of the entire
expression matrix
Choose the cluster whose R2 is furthest from the
average R2 of the permuted expression matrices.
Relevance Networks
Set of genes whose expression profiles are
predictive of one another.
Can be used to identify negative correlations
between genes
Genes with low entropy (least variable across
experiments) are excluded from analysis.
Relevance Networks
Tmin 0.50
The expression pattern of each gene compared to
that of every other gene.
The remaining relationships between genes define
the subnets
Tmax 0.90
Correlation coefficients outside the boundaries
defined by the minimum and maximum thresholds are
The ability of each gene to predict the
expression of each other gene is assigned a
correlation coefficient
T-Tests (TTEST) Between subjects (or unpaired)
- 1
  • Assign experiments to two groups, e.g., in the
    expression matrix
  • below, assign Experiments 1, 2 and 5 to group A,
  • experiments 3, 4 and 6 to group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
TTEST Between subjects - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the t-statistic
for each gene either from A. Theoretical
t-distribution OR B. Permutation tests.
TTEST - Between subjects - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Original grouping
Randomized grouping
TTEST - Between subjects - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi
) Then, the p-value associated with the gene 1
TTEST - Between subjects - 5
  • 5. Determine whether a genes expression levels
    are significantly
  • different between the two groups by one of three
  • Just alpha If the calculated p-value for a gene
    is less than
  • or equal to the user-input alpha (critical
    p-value), the gene is
  • considered significant.
  • OR
  • Use Bonferroni corrections to reduce the
    probability of
  • erroneously classifying non-significant genes as
  • B) Standard Bonferroni correction The user-input
    alpha is divided
  • by the total number of genes to give a critical
    p-value that is used
  • as above.

TTEST - Between subjects 6
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (alpha / N), where N is
the total number of genes for the gene with the
second-highest t-value, the critical p-value will
be (alpha/ N-1), and so on.
TTEST 1-class (or One-sample t-test) - 1
  • Used to test if the the mean expression of a gene
    over all experiments is
  • different from a hypothesized mean.

Exp 1
Exp 2
Exp 3
Exp 4
Exp 5
Exp 6
Vector 1
Gene 1
Vector 2
Gene 2
Vector 3
Gene 3
2. Question Is the mean of the values of a given
gene vector significantly different from a
hypothesized mean?
TTEST- 1 Class - 2
3. Often, the hypothesized mean in gene
expression studies is zero, meaning that we are
looking for genes whose mean log2 ratio across
all experiments is significantly different from
zero, i.e., 4. Using 1-sample t-tests, we can
select genes which, on average, show
differential expression across all experiments
(since genes with no differential expression
should have a mean log2 ratio of zero across all
expts). 5. Calculate t-value, where
Observed mean of gene vector Hypothesized mean
of gene vector t ----------------------------
Standard error of the mean of the gene vector
TTEST 1 class - 3
6. Calculate p-value from a theoretical
t-distribution, OR 7. By permutation 7a.
Randomly pick some elements of the gene vector,
and change their values, such that the new value
of the changed element is original value 2
x (original value - hypothesized mean)
(i.e., flip the elements deviation around the
hypothesized mean) Thus, if the original gene
values are and the hypothesized mean is
zero, then the randomized gene values could
These elements were randomly chosen and flipped
around zero, the hypothesized mean
TTEST 1 class - 4
7b. Calculate t-value from the randomized
gene 7c. Repeat 7a and 7b as many times as
desired. If all permutations are chosen, then
every possible combination of elements in the
gene vector is chosen for flipping. 7d. The
p-value 1 (the proportion of times that the
original absolute t-value exceeds the randomized
absolute t-value over all the permutations
conducted). 8. If a genes p-value is less than
or equal to the user-specified critical
p-value, the genes mean expression over all
experiments is significantly different from the
hypothesized mean. 9. Bonferroni and adjusted
Bonferroni corrections may be applied just as in
the two-sample t-test.
One Way Analysis of Variance (ANOVA)
  1. Assign experiments to gt 2 groups

Group 2
Group 3
2. Question Is mean expression level of a gene
the same across all groups?
3. Calculate an F-ratio for each gene,
where Mean square (groups) F
--------------------------, which is a measure
of Mean square (error) Between groups
variability ---------------------------------
Within groups variability The larger the value
of F, the greater the difference among the group
means relative to the sampling error variability
(which is the within groups variability). i.e.,
the larger the value of F, the more likely it is
that the differences among the group means
reflect real differences among the means of the
populations they are drawn from, rather than
being due to random sampling error.
ANOVA - 3 4. The p-value associated with an
F-value is the probability that an F-value that
large would be obtained if there were no
differences among group means (i.e., given the
null hypothesis). Therefore, the smaller the
p-value, the less likely it is that the null
hypothesis is valid, i.e., the differences among
group means are more likely to reflect real
population differences as p-values decrease in
  • ANOVA - 4
  • 5. P-values can be obtained for the F-values from
    a theoretical F-distribution, assuming that the
    populations from which the data are obtained
  • are normally distributed, and
  • have homogeneous variances.

The test is considered robust to violations of
these assumptions, provided sample sizes are
relatively large and similar across groups.
ANOVA 5 6. P-values can be obtained from
permutation tests (just like in t-tests), if one
does not want to rely on the assumptions needed
for using the F-distribution. P-values can also
be corrected for multiple comparisons (using
Bonferroni or other procedures). These features
will soon be implemented in MeV.
Two-factor ANOVA (TFA)
  • Can be used to find genes whose expression is
  • different over two factors (e.g., sex and
    strain), as well as to
  • look for genes with a significant interaction for
    these two
  • factors.

Strain B
Strain C
Strain A
TFA - 2
TFA - 3
  • Ideally, design should be balanced, i.e., equal
    numbers of samples
  • in each factor A factor B combination.
  • If unbalanced, the analysis can still be
    conducted, but F-tests will
  • be somewhat biased. May need to use smaller
  • can have balanced designs with no replication
    (see below). In this
  • case, interaction cannot be tested..

Significance analysis of microarrays (SAM)
  • SAM can be used to pick out significant genes
    based on differential expression between sets of
  • Currently implemented for the following designs
  • - two-class unpaired
  • two-class paired
  • multi-class
  • censored survival
  • one-class

SAM -2
  • SAM gives estimates of the False Discovery Rate
    (FDR), which is the proportion of genes likely to
    have been wrongly identified by chance as being
  • It is a very interactive algorithm allows users
    to dynamically change thresholds for significance
    (through the tuning parameter delta) after
    looking at the distribution of the test
  • The ability to dynamically alter the input
    parameters based on immediate visual feedback,
    even before completing the analysis, should make
    the data-mining process more sensitive.

SAM designs
  • Two-class unpaired to pick out genes whose mean
    expression level is significantly different
    between two groups of samples (analogous to
    between subjects t-test).
  • Two-class paired samples are split into two
    groups, and there is a 1-to-1 correspondence
    between an sample in group A and one in group B
    (analogous to paired t-test).

SAM designs - 2
  • Multi-class picks up genes whose mean expression
    is different across gt 2 groups of samples
    (analogous to one-way ANOVA)
  • Censored survival picks up genes whose
    expression levels are correlated with duration of
  • One-class picks up genes whose mean expression
    across experiments is different from a
    user-specified mean.

SAM Two-Class Unpaired
  • Assign experiments to two groups, e.g., in the
    expression matrix
  • below, assign Experiments 1, 2 and 5 to group A,
  • experiments 3, 4 and 6 to group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
SAM Two-Class Unpaired 2
Permutation tests
  • For each gene, compute d-value (analogous to
    t-statistic). This is
  • the observed d-value for that gene.
  • ii) Rank the genes in ascending order of their

iii) Randomly shuffle the values of the genes
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Compute the d-value for each randomized gene
Original grouping
Randomized grouping
SAM Two-Class Unpaired - 3
iv) Rank the permuted d-values of the genes in
ascending order
v) Repeat steps iii) and iv) many times, so that
each gene has many randomized d-values
corresponding to its rank from the
observed (unpermuted) d-value. Take the average
of the randomized d-values for each gene. This
is the expected d-value of that gene.
vi) Plot the observed d-values vs. the expected
SAM Two-Class Unpaired 4
SAM Two-Class Unpaired 5
  • For each permutation of the data, compute the
    number of positive and negative significant genes
    for a given delta as explained in the previous
    slide. The median number of significant genes
    from these permutations is the median False
    Discovery Rate.
  • The rationale behind this is, any genes
    designated as significant from the randomized
    data are being picked up purely by chance (i.e.,
    falsely discovered). Therefore, the median
    number picked up over many randomizations is a
    good estimate of false discovery rate.

SAM Two-Class Paired
  • Samples fall into two groups
  • Each member of group A is associated with a
    member of
  • group B in a 1-to-1 relationship

A-B pair
SAM Two-Class Paired - 2
  • e.g., groups A and B could respectively represent
    before and after a drug treatment, and each
    A-B pair of samples could come from the same
    patient before and after the treatment.
  • or, groups A and B could represent two strains
    for which samples were collected at the several
    time points over a time course study. A sample
    collected from each of strain A and B at the same
    time point could form an AB pair.
  • The rest of the analysis is similar to two-class
    unpaired SAM. Positive significant genes are
    those for which Mean(Group B) is significantly
    larger than Mean (Group A), and reverse is true
    for negative significant genes

SAM Multi-Class
  • Extension of SAM two -class unpaired to more
    than 2 groups
  • Experiments belong to one of at least three
  • Analogous to one-way between subjects ANOVA

Group 2
Group 3
SAM Multi-Class - 2
  • This analysis yields only positive significant
  • These are genes whose means are significantly
    different across
  • some combination of the groups of experiments.

SAM Censored Survival
  • Each experiment (sample) is associated with an
  • time, and a state at the time of observation.
  • The state is either dead or censored
  • Censored means that the subject survived
    beyond the time
  • point at which the sample was taken.
  • A positive score means that a higher expression
    level for that
  • gene implies shorter survival (i.e., higher
    risk), whereas a
  • negative score means that higher expression
    implies longer
  • survival.

SAM One-Class
  • used to pick up genes whose mean expression
    across experiments
  • is different from a user-specified mean.
  • analogous to one-class t-test
  • positive genes are those whose means are greater
    than the specified
  • mean, while negative genes have means smaller
    than the specified
  • mean

Support Vector Machines (SVM)
  • supervised learning technique
  • uses supplied information such as presumptive
    biological relationships between a set of
    elements, and the expression profiles of elements
    to produce a binary classification of elements.

Supervised Learning
-begins with the definition of a class which
specifies in advance which elements should
cluster together. -ie. genes for enzymes in a
common pathway or part of a regulatory system, or
samples may be a tissue type or from a particular
strain. -this information is used to train the
SVM to discriminate members from non-members
SVM Process Overview
SVM Training
SVM Classification
Elements In Classification
Elements Out of Classification
SVM Classification
  • SVM attempts to find an optimal separating
    hyperplane between members of the two initial

Separating hyperplane
Separation Problem
-an optimal hyperplane partitions the initial
classification correctly and maximizes distance
from the plane to elements on either side,
positive and negative examples. -when the
training examples (initial classification)
consists of very diverse expression patterns
finding an optimal hyperplane can be impossible
SVM Kernel Construction
  • The expression data can be transformed to a
    higher dimensional space (feature space) by
    applying a kernel function. This transformation
    can have the effect of allowing a separating
    hyperplane to be found.

Practical SVM Issues
  • Results depend heavily on the input parameters.
  • Using a high degree kernel function risks
    artificial separation of the data.
  • An iterative approach to increasing the kernel
    power is advisable.

SVM Results
  • Two classes are produced
  • Positive Class contains elements with expression
    patterns similar to those in the positive
    examples in the training set.
  • Negative Class contains all other members of the
    input set.
  • Each of these classes has elements that fall in
    two groups
  • Those initially in the class (true positives and
    true negatives)
  • Those recruited into the class (false positives
    and false negatives)

K-Nearest Neighbor Classification KNNC - 1
  • supervised classification scheme
  • user specifies the number of expected classes
  • a training set of vectors is provided as input
  • user specifies classes of training vectors
  • training set should contain example of each

KNNC 2 pre-classification filters
  • Prior to classification, variance filtering can
    optionally be applied
  • to all vectors (training set vectors to be
    trained). This will filter
  • out genes with low variance across experiments.
    Note that this
  • might filter out some genes in the training set
    as well.
  • Correlation filtering can also be applied on the
    vectors to be
  • classified. This would filter out those vectors
    in the set to be
  • classified, that are not significantly correlated
    with any gene in the
  • training set.
  • Significance for correlation filtering is
    determined by a
  • permutation test.

KNNC 3 - correlation filtering randomization
1. The Pearson correlation coefficient r is
computed between a given vector to be classified,
and each member of the training set 2. The
maximum such r is called the rmax for that
vector. 3. The vector is randomized a
user-specified number of times, and each time, an
rmax is calculated using the randomized
vector (call it rmax), just as in steps 1 and
2. 4. The proportion of times rmax exceeds
rmax over all randomizations is the p-value for
that vector. 5. If the p-value for a vector lt
the user-specified p-value, that vector is
retained for further analysis. 6. Steps 1-6 are
repeated for every vector in the set to be
KNNC 4 - Classification parameters
  • Let v be a vector that needs to be classified,
  • and T t1, t2, , t10 be the set of training
  • The user specifies the classes of each element
    of T. Say, there
  • are 4 classes.
  • The user also specifies the number of neighbors
    k. Say, k 5.

KNNC 5 - Classification
  • Suppose vs 5 nearest neighbors in set T (by
    Euclidean distance) are
  • t1, t4, t8, t2, and t5.
  • Since class 1 is most frequently represented in
    vs nearest neigbors, v is assigned
  • to class 1.
  • If there is a tie in frequency of classes
    represented among nearest neighbors, the
  • vector remains unassigned.

EASE(Expression Analysis Systematic Explorer)
EASE analysis identifies prevalent biological
themes within gene clusters. The significance of
each identified theme is determined by its
prevalence in the cluster and in the gene
population of genes from which the cluster was
Diverse Biological Roles
Consider a population of genes representing a
diverse set of biological roles or themes shown
below as different colors.
Many algorithms can be applied to expression data
to partition genes based on expression profiles
over multiple conditions. Many of these
techniques work solely on expression data and
disregard biological information.
Consider a particular cluster
-What are the some of the predominant biological
themes represented in the cluster and how should
significance be assigned to a discovered
biological theme?
Example Population Size 40 genes Cluster
size 12 genes 10 genes, shown in green, have a
common biological theme and 8 occur within the
Consider the Outcome
80 of the genes related to the theme in the
population ended up within the relatively small
Contingency Matrix
A 2x2 contingency matrix is typically used to
capture the relationships between cluster
membership and membership to a biological theme.
(No Transcript)
Assigning Significance to the Findings
The Fishers Exact Test permits us to determine
if there are non-random associations between the
two variables, expression based cluster
membership and membership to a particular
biological theme.
8 2
4 26
p ? .0002
( 2x2 contingency matrix )
Hypergeometric Distribution
a b
c d
The probability of any particular matrix
occurring by random selection, given no
association between the two variables, is
given by the hypergeometric rule.
Probability Computation
8 2
4 26
, we are not only
For our matrix,
interested in getting the probability of getting
exactly 8 annotation hits in the cluster but
rather the probability of having 8 or more hits.
In this case the probabilities of each of the
possible matrices is summed.
9 1
3 27
10 0
2 28
8 2
4 26
.0002207 7.27x10-6 7.79x10-8 ? .000228
EASE Results
  • Consider all of the Results
  • EASE reports all themes represented in a cluster
    and although some themes may not meet statistical
    significance it may still be important to note
    that particular biological roles or pathways are
    represented in the cluster.
  • Independently Verify Roles
  • Once found, biological themes should be
  • independently verified using annotation resources.

Basic EASE Requirements
Annotation keys identifiers for each gene must
be loaded with the data into MeV. EASE file
system EASE uses a file system to link
annotation keys to biological themes.
EASE File System
EASE(Expression Analysis Systematic Explorer)
Hosack et al. Identifying biological themes
within lists of genes with EASE. Genome Biol.,
4R70-R70.8, 2003.
NIAID graciously provided the foundation Java
classes upon which the MeV version was built.
Coming Attractions
  • Algorithm scripting
  • Discriminant analysis
  • Chromosome Viewers
  • etc.
Write a Comment
User Comments (0)