Analysis of Multiple ExperimentsTIGR Multiple

Experiment Viewer (MeV)

Advanced Course Coverage

- Introduction
- -fundamental concepts, expression vectors and

distance metrics - -fundamental statistical concepts encountered in

mev analysis modules - Algorithm Coverage
- -Lecture / Hands on Exercises
- (refer to algorithm handout for order)

Microarray Data Flow

Scheduler (Machine Scheduling)

SliTrack (Machine Control)

PCR Score

MABCOS (Barcode System)

Exp Designer

.tiff Image File

Spotfinder (Image Analysis)

MADAM (Data Manager)

Expression Data

Raw .tav File

Miner (.tav File Creator)

Raw .tav File

MIDAS (Normalization)

GenePix Converter

Normalized .tav File

Query Window

MeV (Data Analysis)

Interpretation

The Expression Matrix is a representation of data

from multiple microarray experiments.

Each element is a log ratio (usually log 2 (Cy5 /

Cy3) )

Black indicates a log ratio of zero, i. e., Cy5

and Cy3 are very close in value

Green indicates a negative log ratio , i.e., Cy5

lt Cy3

Gray indicates missing data

Red indicates a positive log ratio, i.e, Cy5 gt

Cy3

Expression Vectors

- -Gene Expression Vectors
- encapsulate the expression of a gene over a set

of experimental conditions or sample types.

Log2(cy5/cy3)

Expression Vectors As Points inExpression Space

Exp 1

Exp 2

Exp 3

G1

-0.8

-0.3

-0.7

G2

-0.8

-0.7

-0.4

Similar Expression

G3

-0.4

-0.6

-0.8

G4

0.9

1.2

1.3

G5

1.3

0.9

-0.6

Experiment 3

Experiment 2

Experiment 1

Distance and Similarity

-the ability to calculate a distance (or

similarity, its inverse) between two expression

vectors is fundamental to clustering

algorithms -distance between vectors is the basis

upon which decisions are made when grouping

similar patterns of expression -selection of a

distance metric defines the concept of distance

Distance a measure of similarity between genes.

p1

- Some distances (MeV provides 11 metrics)
- Euclidean ??i 1 (xiA - xiB)2

p0

3. Pearson correlation

Distance is Defined by a Metric

1.4

-0.90

4.2

-1.00

Statistical Concepts

Probability distributions

The probability of an event is the likelihood of

its occurring. It is sometimes computed as a

relative frequency (rf), where the number of

favorable outcomes for an event rf

------------------------------------------------

---------------- the total number of possible

outcomes for that event.

The probability of an event can sometimes be

inferred from a theoretical probability

distribution, such as a normal distribution.

Normal distribution

s std. deviation of the distribution

X µ (mean of the distribution)

Less than a 5 chance that the sample with mean s

came from population 1, i.e., s is significantly

different from mean 1 at the p lt 0.05

significance level. But we cannot reject the

hypothesis that the sample came from population 2.

Many biological variables, such as height and

weight, can reasonably be assumed to approximate

the normal distribution. But expression

measurements? Probably not. Fortunately, many

statistical tests are considered to be fairly

robust to violations of the normality

assumption, and other assumptions used in these

tests. Randomization / resampling based tests

can be used to get around the violation of the

normality assumption. Even when parametric

statistical tests (the ones that make use of

normal and other distributions) are valid,

randomization tests are still useful.

Outline of a randomization test - 1

- Compute the value of interest (i.e., the

test-statistic s) from your data set.

s

Original data set

- Make fake data sets from your original data, by

taking a random sub-sample of the data, or by

re-arranging the data in a random fashion. - Re-compute s from the fake data set.

fake s

fake s

fake s

. . .

Randomized data sets

Outline of a randomization test - 2

4. Repeat steps 2 and 3 many times (often several

hundred to several thousand times). Keep a

record of the fake s values from step 3. 5.

Draw inferences about the significance of your

original s value by comparing it with the

distribution of the randomized (fake) s values.

Original s value could be significant as it

exceeds most of the randomized s values

Range of randomized s values

Outline of a randomization test - 3

Rationale Ideally, we want to know the

behavior of the larger population from which

the sample is drawn, in order to make

statistical inferences. Here, we dont know

that the larger population behaves like a

normal distribution, or some other idealized

distribution. All we have to work with are the

data in hand. Our fake data sets are our best

guess about this behavior (i.e., if we had been

pulling data at random from an infinitely large

population, we might expect to get a

distribution similar to what we get by pulling

random sub-samples, or by reshuffling the order

of the data in our sample)

- The problem of multiple testing
- (adapted from presentation by Anja von

Heydebreck, MaxPlanckInstitute for Molecular

Genetics, - Dept. Computational Molecular Biology, Berlin,

Germany - http//www.bioconductor.org/workshops/Heidelberg02

/mult.pdf) - Lets imagine there are 10,000 genes on a chip,

AND - None of them is differentially expressed.
- Suppose we use a statistical test for

differential - expression, where we consider a gene to be

differentially expressed if it meets the

criterion at a - p-value of p lt 0.05.

- The problem of multiple testing 2
- Lets say that applying this test to gene G1

yields a p-value of p 0.01 - Remember that a p-value of 0.01 means that there

is a 1 chance that the gene is not

differentially expressed, i.e., - Even though we conclude that the gene is

differentially expressed (because p lt 0.05),

there is a 1 chance that our conclusion is

wrong. - We might be willing to live with such a low

probability - of being wrong
- BUT .....

- The problem of multiple testing 3
- We are testing 10,000 genes, not just one!!!
- Even though none of the genes is differentially

expressed, about 5 of the genes (i.e., 500

genes) will be erroneously concluded to be

differentially expressed, because we have decided

to live with a p-value of 0.05 - If only one gene were being studied, a 5 margin

of error might not be a big deal, but 500 false

conclusions in one study? That doesnt sound too

good.

- The problem of multiple testing - 4
- There are tricks we can use to reduce the

severity of - this problem.
- They all involve slashing the p-value for each

test - (i.e., gene), so that while the critical p-value

for the entire - data set might still equal 0.05, each gene will

be - evaluated at a lower p-value.
- Well go into some of these techniques later.

- Dont get too hung up on p-values.
- Ultimately, what matters is biological

relevance. - P-values should help you evaluate the strength of

the - evidence, rather than being used as an absolute

yardstick - of significance. Statistical significance is not

necessarily - the same as biological significance.

- i.e., you dont want to belong to that group of

people whose aim in life is to be wrong 5 of the

time!!!

Kempthorne, O., and T.E. Deoerfler 1969 The

behaviour of some significance tests under

experimental randomization. Biometrika

56231-248, as cited in Manly, B.J.F. 1997.

Randomization, bootstrap and Monte Carlo methods

in biology pg. 1. Chapman and Hall / CRC

- Pearson correlation coefficient r
- Indicates the degree to which a linear

relationship can be approximated between two

variables. - Can range from (1.0) to (1.0).
- Positive r between two variables X and Y as X

increases, so does Y on the whole.

- Negative r as X increases, Y generally

decreases. - The higher the magnitude of r (in the positive

or negative direction), the more linear the

relationship.

- Pearson correlation - 2
- Sometimes, a p-value is associated with the

correlation coefficient r. - This p-value is computed from a theoretical

distribution of the correlation coefficient,

similar to the normal distribution.

This is the p-value for the null hypothesis

that the X and Y data for our sample come from a

population in which their correlation is zero,

i.e., the null hypothesis is that there is no

linear relationship between X and Y. If p is

sufficiently small (often p lt 0.05), we can

reject the null hypothesis, i.e., we conclude

that there is indeed a linear relationship

between X and Y.

Pearson correlation - 3 The square of the

Pearson correlation, r2, also known as the

coefficient of determination, is a measure of the

strength of the linear relationship between X

and Y. It is the proportion of the total

variation in X and Y that is explained by a

linear relationship.

Algorithms

Hierarchical Clustering (HCL)

HCL is an agglomerative clustering method which

joins similar genes into groups. The iterative

process continues with the joining of resulting

groups based on their similarity until all groups

are connected in a hierarchical tree.

(HCL-1)

Hierarchical Clustering

g1 is most like g8

g4 is most like g1, g8

(HCL-2)

Hierarchical Clustering

g5 is most like g7

g5,g7 is most like g1, g4, g8

(HCL-3)

Hierarchical Tree

(HCL-4)

Hierarchical Clustering

During construction of the hierarchy, decisions

must be made to determine which clusters should

be joined. The distance or similarity between

clusters must be calculated. The rules that

govern this calculation are linkage methods.

(HCL-5)

Agglomerative Linkage Methods

- Linkage methods are rules or metrics that return

a value that can be used to determine which

elements (clusters) should be linked. - Three linkage methods that are commonly used are

- Single Linkage
- Average Linkage
- Complete Linkage

(HCL-6)

Single Linkage

Cluster-to-cluster distance is defined as the

minimum distance between members of one cluster

and members of the another cluster. Single

linkage tends to create elongated clusters with

individual genes chained onto clusters. DAB

min ( d(ui, vj) ) where u Î A and v Î B for all

i 1 to NA and j 1 to NB

DAB

(HCL-7)

Average Linkage

Cluster-to-cluster distance is defined as the

average distance between all members of one

cluster and all members of another cluster.

Average linkage has a slight tendency to produce

clusters of similar variance. DAB 1/(NANB) S

S ( d(ui, vj) ) where u Î A and v Î B for all

i 1 to NA and j 1 to NB

DAB

(HCL-8)

Complete Linkage

Cluster-to-cluster distance is defined as the

maximum distance between members of one cluster

and members of the another cluster. Complete

linkage tends to create clusters of similar size

and variability. DAB max ( d(ui, vj) ) where

u Î A and v Î B for all i 1 to NA and j 1 to

NB

DAB

(HCL-9)

Comparison of Linkage Methods

(HCL-10)

Bootstrapping (ST)

Bootstrapping resampling with replacement

Original expression matrix

Various bootstrapped matrices (by experiments)

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Jackknifing (ST)

Jackknifing resampling without replacement

Original expression matrix

Various jackknifed matrices (by experiments)

Analysis of Bootstrapped and Jackknifed Support

Trees

- Bootstrapped or jackknifed expression matrices

are created many times by randomly resampling the

original expression matrix, using either the

bootstrap or jackknife procedure. - Each time, hierarchical trees are created from

the resampled matrices. - The trees are compared to the tree obtained from

the original data set. - The more frequently a given cluster from the

original tree is found in the resampled trees,

the stronger the support for the cluster. - As each resampled matrix lacks some of the

original data, high support for a cluster means

that the clustering is not biased by a small

subset of the data.

K-Means / K-Medians Clustering (KMC) 1

1. Specify number of clusters, e.g., 5.

2. Randomly assign genes to clusters.

K-Means Clustering 2

3. Calculate mean / median expression profile of

each cluster.

4. Shuffle genes among clusters such that each

gene is now in the cluster whose mean / median

expression profile (calculated in step 3) is the

closest to that genes expression profile.

5. Repeat steps 3 and 4 until genes cannot be

shuffled around any more, OR a user-specified

number of iterations has been reached.

K-Means / K-Medians is most useful when the user

has an a-priori hypothesis about the number of

clusters the genes should group into.

Principal Components (PCAG and PCAE) 1

- PCA simplifies the views of the data.
- Suppose we have measurements for each gene on

multiple - experiments.
- Suppose some of the experiments are correlated.
- PCA will ignore the redundant experiments, and

will take a - weighted average of some of the experiments, thus

possibly making - the trends in the data more interpretable.
- 5. The components can be thought of as axes in

n-dimensional - space, where n is the number of components. Each

axis represents a - different trend in the data.

PCAG and PCAE - 2

In this example, x-axis could mean a continuum

from over-to under-expression (blue and

green genes over-expressed, yellow genes

under-expressed) y-axis could mean that gray

genes are over-expressed in first five expts and

under expressed in The remaining expts, while

brown genes are under-expressed in the first

five expts, and over-expressed in the remaining

expts. z-axis might represent different cyclic

patterns, e.g., red genes might be

over-expressed in odd-numbered expts and

under-expressed in even-numbered ones, whereas

the opposite is true for purple

genes. Interpretation of components is somewhat

subjective.

Cluster Affinity Search Technique (CAST)

-uses an iterative approach to segregate elements

with high affinity into a cluster -the process

iterates through two phases -addition of high

affinity elements to the cluster being

created -removal or clean-up of low affinity

elements from the cluster being created

Clustering Affinity Search Technique (CAST)-1

Affinity a measure of similarity between a

gene, and all the genes in a cluster. Threshold

affinity user-specified criterion for retaining

a gene in a cluster, defined as age of maximum

affinity at that point

1. Create a new empty cluster C1.

2. Set initial affinity of all genes to zero

3. Move the two most similar genes into the new

cluster.

4. Update the affinities of all the genes (new

affinity of a gene its previous affinity its

similarity to the gene(s) newly added to the

cluster C1)

ADD GENES

5. While there exists an unassigned gene whose

affinity to the cluster C1 exceeds

the user-specified threshold affinity, pick the

unassigned gene whose affinity is the

highest, and add it to cluster C1. Update the

affinities of all the genes accordingly.

CAST 2

REMOVE GENES

6. When there are no more unassigned

high-affinity genes, check to see if cluster C1

contains any elements whose affinity is lower

than the current threshold. If so, remove the

lowest-affinity gene from C1. Update the

affinities of all genes by subtracting from each

genes affinity, its similarity to the removed

gene.

7. Repeat step 6 while C1 contains a low-affinity

gene.

G13

G3

G8

Current cluster C1

G2

G4

G6

G12

G14

G5

G9

G11

G7

G1

G10

G15

Unassigned genes

8. Repeat steps 5-7 as long as changes occur to

the cluster C1.

9. Form a new cluster with the genes that were

not assigned to cluster C1, repeating steps 1-8.

10. Keep forming new clusters following steps

1-9, until all genes have been assigned to a

cluster

QT-Clust (from Heyer et. al. 1999) (HJC) -1

- Compute a jackknifed distance between all pairs

of genes - (Jackknifed distance The data from one

experiment are excluded from both genes, and the - distance is calculated. Each experiment is thus

excluded in turn, and the maximum distance - between the two genes (over all exclusions) is

the jackknifed distance. This is a conservative - estimate of distance that accounts for bias that

might be introduced by single outlier

experiments.)

2. Choose a gene as the seed for a new cluster.

Add the gene which increases cluster diameter

the least. Continue adding genes until

additional genes will exceed the specified

cluster diameter limit.

3. Repeat step 2 for every gene, so that each

gene has the chance to be the seed of a new

cluster. All clusters are provisional at this

point.

QT-Clust 2

4. Choose the largest cluster obtained from steps

2 and 3. In case of a tie, pick one of the

largest clusters at random.

G4

G9

G3

Seed gene

Pick this cluster

5. All genes that are not in the cluster selected

above are treated as currently unassigned.

Repeat steps 2-4 on these unassigned genes.

6. Stop when the last cluster thus formed has

fewer genes than a user-specified number. All

genes that are not in a cluster at this point are

treated as unassigned.

Self Organizing Tree Algorithm

SOTA - 1

- Dopazo, J. , J.M Carazo, Phylogenetic

reconstruction using and unsupervised growing

neural network that adopts the topology of a

phylogenetic tree. J. Mol. Evol. 44226-233,

1997. - Herrero, J., A. Valencia, and J. Dopazo. A

hierarchical unsupervised growing neural network

for clustering gene expression patterns.

Bioinformatics, 17(2)126-136, 2001.

SOTA Characteristics

SOTA - 2

- Divisive clustering, allowing high level

hierarchical structure to be revealed without

having to completely partition the data set down

to single gene vectors - Data set is reduced to clusters arranged in a

binary tree topology - The number of resulting clusters is not fixed

before clustering - Neural network approach which has advantages

similar to SOMs such as handling large data sets

that have large amounts of noise

SOTA Topology

SOTA - 3

Centroid Vector

Parent Node

ap

Members

as

aw

Winning Cell

Sister Cell

a migration factor (as lt ap lt aw)

Adaptation Overview

SOTA - 4

-each gene vector associated with the parent is

compared to the centroid vector of its offspring

cells. -the most similar cells centroid and

its neighboring cells are adapted using the

appropriate migration weights.

SOTA - 5

-following the presentation of all genes to the

system a measure of system diversity is used to

determine if training has found an optimal

position for the offspring. -if the system

diversity improves (decreases) then another

training epoch is started otherwise training ends

and a new cycle starts with a cell division.

SOTA - 6

The most diverse cell is selected for division

at the start of the next training cycle.

Growth Termination

SOTA - 7

Expansion stops when the most diverse cells

diversity falls below a threshold.

SOTA - 8

Each training cycle ends when the overall tree

diversity stabilizes. This triggers a cell

division and possibly a new training cycle.

Self-organizing maps (SOMs) 1

1. Specify the number of nodes (clusters)

desired, and also specify a 2-D geometry for the

nodes, e.g., rectangular or hexagonal

N Nodes G Genes

SOMs 2

2. Choose a random gene, e.g., G9

3. Move the nodes in the direction of G9. The

node closest to G9 (N2) is moved the most, and

the other nodes are moved by smaller varying

amounts. The further away the node is from N2,

the less it is moved.

SOM Neighborhood Options

Gaussian Neighborhood

Bubble Neighborhood

radius

G7

G7

G8

G8

G9

G9

G10

G10

G11

G11

N1

N2

N1

N2

N3

N4

N3

N4

N5

N6

N5

N6

Some move, alpha is constant.

All move, alpha is scaled.

SOMs 3

4. Steps 2 and 3 (i.e., choosing a random gene

and moving the nodes towards it) are repeated

many (usually several thousand) times. However,

with each iteration, the amount that the nodes

are allowed to move is decreased.

5. Finally, each node will nestle among a

cluster of genes, and a gene will be considered

to be in the cluster if its distance to the node

in that cluster is less than its distance to any

other node

G7

G8

G1

G6

G5

G9

N2

G2

N1

G4

G10

G3

G11

G12

G13

N4

G14

G15

G26

G27

N3

G29

G28

G16

G17

G19

G18

G20

G23

N6

G21

G24

N5

G22

G25

Template Matching

-template matching allows one to find expression

vectors which match a provided template -a

template can be derived from - a gene known to

be central to the area of study - a sample or

set of samples of a particular type - a cluster

with a mean pattern of interest - a pattern

constructed to reveal trends based on

knowledge of the experimental design

PTM-2

-Sometimes it is useful to identify elements that

have complementary patterns by selecting to use

the absolute value of r.

K-Means / K-Medians Support (KMS)

- Because of the random initialization of K-Means /

K-Means, - clustering results may vary somewhat between

successive runs on - the same dataset. KMS helps us validate the

clustering results - obtained from K-Means / K-Medians.
- Run K-Means / K-Medians multiple times.
- The KMS module generates clusters in which the

member genes - frequently group together in the same clusters

(consensus clusters) - across multiple runs of K-Means / K-Medians.
- 3. The consensus clusters consist of genes that

clustered together - in at least x of the K-Means / Medians runs,

where x is the - threshold percentage input by the user.

Gene Shaving

Results in a series of nested clusters

Choose cluster of appropriate size as determined

by gap statistic calculation

Repeat until only one gene remains

Orthogonalize expression matrix with respect to

the average gene in the cluster and repeat

shaving procedure

Gene Shaving

Gap statistic calculation (choosing cluster size)

Quality measure for clusters

between variance of mean gene across experiments

within variance of each gene about the cluster

average

Large R2 implies a tight cluster of coherent genes

The final cluster contains a set of genes that

are greatly affected by the experimental

conditions in a similar way.

Create random permutations of the expression

matrix and calculate R2 for each

Compare R2 of each cluster to that of the entire

expression matrix

Choose the cluster whose R2 is furthest from the

average R2 of the permuted expression matrices.

Relevance Networks

Set of genes whose expression profiles are

predictive of one another.

Can be used to identify negative correlations

between genes

Genes with low entropy (least variable across

experiments) are excluded from analysis.

Relevance Networks

Tmin 0.50

The expression pattern of each gene compared to

that of every other gene.

The remaining relationships between genes define

the subnets

Tmax 0.90

Correlation coefficients outside the boundaries

defined by the minimum and maximum thresholds are

eliminated.

The ability of each gene to predict the

expression of each other gene is assigned a

correlation coefficient

T-Tests (TTEST) Between subjects (or unpaired)

- 1

- Assign experiments to two groups, e.g., in the

expression matrix - below, assign Experiments 1, 2 and 5 to group A,

and - experiments 3, 4 and 6 to group B.

2. Question Is mean expression level of a gene

in group A significantly different from mean

expression level in group B?

TTEST Between subjects - 2

3. Calculate t-statistic for each gene

4. Calculate probability value of the t-statistic

for each gene either from A. Theoretical

t-distribution OR B. Permutation tests.

TTEST - Between subjects - 3

Permutation tests

i) For each gene, compute t-statistic

ii) Randomly shuffle the values of the gene

between groups A and B, such that the reshuffled

groups A and B respectively have the same number

of elements as the original groups A and B.

Original grouping

Randomized grouping

TTEST - Between subjects - 4

Permutation tests - continued

iii) Compute t-statistic for the randomized

gene iv) Repeat steps i-iii n times (where n is

specified by the user). v) Let x the number of

times the absolute value of the original

t-statistic exceeds the absolute values of the

randomized t-statistic over n randomizations. vi

) Then, the p-value associated with the gene 1

(x/n)

TTEST - Between subjects - 5

- 5. Determine whether a genes expression levels

are significantly - different between the two groups by one of three

methods - Just alpha If the calculated p-value for a gene

is less than - or equal to the user-input alpha (critical

p-value), the gene is - considered significant.
- OR
- Use Bonferroni corrections to reduce the

probability of - erroneously classifying non-significant genes as

significant. - B) Standard Bonferroni correction The user-input

alpha is divided - by the total number of genes to give a critical

p-value that is used - as above.

TTEST - Between subjects 6

5C) Adjusted Bonferroni i) The t-values for

all the genes are ranked in descending order.

ii) For the gene with the highest t-value, the

critical p-value becomes (alpha / N), where N is

the total number of genes for the gene with the

second-highest t-value, the critical p-value will

be (alpha/ N-1), and so on.

TTEST 1-class (or One-sample t-test) - 1

- Used to test if the the mean expression of a gene

over all experiments is - different from a hypothesized mean.

Exp 1

Exp 2

Exp 3

Exp 4

Exp 5

Exp 6

Vector 1

Gene 1

Vector 2

Gene 2

Vector 3

Gene 3

2. Question Is the mean of the values of a given

gene vector significantly different from a

hypothesized mean?

TTEST- 1 Class - 2

3. Often, the hypothesized mean in gene

expression studies is zero, meaning that we are

looking for genes whose mean log2 ratio across

all experiments is significantly different from

zero, i.e., 4. Using 1-sample t-tests, we can

select genes which, on average, show

differential expression across all experiments

(since genes with no differential expression

should have a mean log2 ratio of zero across all

expts). 5. Calculate t-value, where

Observed mean of gene vector Hypothesized mean

of gene vector t ----------------------------

--------------------------------------------------

Standard error of the mean of the gene vector

TTEST 1 class - 3

6. Calculate p-value from a theoretical

t-distribution, OR 7. By permutation 7a.

Randomly pick some elements of the gene vector,

and change their values, such that the new value

of the changed element is original value 2

x (original value - hypothesized mean)

(i.e., flip the elements deviation around the

hypothesized mean) Thus, if the original gene

values are and the hypothesized mean is

zero, then the randomized gene values could

be

These elements were randomly chosen and flipped

around zero, the hypothesized mean

TTEST 1 class - 4

7b. Calculate t-value from the randomized

gene 7c. Repeat 7a and 7b as many times as

desired. If all permutations are chosen, then

every possible combination of elements in the

gene vector is chosen for flipping. 7d. The

p-value 1 (the proportion of times that the

original absolute t-value exceeds the randomized

absolute t-value over all the permutations

conducted). 8. If a genes p-value is less than

or equal to the user-specified critical

p-value, the genes mean expression over all

experiments is significantly different from the

hypothesized mean. 9. Bonferroni and adjusted

Bonferroni corrections may be applied just as in

the two-sample t-test.

One Way Analysis of Variance (ANOVA)

- Assign experiments to gt 2 groups

Group 2

Group 3

2. Question Is mean expression level of a gene

the same across all groups?

ANOVA - 2

3. Calculate an F-ratio for each gene,

where Mean square (groups) F

--------------------------, which is a measure

of Mean square (error) Between groups

variability ---------------------------------

Within groups variability The larger the value

of F, the greater the difference among the group

means relative to the sampling error variability

(which is the within groups variability). i.e.,

the larger the value of F, the more likely it is

that the differences among the group means

reflect real differences among the means of the

populations they are drawn from, rather than

being due to random sampling error.

ANOVA - 3 4. The p-value associated with an

F-value is the probability that an F-value that

large would be obtained if there were no

differences among group means (i.e., given the

null hypothesis). Therefore, the smaller the

p-value, the less likely it is that the null

hypothesis is valid, i.e., the differences among

group means are more likely to reflect real

population differences as p-values decrease in

magnitude.

- ANOVA - 4
- 5. P-values can be obtained for the F-values from

a theoretical F-distribution, assuming that the

populations from which the data are obtained - are normally distributed, and
- have homogeneous variances.

The test is considered robust to violations of

these assumptions, provided sample sizes are

relatively large and similar across groups.

ANOVA 5 6. P-values can be obtained from

permutation tests (just like in t-tests), if one

does not want to rely on the assumptions needed

for using the F-distribution. P-values can also

be corrected for multiple comparisons (using

Bonferroni or other procedures). These features

will soon be implemented in MeV.

Two-factor ANOVA (TFA)

- Can be used to find genes whose expression is

significantly - different over two factors (e.g., sex and

strain), as well as to - look for genes with a significant interaction for

these two - factors.

Strain B

Strain C

Strain A

Male

Female

TFA - 2

TFA - 3

- Ideally, design should be balanced, i.e., equal

numbers of samples - in each factor A factor B combination.
- If unbalanced, the analysis can still be

conducted, but F-tests will - be somewhat biased. May need to use smaller

p-values. - can have balanced designs with no replication

(see below). In this - case, interaction cannot be tested..

Significance analysis of microarrays (SAM)

- SAM can be used to pick out significant genes

based on differential expression between sets of

samples. - Currently implemented for the following designs
- - two-class unpaired
- two-class paired
- multi-class
- censored survival
- one-class

SAM -2

- SAM gives estimates of the False Discovery Rate

(FDR), which is the proportion of genes likely to

have been wrongly identified by chance as being

significant. - It is a very interactive algorithm allows users

to dynamically change thresholds for significance

(through the tuning parameter delta) after

looking at the distribution of the test

statistic. - The ability to dynamically alter the input

parameters based on immediate visual feedback,

even before completing the analysis, should make

the data-mining process more sensitive.

SAM designs

- Two-class unpaired to pick out genes whose mean

expression level is significantly different

between two groups of samples (analogous to

between subjects t-test). - Two-class paired samples are split into two

groups, and there is a 1-to-1 correspondence

between an sample in group A and one in group B

(analogous to paired t-test).

SAM designs - 2

- Multi-class picks up genes whose mean expression

is different across gt 2 groups of samples

(analogous to one-way ANOVA) - Censored survival picks up genes whose

expression levels are correlated with duration of

survival. - One-class picks up genes whose mean expression

across experiments is different from a

user-specified mean.

SAM Two-Class Unpaired

- Assign experiments to two groups, e.g., in the

expression matrix - below, assign Experiments 1, 2 and 5 to group A,

and - experiments 3, 4 and 6 to group B.

2. Question Is mean expression level of a gene

in group A significantly different from mean

expression level in group B?

SAM Two-Class Unpaired 2

Permutation tests

- For each gene, compute d-value (analogous to

t-statistic). This is - the observed d-value for that gene.
- ii) Rank the genes in ascending order of their

d-values.

iii) Randomly shuffle the values of the genes

between groups A and B, such that the reshuffled

groups A and B respectively have the same number

of elements as the original groups A and B.

Compute the d-value for each randomized gene

Original grouping

Randomized grouping

SAM Two-Class Unpaired - 3

iv) Rank the permuted d-values of the genes in

ascending order

v) Repeat steps iii) and iv) many times, so that

each gene has many randomized d-values

corresponding to its rank from the

observed (unpermuted) d-value. Take the average

of the randomized d-values for each gene. This

is the expected d-value of that gene.

vi) Plot the observed d-values vs. the expected

d-values

SAM Two-Class Unpaired 4

SAM Two-Class Unpaired 5

- For each permutation of the data, compute the

number of positive and negative significant genes

for a given delta as explained in the previous

slide. The median number of significant genes

from these permutations is the median False

Discovery Rate. - The rationale behind this is, any genes

designated as significant from the randomized

data are being picked up purely by chance (i.e.,

falsely discovered). Therefore, the median

number picked up over many randomizations is a

good estimate of false discovery rate.

SAM Two-Class Paired

- Samples fall into two groups
- Each member of group A is associated with a

member of - group B in a 1-to-1 relationship

A-B pair

SAM Two-Class Paired - 2

- e.g., groups A and B could respectively represent

before and after a drug treatment, and each

A-B pair of samples could come from the same

patient before and after the treatment. - or, groups A and B could represent two strains

for which samples were collected at the several

time points over a time course study. A sample

collected from each of strain A and B at the same

time point could form an AB pair.

- The rest of the analysis is similar to two-class

unpaired SAM. Positive significant genes are

those for which Mean(Group B) is significantly

larger than Mean (Group A), and reverse is true

for negative significant genes

SAM Multi-Class

- Extension of SAM two -class unpaired to more

than 2 groups - Experiments belong to one of at least three

groups - Analogous to one-way between subjects ANOVA

Group 2

Group 3

SAM Multi-Class - 2

- This analysis yields only positive significant

genes - These are genes whose means are significantly

different across - some combination of the groups of experiments.

SAM Censored Survival

- Each experiment (sample) is associated with an

observation - time, and a state at the time of observation.
- The state is either dead or censored
- Censored means that the subject survived

beyond the time - point at which the sample was taken.
- A positive score means that a higher expression

level for that - gene implies shorter survival (i.e., higher

risk), whereas a - negative score means that higher expression

implies longer - survival.

SAM One-Class

- used to pick up genes whose mean expression

across experiments - is different from a user-specified mean.
- analogous to one-class t-test
- positive genes are those whose means are greater

than the specified - mean, while negative genes have means smaller

than the specified - mean

Support Vector Machines (SVM)

- supervised learning technique
- uses supplied information such as presumptive

biological relationships between a set of

elements, and the expression profiles of elements

to produce a binary classification of elements.

Supervised Learning

-begins with the definition of a class which

specifies in advance which elements should

cluster together. -ie. genes for enzymes in a

common pathway or part of a regulatory system, or

samples may be a tissue type or from a particular

strain. -this information is used to train the

SVM to discriminate members from non-members

SVM Process Overview

SVM Training

SVM Classification

Elements In Classification

Elements Out of Classification

SVM Classification

- SVM attempts to find an optimal separating

hyperplane between members of the two initial

classifications.

Separating hyperplane

Separation Problem

-an optimal hyperplane partitions the initial

classification correctly and maximizes distance

from the plane to elements on either side,

positive and negative examples. -when the

training examples (initial classification)

consists of very diverse expression patterns

finding an optimal hyperplane can be impossible

SVM Kernel Construction

- The expression data can be transformed to a

higher dimensional space (feature space) by

applying a kernel function. This transformation

can have the effect of allowing a separating

hyperplane to be found.

Practical SVM Issues

- Results depend heavily on the input parameters.
- Using a high degree kernel function risks

artificial separation of the data. - An iterative approach to increasing the kernel

power is advisable.

SVM Results

- Two classes are produced
- Positive Class contains elements with expression

patterns similar to those in the positive

examples in the training set. - Negative Class contains all other members of the

input set. - Each of these classes has elements that fall in

two groups - Those initially in the class (true positives and

true negatives) - Those recruited into the class (false positives

and false negatives)

K-Nearest Neighbor Classification KNNC - 1

- supervised classification scheme
- user specifies the number of expected classes
- a training set of vectors is provided as input
- user specifies classes of training vectors
- training set should contain example of each

class

KNNC 2 pre-classification filters

- Prior to classification, variance filtering can

optionally be applied - to all vectors (training set vectors to be

trained). This will filter - out genes with low variance across experiments.

Note that this - might filter out some genes in the training set

as well. - Correlation filtering can also be applied on the

vectors to be - classified. This would filter out those vectors

in the set to be - classified, that are not significantly correlated

with any gene in the - training set.
- Significance for correlation filtering is

determined by a - permutation test.

KNNC 3 - correlation filtering randomization

test

1. The Pearson correlation coefficient r is

computed between a given vector to be classified,

and each member of the training set 2. The

maximum such r is called the rmax for that

vector. 3. The vector is randomized a

user-specified number of times, and each time, an

rmax is calculated using the randomized

vector (call it rmax), just as in steps 1 and

2. 4. The proportion of times rmax exceeds

rmax over all randomizations is the p-value for

that vector. 5. If the p-value for a vector lt

the user-specified p-value, that vector is

retained for further analysis. 6. Steps 1-6 are

repeated for every vector in the set to be

classified.

KNNC 4 - Classification parameters

- Let v be a vector that needs to be classified,
- and T t1, t2, , t10 be the set of training

vectors. - The user specifies the classes of each element

of T. Say, there - are 4 classes.
- The user also specifies the number of neighbors

k. Say, k 5.

KNNC 5 - Classification

- Suppose vs 5 nearest neighbors in set T (by

Euclidean distance) are - t1, t4, t8, t2, and t5.
- Since class 1 is most frequently represented in

vs nearest neigbors, v is assigned - to class 1.
- If there is a tie in frequency of classes

represented among nearest neighbors, the - vector remains unassigned.

EASE(Expression Analysis Systematic Explorer)

EASE analysis identifies prevalent biological

themes within gene clusters. The significance of

each identified theme is determined by its

prevalence in the cluster and in the gene

population of genes from which the cluster was

created.

Diverse Biological Roles

Consider a population of genes representing a

diverse set of biological roles or themes shown

below as different colors.

Many algorithms can be applied to expression data

to partition genes based on expression profiles

over multiple conditions. Many of these

techniques work solely on expression data and

disregard biological information.

Consider a particular cluster

-What are the some of the predominant biological

themes represented in the cluster and how should

significance be assigned to a discovered

biological theme?

Example Population Size 40 genes Cluster

size 12 genes 10 genes, shown in green, have a

common biological theme and 8 occur within the

cluster.

Consider the Outcome

AND

80 of the genes related to the theme in the

population ended up within the relatively small

cluster.

Contingency Matrix

A 2x2 contingency matrix is typically used to

capture the relationships between cluster

membership and membership to a biological theme.

(No Transcript)

Assigning Significance to the Findings

The Fishers Exact Test permits us to determine

if there are non-random associations between the

two variables, expression based cluster

membership and membership to a particular

biological theme.

Cluster

in

out

8 2

4 26

in

p ? .0002

Theme

out

( 2x2 contingency matrix )

Hypergeometric Distribution

a b

c d

The probability of any particular matrix

occurring by random selection, given no

association between the two variables, is

given by the hypergeometric rule.

Probability Computation

8 2

4 26

, we are not only

For our matrix,

interested in getting the probability of getting

exactly 8 annotation hits in the cluster but

rather the probability of having 8 or more hits.

In this case the probabilities of each of the

possible matrices is summed.

9 1

3 27

10 0

2 28

8 2

4 26

.0002207 7.27x10-6 7.79x10-8 ? .000228

EASE Results

- Consider all of the Results
- EASE reports all themes represented in a cluster

and although some themes may not meet statistical

significance it may still be important to note

that particular biological roles or pathways are

represented in the cluster. - Independently Verify Roles
- Once found, biological themes should be
- independently verified using annotation resources.

Basic EASE Requirements

Annotation keys identifiers for each gene must

be loaded with the data into MeV. EASE file

system EASE uses a file system to link

annotation keys to biological themes.

EASE File System

EASE(Expression Analysis Systematic Explorer)

Hosack et al. Identifying biological themes

within lists of genes with EASE. Genome Biol.,

4R70-R70.8, 2003.

NIAID graciously provided the foundation Java

classes upon which the MeV version was built.

Coming Attractions

- Algorithm scripting
- Discriminant analysis
- Chromosome Viewers
- etc.