1 / 256

Analysis of Large Scale Gene Expression Data

- John Quackenbush
- September 2005

Microarray Analysis

- General microarray overview
- How do platforms compare?
- Should I subtract background?
- TM4 Array Solutions
- Annotating arrays
- How complete is the genome?
- Tools for array analysis
- Experimental design
- Normalization
- What do we measure?
- Some concepts from statistics
- Distributions
- Randomization Tests
- Multiple Testing
- Finding significant genes
- Fold Change
- T-tests
- ANOVA
- SAM

- Hierarchical clustering
- Bootstrapping
- Jack-knifing
- Self-organizing trees
- K-means/k-medians Clustering
- Self-organizing maps
- Template matching
- Gene shaving
- Relevance networks
- Principal Components Analysis
- Support Vector Machines (SVM)
- k-Nearest Neighbors
- EASE Meta-Analysis
- Demo Data Set 1

General Microarray Analysis

Levels of Biological Information DNA mRNA Prote

ins Informational Pathways Informational

Networks Cells Organs Individuals Populations

Ecologies

Genomics Functional Genomics Proteomics Metabolo

mics Systems Biology Cellular Biology Medicine Med

icine Genetics Ecology

The Future!

February 2001 Completion of the Draft Human

Genome

April 14, 2003 The Human Genome is completed

again!

October 2004 The Human Genome is now really

finished!

But what does finished mean???

Where are the pressing questions?

- Can we find the genes and assign them functions?
- Can we predict protein structures and functions?
- Can we reconstruct metabolic, signaling, and

other pathways? - Can we reconstruct informational networks?
- Can we link genotype to phenotype?
- Can we use genotype/phenotype to predict relevant

outcome? - Can we use cross-species comparisons to learn

something?

The Beast Microarray Robot from Intelligent

Automation lthttp//www.ias.comgt

Microarray Overview I

Microarray Overview II

Microarray Overview II

Affymetrix GeneChip Expression Analysis

Generate DNA Sequence

Design and synthesize chips

ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC

ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC

Affymetrix GeneChip Expression Analysis

Microarray Expression Analysis

Gene

32,448 element mouse array

Thanks to M. Ko (NIA) and B. Soares (BMAP)

kidney vs. heart 15?g total RNA

Shuibang Wang, Yan Yu, Renee Gaspard

Steps in the Process

- Select array elements and annotate them
- Build a database to manage stuff
- Print arrays and manage the lab
- Hybridize and analyze images manage data
- Analyze hybridization data and get results

What can I do with arrays?

Types of Experiments

- Class Comparison
- Can I find genes that distinguish between two

classes, such as tumor and normal? - Class Discovery
- Given what I think is a uniform group of samples,

can I find subsets that are biologically

meaningful? - Classification
- Given a set of samples in different classes, can

I assign a new, unknown sample to one of the

classes? - Mechanistic Studies
- Can I discover a causative mechanism associated

with the distinction between classes?

How do platforms compare?

Microarray Platform Comparisons

Platform comparisons may result in little to no

replication of gene expression levels. Findings

may replicate within a platform (top), but not

between platforms (bottom). Spotted long (80mer)

oligos versus Affymetrix 25mer chip.

Rogojina et al. 2003. Use of Affymetrix to

spotted oligonucleotide microarrays using two

retinal epithelial cell lines. Molecular Vision

9482-96

Experimental Design

Gene expression in Angiotensin II-induced

hypertension acute versus chronic effects

Saline Ang II

- Two Factors
- Factor 1 Treatment saline or Angiotensin II
- Factor 2 Time
- acute (24 h) or chronic (14 day)

acute chronic

Initial RNA samples 1.5 ug total RNA Total RNA

was amplified (TIGR SOP M022) to generate

antisense cRNA cRNA yield 40 - 60 ug from

amplification

Jennie Larkin, Harry Gavras thanks to Affymetrix

Experimental Design

Total RNA cRNA aminoallyl labelled cDNA 1.5

ug 50 ug

amplification

TIGR Microarray Protocol

Affymetrix GeneChip Protocol

Cy3/Cy5 dye coupling and hybridization Wash and

scan

Second cycle small sample protocol Biotinylated

cRNA Fragmentation Wash and stain Scan

DATA

Jennie Larkin, Harry Gavras thanks to Affymetrix

Affymetrix TIGR shared data

- 11,714 TCs were present on both platforms
- 5854 had values in gt 80 of experiments
- 92 (N 5358) did not statistically differ

between platforms (2-factor ANOVA, a .05)

Jennie Larkin, Harry Gavras thanks to Affymetrix

How do platforms compare?

Jennie Larkin, Harry Gavras thanks to Affymetrix

Two platforms similar values

- Both OSF2 and procollagen type III alpha 1 were

up-regulated in chronic Ang II treatment. - Even the pattern of variability between

biological replicates was reproduced between

platforms.

Jennie Larkin, Harry Gavras thanks to Affymetrix

Two platforms similar values

- Both ADAMTS1 and ADRP were up-regulated in acute

Ang II treatment. - Again, both the values and the pattern of

variability between biological replicates were

reproduced between platforms.

Jennie Larkin, Harry Gavras thanks to Affymetrix

Not all is perfect

There were differences between the arrays, but

these were the exception rather than the

rule. Metallothionein-II was only up-regulated

in acute on the TIGR arrays, but was up-regulated

in both acute and chronic on the Affy chips.

Jennie Larkin, Harry Gavras thanks to Affymetrix

When Platforms Agree

When Platforms Disagree, They Disagree

General Microarray Strategy

- Choose an experimentally interesting and

tractable model system - Design an experiment with comparisons between

related variants - Include sufficient biological replication to make

good estimates - Hybridize and collect data
- Normalize and filter
- Mine data for biological patterns of expression
- Integrate expression data with other ancillary

data such, including genotype, phenotype, the

genome, and its annotation

Should I subtract background?

Non-normalized, background subtracted data

100

Annotating andComparing Arrays

TIGR Gene Indices home page

www.tigr.org/tdb/tgi

83 species gt26,000,000 sequences

The Gene Indices home page

biocomp.dfci.harvard.edu/tgi/tgipage.html

85 species gt28,000,000 sequences

TGICL Tools are available with more coming

Gene Index Assembly process

A TC Example

GO Terms and EC Numbers

Babak Parvizi

Everything is mapped in the context of the

relevant genome

The TIGR Gene Indices lthttp//www.tigr.org.tdb/tdb

/tgigt

Dan Lee, Ingeborg Holt

Building TOGs Reflexive, Transitive Closure

Thanks to Woytek Makalowski and Mark Boguski

TOGA An Sample Alignment bithoraxoid-like

protein

(No Transcript)

RESOURCERER

Jennifer Tsai

http//pga.tigr.org/tools.shtml

RESOURCERER An Example

RESOURCERER Using Genetic Markers

Now Available QTL-based searches, promoter

region extraction, GO-Slim analysis

Just added

The Features are the same

How Complete isthe Human Genome?

is easy!

Gene Finding in Humans

Razvan Sultana

is easy?

Gene Finding in Humans

Razvan Sultana

is difficult?

Gene Finding in Humans

Razvan Sultana

is difficult?

Gene Finding in Humans

A genome and its annotation is only a hypothesis

that must be tested.

Razvan Sultana

Select TIGRHuman TCsnot mappingto human

genome(about 9000)

Razvan Sultana Laura Beaver Bryan Frank

Answer 61 Positives!

SequenceValidationUnderway Approximately50

appear good

Razvan Sultana Laura Beaver Bryan Frank

Tools for Array Analysis

TM4 Resources

TM4 Website

- Application Downloads
- Documentation and FAQs
- And Much More!

http//www.tm4.org

CD Contents

- All TM4 Applications
- User Manuals
- Supplementary Documentation
- Sample Data Sets

TM4 Reference

Saeed, A.I., Sharov, V., White, J., Li, J.,

Liang, W., Bhagabati, N., Braisted, J., et al.

2003. TM4 A Free, Open-Source System for

Microarray Data Management and Analysis.

BioTechniques 34 374-378

Microarray Data Flow

Scanner

Printer

.tiff Image File

Image Analysis

Raw Gene Expression Data

Gene Annotation

Normalization / Filtering

Normalized Data with Gene Annotation

Expression Analysis

Interpretation of Analysis Results

MAD Database Schema

MADAM Microarray Data Manager

MAGE-ML exportnow functional

Joseph White Jerry Li Alexander Saeed Vasily

Sharov Syntek Inc.

? Available with OSI source and MySQL

MAGE-ML Export with Madam

MIDAS Data Analysis

Wei Liang

Variance Stabilization, Adding Error

Models, MAANOVA, Automated Reporting

? Available with source

MeV Data Mining Tools

Alexander Saeed Alexander Sturn Nirmal

Bhagabati John Braisted Syntek Inc. Datanaut,

Inc.

Now with scriptingand the ability to save state

andrestart analysis

? Available with OSI source www.tm4.org

Microarray Data Flow

Scheduler (Machine Scheduling)

SliTrack (Machine Control)

.tiff Image File

PCR Score

MABCOS (Barcode System)

Exp Designer

Spotfinder (Image Analysis)

MADAM (Data Manager)

Expression Data

Raw .tav File

Miner (.tav File Creator)

Raw .tav File

MIDAS (Normalization)

GenePix Converter

Normalized .tav File

Query Window

MeV (Data Analysis)

Interpretation

Protocols and Methods

Publications on Microarray Tools and Techniques

Publications on Microarray Data Analysis and

Design

Publications on Microarray Data Exchange Standards

MIAME Standards Nature family, Cell family, EMBO

reports, Bioinformatics,Genome Research, Genome

Biology, Science, The Lancet,Science, and

others.

The Starting Point Designing the Experiment

The Experimental Design

- The Experimental Design dictates a good deal of

what you can do with the data - Good normalization and processing reflects the

experimental design - The design also facilitates certain comparisons

between samples and provides the statistical

power you need for assigning confidence limits to

individual measurements - The design must reflect experimental reality
- The most straight-forward designs compare

expression in two classes of samples to look for

patterns that distinguish them.

Sample Pairing for Co-Hybridization Experiments

Direct Comparison with Dye Swap

A1

B1

A2

B2

A3

B3

A4

B4

A1

B1

A2

A3

B2

B3

A4

B4

- RNA sample is not limiting (e.g. plenty of

sample) - Flip dyes account for any gene-dye effects

Balanced Block Design

A1

B1

A2

B2

A4

B4

A3

B3

- RNA sample is limiting
- Balanced blocking accounts for any gene-dye

effects

Multiple Sample Pairings

Reference Design (Indirect Comparison)

- More than two samples are compared
- (e.g. tumor classification, time course)
- Flip dyes are not necessary but can be done to

increase precision - Ratio values are inferred (indirect)
- Suited for cluster analysis need common

reference

Loop Design

Why perform flip-dye experiments?

Loops and Reference Designs

23 Hybs

10 hybs

Standard flip-dye expt

S. Wang , K. Kerr, J. Quackenbush, G. Churchill

Loops and Reference Designs

Both approaches can give equivalent results

S. Wang , K. Kerr, J. Quackenbush, G. Churchill

Loop vs. Reference Designs

- Loop design
- Can provide direct measurements
- Give more data on each experimental sample

with the same number of hybs - Require more RNA per sample
- Can unwind with a bad sample or for a gene

with bad data - Reference design
- Easily extensible
- Simple interpretation of all results
- Requires less RNA per sample
- Less sensitive to bad RNA samples and bad

array elements

Experimental Design

A1

A2

A3

A4

B1

B2

B3

B4

Keep it simple!

One Possible Experimental Paradigm

Examining Genotype, Phenotype, and Environment

Parental - stressed

Derived - stressed

Parental - unstressed

Derived - unstressed

Basic Design Principles

- Biological replicas are more informative than

correlated replicas (independent RNA, independent

slides) - More replicas are better higher statistical

power - For loops, hybridizations of individual samples

should be balanced (as many Cy3 as Cy5

labelings) - Self-self hybs add data on reproducibility and

can be used to produce error models - At a minimum, should use dye swap replicates to

compensate for any dye biases in labeling or

detection

How Many Replicates?

(Simon et al., Genetic Epidemiology 23 21-36,

2002)

n 4(za/2 zb)2 / (d/1.4s)2

Where za/2 and zb are normal percentile values at

significance level a and false negative rate b

parameter d represents the minimum detectable

log2 ratio and s represents the SD of log ratio

values. For a 0.001 and b 0.05, then za/2

-3.29 and zb -1.65. Assume d 1.0 (2-fold

change) and s 0.25, Therefore n 12 samples

(6 query and 6 control).

Normalization

Why Normalize Data?

- Goal is to measure ratios of gene expression

levels (ratio)i Ri/Giwhere Ri/Gi are,

respectively , the measured intensities for the

ith spot. - In a self-self hybridization, we would expect all

ratios to be equal to one Ri/Gi 1 for all i.

But they may not be. - Why not?
- Unequal labeling efficiencies for Cy3/Cy5
- Noise in the system
- Differential expression
- Normalization brings (appropriate) ratios back to

one.

The Starting Point The Ratio

Log2(ratio) measures treat up- and down-regulated

genes equally

log2(1) 0 log2(2) 1 log2(1/2) -1

Normalization Approaches A variety exist

- Total Intensity
- Linear Regression
- Ratio statistics described by Chen, Dougherty,

Bittner - J. Biomed. Optics (1997) 2(4) 364-374
- Iterative log(ratio) mean centering
- Lowess Correction
- And others
- Any of these using
- Entire Data Set
- User-defined Data Set/Controls

Normalization Approaches

Using the Entire Data Set

- Probe Quantification less important
- No assumption on which genes constitute
- housekeeping set
- Uses all the data
- No independent confirmation

User-defined Data Set/Controls

- Requires definition of housekeeping set
- or good added controls
- Requires good RNA quantitation
- Ignores much data

Normalization Approaches

The Solution(?)

- The best technique is experiment dependent
- A good approach is to use a combination of

techniques - All analysis methods depend on an

intelligent Experimental design

Resource A. thaliana DNA Clones for Spiking

- chlorophyll a/b binding protein (Cab)
- RUBISCO activase (RCA)
- ribulose-1,5-bisphosphate carboxylase/oxygenase

(RbcL) - lipid transfer protein 4 (LTP4)
- lipid transfer protein 6 (LTP6)
- papain-type cysteine endopeptidase (XCP2)
- root cap 1 (RCP1)
- NAC1
- triosphosphate isomerase (TIM)
- ribulose-5-phosphate kinase (PRKase)

SP6 Transcription Start

5ATTTA GGTGA CACTA TAGAA TACAA GCTTG GGCTG

CAGGT CGACT CTAGA GGATC CCCGG GCGAG CTCCC

AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA CCGAA TTC3

SP6 Promoter

HindIII

PstI

SatI AccI HincII

XbaI

EcoRI

SacI

AvaI SmaI

BamHi

Clone set available at lthttp//pga.tigr.orggt

Resource B. subtillus DNA Clones for Spiking

- pGIBS-lys ATCC 87482
- pGIBS-phe ATCC 87483
- pGIBS-thr ATCC 87484
- pGIBS-trp ATCC 87485
- pGIBS-dap ATCC 87486
- Artificial polyA added to the 3end

Clone set available at lthttp//www.atcc.orggt

Normalization Approaches Total Intensity

- Conceptually, this is the simplest approach
- Assumption Total RNA (mass) used is same for
- both samples.
- So, averaged across thousands of genes, total
- hybridization should be the same for both samples

Before and After Normalization

The Starting Point The R-I Plot

- Data exhibits an intensity-dependent structure
- Uncertainty in measurements is greater at lower

intensities - Uncertainty in ratio measurements generally

greater at lower intensities - Plot log2(R/G) vs. log2(RG)
- variation Terry Speeds M-A plot with
- (½ )log2(RG)

Good Data

Bad Data from Parts Unknown

Each pen group is colored differently

Gary Churchill

Lowess Normalization

- Why LOWESS?

- Observations
- Intensity-dependent structure
- Data not mean centered at log2(ratio) 0

LOWESS (Contd)

- Local linear regression model
- Tri-cube weight function
- Least Squares

LOWESS Results

Variance stabilization/regularization

- Measurements of expression vary between any two

assays - This can be affected by changes in the mean

expression level, but normalization can help

reduce those differences - However, the variance, or spread in the data, can

be quite different between replicates (or pen

groups) - Variance stabilization can rescale the data for

each experiment to make these more comparable

A Box Plot can show the difference in

variancebetween replicates

Standard Deviation Regularization

Let aij be the raw log ratio for the jth spot in

ith block (or slide)

aij be the scaled log ratio for the jth spot in

ith block (or slide)

where Nj denotes the number of genes ith block or

ith slide, M denotes the number of blocks or

slides, aij denotes the log ratio mean of ith

block (or ith slide)

MIDAS Normalization Methods(Standard deviation

regularization)

Standard deviation regularization

Assumption log-ratio standard deviations within

each block or slide are the same.

Variance regularization can remove the bias

There are Limits to what you can Measure

The Limits of log-ratios The space we explore

The Limits of log-ratios The space we explore

The Limits of log-ratios The space we explore

What do we measure?

Dealing with Data

Before any pattern analysis can be done, one must

first normalize and filter the data.

Normalization facilitates comparisons between

datasets.

Filtering transformations can eliminate

questionable data and reduce complexity.

Expression Elements

Ratio vs. log-ratio

Ai Red intensity Bi Green intensity

Let

R

4

Gene1

3

log2(AB)

2

1

Gene2

0

AB

Advantages of log transformation Treat

up-regulated and down-regulated genes

symmetrically! Transfer multiplication

operations to addition operations! Because

Expression Vectors

Gene Expression Vectors represent the expression

of a gene over a set of experimental conditions

or sample types.

Multiple Samples?

- Goal is identify genes (or experiments) which

havesimilar patterns of expression - This is a problem in data mining
- Clustering Algorithms are most widely used
- Types
- Agglomerative Hierarchical
- Divisive k-means, SOMs
- Hybrid SOTA
- Nonclustering Principal Component Analysis

(PCA) - All depend on how one measures distance

Expression Vectors

- Crucial concept for understanding clustering
- Each gene is represented by a vector where

coordinates are its values log(ratio) in each

experiment - x log(ratio)expt1
- y log(ratio)expt2
- z log(ratio)expt3
- etc.

Expression Vectors

- Crucial concept for understanding clustering
- Each gene is represented by a vector where

coordinates are its values log(ratio) in each

experiment - x log(ratio)expt1
- y log(ratio)expt2
- z log(ratio)expt3
- etc.
- For example, if we do six experiments,
- Gene1 (-1.2, -0.5, 0, 0.25, 0.75, 1.4)
- Gene2 (0.2, -0.5, 1.2, -0.25, -1.0, 1.5)
- Gene3 (1.2, 0.5, 0, -0.25, -0.75, -1.4)
- etc.

Expt 1 Expt 2 Expt 3 Expt 4 Expt 5 Expt 6

Expression Matrix

- These gene expression vectors of log(ratio)

values can be used to construct an expression

matrix

- Gene1 -1.2 -0.5 0 0.25 0.75

1.4 - Gene2 0.2 -0.5 1.2 -0.25 -1.0

1.5 - Gene3 1.2 0.5 0 -0.25

-0.75 -1.4 - etc.
- This is often represented as a red/green colored

matrix

Expression Matrix

The Expression Matrix is a representation of data

frommultiple microarray experiments.

Each element is a log ratio, usually log 2

(Cy5/Cy3)

Black indicates a log-ratio of zero, i.e., Cy5

and Cy3 are very close in value

Green indicates a negative log-ratio, i.e., Cy5 lt

Cy3

Gray indicates missing data

Red indicates a positive log ratio, i.e, Cy5 gt

Cy3

Expression Vectors As Points inExpression Space

Similar Expression

Experiment 3

Experiment 2

Experiment 1

Distance metrics

- Distances are measured between expression

vectors - Distance metrics define the way we measure

distances - Many different ways to measure distance
- Euclidean distance
- Pearson correlation coefficient(s)
- Manhattan distance
- Mutual information
- Kendalls Tau
- etc.
- Each has different properties and can reveal

different features of the data

Distance and Similarity

The ability to calculate a distance (or

similarity - its inverse) between two expression

vectors is fundamental to clustering

algorithms Distance between vectors is the

basis upon which decisions are made when grouping

similar patterns of expression Selection of a

distance metric defines the concept of distance

for a particular experiment

Distance a measure of similarity between genes

Some distances (MeV provides 11 metrics)

Distance Is Defined by a Metric

Distance is Defined by a Metric

1.4

-0.90

4.2

-1.00

Gene1 Gene2 Gene3 Gene4 Gene5 Gene6

Distance Matrix

- Once a distance metric has been selected, the

starting point for all clustering methods is a

distance matrix

- Gene1 0 1.5 1.2 0.25

0.75 1.4 - Gene2 1.5 0 1.3 0.55

2.0 1.5 - Gene3 1.2 1.3 0 1.3

0.75 0.3 - Gene4 0.25 0.55 1.3 0 0.25

0.4 - Gene5 0.75 2.0 0.75 0.25

0 1.2 - Gene6 1.4 1.5 0.3 0.4

1.2 0 - The elements of this matrix are the pair-wise

distances. Note that the matrix is symmetric

about the diagonal.

MeV Data Mining Tools

Alexander Saeed Alexander Sturn Nirmal

Bhagabati John Braisted Syntek Inc. Datanaut,

Inc.

? Available with OSI source

Some Concepts from Statistics

Probability distributions

- The probability of an event is the likelihood

of its occurring. - It is sometimes computed as a relative frequency

(rf), where

The probability of an event can sometimes be

inferred from a theoretical probability

distribution, such as a normal distribution.

Normal distribution

- Less than a 5 chance that the sample with mean

s came from Population 1 - s is significantly different from Mean 1 at the

p lt 0.05 significance level. - But we cannot reject the hypothesis that the

sample came from Population 2

Probability and Expression Data

- Many biological variables, such as height and

weight, can reasonably be assumed to approximate

the normal distribution. - But expression measurements? Probably not.
- Fortunately, many statistical tests are

considered to be fairly robust to violations of

the normality assumption, and other assumptions

used in these tests. - Randomization / resampling based tests can be

used to get around the violation of the normality

assumption. - Even when parametric statistical tests (the ones

that make use of normal and other distributions)

are valid, randomization tests are still useful.

Outline of a randomization test - 1

1. Compute the value of interest (i.e., the

test-statistic s) from your data set.

2. Make fake data sets from your original

data, by taking a random sub-sample of the data,

or by re-arranging the data in a random fashion.

Re-compute s from the fake data set.

Outline of a randomization test - 2

3. Repeat step 2 many times (often several

hundred to several thousand times) and record of

the fake s values from step 2 4. Draw

inferences about the significance of your

original s value by comparing it with the

distribution of the randomized (fake) s values

Outline of a randomization test - 3

- Rationale
- Ideally, we want to know the behavior of the

larger population from which the sample is drawn,

in order to make statistical inferences. - Here, we dont know that the larger population

behaves like a normal distribution, or some

other idealized distribution. All we have to work

with are the data in hand. - Our fake data sets are our best guess about

this behavior (i.e., if we had been pulling data

at random from an infinitely large population, we

might expect to get a distribution similar to

what we get by pulling random sub-samples, or by

reshuffling the order of the data in our sample)

The problem of multiple testing (adapted from

presentation by Anja von Heydebreck,

MaxPlanckInstitute for Molecular Genetics,

Dept. Computational Molecular Biology, Berlin,

Germany http//www.bioconductor.org/workshops/Heid

elberg02/mult.pdf)

- Lets imagine there are 10,000 genes on a chip,

and - none of them is differentially expressed.
- Suppose we use a statistical test for

differential expression, where we consider a gene

to be differentially expressed if it meets the

criterion at a p-value of p lt 0.05.

The problem of multiple testing 2

- Lets say that applying this test to gene G1

yields a p-value of p 0.01 - Remember that a p-value of 0.01 means that there

is a 1 chance that the gene is not

differentially expressed, i.e., - Even though we conclude that the gene is

differentially expressed (because p lt 0.05),

there is a 1 chance that our conclusion is

wrong. - We might be willing to live with such a low

probability of being wrong - BUT .....

The problem of multiple testing 3

- We are testing 10,000 genes, not just one!!!
- Even though none of the genes is differentially

expressed, about 5 of the genes (i.e., 500

genes) will be erroneously concluded to be

differentially expressed, because we have decided

to live with a p-value of 0.05 - If only one gene were being studied, a 5 margin

of error might not be a big deal, but 500 false

conclusions in one study? That doesnt sound too

good.

The problem of multiple testing 4

- There are tricks we can use to reduce the

severity of this problem. - They all involve slashing the p-value for each

test (i.e., gene), so that while the critical

p-value for the entire data set might still equal

0.05, each gene will be evaluated at a lower

p-value. - Well go into some of these techniques later.

The problem of multiple testing 5

- Dont get too hung up on p-values.
- Ultimately, what matters is biological

relevance. - P-values should help you evaluate the strength of

the evidence, rather than being used as an

absolute yardstick of significance. - Statistical significance is not necessarily the

same as biological significance.

Finding Significant Genes

- Assume we will compare two conditions with

multiple replicates for each class - Our goal is to find genes that are significantly

different between these classes - These are the genes that we will use for later

data mining

Finding Significant Genes

- Average Fold Change Difference for each gene
- suffers from being arbitrary and not taking into

account systematic variation in the data

Finding Significant Genes

- t-test for each gene
- Tests whether the difference between the mean of

the query and reference groups are the same - Essentially measures signal-to-noise
- Calculate p-value (permutations or distributions)
- May suffer from intensity-dependent effects

t-tests

T-Tests (Between Subjects or unpaired) - 1

- Assign experiments to two groups, e.g., in the

expression matrix below, assign Experiments 1, 2

and 5 to group A, and experiments 3, 4 and 6 to

group B.

2. Question Is mean expression level of a gene

in group A significantly different from mean

expression level in group B?

T-TEST - 2

3. Calculate t-statistic for each gene

4. Calculate probability value of the

t-statistic for each gene either from A.

Theoretical t-distribution OR B.

Permutation tests.

T-TEST - 3

Permutation tests

i) For each gene, compute t-statistic

ii) Randomly shuffle the values of the gene

between groups A and B, such that the reshuffled

groups A and B respectively have the same number

of elements as the original groups A and B.

T-TEST - 4

Permutation tests - continued

iii) Compute t-statistic for the randomized

gene iv) Repeat steps i-iii n times (where n is

specified by the user). v) Let x the number of

times the absolute value of the original

t-statistic exceeds the absolute values of the

randomized t-statistic over n randomizations. vi)

Then, the p-value associated with the gene 1

(x/n)

T-TEST - 5

- 5. Determine whether a genes expression levels

are significantly different between the two

groups by one of three methods - Just alpha (a significance level) If the

calculated p-value for a gene is less than or

equal to the user-input a (critical p-value), the

gene is considered significant. - OR
- Use Bonferroni corrections to reduce the

probability of erroneously classifying

non-significant genes as significant. - B) Standard Bonferroni correction The user-input

alpha is divided by the total number of genes to

give a critical p-value that is used as above gt

pcritical a/N.

T-TEST 6

5C) Adjusted Bonferroni i) The t-values for

all the genes are ranked in descending order.

ii) For the gene with the highest t-value, the

critical p-value becomes (a/N), where N is the

total number of genes for the gene with the

second-highest t-value, the critical p-value

will be (a/N-1), and so on.

TTEST 1-class (or One-sample t-test) - 1

- Used to test if the the mean expression of a gene

over all experiments is different from a

hypothesized mean.

2. Question Is the mean of the values of a given

gene vector significantly different from a

hypothesized mean?

TTEST- 1 Class - 2

3. Often, the hypothesized mean in gene

expression studies is zero, meaning that we are

looking for genes whose mean log2 ratio across

all experiments is significantly different from

zero, i.e., 4. Using 1-sample t-tests, we can

select genes which, on average, show differential

expression across all experiments (since genes

with no differential expression should have a

mean log2 ratio of zero across all expts). 5.

Calculate t-value, where Observed mean

of gene vector Hypothesized mean of gene

vector t ------------------------------------

------------------------------------------ Stand

ard error of the mean of the gene vector

TTEST 1 class - 3

6. Calculate p-value from a theoretical

t-distribution, OR 7. By permutation 7a.

Randomly pick some elements of the gene vector,

and change their values,such that the new value

of the changed element is original value 2

x (original value - hypothesized mean)

(i.e., flip the elements deviation around the

hypothesized mean) Thus, if the original gene

values are and the hypothesized mean is

zero, then the randomized gene values could

be

TTEST 1 class - 4

7b. Calculate t-value from the randomized

gene 7c. Repeat 7a and 7b as many times as

desired. If all permutations are chosen, then

every possible combination of elements in the

gene vector is chosen for flipping. 7d. The

p-value 1 (the proportion of times that the

original absolute t-value exceeds the randomized

absolute t-value over all the permutations

conducted). 8. If a genes p-value is less than

or equal to the user-specified critical p-value,

the genes mean expression over all experiments

is significantly different from the hypothesized

mean. 9. Bonferroni and adjusted Bonferroni

corrections may be applied just as in the

two-sample t-test.

Finding Significant Genes

- Volcano Plots
- Combines p-values and fold change measures
- Significant genes appear in upper corners

Finding Significant Genes

- Analysis of Variation (ANOVA)
- Which genes are most significant for separating

classes of samples? - Calculate p-value (permutations or distributions)
- Reduces to a t-test for 2 samples
- May suffer from intensity-dependent effects

One Way Analysis of Variance (ANOVA)

- Assign experiments to gt 2 groups

2. Question Is mean expression level of a gene

the same across all groups?

ANOVA - 2

3. Calculate an F-ratio for each gene,

where Mean square (groups) F

----------------------------------, which is a

measure of Mean square (error) Between

groups variability -----------------------------

--------- Within groups variability The larger

the value of F, the greater the difference among

the group means relative to the sampling error

variability (which is the within groups

variability). i.e., the larger the value of F,

the more likely it is that the differences among

the group means reflect real differences among

the means of the populations they are drawn

from, rather than being due to random sampling

error.

ANOVA - 3 4. The p-value associated with an

F-value is the probability that an F-value that

large would be obtained if there were no

differences among group means (i.e., given the

null hypothesis). Therefore, the smaller the

p-value, the less likely it is that the null

hypothesis is valid, i.e., the differences among

group means are more likely to reflect real

population differences as p-values decrease in

magnitude.

- ANOVA - 4
- 5. P-values can be obtained for the F-values from

a theoretical F-distribution, assuming that the

populations from which the data are obtained - are normally distributed, and
- have homogeneous variances.

The test is considered robust to violations of

these assumptions, provided sample sizes are

relatively large and similar across groups.

ANOVA 5 6. P-values can be obtained from

permutation tests (just like in t-tests), if one

does not want to rely on the assumptions needed

for using the F-distribution. P-values can also

be corrected for multiple comparisons (using

Bonferroni or other procedures).

Two-factor ANOVA (TFA)

Can be used to find genes whose expression is

significantly different over two factors (e.g.,

sex and strain), as well as to look for genes

with a significant interaction for these two

factors.

Strain B

Strain C

Strain A

Male

Female

TFA - 2

TFA - 3

- Ideally, design should be balanced, i.e., equal

numbers of samples in each factor A factor B

combination. - If unbalanced, the analysis can still be

conducted, but F-tests will be somewhat biased.

May need to use smaller p-values. - Can have balanced designs with no replication

(see below). In this case, interaction cannot be

tested..

Finding Significant Genes

- Significance Analysis of Microarrays (SAM)
- Uses a modified t-test by estimating and adding a

small positive constant to the denominator - Significant genes are those which exceed the

expected values from permutation analysis.

SAM test Statistic

- di Score
- si Standard Deviation
- s0 Safety Factor

SAM Variance Estimate

- Gene by gene variance estimate safety factor
- Variance equal in the two conditions
- s0 term is here to deal with cases when variance

estimates gets too close to zero

- How to choose s0?
- Test statistics are binned in 100 different group

depending on the si value - s0 is chosen so that the dispersion of the test

statistic does not vary from bin to bin - avoids aberrant values when variance estimates

close to 0

SAM Hypothesis Testing

- Permutation technique
- Multiple testing adjustment technique
- False Discovery Rate

Confidence Level False Discovery Rate

- Fix a threshold DELTA for differentially

expressed genes - For each permutation, count how many genes you

declare differentially expressed - NB In a permutation you should find 0 genes.
- Compute median number of falsely called genes in

permutations - False Discovery Rate is number of falsely called

genes divided by number of differential expressed

genes in original data

FDR percentage of NON-significant genes you can

expect to find in your result list

SAM

- SAM gives estimates of the False Discovery Rate

(FDR), which is the proportion of genes likely to

have been wrongly identified by chance as being

significant. - It is a very interactive algorithm allows users

to dynamically change thresholds for significance

(through the tuning parameter delta) after

looking at the distribution of the test

statistic. - The ability to dynamically alter the input

parameters based on immediate visual feedback,

even before completing the analysis, helps make

the data-mining process sensitive.

Significance analysis of microarrays (SAM)

- SAM can be used to identify significant genes

based on differential expression between sets of

samples. - Currently implemented for the following designs
- two-class unpaired
- two-class paired
- multi-class
- censored survival
- one-class

SAM designs

- Two-class unpaired to pick out genes whose mean

expression level is significantly different

between two groups of samples (analogous to

between subjects t-test). - Two-class paired samples are split into two

groups, and there is a 1-to-1 correspondence

between an sample in group A and one in group B

(analogous to paired t-test).

SAM designs - 2

- Multi-class identifies genes whose mean

expression is different across gt 2 groups of

samples (analogous to one-way ANOVA) - Censored survival finds genes whose expression

levels are correlated with duration of survival

using Cox-regression. - One-class selects genes whose mean expression

across experiments is different from a

user-specified mean.

SAM Two-Class 1

- Assign experiments to two groups, e.g., in the

expression matrix below, assign Experiments 1, 2

and 5 to group A, and experiments 3, 4 and 6 to

group B.

2. Question Is mean expression level of a gene

in group A significantly different from mean

expression level in group B?

SAM Two-Class 2

Permutation tests

i) For each gene, compute d-value (analogous to

t-statistic). This is the observed d-value for

that gene.

ii) Randomly shuffle the values of the gene

between groups A and B, such that the reshuffled

groups A and B respectively have the same number

of elements as the original groups A and B.

Compute the observed d-value for each randomized

gene

SAM Two-Class 3

- iii) Repeat step (ii) many times, so that each

gene has many randomized d-values. Take the

average of the randomized d-values for each gene.

This is the expected d-value of that gene. - iv) Plot the observed d-values vs. the expected

d-values

SAM Two-Class 4

Significant positive genes (i.e., mean

expression of group B gt mean expression of

group A) in red

The more a gene deviates from the observed

expected line, the more likely it is to be

significant. Any gene beyond the first gene in

the ve or ve direction on the x-axis (including

the first gene), whose observed exceeds the

expected by at least delta, is considered

significant.

SAM Two-Class 5

- For each permutation of the data, compute the

number of positive and negative significant genes

for a given delta as explained in the previous

slide. The median number of significant genes

from these permutations is the median False

Discovery Rate. - The rationale behind this is, any genes

designated as significant from the randomized

data are being picked up purely by chance (i.e.,

falsely discovered). Therefore, the median

number picked up over many randomizations is a

good estimate of false discovery rate.

SAM Two-Class Paired

- Samples fall into two groups
- Each member of group A is associated with a

member of group B in a 1-to-1 relationship

A-B pair

SAM Two-Class Paired - 2

- e.g., groups A and B could respectively

represent before and after a drug treatment,

and each A-B pair of samples could come from the

same patient before and after the treatment. - or, groups A and B could represent two strains

for which samples were collected at the several

time points over a time course study. A sample

collected from each of strain A and B at the same

time point could form an AB pair.

- The rest of the analysis is similar to two-class

unpaired SAM. Positive significant genes are

those for which Mean(Group B) is significantly

larger than Mean (Group A), and reverse is true

for negative significant genes

SAM Multi-Class

- Extension of SAM two -class unpaired to more

than 2 groups - Experiments belong to one of at least three

groups - Analogous to one-way between subjects ANOVA

SAM Multi-Class - 2

- This analysis yields only positive significant

genes - These are genes whose means are significantly

different across some combination of the groups

of experiments.

SAM Censored Survival

- Each experiment (sample) is associated with an

observation time, and a state at the time of

observation. - The state is either dead or censored
- Censored means that the subject survived

beyond the time point at which the sample was

taken. - A positive score means that a higher expression

level for that gene implies shorter survival

(i.e., higher risk), whereas a negative score

means that higher expression implies longer

survival.

Finding Patterns in the Data

Hierarchical Clustering

1. Calculate the distance between all genes. Find

the smallest distance. If several pairs share the

same similarity, use a predetermined rule to

decide between alternatives.

2. Fuse the two selected clusters to produce a

new cluster that now contains at least two

objects. Calculate the distance between the new

cluster and all other clusters.

3. Repeat steps 1 and 2 until only a single

cluster remains.

4. Draw a tree representing the results.

Hierarchical Clustering

(HCL-2)

Hierarchical Clustering

(HCL-3)

Hierarchical Tree

(HCL-4)

Agglomerative Linkage Methods

- Linkage methods are rules or metrics that return

a value that can be used to determine which

elements (clusters) should be linked. - Three linkage methods that are commonly used are

- Single Linkage
- Average Linkage
- Complete Linkage

(HCL-6)

Single Linkage

Cluster-to-cluster distance is defined as the

minimum distance between members of one cluster

and members of the another cluster. Single

linkage tends to create elongated clusters with

individual genes chained onto clusters. DAB

min ( d(ui, vj) ) where u Î A and v Î B for all

i 1 to NA and j 1 to NB

DAB

(HCL-7)

Average Linkage

Cluster-to-cluster distance is defined as the

average distance between all members of one

cluster and all members of another cluster.

Average linkage has a slight tendency to produce

clusters of similar variance. DAB 1/(NANB) S

S ( d(ui, vj) ) where u Î A and v Î B for all

i 1 to NA and j 1 to NB

DAB

(HCL-8)

Complete Linkage

Cluster-to-cluster distance is defined as the

maximum distance between members of one cluster

and members of the another cluster. Complete

linkage tends to create clusters of similar size

and variability. DAB max ( d(ui, vj) ) where

u Î A and v Î B for all i 1 to NA and j 1 to

NB

DAB

(HCL-9)

Comparison of Linkage Methods

Average

Single

Complete

(HCL-10)

Bootstrapping

Bootstrapping resampling with replacement

Original expression matrix

Various bootstrapped matrices (by experiments)

Jackknifing

Jackknifing resampling without replacement

Original expression matrix

Various jackknifed matrices (by experiments)

Analysis of bootstrapped and jackknifed support

trees

- Bootstrapped or jackknifed expression matrices

are created many times by randomly resampling the

original expression matrix, using either the

bootstrap or jackknife procedure. - Each time, hierarchical trees are created from

the resampled matrices. - The trees are compared to the tree obtained from

the original data set. - The more frequently a given cluster from the

original tree is found in the resampled trees,

the stronger the support for the cluster. - As each resampled matrix lacks some of the

original data, high support for a cluster means

that the clustering is not biased by a small

subset of the data.

Self Organizing Tree Algorithm

SOTA - 1

- Dopazo, J. , J.M Carazo, Phylogenetic

reconstruction using and unsupervised growing

neural network that adopts the topology of a

phylogenetic tree. J. Mol. Evol. 44226-233,

1997. - Herrero, J., A. Valencia, and J. Dopazo. A

hierarchical unsupervised growing neural network

for clustering gene expression patterns.

Bioinformatics, 17(2)126-136, 2001.

SOTA Characteristics

SOTA - 2

- Divisive clustering, allowing high level

hierarchical structure to be revealed without

having to completely partition the data set down

to single gene vectors - Data set is reduced to clusters arranged in a

binary tree topology - The number of resulting clusters is not fixed

before clustering - Neural network approach which has advantages

similar to SOMs such as handling large data sets

that have large amounts of noise

SOTA Topology

SOTA - 3

Centroid Vector

ap

Members

Adaptation Overview

SOTA - 4

- Each gene vector associated with the parent is

compared to the centroid vector of its offspring

cells. - The most similar cells centroid and its

neighboring cells are adapted using the

appropriate migration weights.

- Following the presentation of all genes to the

system a measure of system diversity is used to

determine if training has found an optimal

position for the offspring. - If the system diversity improves (decreases)

then another training epoch is started otherwise

training ends and a new cycle starts with a cell

division.

SOTA - 5

SOTA - 6

The most diverse cell is selected for division

at the start of the next training cycle.

Growth Termination

SOTA - 7

Expansion stops when the most diverse cells

diversity falls below a threshold.

SOTA - 8

Each training cycle ends when the overall tree

diversity stabilizes. This triggers a cell

division and possibly a new training cycle.

K-Means/Medians Clustering 1

K-Means/Medians Clustering 2

3. Calculate mean/median expression profile of

each cluster.

4. Shuffle genes among clusters such that each

gene is now in the cluster whose mean expression

profile (calculated in step 3) is the closest to

that genes expression profile.

5. Repeat steps 3 and 4 until genes cannot be

shuffled around any more, OR a user-specified

number of iterations has been reached.

k-means is most useful when the user has an a

priori hypothesis about the number of clusters

the genes should belong to.

K-Means / K-Medians Support (KMS)

- Because of the random initialization of

K-Means/K-Means, clustering results may vary

somewhat between successive runs on the same

dataset. KMS helps us validate the clustering

results obtained from K-Means/K-Medians. - Run K-Means / K-Medians multiple times.
- The KMS module generates clusters in which the

member genes frequently group together in the

same clusters (consensus clusters) across

multiple runs of K-Means / K-Medians. - The consensus clusters consist of genes that

clustered together in at least x of the K-Means

/ Medians runs, where x is the threshold

percentage input by the user.

Self-organizing maps (SOMs) 1

1. Specify the number of nodes (clusters)

desired, and also specify a 2-D geometry for the

nodes, e.g., rectangular or hexagonal

SOMs 2

2. Choose a random gene, e.g., G9

3. Move the nodes in the direction of G9. The

node closest to G9 (N2) is moved the most, and

the other nodes are moved by smaller varying

amounts. The farther away the node is from N2,

the less it is moved.

SOMs 3

4. Steps 2 and 3 (i.e., choosing a random gene

and moving the nodes towards it) are repeated

many (usually several thousand) times. However,

with each iteration, the amount that the nodes

are allowed to move is decreased.

5. Finally, each node will nestle among a

cluster of genes, and a gene will be considered

to be in the cluster if its distance to the node

in that cluster is less than its distance to any

other node.

SOM Neighborhood Options

Gaussian Neighborhood

Bubble Neighborhood

radius

G7

G7

G8

G8

G9

G9

G10

G10

G11

G11

N1

N2

N1

N2

N3

N4

N3

N4

N5

N6

N5

N6

Some move, alpha is constant.

All move, alpha is scaled.

Template Matching

- Template matching allows one to find expression

vectors which match a provided template - A template can be derived from
- a gene known to be central to the area of study
- a sample or set of samples of a particular type
- a cluster with a mean pattern of interest
- a pattern constructed to reveal trends based on

knowledge of the experimental design

PTM-2

- Sometimes it is useful to identify elements that

have complementary patterns by selecting to use

the absolute value of r.

Gene Shaving

Results in a series of nested clusters

Choose cluster of appropriate size as determined

by gap statistic calculation

Repeat until only one gene remains

Orthogonalize expression matrix with respect to

the average gene in the cluster and repeat

shaving procedure

Gene Shaving

The final cluster contains a set of genes that

are greatly affected by the experimental

conditions in a similar way.

Create random permutations of the expression

matrix and calculate R2 for each

Compare R2 of each cluster to that of the entire

expression matrix

Choose the cluster whose R2 is furthest from the

average R2 of the permuted expression matrices.

Relevance Networks

Set of genes whose expression profiles are

predictive of one another.

Can be used to identify negative correlations

between genes.

Relevance Networks

The remaining relationships between genes define

the subnets

Principal Components Analysis

- PCA simplifies the views of the data.
- Suppose we have measurements for each gene on

multiple experiments. - Suppose some of the experiments are correlated.
- PCA will ignore the redundant experiments, and

will take a weighted average of some of the

experiments, thus possibly making the trends in

the data more interpretable. - 5. The