1 / 88

Some Statistical Issues in Microarray Data

Analysis

- Alex Sánchez
- Estadística i Bioinformàtica
- Departament dEstadística Universitat de

Barcelona - Unitat dEstadística i BioinformàticaIR-HUVH

Outline

- Introduction
- Experimental design
- Selecting differentially expressed genes
- Statistical tests
- Significance testing
- Linear models and Analysis of the variance
- Multiple testing
- Software for microarray data analysis

Introduction

Microarray experiments Overview

Why are we talking of statistics?

- A microarray experiment is, as called, an

experiment, that is - It has been performed to determine if some

previous hypothesis are true or false (although

it can also lead to new hypotheses) - It is subject to errors which may arise from many

sources

Sources of variability

- Biological Heterogeneity in Population
- Specimen Collection/ Handling Effects
- Tumor surgical bx, FNA
- Cell Line culture condition, confluence level
- Biological Heterogeneity in Specimen
- RNA extraction
- RNA amplification
- Fluor labeling
- Hybridization
- Scanning
- PMT voltage
- laser power

(Geschwind, Nature Reviews Neuroscience, 2001)

Categories of variability

- Systematic variability
- Amount of RNA in the biopsy
- Efficiencies of lab procedures such as
- RNA extraction,
- reverse transcription,
- Labeling or
- photodetection

- Random variation
- PCR yield
- DNA quality
- spotting efficiency,
- spot size
- cross-/unspecific hybridization
- stray signal

Dealing with systematic variability

- Systematic variability has similar effects on

many measurements - Corrections can be estimated from data
- CALIBRATION or NORMALIZATION is the general name

for processes that correct for systematic

variability

Dealing with random variation

- Random variation cannot be explicitly accounted

for - Usual way to deal with it is to assume some ERROR

MODELS (e.g. eiN(0, s2)) - Assuming these error models are true
- EXPERIMENTAL DESIGN is (must be) used to control

the action of random variation - STATISTICAL INFERENCE is (must be) used to

extract conclusions in the presence of random

variation

Biological question

Experimental design

Failed

Microarray experiment

Quality Measurement

Image analysis

Today

Normalization

Pass

Analysis

Discrimination

Clustering

Testing

Estimation

Biological verification and interpretation

Experimental design

Why experimental design?

- The objective of experimental design is to make

the analysis of the data and the interpretation

of the results - As simple and as powerful as possible
- Given the purpose of the experiment
- And the constraints of the experimental material

Scientific aims and design choice

- The primary focus of the experiments needs to be

clearly stated, whether it is - to identify differentially expressed genes
- to search for specific gene-expression patterns
- to identify phenotypic subclasses
- Aim of the experiment guides design choice
- Sometimes only one choice is reasonable
- Sometimes different options available

Designing microarray experiments

- The appropriate design of a microarray experiment

must consider - Design of the array
- Allocation of mRNA samples to the slides

I Layout of the array

- Which sequences to use
- cDNAs? Selection of cDNA from library
- Riken, NIA, etc
- Affymetrix? PMs and MMs
- Oligo probes selection (from Operon, Agilent,

etc) - Control probes
- What ?. Where should controls be put
- How many sequences to use
- Should there be replicate spots within a slide?

II Allocating samples in slides

- Types of Samples
- Replication technical vs biological
- Pooled vs individual samples
- Different design layout / data analysis
- Scientific aim of the experiment
- Efficiency, Robustness, Extensibility
- Physical limitations (cost)
- Number of slides
- Amount of material

Basic principles of experimental design

- Apply the following principles to best attain the

objectives of experimental design - Replication
- Local control or Blocking
- Randomization

1. Replication

- Its important
- To reduce uncertainty (increase precision)
- To obtain sufficient power for the tests
- As a formal basis for inferential procedures
- Consider different types of replicates
- Technical
- Duplicate spots
- Multiple hybridizations from the same sample
- Biological
- Repeat most what is expected to vary most!

Biological vs Technical Replicates

_at_ Nature reviews G. Churchill (2002)

Replication vs Pooling

- mRNA from different samples are often combined to

form a pooled-sample or pool. Why? - If each sample doesnt yield enough mRNA
- To compensate an excess of variability ? ?
- Statisticians tend not to like it but pooling may

be OK if properly done - Combine several samples in each pool
- Use several pools from different samples
- Do not use pools when individual information is

important (e.g.paired designs)

2. Blocking

- Assume we wish to perform an experiment to

compare two treatments. - The samples or their processing may not be

homogeneous There are blocks - Subjects Male/Female
- Arrays produced in two lots (February, March)
- If there are systematic differences between

blocks the effects of interest (e.g. tretament)

may be confounded - Observed differences are attributable to

treatment effect or to confounding factors?

Confounding block with treatment effects

- Two alternative designs to investigate treatment

effects - Left Treatment effects confounded with Sex and

Batch effect - Right Treatments are balanced between blocks
- Influence of blocks is automatically compensated
- Statistical analysis may separate block from

treatment efefect

3. Randomisation

- Randomly assigning samples to groups to eliminate

unspecific disturbances - Randomly assign individuals to treatments.
- Randomise order in which experiments are

performed. - Randomisation required to ensure validity of

statistical procedures. - Block what you can and randomize what you cannot

Experimental layout

- How are mRNA samples assigned to arrays
- The experimental layout has to be chosen so that

the resulting analysis can be done as efficient

and robust as possible - Sometimes there is only one reasonable choice
- Sometimes several choices are available

Example I Only one design choice

- Case 1 Meaningful biological control (C)
- Samples Liver tissue from 4 mice treated by

cholesterol modifying drugs. - Question 1 Genes that respond differently

between the T and the C. - Question 2 Genes that responded similarly

across two or more treatments relative to

control. - Case 2 Use of universal reference.
- Samples Different tumor samples.
- Question To discover tumor subtypes.

Example 2 a number of different designs are

suitable for use (1)

- Time course experiments
- Design choice depends on the comparisons of

interest

How can we decide?

- A-optimality choosee design which minimizes

variance of estimates of effects of interest - A simple example Direct vs indirect estimates

Summary

- Selection of mRNA samples is important
- Most important biological replicates
- Technical replicates also useful, but different
- If needed and possible use pooling wisely
- Choice of experimental layout guided by
- The scientific question
- Experimental design principles
- Efficiency and robustness considerations
- Correspondence between experimental

Designs-Linear Models-ANOVA can be exploited to

select model and analyze data

Experimental design, Linear Models and Analysis

of the Variance

- In experimental design the different sources of

variability influencing the observed response may

be identified. - These sources can be related with the response

using a linear model - Analysis of the variance can be used to

separately estimate and test the relative

importance of each source of variability.

Statistical methods to detect differentially

expressed genes

Class comparison Identifying differentially

expressed genes

- Identify genes differentially expressed between

different conditions such as - Treatment, cell type,... (qualitative covariates)

- Dose, time, ... (quantitative covariate)
- Survival, infection time,... !
- Estimate effects/differences between groups

probably using log-ratios, i.e. the difference on

log scale log(X)-log(Y) log(X/Y)

What is a significant change?

- Depends on the variability within groups, which

may be different from gene to gene. - To assess the statistical significance of

differences, conduct a statistical test for each

gene.

Different settings for statistical tests

- Indirect comparisons 2 groups, 2 samples,

unpaired - E.g. 10 individuals 5 suffer diabetes, 5 healthy
- One sample fro each individual
- Typically Two sample t-test or similar
- Direct comparisons Two groups, two samples,

paired - E.g. 6 individuals with brain stroke.
- Two samples from each one from healthy (region

1) and one from affected (region 2). - Typically One sample t-test (also called paired

t-test) or similar based on the individual

differences between conditions.

Different ways to do the experiment

- An experiment use cDNA arrays (two-colour) or

affy (one-colour). - Depending on the technology used allocation of

conditions to slides changes.

Type of chip Experiment cDNA(2-col) Affy (1-col)

10 indiv. Diab (5) Heal (5) Reference design. (5) Diab/Ref (5) Heal/Ref Comparison design. (5) Diab vs (5) Heal

6 indiv. Region 1 Region 2 6 slides 1 individual per slide (6) reg1/reg2 12 slides (6) Paired differences

Natural measures of discrepancy

For Direct comparisons in two colour or

paired-one colour.

For Indirect comparisons in two colour or

Direct comparisons in one colour.

Some Issues

- Can we trust average effect sizes (average

difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.

Gene M1 M2 M3 M4 M5 M6 Mean SD t

A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

Courtesy of Y.H. Yang

Some Issues

- Can we trust average effect sizes (average

difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.

Gene M1 M2 M3 M4 M5 M6 Mean SD t

A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

Courtesy of Y.H. Yang

- Averages can be driven by outliers.

Some Issues

- Can we trust average effect sizes (average

difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.

Gene M1 M2 M3 M4 M5 M6 Mean SD t

A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10

B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25

C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69

D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19

E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09

- ts can be driven by tiny variances.

Courtesy of Y.H. Yang

Variations in t-tests (1)

- Let
- Rg mean observed log ratio
- SEg standard error of Rg estimated from data on

gene g. - SE standard error of Rg estimated from data

across all genes. - Global t-test tRg/SE
- Gene-specific t-test tRg/SEg

Some pros and cons of t-test

Test Pros Cons

Global t-test tRg/SE Yields stable variance estimate Assumes variance homogeneity ? biased if false

Gene-specific tRg/SEg Robust to variance heterogeneity Low power Yields unstable variance estimates (due to few data)

T-tests extensions

SAM (Tibshirani, 2001)

Regularized-t (Baldi, 2001)

EB-moderated t (Smyth, 2003)

Up to here Can we generate a list of candidate

genes?

With the tools we have, the reasonable steps to

generate a list of candidate genes may be

?

A list of candidateDE genes

We need an idea of how significant are these

values ?Wed like to assign them p-values

Significance testing

Nominal p-values

- After a test statistic is computed, it is

convenient to convert it to a p-valueThe

probability that a test statistic, say S(X),

takes values equal or greater than that taken on

the observed sample, say S(X0), under the

assumption that the null hypothesis is

true pPS(X)gtS(X0)H0 true

Significance testing

- Test of significance at the a level
- Reject the null hypothesis if your p-value is

smaller than the significance level - It has advantages but not free from criticisms
- Genes with p-values falling below a prescribed

level may be regarded as significant

Hypothesis testing overview for a single gene

Reported decision Reported decision

H0 is Rejected (gene is Selected) H0 is Accepted (gene not Selected)

State of the nature ("Truth") H0 is false (Affected) TP, prob 1-a FN, prob 1-b Type II error Sensitiviy TP/TPFN

State of the nature ("Truth") H0 is true (Not Affected) FP, PRej H0H0lt a Type I error TN , prob b Specificity TN/TNFP

Positive predictive value TP/TPFP Negative predictive value TN/TNFN

Calculation of p-values

- Standard methods for calculating p-values
- (i) Refer to a statistical distribution table

(Normal, t, F, ) or - (ii) Perform a permutation analysis

(i) Tabulated p-values

- Tabulated p-values can be obtained for standard

test statistics (e.g.the t-test) - They often rely on the assumption of normally

distributed errors in the data - This assumption can be checked (approximately)

using a - Histogram
- Q-Q plot

Example

- Golub data, 27 ALL vs 11 AML samples, 3051 genes
- A t-test yields 1045 genes with plt 0.05

(ii) Permutations tests

- Based on data shuffling. No assumptions
- Random interchange of labels between samples
- Estimate p-values for each comparison (gene) by

using the permutation distribution of the

t-statistics - Repeat for every possible permutation, b1B
- Permute the n data points for the gene (x). The

first n1 are referred to as treatments, the

second n2 as controls - For each gene, calculate the corresponding two

samplet-statistic, tb - After all the B permutations are done put p

b tb tobserved/B

Permutation tests (2)

Volcano plot fold change vs log(odds)1

Significant change detected

No change detected

1 log(odds) is proportional to -log (p-value)

Linear models and Analysis of the Variance to

analyze designed experiments

From experimental design to linear models

- Some weaknesses of statistical framework
- What to do if treatment has more than 2 levels?
- How to deal with more than one treatment or

experimental condition? - How to deal with nuisance factors such as batch

effects, covariates, etc? - Most of this can be solved with an alternative

approach Analysis of the Variance

Multiple testing

How far can we trust the decision?

- The test "Reject H0 if p-val a"
- is said to control the type I error because,

under a certain set of assumptions,the

probability of falsely rejecting H0 is less than

a fixed small threshold - Nothing is warranted about PFN?
- Optimal tests are built trying to minimize this

probability - In practical situations it is often high

What if we wish to test more than one gene at

once? (1)

- Consider more than one test at once
- Two tests each at 5 level. Now probability of

getting a false positive is 1 0.950.95

0.0975 - Three tests ? 1 0.953 0.1426
- n tests ? 1 0.95n
- Converge towards 1 as n increases
- Small p-values dont necessarily imply

significance!!! ? We are not controlling the

probability of type I error anymore

What if we wish to test more than one gene at

once? (2) a simulation

- Simulation of this process for 6,000 genes with 8

treatments and 8 controls - All the gene expression values were simulated

i.i.d from a N (0,1) distribution, i.e. NOTHING

is differentially expressed in our simulation - The number of genes falsely rejected will be on

the average of (6000 a), i.e. if we wanted to

reject all genes with a p-value of less than 1

we would falsely reject around 60 genes - See example

Multiple testing Counting errors

Decision reported Decision reported Decision reported Decision reported

H0 is Rejected (Genes Selected) H0 is Rejected (Genes Selected) H0 is accepted (Genes not Selected) H0 is accepted (Genes not Selected) Total

State of the nature ("Truth") H0 is false (Affected) ma -am0 (S) (m-mo)-(ma -am0) (T) m-mo

State of the nature ("Truth") H0 is true (Not Affected) am0 (V) mo-am0 (U) mo

Total Total Ma (R) m-ma (m-R) m

V Type I errors false positives T

Type II errors false negatives All these

quantities could be known if m0 was known

How does type I error control extend to multiple

testing situations?

- Selecting genes with a p-value less than a

doesnt control for PFP anymore - What can be done?
- Extend the idea of type I error
- FWER and FDR are two such extensions
- Look for procedures that control the probability

for these extended error types - Mainly adjust raw p-values

Two main error rate extensions

- Family Wise Error Rate (FWER)
- FWER is probability of at least one false

positive - FWER Pr( of false discoveries gt0) Pr(Vgt0)
- False Discovery Rate (FDR)
- FDR is expected value of proportion of false

positives among rejected null hypotheses - FDR EV/R Rgt0 EV/R Rgt0PRgt0

FDR and FWER controlling procedures

- FWER
- Bonferroni (adj Pvalue minnPvalue,1)
- Holm (1979)
- Hochberg (1986)
- Westfall Young (1993) maxT and minP
- FDR
- Benjamini Hochberg (1995)
- Benjamini Yekutieli (2001)

Difference between controlling FWER or FDR

- FWER? Controls for no (0) false positives
- gives many fewer genes (false positives),
- but you are likely to miss many
- adequate if goal is to identify few genes that

differ between two groups - FDR? Controls the proportion of false positives
- if you can tolerate more false positives
- you will get many fewer false negatives
- adequate if goal is to pursue the study e.g. to

determine functional relationships among genes

Steps to generate a list of candidate genes

revisited (2)

Nominal p-valuesP1, P2, , PG

A list of candidateDE genes

Select genes with adjusted P-valuessmaller than

a

Adjusted p-valuesaP1, aP2, , aPG

Example

- Golub data, 27 ALL vs 11 AML samples, 3051 genes
- Bonferroni adjustment 98 genes with padjlt 0.05

(praw lt 0.000016)

Extensions

- Some issues we have not dealt with
- Replicates within and between slides
- Several effects use a linear model
- ANOVA are the effects equal?
- Time series selecting genes for trends
- Different solutions have been suggested for each

problem - Still many open questions

Examples

Ex. 1- Swirl zebrafish experiment

- Swirl is a point mutation causing defects in the

organization of the developing embryo along its

ventral-dorsal axis - As a result some cell types are reduced and

others are expanded - A goal of this experiment was to identify genes

with altered expression in the swirl mutant

compared to the wild zebrafish

Example 1 Experimental design

- Each microarray contained 8848 cDNA probes

(either genes or EST sequences) - 4 replicate slides 2 sets of dye-swap pairs
- For each pair, target cDNA of the swirl mutant

was labeled using one of Cy5 or Cy3 and the

target cDNA of the wild type mutant was labeled

using the other dye

2

Wild type

Swirl

2

Example 1. Data analysis

- Gene expression data on 8848 genes for 4 samples

(slides) Each hybridixed with Mutant and Wild

type - On a gene-per-gene basis this is a one-sample

problem - Hypothesis to be tested for each gene
- H0 log2(R/G)0
- The decision will be based on average log-ratios

Example 2 . Scanvenger receptor BI (SR-BI)

experiment

- Callow et al. (2000). A study of lipid metabolism

and atherosclerosis susceptibility in mice. - Transgenic mice with SR-BI gene overexpressed

have low HDL cholesterol levels. - Goal To identify genes with altered expression

in the livers of transgenic mice with SR-BI gene

overexpressed mice (T) compared to normal FVB

control mice (C).

Example 2. Experimental design

- 8 treatment mice (Ti) and 8 control ones (Ci).
- 16 hybridizations liver mRNA from each of the 16

mice (Ti , Ci ) is labelled with Cy5, while

pooled liver mRNA from the control mice (C) is

labelled with Cy3. - Probes 6,000 cDNAs (genes), including 200

related to pathogenicity.

T

8

C

8

C

Example 2. Data analysis

- Gene expression data on 6348 genes for 16

samples 8 for treatment (log T/C) and 8 for

control (log (C/C)) - On a gene-per-gene basis this is a 2 sample

problem - Hypothesis to be tested for each gene
- H0 log (R1/G)-log (R2/G)0
- Decision will be based on average difference of

log ratios

Software for microarray data analysis

Introduction

- Microarray experiments generate huge quantities

of data which have to be - Stored, managed, visualized, processed
- Many options available. However
- No tool satisfies all users needs
- Trade-off. A tool must be
- Powerful but user friendly
- Complete but without too many options,
- Flexible but easy to start with and go further
- Available, to date, well documented but affordable

So, what you need is R?

- R is an open-source system for statistical

computation and graphics. It consists of - A language
- A run-time environment with
- Graphics, a debugger, and
- Access to certain system functions,
- It can be used
- Interactively, through a command language
- Or running programs stored in script files

http//www.r-project.org/

Some pros cons

- Powerful,
- Used by statisticians
- Easy to extend
- Creating add-on packages
- Many already available
- Freely available
- Unix, windows Mac
- Lot of documentation

- Not very easy to learn
- Command-based
- Documentation sometimes cryptic
- Memory intensive
- Worst in windows
- Slow at times

- We believe the effort is worth the pity!!!
- If you just want to do statistical analysis ?

Easy to find alternatives - If you intend to do microarray data analysis?

Probably one of best options

R and Microarrays

- R is a popular tool between statisticians
- Once they started to work with microarrays they

continued using it - To perform the analysis
- To implement new tools
- This gave rise very fast to lots of free R-based

software to analyze microarrays - The Bioconductor project groups many of these

(but not all) developments

The Bioconductor project

- Open source and open development software project

for the analysis and comprehension of genomic

data. - Most early developments as R packages.
- Extensive documentation and training material

from short courseshttp//www.bioconductor.org/wor

kshop.html. - Has reached some stability but still evolving

!!!? what is now a standard may not be so in a

future.

There's much more than R!

- Give a look at
- "My microarray software comparison"
- http//ihome.cuhk.edu.hk/b400559/arraysoft.html

Examples

Ex. 1- Swirl zebrafish experiment

- Swirl is a point mutation causing defects in the

organization of the developing embryo along its

ventral-dorsal axis - As a result some cell types are reduced and

others are expanded - A goal of this experiment was to identify genes

with altered expression in the swirl mutant

compared to the wild zebrafish

Example 1 Experimental design

- Each microarray contained 8848 cDNA probes

(either genes or EST sequences) - 4 replicate slides 2 sets of dye-swap pairs
- For each pair, target cDNA of the swirl mutant

was labeled using one of Cy5 or Cy3 and the

target cDNA of the wild type mutant was labeled

using the other dye

2

Wild type

Swirl

2

Example 1. Data analysis

- Gene expression data on 8848 genes for 4 samples

(slides) Each hybridixed with Mutant and Wild

type - On a gene-per-gene basis this is a one-sample

problem - Hypothesis to be tested for each gene
- H0 log2(R/G)0
- The decision will be based on average log-ratios

Example 2 . Scanvenger receptor BI (SR-BI)

experiment

- Callow et al. (2000). A study of lipid metabolism

and atherosclerosis susceptibility in mice. - Transgenic mice with SR-BI gene overexpressed

have low HDL cholesterol levels. - Goal To identify genes with altered expression

in the livers of transgenic mice with SR-BI gene

overexpressed mice (T) compared to normal FVB

control mice (C).

Example 2. Experimental design

- 8 treatment mice (Ti) and 8 control ones (Ci).
- 16 hybridizations liver mRNA from each of the 16

mice (Ti , Ci ) is labelled with Cy5, while

pooled liver mRNA from the control mice (C) is

labelled with Cy3. - Probes 6,000 cDNAs (genes), including 200

related to pathogenicity.

T

8

C

8

C

Example 2. Data analysis

- Gene expression data on 6348 genes for 16

samples 8 for treatment (log T/C) and 8 for

control (log (C/C)) - On a gene-per-gene basis this is a 2 sample

problem - Hypothesis to be tested for each gene
- H0 log (R1/G)-log (R2/G)0
- Decision will be based on average difference of

log ratios