Title: Some Statistical Issues in Microarray Data Analysis
1Some Statistical Issues in Microarray Data
Analysis
- Alex Sánchez
- Estadística i Bioinformàtica
- Departament dEstadística Universitat de
Barcelona - Unitat dEstadística i BioinformàticaIR-HUVH
2Outline
- Introduction
- Experimental design
- Selecting differentially expressed genes
- Statistical tests
- Significance testing
- Linear models and Analysis of the variance
- Multiple testing
- Software for microarray data analysis
3Introduction
4Microarray experiments Overview
5Why are we talking of statistics?
- A microarray experiment is, as called, an
experiment, that is - It has been performed to determine if some
previous hypothesis are true or false (although
it can also lead to new hypotheses) - It is subject to errors which may arise from many
sources
6Sources of variability
- Biological Heterogeneity in Population
- Specimen Collection/ Handling Effects
- Tumor surgical bx, FNA
- Cell Line culture condition, confluence level
- Biological Heterogeneity in Specimen
- RNA extraction
- RNA amplification
- Fluor labeling
- Hybridization
- Scanning
- PMT voltage
- laser power
(Geschwind, Nature Reviews Neuroscience, 2001)
7Categories of variability
- Systematic variability
- Amount of RNA in the biopsy
- Efficiencies of lab procedures such as
- RNA extraction,
- reverse transcription,
- Labeling or
- photodetection
- Random variation
- PCR yield
- DNA quality
- spotting efficiency,
- spot size
- cross-/unspecific hybridization
- stray signal
8Dealing with systematic variability
- Systematic variability has similar effects on
many measurements - Corrections can be estimated from data
- CALIBRATION or NORMALIZATION is the general name
for processes that correct for systematic
variability
9Dealing with random variation
- Random variation cannot be explicitly accounted
for - Usual way to deal with it is to assume some ERROR
MODELS (e.g. eiN(0, s2)) - Assuming these error models are true
- EXPERIMENTAL DESIGN is (must be) used to control
the action of random variation - STATISTICAL INFERENCE is (must be) used to
extract conclusions in the presence of random
variation
10Biological question
Experimental design
Failed
Microarray experiment
Quality Measurement
Image analysis
Today
Normalization
Pass
Analysis
Discrimination
Clustering
Testing
Estimation
Biological verification and interpretation
11Experimental design
12Why experimental design?
- The objective of experimental design is to make
the analysis of the data and the interpretation
of the results - As simple and as powerful as possible
- Given the purpose of the experiment
- And the constraints of the experimental material
13Scientific aims and design choice
- The primary focus of the experiments needs to be
clearly stated, whether it is - to identify differentially expressed genes
- to search for specific gene-expression patterns
- to identify phenotypic subclasses
- Aim of the experiment guides design choice
- Sometimes only one choice is reasonable
- Sometimes different options available
14Designing microarray experiments
- The appropriate design of a microarray experiment
must consider - Design of the array
- Allocation of mRNA samples to the slides
15I Layout of the array
- Which sequences to use
- cDNAs? Selection of cDNA from library
- Riken, NIA, etc
- Affymetrix? PMs and MMs
- Oligo probes selection (from Operon, Agilent,
etc) - Control probes
- What ?. Where should controls be put
- How many sequences to use
- Should there be replicate spots within a slide?
16II Allocating samples in slides
- Types of Samples
- Replication technical vs biological
- Pooled vs individual samples
- Different design layout / data analysis
- Scientific aim of the experiment
- Efficiency, Robustness, Extensibility
- Physical limitations (cost)
- Number of slides
- Amount of material
17Basic principles of experimental design
- Apply the following principles to best attain the
objectives of experimental design - Replication
- Local control or Blocking
- Randomization
181. Replication
- Its important
- To reduce uncertainty (increase precision)
- To obtain sufficient power for the tests
- As a formal basis for inferential procedures
- Consider different types of replicates
- Technical
- Duplicate spots
- Multiple hybridizations from the same sample
- Biological
- Repeat most what is expected to vary most!
19Biological vs Technical Replicates
_at_ Nature reviews G. Churchill (2002)
20Replication vs Pooling
- mRNA from different samples are often combined to
form a pooled-sample or pool. Why? - If each sample doesnt yield enough mRNA
- To compensate an excess of variability ? ?
- Statisticians tend not to like it but pooling may
be OK if properly done - Combine several samples in each pool
- Use several pools from different samples
- Do not use pools when individual information is
important (e.g.paired designs)
212. Blocking
- Assume we wish to perform an experiment to
compare two treatments. - The samples or their processing may not be
homogeneous There are blocks - Subjects Male/Female
- Arrays produced in two lots (February, March)
- If there are systematic differences between
blocks the effects of interest (e.g. tretament)
may be confounded - Observed differences are attributable to
treatment effect or to confounding factors?
22Confounding block with treatment effects
- Two alternative designs to investigate treatment
effects - Left Treatment effects confounded with Sex and
Batch effect - Right Treatments are balanced between blocks
- Influence of blocks is automatically compensated
- Statistical analysis may separate block from
treatment efefect
233. Randomisation
- Randomly assigning samples to groups to eliminate
unspecific disturbances - Randomly assign individuals to treatments.
- Randomise order in which experiments are
performed. - Randomisation required to ensure validity of
statistical procedures. - Block what you can and randomize what you cannot
24Experimental layout
- How are mRNA samples assigned to arrays
- The experimental layout has to be chosen so that
the resulting analysis can be done as efficient
and robust as possible - Sometimes there is only one reasonable choice
- Sometimes several choices are available
25Example I Only one design choice
- Case 1 Meaningful biological control (C)
- Samples Liver tissue from 4 mice treated by
cholesterol modifying drugs. - Question 1 Genes that respond differently
between the T and the C. - Question 2 Genes that responded similarly
across two or more treatments relative to
control. - Case 2 Use of universal reference.
- Samples Different tumor samples.
- Question To discover tumor subtypes.
26Example 2 a number of different designs are
suitable for use (1)
- Time course experiments
- Design choice depends on the comparisons of
interest
27How can we decide?
- A-optimality choosee design which minimizes
variance of estimates of effects of interest - A simple example Direct vs indirect estimates
28Summary
- Selection of mRNA samples is important
- Most important biological replicates
- Technical replicates also useful, but different
- If needed and possible use pooling wisely
- Choice of experimental layout guided by
- The scientific question
- Experimental design principles
- Efficiency and robustness considerations
- Correspondence between experimental
Designs-Linear Models-ANOVA can be exploited to
select model and analyze data
29Experimental design, Linear Models and Analysis
of the Variance
- In experimental design the different sources of
variability influencing the observed response may
be identified. - These sources can be related with the response
using a linear model - Analysis of the variance can be used to
separately estimate and test the relative
importance of each source of variability.
30Statistical methods to detect differentially
expressed genes
31Class comparison Identifying differentially
expressed genes
- Identify genes differentially expressed between
different conditions such as - Treatment, cell type,... (qualitative covariates)
- Dose, time, ... (quantitative covariate)
- Survival, infection time,... !
- Estimate effects/differences between groups
probably using log-ratios, i.e. the difference on
log scale log(X)-log(Y) log(X/Y)
32What is a significant change?
- Depends on the variability within groups, which
may be different from gene to gene. - To assess the statistical significance of
differences, conduct a statistical test for each
gene.
33Different settings for statistical tests
- Indirect comparisons 2 groups, 2 samples,
unpaired - E.g. 10 individuals 5 suffer diabetes, 5 healthy
- One sample fro each individual
- Typically Two sample t-test or similar
- Direct comparisons Two groups, two samples,
paired - E.g. 6 individuals with brain stroke.
- Two samples from each one from healthy (region
1) and one from affected (region 2). - Typically One sample t-test (also called paired
t-test) or similar based on the individual
differences between conditions.
34Different ways to do the experiment
- An experiment use cDNA arrays (two-colour) or
affy (one-colour). - Depending on the technology used allocation of
conditions to slides changes.
Type of chip Experiment cDNA(2-col) Affy (1-col)
10 indiv. Diab (5) Heal (5) Reference design. (5) Diab/Ref (5) Heal/Ref Comparison design. (5) Diab vs (5) Heal
6 indiv. Region 1 Region 2 6 slides 1 individual per slide (6) reg1/reg2 12 slides (6) Paired differences
35Natural measures of discrepancy
For Direct comparisons in two colour or
paired-one colour.
For Indirect comparisons in two colour or
Direct comparisons in one colour.
36Some Issues
- Can we trust average effect sizes (average
difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.
Gene M1 M2 M3 M4 M5 M6 Mean SD t
A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10
B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25
C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69
D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19
E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09
Courtesy of Y.H. Yang
37Some Issues
- Can we trust average effect sizes (average
difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.
Gene M1 M2 M3 M4 M5 M6 Mean SD t
A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10
B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25
C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69
D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19
E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09
Courtesy of Y.H. Yang
- Averages can be driven by outliers.
38Some Issues
- Can we trust average effect sizes (average
difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.
Gene M1 M2 M3 M4 M5 M6 Mean SD t
A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10
B 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25
C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69
D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19
E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09
- ts can be driven by tiny variances.
Courtesy of Y.H. Yang
39Variations in t-tests (1)
- Let
- Rg mean observed log ratio
- SEg standard error of Rg estimated from data on
gene g. - SE standard error of Rg estimated from data
across all genes. - Global t-test tRg/SE
- Gene-specific t-test tRg/SEg
40Some pros and cons of t-test
Test Pros Cons
Global t-test tRg/SE Yields stable variance estimate Assumes variance homogeneity ? biased if false
Gene-specific tRg/SEg Robust to variance heterogeneity Low power Yields unstable variance estimates (due to few data)
41T-tests extensions
SAM (Tibshirani, 2001)
Regularized-t (Baldi, 2001)
EB-moderated t (Smyth, 2003)
42Up to here Can we generate a list of candidate
genes?
With the tools we have, the reasonable steps to
generate a list of candidate genes may be
?
A list of candidateDE genes
We need an idea of how significant are these
values ?Wed like to assign them p-values
43Significance testing
44Nominal p-values
- After a test statistic is computed, it is
convenient to convert it to a p-valueThe
probability that a test statistic, say S(X),
takes values equal or greater than that taken on
the observed sample, say S(X0), under the
assumption that the null hypothesis is
true pPS(X)gtS(X0)H0 true
45Significance testing
- Test of significance at the a level
- Reject the null hypothesis if your p-value is
smaller than the significance level - It has advantages but not free from criticisms
- Genes with p-values falling below a prescribed
level may be regarded as significant
46Hypothesis testing overview for a single gene
Reported decision Reported decision
H0 is Rejected (gene is Selected) H0 is Accepted (gene not Selected)
State of the nature ("Truth") H0 is false (Affected) TP, prob 1-a FN, prob 1-b Type II error Sensitiviy TP/TPFN
State of the nature ("Truth") H0 is true (Not Affected) FP, PRej H0H0lt a Type I error TN , prob b Specificity TN/TNFP
Positive predictive value TP/TPFP Negative predictive value TN/TNFN
47Calculation of p-values
- Standard methods for calculating p-values
- (i) Refer to a statistical distribution table
(Normal, t, F, ) or - (ii) Perform a permutation analysis
48(i) Tabulated p-values
- Tabulated p-values can be obtained for standard
test statistics (e.g.the t-test) - They often rely on the assumption of normally
distributed errors in the data - This assumption can be checked (approximately)
using a - Histogram
- Q-Q plot
49Example
- Golub data, 27 ALL vs 11 AML samples, 3051 genes
- A t-test yields 1045 genes with plt 0.05
50(ii) Permutations tests
- Based on data shuffling. No assumptions
- Random interchange of labels between samples
- Estimate p-values for each comparison (gene) by
using the permutation distribution of the
t-statistics - Repeat for every possible permutation, b1B
- Permute the n data points for the gene (x). The
first n1 are referred to as treatments, the
second n2 as controls - For each gene, calculate the corresponding two
samplet-statistic, tb - After all the B permutations are done put p
b tb tobserved/B
51Permutation tests (2)
52Volcano plot fold change vs log(odds)1
Significant change detected
No change detected
1 log(odds) is proportional to -log (p-value)
53Linear models and Analysis of the Variance to
analyze designed experiments
54From experimental design to linear models
- Some weaknesses of statistical framework
- What to do if treatment has more than 2 levels?
- How to deal with more than one treatment or
experimental condition? - How to deal with nuisance factors such as batch
effects, covariates, etc? - Most of this can be solved with an alternative
approach Analysis of the Variance
55Multiple testing
56How far can we trust the decision?
- The test "Reject H0 if p-val a"
- is said to control the type I error because,
under a certain set of assumptions,the
probability of falsely rejecting H0 is less than
a fixed small threshold -
- Nothing is warranted about PFN?
- Optimal tests are built trying to minimize this
probability - In practical situations it is often high
57What if we wish to test more than one gene at
once? (1)
- Consider more than one test at once
- Two tests each at 5 level. Now probability of
getting a false positive is 1 0.950.95
0.0975 - Three tests ? 1 0.953 0.1426
- n tests ? 1 0.95n
- Converge towards 1 as n increases
- Small p-values dont necessarily imply
significance!!! ? We are not controlling the
probability of type I error anymore
58What if we wish to test more than one gene at
once? (2) a simulation
- Simulation of this process for 6,000 genes with 8
treatments and 8 controls - All the gene expression values were simulated
i.i.d from a N (0,1) distribution, i.e. NOTHING
is differentially expressed in our simulation - The number of genes falsely rejected will be on
the average of (6000 a), i.e. if we wanted to
reject all genes with a p-value of less than 1
we would falsely reject around 60 genes - See example
59Multiple testing Counting errors
Decision reported Decision reported Decision reported Decision reported
H0 is Rejected (Genes Selected) H0 is Rejected (Genes Selected) H0 is accepted (Genes not Selected) H0 is accepted (Genes not Selected) Total
State of the nature ("Truth") H0 is false (Affected) ma -am0 (S) (m-mo)-(ma -am0) (T) m-mo
State of the nature ("Truth") H0 is true (Not Affected) am0 (V) mo-am0 (U) mo
Total Total Ma (R) m-ma (m-R) m
V Type I errors false positives T
Type II errors false negatives All these
quantities could be known if m0 was known
60How does type I error control extend to multiple
testing situations?
- Selecting genes with a p-value less than a
doesnt control for PFP anymore - What can be done?
- Extend the idea of type I error
- FWER and FDR are two such extensions
- Look for procedures that control the probability
for these extended error types - Mainly adjust raw p-values
61Two main error rate extensions
- Family Wise Error Rate (FWER)
- FWER is probability of at least one false
positive - FWER Pr( of false discoveries gt0) Pr(Vgt0)
- False Discovery Rate (FDR)
- FDR is expected value of proportion of false
positives among rejected null hypotheses - FDR EV/R Rgt0 EV/R Rgt0PRgt0
62FDR and FWER controlling procedures
- FWER
- Bonferroni (adj Pvalue minnPvalue,1)
- Holm (1979)
- Hochberg (1986)
- Westfall Young (1993) maxT and minP
- FDR
- Benjamini Hochberg (1995)
- Benjamini Yekutieli (2001)
63Difference between controlling FWER or FDR
- FWER? Controls for no (0) false positives
- gives many fewer genes (false positives),
- but you are likely to miss many
- adequate if goal is to identify few genes that
differ between two groups - FDR? Controls the proportion of false positives
- if you can tolerate more false positives
- you will get many fewer false negatives
- adequate if goal is to pursue the study e.g. to
determine functional relationships among genes
64Steps to generate a list of candidate genes
revisited (2)
Nominal p-valuesP1, P2, , PG
A list of candidateDE genes
Select genes with adjusted P-valuessmaller than
a
Adjusted p-valuesaP1, aP2, , aPG
65Example
- Golub data, 27 ALL vs 11 AML samples, 3051 genes
- Bonferroni adjustment 98 genes with padjlt 0.05
(praw lt 0.000016)
66Extensions
- Some issues we have not dealt with
- Replicates within and between slides
- Several effects use a linear model
- ANOVA are the effects equal?
- Time series selecting genes for trends
- Different solutions have been suggested for each
problem - Still many open questions
67Examples
68Ex. 1- Swirl zebrafish experiment
- Swirl is a point mutation causing defects in the
organization of the developing embryo along its
ventral-dorsal axis - As a result some cell types are reduced and
others are expanded - A goal of this experiment was to identify genes
with altered expression in the swirl mutant
compared to the wild zebrafish
69Example 1 Experimental design
- Each microarray contained 8848 cDNA probes
(either genes or EST sequences) - 4 replicate slides 2 sets of dye-swap pairs
- For each pair, target cDNA of the swirl mutant
was labeled using one of Cy5 or Cy3 and the
target cDNA of the wild type mutant was labeled
using the other dye
2
Wild type
Swirl
2
70Example 1. Data analysis
- Gene expression data on 8848 genes for 4 samples
(slides) Each hybridixed with Mutant and Wild
type - On a gene-per-gene basis this is a one-sample
problem - Hypothesis to be tested for each gene
- H0 log2(R/G)0
- The decision will be based on average log-ratios
71Example 2 . Scanvenger receptor BI (SR-BI)
experiment
- Callow et al. (2000). A study of lipid metabolism
and atherosclerosis susceptibility in mice. - Transgenic mice with SR-BI gene overexpressed
have low HDL cholesterol levels. - Goal To identify genes with altered expression
in the livers of transgenic mice with SR-BI gene
overexpressed mice (T) compared to normal FVB
control mice (C).
72Example 2. Experimental design
- 8 treatment mice (Ti) and 8 control ones (Ci).
- 16 hybridizations liver mRNA from each of the 16
mice (Ti , Ci ) is labelled with Cy5, while
pooled liver mRNA from the control mice (C) is
labelled with Cy3. - Probes 6,000 cDNAs (genes), including 200
related to pathogenicity.
T
8
C
8
C
73Example 2. Data analysis
- Gene expression data on 6348 genes for 16
samples 8 for treatment (log T/C) and 8 for
control (log (C/C)) - On a gene-per-gene basis this is a 2 sample
problem - Hypothesis to be tested for each gene
- H0 log (R1/G)-log (R2/G)0
- Decision will be based on average difference of
log ratios
74Software for microarray data analysis
75Introduction
- Microarray experiments generate huge quantities
of data which have to be - Stored, managed, visualized, processed
- Many options available. However
- No tool satisfies all users needs
- Trade-off. A tool must be
- Powerful but user friendly
- Complete but without too many options,
- Flexible but easy to start with and go further
- Available, to date, well documented but affordable
76So, what you need is R?
- R is an open-source system for statistical
computation and graphics. It consists of - A language
- A run-time environment with
- Graphics, a debugger, and
- Access to certain system functions,
- It can be used
- Interactively, through a command language
- Or running programs stored in script files
77http//www.r-project.org/
78Some pros cons
- Powerful,
- Used by statisticians
- Easy to extend
- Creating add-on packages
- Many already available
- Freely available
- Unix, windows Mac
- Lot of documentation
- Not very easy to learn
- Command-based
- Documentation sometimes cryptic
- Memory intensive
- Worst in windows
- Slow at times
- We believe the effort is worth the pity!!!
- If you just want to do statistical analysis ?
Easy to find alternatives - If you intend to do microarray data analysis?
Probably one of best options
79R and Microarrays
- R is a popular tool between statisticians
- Once they started to work with microarrays they
continued using it - To perform the analysis
- To implement new tools
- This gave rise very fast to lots of free R-based
software to analyze microarrays - The Bioconductor project groups many of these
(but not all) developments
80The Bioconductor project
- Open source and open development software project
for the analysis and comprehension of genomic
data. - Most early developments as R packages.
- Extensive documentation and training material
from short courseshttp//www.bioconductor.org/wor
kshop.html. - Has reached some stability but still evolving
!!!? what is now a standard may not be so in a
future.
81There's much more than R!
- Give a look at
- "My microarray software comparison"
- http//ihome.cuhk.edu.hk/b400559/arraysoft.html
82Examples
83Ex. 1- Swirl zebrafish experiment
- Swirl is a point mutation causing defects in the
organization of the developing embryo along its
ventral-dorsal axis - As a result some cell types are reduced and
others are expanded - A goal of this experiment was to identify genes
with altered expression in the swirl mutant
compared to the wild zebrafish
84Example 1 Experimental design
- Each microarray contained 8848 cDNA probes
(either genes or EST sequences) - 4 replicate slides 2 sets of dye-swap pairs
- For each pair, target cDNA of the swirl mutant
was labeled using one of Cy5 or Cy3 and the
target cDNA of the wild type mutant was labeled
using the other dye
2
Wild type
Swirl
2
85Example 1. Data analysis
- Gene expression data on 8848 genes for 4 samples
(slides) Each hybridixed with Mutant and Wild
type - On a gene-per-gene basis this is a one-sample
problem - Hypothesis to be tested for each gene
- H0 log2(R/G)0
- The decision will be based on average log-ratios
86Example 2 . Scanvenger receptor BI (SR-BI)
experiment
- Callow et al. (2000). A study of lipid metabolism
and atherosclerosis susceptibility in mice. - Transgenic mice with SR-BI gene overexpressed
have low HDL cholesterol levels. - Goal To identify genes with altered expression
in the livers of transgenic mice with SR-BI gene
overexpressed mice (T) compared to normal FVB
control mice (C).
87Example 2. Experimental design
- 8 treatment mice (Ti) and 8 control ones (Ci).
- 16 hybridizations liver mRNA from each of the 16
mice (Ti , Ci ) is labelled with Cy5, while
pooled liver mRNA from the control mice (C) is
labelled with Cy3. - Probes 6,000 cDNAs (genes), including 200
related to pathogenicity.
T
8
C
8
C
88Example 2. Data analysis
- Gene expression data on 6348 genes for 16
samples 8 for treatment (log T/C) and 8 for
control (log (C/C)) - On a gene-per-gene basis this is a 2 sample
problem - Hypothesis to be tested for each gene
- H0 log (R1/G)-log (R2/G)0
- Decision will be based on average difference of
log ratios