Title: 3' A brief look into data analysis for microarray experiments
13. A brief look into data analysis for
microarray experiments
- Alex Sánchez. Departament dEstadística.
- Universitat de Barcelona
2Outline
- Introduction
- 2-fold change approach (2 conditions, 1 sample)
- T-tests and extensions (2 conditions, gt 1 sample)
- Significance and multiple testing
- More than Two Conditions
- A brief introduction to clustering
3Introduction
4Identifying differentially expressed genes
(filtering?)
- Goal identify genes associated with covariate
or response of interest such as - Qualitative covariates treatment, cell type,
- Quantitative covariate dose, time
- Responses survival, infection time
- Any combination of these!
- Selecting a subset of differentially expres-sed
genes is called sometimes filtering - Previous step to other analysis procedures
5Life cycle
Biological question
Experimental design
Failed
Microarray experiment
Quality Measurement
Image analysis
Today
Normalization
Pass
Analysis
Discrimination
Clustering
Testing
Estimation
Biological verification and interpretation
6The experimental frame
- We can distinguish different situations
- No replicates (one chip/condition)
- 2 conditions ? One k-fold change analysis
- gt 2 conditions ? Several k-fold change analysis
- There are replicates
- 2 conditions on one chip ? One sample tests
- 2 conditions on two chips? Two sample tests
- More than 2 conditions ? ANOVA or linear models
7Experimental frame for 2 conditions
- One chip per condition? 2 fold change
- Null hypothesis H0 log2(Rg/Gg)0, g1,,G
- Decision based on one value per gene
- Several chips? one or two-sample tests
- If two samples in same array replicated arrays
- H0 log2(R/G)0. Decision based on average
logratios - If we have a common reference (boths samples
hybridize to same control) ? - Hypothesis changes to log (R1/G)-log (R2/G)0
- Decision based on average difference of log ratios
8Two-fold change (two conditions, one single chip)
9Fold change (single slide) methods
- The observed (log2) ratio between two conditions
is used - Arbitrary cut-off value (2 fold?)
- Not a statistical test ? no associated level
of confidence - Some known problems
- Subject to bias if data not properly normalized
- Sensitive to variance heterogeneity across genes
- Solutions have been suggested (next slide)
10Origin of the 2-fold change approach
- deRisi et al (1997) found only 19/6300 false
positives using this criterion - Perhaps correct for their experiment but
- Similar controls would be required for each new
experiment? redefinition of k-fold value - May be influenced by experimental factors
- A practical reason often there are no replicates
because they are too expensive - Making inferences with samples of size1 ?!!
11Some approaches to use of k-fold-change criteria
(1)
- A naïve approach to standardize the data
- Assumption of normality (following Chen et al.,
1998) - Ratios can be converted into Z-scores
- Each of which has an associated P-value
- Problems
- Assumes normality, and homocedasticity
- Must use robust estimates of centrality (e.g.
median) and dispersion (e.g. MAD)
12Some approaches to use of k-fold-change criteria
(2)
- More sophisticated approaches have been proposed
- Mainly based in relaxing distributional
assumptions - Newton et al. Gamma-Gamma-Bernoulli hierarchical
model for each (R,G). - Roberts et al. Each (R,G) is assumed to be
normally and independently distributed with
variance depending linearly on the mean. - Sapir Churchill. Each log R/G is assumed to be
distributed according to a mixture of normal and
uniform distributions. Decision based on R/G only.
13Example Matt Callows Srb1 data (5).
Newtons and Chens single slide method
- It is not hard to do by eye
- The problem is probably beyond formal statistical
inference (valid p-values, etc) for the
foreseeable future.why?
14T-tests and extensions(2 conditions, several
chips)
15Tests of Differential Expression between two
conditions, several chips
- With several replicates per condition ?
- the variability of gene expression,
- on a gene per gene basis,
- can be taken in account
- Natural measures of differential expression
will be based on the mean, or difference of
means, conveniently standardized
16Natural measures of discrepancy
Direct comparisons
Indirect comparisons
17Some Issues
- Can we trust average effect sizes (average
difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.
Courtesy of Y.H. Yang
18Some Issues
- Can we trust average effect sizes (average
difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.
Courtesy of Y.H. Yang
- Averages can be driven by outliers.
19Some Issues
- Can we trust average effect sizes (average
difference of means) alone? - Can we trust the t statistic alone?
- Here is evidence that the answer is no.
- ts can be driven by tiny variances.
Courtesy of Y.H. Yang
20Variations in t-tests (1)
- Let
- Rg mean observed log ratio
- SEg standard error of Rg estimated from data on
gene g. - SE standard error of Rg estimated from data
across all genes. - Global t-test tRg/SE
- Gene-specific t-test tRg/SEg
21Some pros and cons of t-test
22Alternatives suggested
- SAM
- Regularized t-test
- B (Empirical bayes) statistic
- Others
23SAM t-test or S-test
- Adds a small constant (c perc90(SEg))
- Genes with small fold changes will not be
selected as significant
24Regularized t-test statistic
- ?0 relative contributions of ? and ?g
- n number of replicate measurements for each
condition
25Can we generate a list of candidate genes?
With the tools we have, the reasonable steps to
generate a list of candidate genes may be
?
A list of candidateDE genes
We need an idea of how significant are these
values ?Wed like to assign them p-values
26Significance and multiple testing
27Nominal p-values
- After a test statistic is computed, it is
convenient to convert it to a p-valueThe
probability of ocurrence of a test statistic
equal to, or more extreme than the observed value
under the assumption that the null hypothesis is
true pPS S0H0 true
28Significance testing
- Test of significance at the a level
- Reject the null hypothesis if your p-value is
smaller than the significance level - It has advantages but not free from criticisms
- Genes with p-values falling below a prescribed
level may be regarded as significant
29Calculation of p-values
- Standard methods for calculating p-values
- (i) Refer to a statistical distribution table
(Normal, t, F, ) or - (ii) Perform a permutation analysis
30(i) Tabulated p-values
- Tabulated p-values can be obtained for standard
test statistics (e.g.the t-test) - They often rely on the assumption of normally
distributed errors in the data - This assumption can be checked (approximatedly)
using a - Histogram
- Q-Q plot
31Histogram QQ-plots of t-statistics
32More about QQ plots
- Not only useful for checking normality
- Also to identify genes with extreme t-values
values off the line, at one end or another - Very useful with thousands of genes, but
- We cant expect all differentially expressed
genes to stand out as extremes - many will be masked by more extreme random
variation, which is a big problem in this context
33(ii) Permutations tests
- Based on data shuffling. No assumptions
- Random interchange of labels between samples
- Estimate p-values for each comparison (gene) by
using the permutation distribution of the
t-statistics - Repeat for every possible permutation, b1B
- Permute the n data points for the gene (x). The
first n1 are referred to as treatments, the
second n2 as controls. - For each gene, calculate the corresponding two
sample t-statistic, tb. - After all the B permutations are done put p
b tb tobserved/B
34Permutation tests (2)
35Permutation tests (3)
36Are these p-values correct?
- Statistical tests usually control type I error
the probability of rejecting H0 when it is true - High number of tests ? Problem!!!
- If we perform 10.000 simultaneous tests on
samples under the null hypothesis - And fix a type I error ? of 2
- We expect to reject H0 about 200 times
- To avoid this adjust p-values
37Why should we adjust p-values?
- A simulation study illustrates why
- Simulations of this process for 6,000 genes with
8 treatments and 8 controls. - All the gene expression values were simulated
independent and identically distributated (i.i.d)
from a N (0,1) distribution, - i.e. NOTHING is differentially expressed.
38Unadjusted p-values
Clearly we cant just use standard p-value
thresholds (.05, .01).
39Steps to generate a list of candidate genes
revisited (2)
Nominal p-valuesP1, P2, , PG
A list of candidateDE genes
Select genes with adjusted P-valuessmaller than
a
Adjusted p-valuesaP1, aP2, , aPG
40Multiple Testing
- Define an adequate type I error
- Use a procedure that
- Ensures an strict control of type I error
- Powerful (few false negatives)
- Based on the joint distribution of the
multiple tests - Calculate an adjusted p-value for each gene that
reflects the global type I error
41Multiple Testing (2)Approaches
- Two alternatives to controlling type I error for
multiple testing are - Control family-wise error rate (FWER) the
probability of making one or more type I errors - Bonferroni, Westfall and Young, etc
- Control the false discovery rate (FDR)
proportion of false positives among all of the
rejected null hypotheses
42Multiple Testing (3)
- False discovery rate E(V/R)
- Family-wise p(V 1)
43Some Advantages of p-value Adjustment
- Test level (size) does not need to be determined
in advance - Some procedures most easily described in terms of
their adjusted p-values - Usually easily estimated using resampling
- Procedures can be readily compared based on the
corresponding adjusted p-values
44Some p-values adjustment methods
- Bonferroni adjustment
- Westfall, PH and SS Young (1993) Resampling-based
multiple testing (max T). - Benjamini, Y Y Hochberg (1995) Controlling the
false discovery rate a practical and powerful
approach to multiple testing - J Storey (2001) The positive false discovery
rate a Bayesian interpretation and the q-value. - Y Ge et al (2001) Fast algorithm for resampling
based p-value adjustment for multiple testing.
45More than 2 conditions
- K-samples experiments
- Factorial experiments
46More than Two Conditions
- Many experiments are intended to make complex
comparisons - compare several conditions (gt 2 treatments),
- compare the joint effect of two drugs
- Compare two strains of mayze (mutant wild type)
at two different times - These can be analysed using factorial desingns
which involve the use of ANOVA models - We dont discuss them here. See references
47Anova model
48Finding patterns in genes
- Introduction to clustering
49Expression profiles in microarray data
Expression profile for all genes in a single
experimental condition
Genes
Experimental conditions
Expression profile for one gene in all
experimental conditions
50Why should we cluster data?
- Gene expression studies assume that genes with
similar function - Have similar patterns of expression
- Have common transcription factors
- If we believe the previous is true the analysis
of patterns in the expression matrix should help
to identify - Biological function for uncharacterized genes
- Transcription factor for genes
51Does common expression mean common regulation?
52Genes associated with pathologies
53Leukemia typologies identification
54Cluster analysis
- Multivariate statistical methods (data mining,
machine learning, ) that - Given a set of individuals (points),
- Characterized by a series of attributes,
- And a similarity measure between them
- Allows to form groups (clusters) such that
- Points inside a group are more similar between
them that points between different groups
55Clustering Group identification
56Clustering expression data
- Cluster genes (rows), to (try to) identify groups
of co-regulated genes. - Cluster samples (columns), to (try to) identify
phenotypes using their molecular profiles - One can cluster genes and samples simultaneously
57Basic issues in clustering
- Issues to decide before clustering
- Which genes/arrays to use?
- Which similarity/dissimilarity measure?
- Which clustering algorithm?
- Its an exploratory technique
- Theres no optimal solution
- Any method will yield groups
- Which method gives good groups?
58(No Transcript)
59Similarity measures
- A similarity/dissimilarity measure between two
individuals i, j, is an index, sij, (usually
between 0 and 1, 0 sij 1) which measures the
intensity by which i, and j are related. - Its usually measured using
- a similarity coefficient in cathegorical
variables - A distance function with continuous
variables(this is the case of expression data)
60Similarity measures (1)
61Distance functions (2)
- Amounts to pearsons correlation coefficient if
variables are centered
62Pearsons correlation coefficient
- Widely used in expression studies
- Some known problems
- Ignores data variability
- Can yield false positives(a-b)
- There exist robust variants
- jackknife
63Clustering algorithms
- Many algorithms, based upon many ideas
- Most popular
- Hierarchichal methods
- Aglomerative (N? 1) or Divisive (1?N)
- Partititioning methods
- K-means, Self organizing maps (SOM)
- Other
- Support Vector Machines (SVM)
64Clustering algorithms
65Partititioning methods
- Each element is initially assigned to one of K
groups which have been specified a priori - A cost function is used to re-assign individuals
to groups until an optimality criterion is
reached (e.g. minimize SS inside clusters) - Some partitive methods
- k-means, partitioning around medoids (PAM),
self-organizing maps (SOM), model-based
clustering
66Partitioning methods
67Hierarchical methods
- Number of groups isnt defined a priori
- Build dendogram, s.t. 2 individuals the nearer
one finds two individuals ? the most similar they
are - Cutting it at any level gives cluters
- Can be
- Aglomerative (bottom-up) N ? 1 groups
- Divisive (top-down) 1 ? N groups
68Hierarchical methods
69Partitioning vs Hierarchical
Hierarchical
Partitioning
- Advantages
- Faster computation.
- Visual.
- Disadvantages
- Unrelated genes are eventually joined
- Rigid, cannot correct later for erroneous
decisions made earlier. - Hard to define clusters.
- Advantages
- Optimal for certain criteria.
- Genes automatically assigned to clusters
- Disadvantages
- Need initial k
- Often require long computation times.
- All genes are forced into a cluster.
70Example Yeast cell-cycle
71Acknowledgments
- Special thanks for Yee Hwa yang (UCSF) for
allowing me to use some of her materials - Sandrine Dudoit Terry Speed, U.C. Berkeley
- M. Carme Ruíz de Villa, U. Barcelona
- Sara Marsal, U. Reumatología, HVH Barcelona