3' A brief look into data analysis for microarray experiments - PowerPoint PPT Presentation

1 / 71

About This Presentation

Title:

3' A brief look into data analysis for microarray experiments

Description:

(ii) Perform a permutation analysis. 30 (i) Tabulated p-values ... (ii) Permutations tests. Based on data shuffling. No assumptions ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 72

Provided by: alexsa

Category:

more less

Transcript and Presenter's Notes

Title: 3' A brief look into data analysis for microarray experiments

1
3. A brief look into data analysis for
microarray experiments

Alex Sánchez. Departament dEstadística.
Universitat de Barcelona

2
Outline

Introduction
2-fold change approach (2 conditions, 1 sample)
T-tests and extensions (2 conditions, gt 1 sample)
Significance and multiple testing
More than Two Conditions
A brief introduction to clustering

3
Introduction
4
Identifying differentially expressed genes
(filtering?)

Goal identify genes associated with covariate
or response of interest such as
Qualitative covariates treatment, cell type,
Quantitative covariate dose, time
Responses survival, infection time
Any combination of these!
Selecting a subset of differentially expres-sed
genes is called sometimes filtering
Previous step to other analysis procedures

5
Life cycle
Biological question
Experimental design
Failed
Microarray experiment
Quality Measurement
Image analysis
Today
Normalization
Pass
Analysis
Discrimination
Clustering
Testing
Estimation
Biological verification and interpretation
6
The experimental frame

We can distinguish different situations
No replicates (one chip/condition)
2 conditions ? One k-fold change analysis
gt 2 conditions ? Several k-fold change analysis
There are replicates
2 conditions on one chip ? One sample tests
2 conditions on two chips? Two sample tests
More than 2 conditions ? ANOVA or linear models

7
Experimental frame for 2 conditions

One chip per condition? 2 fold change
Null hypothesis H0 log2(Rg/Gg)0, g1,,G
Decision based on one value per gene
Several chips? one or two-sample tests
If two samples in same array replicated arrays
H0 log2(R/G)0. Decision based on average
logratios
If we have a common reference (boths samples
hybridize to same control) ?
Hypothesis changes to log (R1/G)-log (R2/G)0
Decision based on average difference of log ratios

8
Two-fold change (two conditions, one single chip)
9
Fold change (single slide) methods

The observed (log2) ratio between two conditions
is used
Arbitrary cut-off value (2 fold?)
Not a statistical test ? no associated level
of confidence
Some known problems
Subject to bias if data not properly normalized
Sensitive to variance heterogeneity across genes
Solutions have been suggested (next slide)

10
Origin of the 2-fold change approach

deRisi et al (1997) found only 19/6300 false
positives using this criterion
Perhaps correct for their experiment but
Similar controls would be required for each new
experiment? redefinition of k-fold value
May be influenced by experimental factors
A practical reason often there are no replicates
because they are too expensive
Making inferences with samples of size1 ?!!

11
Some approaches to use of k-fold-change criteria
(1)

A naïve approach to standardize the data
Assumption of normality (following Chen et al.,
1998)
Ratios can be converted into Z-scores
Each of which has an associated P-value
Problems
Assumes normality, and homocedasticity
Must use robust estimates of centrality (e.g.
median) and dispersion (e.g. MAD)

12
Some approaches to use of k-fold-change criteria
(2)

More sophisticated approaches have been proposed
Mainly based in relaxing distributional
assumptions
Newton et al. Gamma-Gamma-Bernoulli hierarchical
model for each (R,G).
Roberts et al. Each (R,G) is assumed to be
normally and independently distributed with
variance depending linearly on the mean.
Sapir Churchill. Each log R/G is assumed to be
distributed according to a mixture of normal and
uniform distributions. Decision based on R/G only.

13
Example Matt Callows Srb1 data (5).
Newtons and Chens single slide method

It is not hard to do by eye
The problem is probably beyond formal statistical
inference (valid p-values, etc) for the
foreseeable future.why?

14
T-tests and extensions(2 conditions, several
chips)
15
Tests of Differential Expression between two
conditions, several chips

With several replicates per condition ?
the variability of gene expression,
on a gene per gene basis,
can be taken in account
Natural measures of differential expression
will be based on the mean, or difference of
means, conveniently standardized

16
Natural measures of discrepancy
Direct comparisons
Indirect comparisons
17
Some Issues

Can we trust average effect sizes (average
difference of means) alone?
Can we trust the t statistic alone?
Here is evidence that the answer is no.

Courtesy of Y.H. Yang
18
Some Issues

Can we trust average effect sizes (average
difference of means) alone?
Can we trust the t statistic alone?
Here is evidence that the answer is no.

Courtesy of Y.H. Yang

Averages can be driven by outliers.

19
Some Issues

Can we trust average effect sizes (average
difference of means) alone?
Can we trust the t statistic alone?
Here is evidence that the answer is no.

ts can be driven by tiny variances.

Courtesy of Y.H. Yang
20
Variations in t-tests (1)

Let
Rg mean observed log ratio
SEg standard error of Rg estimated from data on
gene g.
SE standard error of Rg estimated from data
across all genes.
Global t-test tRg/SE
Gene-specific t-test tRg/SEg

21
Some pros and cons of t-test
22
Alternatives suggested

SAM
Regularized t-test
B (Empirical bayes) statistic
Others

23
SAM t-test or S-test

Adds a small constant (c perc90(SEg))
Genes with small fold changes will not be
selected as significant

24
Regularized t-test statistic

?0 relative contributions of ? and ?g
n number of replicate measurements for each
condition

25
Can we generate a list of candidate genes?
With the tools we have, the reasonable steps to
generate a list of candidate genes may be
?
A list of candidateDE genes
We need an idea of how significant are these
values ?Wed like to assign them p-values
26
Significance and multiple testing
27
Nominal p-values

After a test statistic is computed, it is
convenient to convert it to a p-valueThe
probability of ocurrence of a test statistic
equal to, or more extreme than the observed value
under the assumption that the null hypothesis is
true pPS S0H0 true

28
Significance testing

Test of significance at the a level
Reject the null hypothesis if your p-value is
smaller than the significance level
It has advantages but not free from criticisms
Genes with p-values falling below a prescribed
level may be regarded as significant

29
Calculation of p-values

Standard methods for calculating p-values
(i) Refer to a statistical distribution table
(Normal, t, F, ) or
(ii) Perform a permutation analysis

30
(i) Tabulated p-values

Tabulated p-values can be obtained for standard
test statistics (e.g.the t-test)
They often rely on the assumption of normally
distributed errors in the data
This assumption can be checked (approximatedly)
using a
Histogram
Q-Q plot

31
Histogram QQ-plots of t-statistics
32
More about QQ plots

Not only useful for checking normality
Also to identify genes with extreme t-values
values off the line, at one end or another
Very useful with thousands of genes, but
We cant expect all differentially expressed
genes to stand out as extremes
many will be masked by more extreme random
variation, which is a big problem in this context

33
(ii) Permutations tests

Based on data shuffling. No assumptions
Random interchange of labels between samples
Estimate p-values for each comparison (gene) by
using the permutation distribution of the
t-statistics
Repeat for every possible permutation, b1B
Permute the n data points for the gene (x). The
first n1 are referred to as treatments, the
second n2 as controls.
For each gene, calculate the corresponding two
sample t-statistic, tb.
After all the B permutations are done put p
b tb tobserved/B

34
Permutation tests (2)
35
Permutation tests (3)
36
Are these p-values correct?

Statistical tests usually control type I error
the probability of rejecting H0 when it is true
High number of tests ? Problem!!!
If we perform 10.000 simultaneous tests on
samples under the null hypothesis
And fix a type I error ? of 2
We expect to reject H0 about 200 times
To avoid this adjust p-values

37
Why should we adjust p-values?

A simulation study illustrates why
Simulations of this process for 6,000 genes with
8 treatments and 8 controls.
All the gene expression values were simulated
independent and identically distributated (i.i.d)
from a N (0,1) distribution,
i.e. NOTHING is differentially expressed.

38
Unadjusted p-values
Clearly we cant just use standard p-value
thresholds (.05, .01).
39
Steps to generate a list of candidate genes
revisited (2)
Nominal p-valuesP1, P2, , PG
A list of candidateDE genes
Select genes with adjusted P-valuessmaller than
a
Adjusted p-valuesaP1, aP2, , aPG
40
Multiple Testing

Define an adequate type I error
Use a procedure that
Ensures an strict control of type I error
Powerful (few false negatives)
Based on the joint distribution of the
multiple tests
Calculate an adjusted p-value for each gene that
reflects the global type I error

41
Multiple Testing (2)Approaches

Two alternatives to controlling type I error for
multiple testing are
Control family-wise error rate (FWER) the
probability of making one or more type I errors
Bonferroni, Westfall and Young, etc
Control the false discovery rate (FDR)
proportion of false positives among all of the
rejected null hypotheses

42
Multiple Testing (3)

False discovery rate E(V/R)
Family-wise p(V 1)

43
Some Advantages of p-value Adjustment

Test level (size) does not need to be determined
in advance
Some procedures most easily described in terms of
their adjusted p-values
Usually easily estimated using resampling
Procedures can be readily compared based on the
corresponding adjusted p-values

44
Some p-values adjustment methods

Bonferroni adjustment
Westfall, PH and SS Young (1993) Resampling-based
multiple testing (max T).
Benjamini, Y Y Hochberg (1995) Controlling the
false discovery rate a practical and powerful
approach to multiple testing
J Storey (2001) The positive false discovery
rate a Bayesian interpretation and the q-value.
Y Ge et al (2001) Fast algorithm for resampling
based p-value adjustment for multiple testing.

45
More than 2 conditions

K-samples experiments
Factorial experiments

46
More than Two Conditions

Many experiments are intended to make complex
comparisons
compare several conditions (gt 2 treatments),
compare the joint effect of two drugs
Compare two strains of mayze (mutant wild type)
at two different times
These can be analysed using factorial desingns
which involve the use of ANOVA models
We dont discuss them here. See references

47
Anova model
48
Finding patterns in genes

Introduction to clustering

49
Expression profiles in microarray data
Expression profile for all genes in a single
experimental condition
Genes
Experimental conditions
Expression profile for one gene in all
experimental conditions
50
Why should we cluster data?

Gene expression studies assume that genes with
similar function
Have similar patterns of expression
Have common transcription factors
If we believe the previous is true the analysis
of patterns in the expression matrix should help
to identify
Biological function for uncharacterized genes
Transcription factor for genes

51
Does common expression mean common regulation?
52
Genes associated with pathologies
53
Leukemia typologies identification
54
Cluster analysis

Multivariate statistical methods (data mining,
machine learning, ) that
Given a set of individuals (points),
Characterized by a series of attributes,
And a similarity measure between them
Allows to form groups (clusters) such that
Points inside a group are more similar between
them that points between different groups

55
Clustering Group identification
56
Clustering expression data

Cluster genes (rows), to (try to) identify groups
of co-regulated genes.
Cluster samples (columns), to (try to) identify
phenotypes using their molecular profiles
One can cluster genes and samples simultaneously

57
Basic issues in clustering

Issues to decide before clustering
Which genes/arrays to use?
Which similarity/dissimilarity measure?
Which clustering algorithm?
Its an exploratory technique
Theres no optimal solution
Any method will yield groups
Which method gives good groups?

58
(No Transcript)
59
Similarity measures

A similarity/dissimilarity measure between two
individuals i, j, is an index, sij, (usually
between 0 and 1, 0 sij 1) which measures the
intensity by which i, and j are related.
Its usually measured using
a similarity coefficient in cathegorical
variables
A distance function with continuous
variables(this is the case of expression data)

60
Similarity measures (1)
61
Distance functions (2)

Amounts to pearsons correlation coefficient if
variables are centered

62
Pearsons correlation coefficient

Widely used in expression studies
Some known problems
Ignores data variability
Can yield false positives(a-b)
There exist robust variants
jackknife

63
Clustering algorithms

Many algorithms, based upon many ideas
Most popular
Hierarchichal methods
Aglomerative (N? 1) or Divisive (1?N)
Partititioning methods
K-means, Self organizing maps (SOM)
Other
Support Vector Machines (SVM)

64
Clustering algorithms
65
Partititioning methods

Each element is initially assigned to one of K
groups which have been specified a priori
A cost function is used to re-assign individuals
to groups until an optimality criterion is
reached (e.g. minimize SS inside clusters)
Some partitive methods
k-means, partitioning around medoids (PAM),
self-organizing maps (SOM), model-based
clustering

66
Partitioning methods
67
Hierarchical methods

Number of groups isnt defined a priori
Build dendogram, s.t. 2 individuals the nearer
one finds two individuals ? the most similar they
are
Cutting it at any level gives cluters
Can be
Aglomerative (bottom-up) N ? 1 groups
Divisive (top-down) 1 ? N groups

68
Hierarchical methods
69
Partitioning vs Hierarchical
Hierarchical
Partitioning

Advantages
Faster computation.
Visual.
Disadvantages
Unrelated genes are eventually joined
Rigid, cannot correct later for erroneous
decisions made earlier.
Hard to define clusters.

Advantages
Optimal for certain criteria.
Genes automatically assigned to clusters
Disadvantages
Need initial k
Often require long computation times.
All genes are forced into a cluster.

70
Example Yeast cell-cycle
71
Acknowledgments

Special thanks for Yee Hwa yang (UCSF) for
allowing me to use some of her materials
Sandrine Dudoit Terry Speed, U.C. Berkeley
M. Carme Ruíz de Villa, U. Barcelona
Sara Marsal, U. Reumatología, HVH Barcelona

Write a Comment

User Comments (0)