Using Statistical Methods to Obtain a List of Differentially Expressed Genes - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Using Statistical Methods to Obtain a List of Differentially Expressed Genes

Description:

Wild-type vs. Myostatin Knockout Mice. Belgian Blue. cattle have a. mutation in the. myostatin gene. Affymetrix GeneChips on 5 Mice per Genotype. WT. WT. WT. M. M ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 49

Provided by: dannet3

Category:

more less

Transcript and Presenter's Notes

Title: Using Statistical Methods to Obtain a List of Differentially Expressed Genes

1
Using Statistical Methods to Obtain a List of
Differentially Expressed Genes
Tim Bancroft Dan Nettleton BBSI Summer
School IOWA STATE UNIVERSITY June 16, 2009
2
Wild-type vs. Myostatin Knockout Mice
Belgian Blue cattle have a mutation in the
myostatin gene.
3
Affymetrix GeneChips on 5 Mice per Genotype
M
WT
M
WT
M
WT
WT
M
WT
M
4
The Dataset
Gene ID
Wild Type
Mutant
5
A Standard Analysis

Two-sample t-test for each gene.
Test the null hypothesis
for the ith gene (wild
type mean mutant mean)
Compute p-values by comparing t-statistics to a
t-distribution with 8 d.f.
Use an adjustment for multiple testing to create
a list of genes declared to be differentially
expressed.

6
The Dataset
Gene ID
Wild Type
Mutant
p-value
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13
.
.
.
p22690
7
Histogram of p-valuesfrom the Two-Sample t-Tests
Number of Genes
p-value
8
Example p-value Distributions
Two-Sample t-test of H0µ1µ2 n1n25, variance1
µ1-µ21
µ1-µ20.5
µ1-µ20
9
Histogram of p-valuesfrom the Two-Sample t-Tests
Number of Genes
p-value
10
The Multiple Testing Problem

Suppose one test of interest has been conducted
for each of m genes in a microarray experiment.
Let p1, p2, ... , pm denote the p-values
corresponding to the m tests.
Let H01, H02, ... , H0m denote the null
hypotheses corresponding to the m tests.

11
The Multiple Testing Problem (continued)

Suppose m0 of the null hypotheses are true and m1
of the null hypotheses are false.
Let c denote a value between 0 and 1 that will
serve as a cutoff for significance
- Reject H0i if pi c
(declare significant)
- Fail to reject (or accept) H0i if pi gt c
(declare non-significant)

12
Table of Outcomes
Accept Null Reject Null
Declare Non-Sig. Declare Sig.
No Discovery Declare Discovery
Negative Result Positive Result True Nulls
U V m0
False Nulls T S
m1 Total W R
m
13
Table of Outcomes
Accept Null Reject Null
Declare Non-Sig. Declare Sig.
No Discovery Declare Discovery
Negative Result Positive Result True Nulls
U V m0
False Nulls T S
m1 Total W R
m
Random Variables
Constants
14
Table of Outcomes
Accept Null Reject Null
Declare Non-Sig. Declare Sig.
No Discovery Declare Discovery
Negative Result Positive Result True Nulls
U V m0
False Nulls T S
m1 Total W R
m
Unobservable
Observable
15
Table of Outcomes
Accept Null Reject Null
Declare Non-Sig. Declare Sig.
No Discovery Declare Discovery
Negative Result Positive Result True Nulls
U V m0
False Nulls T S
m1 Total W R
m
Vnumber of false positives number of false
discoveries number of type 1 errors
16
False Discovery Rate (FDR)

FDR is an error measure that can be useful for
multiple testing problems encountered in
microarray experiments.
FDR was introduced by Benjamini and Hochberg
(1995) and is formally defined as follows
R rejected null hypotheses
V of type I errors (false discoveries)
FDRE(Q) where QV/R if Rgt0 and Q0 otherwise.
Controlling FDR amounts to choosing the
significance cutoff c so that FDR is less than or
equal to some desired level a.

17
A Conceptual Description of FDR

Suppose a scientist conducts 100 independent
microarray experiments.
For each experiment, the scientist produces a
list of genes declared to be differentially
expressed by testing a null hypothesis for each
gene.
For each list consider the ratio of the number of
false positive results to the total number of
genes on the list (set this ratio to 0 if the
list contains no genes).
The FDR is approximated by the average of the
ratios described above.

18
The Benjamini and Hochberg Procedure for
(Strongly) Controlling FDR at Level a

Let p(1), p(2), ... , p(m) denote the m p-values
ordered from smallest to largest.
Find the largest integer k so that p(k) k a /
m.
If no such k exists, set c 0 (declare nothing
significant).
Otherwise set c p(k) (reject the nulls
corresponding to the smallest k p-values).

19
An Example

Suppose 10,000 genes are tested for differential
expression between two treatments.
Suppose 200th smallest p-value is 0.001.
If no genes were truly differentially expressed,
how many of the 10,000 p-values would be expected
to be less than or equal to 0.001?
Use the calculations above to provide an estimate
of the proportion of false positive results among
the list of 200 genes with p-values no larger
than 0.001.

20
P-value Distribution Under H0

It can be shown that P(p-valuec)c for all c ?
(0,1) under H0 (cdf property)
P-values are uniformly distributed on the open
interval (0,1) under the null hypothesis

µ1-µ20
µ1-µ20.5
µ1-µ21
21
Solution

If all 10,000 null hypotheses were true, we would
expect 0.00110,000 10 tests to yield p-values
less than 0.001
A simple estimate of the proportion of false
positive results among the list of 200 genes with
p-values less than 0.001 is 0.001 10,000 / 200
0.05.
Recall that the BH FDR procedure involves
finding the largest integer k so that p(k) k a
/ m.
This is equivalent to finding the largest integer
k such that p(k) m / k a.

22
Other Methods for Estimating or Controlling FDR

Rewrite p(k) m / k p(k) (m0 m1) / k (
p(k) m0 p(k) m1 ) / k
Actual number of type I errors ???
Consider finding the largest integer k such
that p(k) m0 / k a
Produces a gene list at least as long while
controlling FDR at the same level
But since m0 is unknown, it is replaced with an
estimate p(k) m0 / k a

23
Histogram of p-valuesfrom the Two-Sample t-Tests
Number of Genes
p-value
24
Mixture of a Uniform Distribution and a
Distribution Stochastically Smaller than Uniform
Number of Genes
p-value
25
Estimating FDR Using Estimates of m0

Benjamini Y. and Hochberg Y. (2000). On the
adaptive control of the false discovery rate in
multiple testing with independent statistics.
Journal of Educational and Behavioral Statistics
25, 60-83.
Mosig, M. O., Lipkin, E., Galina, K. Tchourzyna,
E., Soller, M., and Friedmann, A. (2001). A
whole genome scan for quantitative trait loci
affecting milk protein percentage in
Israeli-Holstein cattle, by means of selective
milk DNA pooling in a daughter design, using an
adjusted false discovery rate criterion.
Genetics, 157, 1683-1698.
Storey, J. D., and Tibshirani, R. (2001).
Estimating false discovery rates under
dependence, with applications to DNA microarrays.
Technical Report 2001-28, Department of
Statistics, Stanford University.
Storey J. D. (2002). A direct approach to false
discovery rates. Journal of the Royal Statistical
Society, Series B, 64, 479-498.

26
Estimating FDR Using Estimates of m0

Genovese, C. and Wasserman, L. (2002). Operating
characteristics and extensions of the false
discovery rate procedure, Journal of the Royal
Statistical Society, Series B, 64, 499-517.
Storey J. D. (2003). The positive false discovery
rate A Bayesian interpretation and the q-value.
Annals of Statistics, 31, 2013-2035.
Storey, J. D., and Tibshirani, R. (2003).
Statistical significance for genomewide studies.
Proceedings of the National Academy of Sciences
100, 9440-9445
Storey J. D., Taylor JE, and Siegmund D. (2004).
Strong control, conservative point estimation,
and simultaneous conservative consistency of
false discovery rates A unified approach.
Journal of the Royal Statistical Society, Series
B, 66, 187-205.

27
Estimating FDR Using Estimates of m0

Fernando, R. L., Nettleton, D., Southey, B. R.,
Dekkers, J. C. M., Rothschild, M. F., and Soller,
M. (2004). Controlling the proportion of false
positives (PFP) in multiple dependent tests.
Genetics. 166, 611-619.
Genovese, C. and Wasserman, L. (2004). A
stochastic process approach to false discovery
control. The Annals of Statistics, 32,
1035-1061.
Nettleton, D., Hwang, J.T.G., Caldo, R.A., Wise,
R.P. (2006). Estimating the number of true null
hypotheses from a histogram of p-values. Journal
of Agricultural, Biological, and Environmental
Statistics. 11 337-356.

28
Estimating FDR Using Estimates of m0

Ruppert, D., Nettleton, D., Hwang, J.T.G. (2007).
Exploring the information in p-values for the
analysis and planning of multiple-test
experiments. Biometrics. 63 483-495.
Plus many more....

29
A method for obtaining a list of genesthat has
an estimated FDR a

Find the largest integer k such that
p(k) m0 / k a,
where m0 is an estimate of the number of
true null hypotheses among the m tests.
2. If no such k exists, declare nothing
significant. Otherwise, reject the null
hypotheses corresponding to the smallest k
p-values.

30
q-values

Recall that a p-value for an individual test can
be defined as the smallest significance level
(tolerable type 1 error rate) for which we can
reject the null the hypothesis.
The q-value for one test in a family of tests is
the smallest FDR for which we can reject the null
hypothesis for that one test and all others with
smaller p-values.

31
The q-value for a given test fills the blanksin
the following sentences

To reject the null hypothesis for this test and
all others with smaller p-values, I must be
willing to accept a false discovery rate of
_______.
To include this gene on my list of
differentially expressed genes, I must be willing
to accept a false discovery rate of _____.

32
Computation and Use of q-values

Let q(i) denote the q-value that corresponds to
the ith smallest p-value p(i).
q(i) min p(k) m0 / k k i,...,m .
To produce a list of genes with estimated FDR
a, include all genes with q-values a.

33
We will convert these p-values to q-values using
the method of Storey and Tibshirani.
Number of Genes
p-value
34
p-values q-values
35
p-values q-values
36
Remarks

In many cases, it will be difficult to separate
the many of the differentially expressed genes
from the non-differentially expressed genes.
Genes with a small expression change relative to
their variation will have a p-value distribution
that is not far from uniform if the number of
experimental units per treatment is low.

37
Example p-value Distributions
Two-Sample t-test of H0µ1µ2 n1n25, variance1
µ1-µ21
µ1-µ20.5
µ1-µ20
38
Remarks

In many cases, it will be difficult to separate
the many of the differentially expressed genes
from the non-differentially expressed genes.
Genes with a small expression change relative to
their variation will have a p-value distribution
that is not far from uniform if the number of
experimental units per treatment is low.
To do a better job of separating the
differentially expressed genes from the
non-differentially expressed genes we need to use
good experimental designs with more replications
per treatment.
Many experiments call for more complicated
analyses than simple t-tests, but multiple
testing issues remain.

39
Using Information about Genes to Interpret the
Results of Microarray Experiments

Based on a large body of past research, some
information is known about many of the genes
represented on a microarray.
The information might include tissues in which a
gene is known to be expressed, the biological
process in which a genes protein is known to
act, or other general or quite specific details
about the function of the protein produced by a
gene.
By examining this information in concert with the
results of a microarray experiment, biologists
can often gain a greater understanding of their
microarray experiments.

40
Gene Ontology (GO) Terms

GO terms provide one example of information that
is available about genes.
The GO project provides three ontologies
(structured controlled vocabularies) that
describe a genes
1. Biological Processes,
2. Cellular Components, and
3. Molecular Functions.

41
Gene Ontology (GO) Terms

Each gene may be associated with 0 or more GO
terms in a given ontology.
The GO terms in each ontology have varying levels
of specificity.
The GO terms in each ontology can be organized in
a directed acyclic graph (DAG) where each node
represents a term and arrows point from specific
terms to more general terms.

42
Portion of the Biological Processes
OntologyShown in a DAG
Alcohol Metabolic Process
Energy Derivation by Oxidation of Organic
Compounds
Carbohydrate Metabolic Process
Generation of Precursor Metabolites and Energy
Cellular Metabolic Process
Primary Metabolic Process
Macromolecule Metabolic Process
Cellular Process
Metabolic Process
Biological Process
43
Constructing Gene Categories from GO Terms

The set of genes associated with any particular
GO term could be considered as a category or gene
set of interest for subsequent testing.
For example, we might ask if genes that are
associated with the Molecular Function term
muscle alpha-actinin binding are affected by a
treatment of interest.
We could simultaneously query many groups,
general and specific, to better understand the
impact of treatment on expression.

44
Simultaneous Testing of Multiple Categories with
Various Levels of Specificity
muscle alpha-actinin binding
alpha-actinin binding
beta-actinin binding
actinin binding
myosin binding
ATPase binding
RNA polymerase core enzyme binding
cytoskeletal protein binding
enzyme binding
protein binding
binding
molecular function
45
Some Formal Methods for Testing Gene Categories
with Microarray Data

Fishers exact test on lists of gene declared to
be differentially expressed (DDE)
Gene Set Enrichment Analysis (GSEA)
Significance Analysis of Function and Expression
(SAFE)
Pathway Level Analysis of Gene Expression (PLAGE)
Domain Enhanced Analysis (DEA)
Many others appearing and soon to appear

46
Number of Genes Declared to be Differentially
Expressed for Various Estimated FDR Levels
FDR Number of Genes P-Value Threshold 0.01
8 0.000003 0.05 313
0.000900 0.10 748 0.004339 0.15
1465 0.012730 0.20 2143
0.024909
FDR estimated using the method of Storey and
Tibshirani (2003).
47
Are genes of category X overrepresentedamong the
genesdeclared to be differentially expressed?
Gene of Category X?
yes no
50 263 313 50 22327
22377 100 22590 22690
yes
Declared to be Differentially Expressed?
no
Highly significant overrepresentation
according to a chi-square test or Fishers exact
test.
48
Problems with Chi-Square or Fishers Exact Test
for Detecting Overrepresentation

The outcome of the overrepresentation test
depends on the significance threshold used to
declare genes differentially expressed.
Functional categories in which many genes exhibit
small changes may go undetected.
Genes are not independent, so a key assumption of
the chi-square and Fishers exact tests is
violated.
Information in the multivariate distribution of
genes in a category is not utilized.