Using Statistical Methods to Obtain a List of Differentially Expressed Genes - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Using Statistical Methods to Obtain a List of Differentially Expressed Genes

Description:

Wild-type vs. Myostatin Knockout Mice. Belgian Blue. cattle have a. mutation in the. myostatin gene. Affymetrix GeneChips on 5 Mice per Genotype. WT. WT. WT. M. M ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 49
Provided by: dannet3
Category:

less

Transcript and Presenter's Notes

Title: Using Statistical Methods to Obtain a List of Differentially Expressed Genes


1
Using Statistical Methods to Obtain a List of
Differentially Expressed Genes
Tim Bancroft Dan Nettleton BBSI Summer
School IOWA STATE UNIVERSITY June 16, 2009
2
Wild-type vs. Myostatin Knockout Mice
Belgian Blue cattle have a mutation in the
myostatin gene.
3
Affymetrix GeneChips on 5 Mice per Genotype
M
WT
M
WT
M
WT
WT
M
WT
M
4
The Dataset
Gene ID
Wild Type
Mutant
5
A Standard Analysis
  • Two-sample t-test for each gene.
  • Test the null hypothesis
    for the ith gene (wild
    type mean mutant mean)
  • Compute p-values by comparing t-statistics to a
    t-distribution with 8 d.f.
  • Use an adjustment for multiple testing to create
    a list of genes declared to be differentially
    expressed.

6
The Dataset
Gene ID
Wild Type
Mutant
p-value
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13
.
.
.
p22690
7
Histogram of p-valuesfrom the Two-Sample t-Tests
Number of Genes
p-value
8
Example p-value Distributions
Two-Sample t-test of H0µ1µ2 n1n25, variance1
µ1-µ21
µ1-µ20.5
µ1-µ20
9
Histogram of p-valuesfrom the Two-Sample t-Tests
Number of Genes
p-value
10
The Multiple Testing Problem
  • Suppose one test of interest has been conducted
    for each of m genes in a microarray experiment.
  • Let p1, p2, ... , pm denote the p-values
    corresponding to the m tests.
  • Let H01, H02, ... , H0m denote the null
    hypotheses corresponding to the m tests.

11
The Multiple Testing Problem (continued)
  • Suppose m0 of the null hypotheses are true and m1
    of the null hypotheses are false.
  • Let c denote a value between 0 and 1 that will
    serve as a cutoff for significance
  • - Reject H0i if pi c
    (declare significant)
  • - Fail to reject (or accept) H0i if pi gt c

  • (declare non-significant)

12
Table of Outcomes
Accept Null Reject Null
Declare Non-Sig. Declare Sig.
No Discovery Declare Discovery
Negative Result Positive Result True Nulls
U V m0
False Nulls T S
m1 Total W R
m
13
Table of Outcomes
Accept Null Reject Null
Declare Non-Sig. Declare Sig.
No Discovery Declare Discovery
Negative Result Positive Result True Nulls
U V m0
False Nulls T S
m1 Total W R
m
Random Variables
Constants
14
Table of Outcomes
Accept Null Reject Null
Declare Non-Sig. Declare Sig.
No Discovery Declare Discovery
Negative Result Positive Result True Nulls
U V m0
False Nulls T S
m1 Total W R
m
Unobservable
Observable
15
Table of Outcomes
Accept Null Reject Null
Declare Non-Sig. Declare Sig.
No Discovery Declare Discovery
Negative Result Positive Result True Nulls
U V m0
False Nulls T S
m1 Total W R
m
Vnumber of false positives number of false
discoveries number of type 1 errors
16
False Discovery Rate (FDR)
  • FDR is an error measure that can be useful for
    multiple testing problems encountered in
    microarray experiments.
  • FDR was introduced by Benjamini and Hochberg
    (1995) and is formally defined as follows
  • R rejected null hypotheses
  • V of type I errors (false discoveries)
  • FDRE(Q) where QV/R if Rgt0 and Q0 otherwise.
  • Controlling FDR amounts to choosing the
    significance cutoff c so that FDR is less than or
    equal to some desired level a.

17
A Conceptual Description of FDR
  • Suppose a scientist conducts 100 independent
    microarray experiments.
  • For each experiment, the scientist produces a
    list of genes declared to be differentially
    expressed by testing a null hypothesis for each
    gene.
  • For each list consider the ratio of the number of
    false positive results to the total number of
    genes on the list (set this ratio to 0 if the
    list contains no genes).
  • The FDR is approximated by the average of the
    ratios described above.

18
The Benjamini and Hochberg Procedure for
(Strongly) Controlling FDR at Level a
  • Let p(1), p(2), ... , p(m) denote the m p-values
    ordered from smallest to largest.
  • Find the largest integer k so that p(k) k a /
    m.
  • If no such k exists, set c 0 (declare nothing
    significant).
  • Otherwise set c p(k) (reject the nulls
    corresponding to the smallest k p-values).

19
An Example
  • Suppose 10,000 genes are tested for differential
    expression between two treatments.
  • Suppose 200th smallest p-value is 0.001.
  • If no genes were truly differentially expressed,
    how many of the 10,000 p-values would be expected
    to be less than or equal to 0.001?
  • Use the calculations above to provide an estimate
    of the proportion of false positive results among
    the list of 200 genes with p-values no larger
    than 0.001.

20
P-value Distribution Under H0
  • It can be shown that P(p-valuec)c for all c ?
    (0,1) under H0 (cdf property)
  • P-values are uniformly distributed on the open
    interval (0,1) under the null hypothesis

µ1-µ20
µ1-µ20.5
µ1-µ21
21
Solution
  • If all 10,000 null hypotheses were true, we would
    expect 0.00110,000 10 tests to yield p-values
    less than 0.001
  • A simple estimate of the proportion of false
    positive results among the list of 200 genes with
    p-values less than 0.001 is 0.001 10,000 / 200
    0.05.
  • Recall that the BH FDR procedure involves
    finding the largest integer k so that p(k) k a
    / m.
  • This is equivalent to finding the largest integer
    k such that p(k) m / k a.

22
Other Methods for Estimating or Controlling FDR

Rewrite p(k) m / k p(k) (m0 m1) / k (
p(k) m0 p(k) m1 ) / k
Actual number of type I errors ???
Consider finding the largest integer k such
that p(k) m0 / k a
Produces a gene list at least as long while
controlling FDR at the same level
But since m0 is unknown, it is replaced with an
estimate p(k) m0 / k a

23
Histogram of p-valuesfrom the Two-Sample t-Tests
Number of Genes
p-value
24
Mixture of a Uniform Distribution and a
Distribution Stochastically Smaller than Uniform
Number of Genes
p-value
25
Estimating FDR Using Estimates of m0
  • Benjamini Y. and Hochberg Y. (2000). On the
    adaptive control of the false discovery rate in
    multiple testing with independent statistics.
    Journal of Educational and Behavioral Statistics
    25, 60-83.
  • Mosig, M. O., Lipkin, E., Galina, K. Tchourzyna,
    E., Soller, M., and Friedmann, A. (2001). A
    whole genome scan for quantitative trait loci
    affecting milk protein percentage in
    Israeli-Holstein cattle, by means of selective
    milk DNA pooling in a daughter design, using an
    adjusted false discovery rate criterion.
    Genetics, 157, 1683-1698.
  • Storey, J. D., and Tibshirani, R. (2001).
    Estimating false discovery rates under
    dependence, with applications to DNA microarrays.
    Technical Report 2001-28, Department of
    Statistics, Stanford University.
  • Storey J. D. (2002). A direct approach to false
    discovery rates. Journal of the Royal Statistical
    Society, Series B, 64, 479-498.

26
Estimating FDR Using Estimates of m0
  • Genovese, C. and Wasserman, L. (2002). Operating
    characteristics and extensions of the false
    discovery rate procedure, Journal of the Royal
    Statistical Society, Series B, 64, 499-517.
  • Storey J. D. (2003). The positive false discovery
    rate A Bayesian interpretation and the q-value.
    Annals of Statistics, 31, 2013-2035.
  • Storey, J. D., and Tibshirani, R. (2003).
    Statistical significance for genomewide studies.
    Proceedings of the National Academy of Sciences
    100, 9440-9445
  • Storey J. D., Taylor JE, and Siegmund D. (2004).
    Strong control, conservative point estimation,
    and simultaneous conservative consistency of
    false discovery rates A unified approach.
    Journal of the Royal Statistical Society, Series
    B, 66, 187-205.

27
Estimating FDR Using Estimates of m0
  • Fernando, R. L., Nettleton, D., Southey, B. R.,
    Dekkers, J. C. M., Rothschild, M. F., and Soller,
    M. (2004). Controlling the proportion of false
    positives (PFP) in multiple dependent tests.
    Genetics. 166, 611-619.
  • Genovese, C. and Wasserman, L. (2004). A
    stochastic process approach to false discovery
    control. The Annals of Statistics, 32,
    1035-1061.
  • Nettleton, D., Hwang, J.T.G., Caldo, R.A., Wise,
    R.P. (2006). Estimating the number of true null
    hypotheses from a histogram of p-values. Journal
    of Agricultural, Biological, and Environmental
    Statistics. 11 337-356.

28
Estimating FDR Using Estimates of m0
  • Ruppert, D., Nettleton, D., Hwang, J.T.G. (2007).
    Exploring the information in p-values for the
    analysis and planning of multiple-test
    experiments. Biometrics. 63 483-495.
  • Plus many more....

29
A method for obtaining a list of genesthat has
an estimated FDR a
  • Find the largest integer k such that
  • p(k) m0 / k a,
  • where m0 is an estimate of the number of
    true null hypotheses among the m tests.
  • 2. If no such k exists, declare nothing
    significant. Otherwise, reject the null
    hypotheses corresponding to the smallest k
    p-values.



30
q-values
  • Recall that a p-value for an individual test can
    be defined as the smallest significance level
    (tolerable type 1 error rate) for which we can
    reject the null the hypothesis.
  • The q-value for one test in a family of tests is
    the smallest FDR for which we can reject the null
    hypothesis for that one test and all others with
    smaller p-values.

31
The q-value for a given test fills the blanksin
the following sentences
  • To reject the null hypothesis for this test and
    all others with smaller p-values, I must be
    willing to accept a false discovery rate of
    _______.
  • To include this gene on my list of
    differentially expressed genes, I must be willing
    to accept a false discovery rate of _____.

32
Computation and Use of q-values
  • Let q(i) denote the q-value that corresponds to
    the ith smallest p-value p(i).
  • q(i) min p(k) m0 / k k i,...,m .
  • To produce a list of genes with estimated FDR
    a, include all genes with q-values a.


33
We will convert these p-values to q-values using
the method of Storey and Tibshirani.
Number of Genes
p-value
34
p-values q-values
35
p-values q-values
36
Remarks
  • In many cases, it will be difficult to separate
    the many of the differentially expressed genes
    from the non-differentially expressed genes.
  • Genes with a small expression change relative to
    their variation will have a p-value distribution
    that is not far from uniform if the number of
    experimental units per treatment is low.

37
Example p-value Distributions
Two-Sample t-test of H0µ1µ2 n1n25, variance1
µ1-µ21
µ1-µ20.5
µ1-µ20
38
Remarks
  • In many cases, it will be difficult to separate
    the many of the differentially expressed genes
    from the non-differentially expressed genes.
  • Genes with a small expression change relative to
    their variation will have a p-value distribution
    that is not far from uniform if the number of
    experimental units per treatment is low.
  • To do a better job of separating the
    differentially expressed genes from the
    non-differentially expressed genes we need to use
    good experimental designs with more replications
    per treatment.
  • Many experiments call for more complicated
    analyses than simple t-tests, but multiple
    testing issues remain.

39
Using Information about Genes to Interpret the
Results of Microarray Experiments
  • Based on a large body of past research, some
    information is known about many of the genes
    represented on a microarray.
  • The information might include tissues in which a
    gene is known to be expressed, the biological
    process in which a genes protein is known to
    act, or other general or quite specific details
    about the function of the protein produced by a
    gene.
  • By examining this information in concert with the
    results of a microarray experiment, biologists
    can often gain a greater understanding of their
    microarray experiments.

40
Gene Ontology (GO) Terms
  • GO terms provide one example of information that
    is available about genes.
  • The GO project provides three ontologies
    (structured controlled vocabularies) that
    describe a genes
  • 1. Biological Processes,
  • 2. Cellular Components, and
  • 3. Molecular Functions.

41
Gene Ontology (GO) Terms
  • Each gene may be associated with 0 or more GO
    terms in a given ontology.
  • The GO terms in each ontology have varying levels
    of specificity.
  • The GO terms in each ontology can be organized in
    a directed acyclic graph (DAG) where each node
    represents a term and arrows point from specific
    terms to more general terms.

42
Portion of the Biological Processes
OntologyShown in a DAG
Alcohol Metabolic Process
Energy Derivation by Oxidation of Organic
Compounds
Carbohydrate Metabolic Process
Generation of Precursor Metabolites and Energy
Cellular Metabolic Process
Primary Metabolic Process
Macromolecule Metabolic Process
Cellular Process
Metabolic Process
Biological Process
43
Constructing Gene Categories from GO Terms
  • The set of genes associated with any particular
    GO term could be considered as a category or gene
    set of interest for subsequent testing.
  • For example, we might ask if genes that are
    associated with the Molecular Function term
    muscle alpha-actinin binding are affected by a
    treatment of interest.
  • We could simultaneously query many groups,
    general and specific, to better understand the
    impact of treatment on expression.

44
Simultaneous Testing of Multiple Categories with
Various Levels of Specificity
muscle alpha-actinin binding
alpha-actinin binding
beta-actinin binding
actinin binding
myosin binding
ATPase binding
RNA polymerase core enzyme binding
cytoskeletal protein binding
enzyme binding
protein binding
binding
molecular function
45
Some Formal Methods for Testing Gene Categories
with Microarray Data
  • Fishers exact test on lists of gene declared to
    be differentially expressed (DDE)
  • Gene Set Enrichment Analysis (GSEA)
  • Significance Analysis of Function and Expression
    (SAFE)
  • Pathway Level Analysis of Gene Expression (PLAGE)
  • Domain Enhanced Analysis (DEA)
  • Many others appearing and soon to appear

46
Number of Genes Declared to be Differentially
Expressed for Various Estimated FDR Levels
FDR Number of Genes P-Value Threshold 0.01
8 0.000003 0.05 313
0.000900 0.10 748 0.004339 0.15
1465 0.012730 0.20 2143
0.024909
FDR estimated using the method of Storey and
Tibshirani (2003).
47
Are genes of category X overrepresentedamong the
genesdeclared to be differentially expressed?
Gene of Category X?
yes no
50 263 313 50 22327
22377 100 22590 22690
yes
Declared to be Differentially Expressed?
no
Highly significant overrepresentation
according to a chi-square test or Fishers exact
test.
48
Problems with Chi-Square or Fishers Exact Test
for Detecting Overrepresentation
  • The outcome of the overrepresentation test
    depends on the significance threshold used to
    declare genes differentially expressed.
  • Functional categories in which many genes exhibit
    small changes may go undetected.
  • Genes are not independent, so a key assumption of
    the chi-square and Fishers exact tests is
    violated.
  • Information in the multivariate distribution of
    genes in a category is not utilized.
Write a Comment
User Comments (0)
About PowerShow.com