Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm - PowerPoint PPT Presentation

About This Presentation
Title:

Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm

Description:

Of Epidemiology and Biostatistics, 2010 ... (GSEA) Gene Set Enrichment Analysis (GSEA) Parametric ... Z = Sm-m s/m0.5 The test statistic used for the gene ... – PowerPoint PPT presentation

Number of Views:225
Avg rating:3.0/5.0
Slides: 106
Provided by: Comput519
Category:

less

Transcript and Presenter's Notes

Title: Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm


1
Summer Inst. Of Epidemiology and Biostatistics,
2010Gene Expression Data Analysis130pm
500pm in Room W2015
  • Carlo Colantuoni
  • carlo_at_illuminatobiotech.com

http//www.illuminatobiotech.com/GEA2010/GEA2010.h
tm
2
Class Outline
  • Basic Biology Gene Expression Analysis
    Technology
  • Data Preprocessing, Normalization, QC
  • Measures of Differential Expression
  • Multiple Comparison Problem
  • Clustering and Classification
  • The R Statistical Language and Bioconductor
  • GRADES independent project with Affymetrix data.

http//www.illuminatobiotech.com/GEA2010/GEA2010.h
tm
3
Class Outline - Detailed
  • Basic Biology Gene Expression Analysis
    Technology
  • The Biology of Our Genome Transcriptome
  • Genome and Transcriptome Structure Databases
  • Gene Expression Microarray Technology
  • Data Preprocessing, Normalization, QC
  • Intensity Comparison Ratio vs. Intensity Plots
    (log transformation)
  • Background correction (PM-MM, RMA, GCRMA)
  • Global Mean Normalization
  • Loess Normalization
  • Quantile Normalization (RMA GCRMA)
  • Quality Control Batches, plates, pins, hybs,
    washes, and other artifacts
  • Quality Control PCA and MDS for dimension
    reduction
  • SVA Surrogate Variable Analysis
  • Measures of Differential Expression
  • Basic Statistical Concepts
  • T-tests and Associated Problems
  • Significance analysis in microarrays (SAM)
    Empirical Bayes
  • Complex ANOVAs (limma package in R)
  • Multiple Comparison Problem

4
DAY 3
Measures of Differential Expression Review of
basic statistical concepts T-tests and
associated problems Significance analysis in
microarrays (SAM) (Empirical Bayes) Complex
ANOVAs (limma package in R) Multiple
Comparison Problem Bonferroni FDR Differential
Expression of Functional Gene Groups Notes on
Experimental Design
5
Slides from Rob Scharpf
6
Fold-Change?T-Statistics?
Some genes are more variable than others
7
Slides from Rob Scharpf
8
Slides from Rob Scharpf
9
Slides from Rob Scharpf
10
Slides from Rob Scharpf
11
distribution of
distribution of
Slides from Rob Scharpf
12
Slides from Rob Scharpf
13
X1-X2 is normally distributed if X1 and X2 are
normally distributed is this the case in
microarray data?
Slides from Rob Scharpf
14
Problem 1 T-statistic not t-distributed.
Implication p-values/inference incorrect
15
P-values by permutation
  • It is common that the assumptions used to derive
    the statistics are not approximate enough to
    yield useful p-values (e.g. when T-statistics are
    not T distributed.)
  • An alternative is to use permutations.

16
p-values by permutations
  • We focus on one gene only. For the bth iteration,
    b 1, ??? , B
  • Permute the n data points for the gene (x). The
    first n1 are referred to as treatments, the
    second n2 as controls.
  • For each gene, calculate the corresponding two
    sample t-statistic, tb.
  • After all the B permutations are done
  • p b tb tobserved / B
  • This does not yet address the issue of multiple
    tests!

17
The volcano plot shows, for a particular test,
negative log p-value against the effect size (M).
Another problem with t-tests
18
Remember this?
19
Problem 2 t-statistic bigger for genes with
smaller standard error estimates.Implication
Ranking might not be optimal
20
Problem 2
  • With low Ns SD estimates are unstable
  • Solutions
  • Significance Analysis in Microarrays (SAM)
  • Empirical Bayes methods and Stein estimators

21
Significance analysis in microarrays (SAM)
  • A clever adaptation of the t-ratio to borrow
    information across genes
  • Implemented in Bioconductor in the siggenes
    package

Significance analysis of microarrays applied to
the ionizing radiation response, Tusher et al.,
PNAS 2002
22
SAM d-statistic
  • For gene i

mean of sample 1
mean of sample 2
Standard deviation of repeated measurements for
gene i
Exchangeability factor estimated using all genes
23
Minimize the average CV across all genes.
24
Scatter plots of relative difference (d) vs
standard deviation (s) of repeated expression
measurements
A fix for this problem
Relative difference for a permutation of the
data that was balanced between cell lines 1 and 2.
Random fluctuations in the data, measured by
balanced permutations (for cell line 1 and 2)
25
eBayes Borrowing Strength
  • An advantage of having tens of thousands of genes
    is that we can try to learn about typical
    standard deviations by looking at all genes
  • Empirical Bayes gives us a formal way of doing
    this
  • Shrinkage of variance estimates toward a
    prior moderated t-statistics eliminates
    extreme stats due to small variances.
  • Implemented in the limma package in R. In
    addition, limma provides methods for more complex
    experimental designs beyond simple, two-sample
    designs.

26
The Multiple Comparison Problem
  • (some slides courtesy of John Storey)

27
Hypothesis Testing
  • Test for each gene
  • Null Hypothesis no differential expression.
  • Two types of errors can be committed
  • Type I error or false positive (say that a gene
    is differentially expressed when it is not, i.e.,
    reject a true null hypothesis).
  • Type II error or false negative (fail to identify
    a truly differentially expressed gene, i.e.,fail
    to reject a false null hypothesis)

28
Hypothesis testing
  • Once you have a given score for each gene, how do
    you decide on a cut-off?
  • p-values are most common.
  • How do we decide on a cut-off when we are looking
    at many 1000s of tests?
  • Are 0.05 and 0.01 appropriate? How many false
    positives would we get if we applied these
    cut-offs to long lists of genes?

29
Multiple Comparison Problem
  • Even if we have good approximations of our
    p-values, we still face the multiple comparison
    problem.
  • When performing many independent tests, p-values
    no longer have the same interpretation.

30
Bonferroni Procedure
a 0.05 Tests 1000a 0.05 / 1000
0.00005orp p 1000
31
Bonferroni Procedure
Too conservative.How else can we interpret many
1000s of observed statistics?Instead of
evaluating each statistic individually, can we
assess a list of statistics FDR (Benjamini
Hochberg 1995)
32
FDR
  • Given a cut-off statistic, FDR gives us an
    estimate of the proportion of hits in our list of
    differentially expressed genes that are false.

Null Equivalent Expression Alternative
Differential Expression
33
False Discovery Rate
  • The false discovery rate measures the
    proportion of false positives among all genes
    called significant
  • This is usually appropriate because one wants to
    find as many truly differentially expressed genes
    as possible with relatively few false positives
  • The false discovery rate gives an estimate of the
    rate at which further biological verification
    will result in dead-ends

34
Distribution of Statistics
N90
Permuted
Observed
Statistic
35
Distribution of Statistics
FDR
Permuted
Observed
Permuted
Observed
Statistic
36
Distribution of p-values
N90
Observed
Permuted
p-value
37
SAM produces a modified T-statistic (d), and has
an approach to the multiple comparison problem.
38
Scatter plots of relative difference (d) vs
standard deviation (s) of repeated expression
measurements
A fix for this problem
Relative difference for a permutation of the
data that was balanced between cell lines 1 and 2.
Random fluctuations in the data, measured by
balanced permutations (for cell line 1 and 2)
39
Selected genesBeyond expected distribution
40
FDR False Positives/Total Positive CallsThis
FDR analysis requires enough samples in each
condition to estimate a statistic for each gene
observed statistic distribution.And enough
samples in each condition to permute many times
and recalculate this statistic null statistic
distribution.What if we dont have this?
41
FDR 0.05 Beyond 0.9
42
FDR 0.05 Beyond 0.9
43
(No Transcript)
44
(No Transcript)
45
False Positive Rate versus False Discovery Rate
  • False positive rate is the rate at which truly
    null genes are called significant
  • False discovery rate is the rate at which
    significant genes are truly null

46
False Positive Rate and P-values
  • The p-value is a measure of significance in terms
    of the false positive rate (aka Type I error
    rate)
  • P-value is defined to be the minimum false
    positive rate at which the statistic can be
    called significant
  • Can be described as the probability a truly null
    statistic is as or more extreme than the
    observed one

47
False Discovery Rate and Q-values
  • The q-value is a measure of significance in terms
    of the false discovery rate
  • Q-value is defined to be the minimum false
    discovery rate at which the statistic can be
    called significant
  • Can be described as the probability a statistic
    as or more extreme is truly null

48
Power and Sample Size Calculations are Hard
  • Need to specify
  • a (Type I error rate, false positives) or FDR
  • s (stdev will be sample- and gene-specific)
  • Effect size (how do we estimate?)
  • Power (1-b, bType II error rate)
  • Sample Size
  • Some papers
  • Mueller, Parmigiani et al. JASA (2004)
  • Rich Simons group Biostatistics (2005)
  • Tibshirani. A simple method for assessing sample
    sizes in microarray experiments. BMC
    Bioinformatics. 2006 Mar 27106.

49
(No Transcript)
50
Beyond Individual Genes Functional Gene Groups
  • Borrow statistical power across entire dataset
  • Integrate preexisting biological knowledge

51
Functional Annotation of Lists of Genes
KEGG PFAM SWISS-PROT GO DRAGON DAVID/EASE MatchMin
er BioConductor (R)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
Gene Cross-Referencing and Gene Annotation Tools
In BioConductor (in the R statistical language)
annotate package
Microarray-specific metadata
packages DB-specific metadata packages
AnnBuilder package
60
Annotation Tools In BioConductor annotate package
Functions for accessing data in metadata packages.
Functions for accessing NCBI databases.
Functions for assembling HTML tables.
61
Annotation Tools In BioConductor Annotation for
Commercial Microarrays Array-specific metadata
packages
62
Annotation Tools In BioConductor Functional
Annotation with other DBs GO metadata package
63
Annotation Tools In BioConductor Functional
Annotation with other DBs KEGG metadata package
64
Is there enrichment in our list of differentially
expressed genes for a particular functional gene
group or pathway?
Threshold Enrichment One Way of Assessing
Differential Expression of Functional Gene Groups
65
Threshold Enrichment
66
Threshold Enrichment One Way of Assessing
Differential Expression of Functional Gene Groups
67
Threshold Enrichment One Way of Assessing
Differential Expression of Functional Gene Groups
The argument lower.tail will indicate if you are
looking for over- or under- representation of
differentially expressed genes within a
particular functional group (using lower.tailF
for over-representation).
68
Can we use more of our data than Threshold
Enrichment (that only uses the top of our gene
list)?
69
  • Beyond threshold enrichment

70
Functional Gene Subgroups within An Experiment
71
Statistics for Analysis of Differential
Expression of Gene Subgroups
Is THIS
Different from THIS?
72
Over-Expression of a Group of Functionally
Related Genes
plt7.42e-08
T statistic
73
Is THIS
Different from THIS?
Conceptually Distinct from Threshold Enrichment
and the Hypergeometric test!
Statistical Tests c2 Kolmogorov-Smirnov Product
of Probabilities GSEA PAGE geneSetTest (Wilcoxon
rank sum)
74
c2
All Genes
c2 is the sum of D values where
Subset of Interest
E
histogram bins
O
75
Kolmogorov-Smirnov
All Genes
Subset of Interest
76
Product of Individual Probabilities
All Genes
Subset of Interest
77
What shape/type of distributions would each of
these tests be sensitive to?
Statistics from gene subgroup
All statistics
78
Gene Set Enrichment Analysis (GSEA)
Subramanian et al, 2005 PNAS
79
Gene Set Enrichment Analysis (GSEA)
80
Gene Set Enrichment Analysis (GSEA)
81
Gene Set Enrichment Analysis (GSEA)
82
Gene Set Enrichment Analysis (GSEA)
83
Gene Set Enrichment Analysis (GSEA)
84
Parametric Analysis of Gene Set Enrichment (PAGE)
Kim et al, 2005 BMC Bioinformatics
85
Parametric Analysis of Gene Set Enrichment (PAGE)
86
(No Transcript)
87
Sm-m
Z
s/m0.5
88
A simple method in Bioconductor
geneSetTest(limma)
Test whether a set of genes is enriched for
differential expression. Usage geneSetTest(sele
cted,statistics,alternative"mixed",type"auto",ra
nks.onlyTRUE,nsim10000)
The test statistic used for the gene-set-test is
the mean of the statistics in the set. If
ranks.only is TRUE the only the ranks of the
statistics are used. In this case the p-value is
obtained from a Wilcoxon test. If ranks.only is
FALSE, then the p-value is obtained by simulation
using nsim random selected sets of
genes. Arguement alternative mixed or
either fundamentally different questions.
89
Wilcoxon test
90
What shape/type of distributions would each of
these tests be sensitive to?
Statistics from gene subgroup
All statistics
91
Analysis of Gene Networks
92
Large Protein Interaction Network
Network Regulated in Sample 1
93
Large Protein Interaction Network
Network Regulated in Sample 2
Network Regulated in Sample 1
94
Large Protein Interaction Network
Network Regulated in Sample 3
Network Regulated in Sample 1
Network Regulated in Sample 2
95
Large Protein Interaction Network
Network of Interest
Network Regulated in Sample 1
Network Regulated in Sample 2
Network Regulated in Sample 3
96
Additional Notes on Experimental Design
97
Old-School Experimental Design Randomization
98
Replicates in a mouse model
Dissection of tissue
Biological Replicates
RNA Isolation
Amplification
Technical Replicates
Probe labelling
Hybridization
99
Common question in experimental design
  • Should I pool mRNA samples across subjects in an
    effort to reduce the effect of biological
    variability (or cost)?

100
Two simple designs
  • The following two designs have roughly the same
    cost
  • 3 individuals, 3 arrays
  • Pool of three individuals, 3 technical replicates
  • To a statistician the second design seems
    obviously worse. But, I found it hard to convince
    many biologist of this.
  • 3 pools of 3 animals on individual arrays?

101
Cons of Pooling Everything
  • You can not measure within class variation
  • Therefore, no population inference possible
  • Mathematical averaging is an alternative way of
    reducing variance.
  • Pooling may have non-linear effects
  • You can not take the log before you
    average Elog(XY) ? Elog(X) Elog(Y)
  • You can not detect outliers

If the measurements are independent and
identically distributed
102
Cons specific to microarrays
  • Different genes have dramatically different
    biological variances.
  • Not measuring this variance will result in genes
    with larger biological variance having a better
    chance of being considered more important

103
Higher variance larger fold change
We compute fold change for each gene (Y
axis) From 12 individuals we estimate gene
specific variance (X axis) If we pool we never
see this variance.
104
Remember this?
105
Useful Books Statistical analysis of gene
expression microarray data Speed. Analysis
of gene expression data Parmigianni Bioinform
atics and computational biology solutions using
R - Irizarry
Write a Comment
User Comments (0)
About PowerShow.com