Title: Assessing Differential Expression in Mixtures of Cell Types
1Assessing Differential Expression in Mixtures of
Cell Types
2Statistical Testing in a Nutshell
Question / Hypothesis Is the expression of gene g
in cell type 1 higher than in cell type 2?
Data Expression of gene g in several
samples(absolute scale)
3What is a good Statistic?
Probability density
good
Statisctics of genes for which expression of
class1 class 2
Statisctics of genes for which expression of
class1 gt class 2
0
good statistic
Probability density
poor
0
poor statistic
4What is a good Statistic?
Sensitivity
(Proportion of truly higher expressed genes
which were found among the truly higher
expressed genes)
Type I error
(Proportion of non-higher expressed genes that
are erroneously found)
Genes that are foundby the statistics
threshold
5What is a good Statistic?
Sensitivity
Type I error
Genes that are foundby the statistics
threshold
6What is a good Statistic?
Sensitivity
1
Sensitivity
0
1
Type I error
Type I error
1
Sensitivity
0
1
Type I error
threshold
7What is a good Statistic?
Sensitivity
1
Sensitivity
0
1
Type I error
Type I error
1
Sensitivity
0
1
Type I error
threshold
8Receiver Operating Characteristic (ROC)
ROC curve
1
good
Sensitivity
0
1
Type I error
ROC curve
1
poor
Sensitivity
0
1
Type I error
9Statistics for the Detection of Differential Gene
Expression
Bad Idea Subtract the estimated sample means
Problem d is not scale invariant
10Problems in Mixed Cell Samples
Varying cell proportions lead to varying
observations
Examples
- Tumor biopsies with varying proportions of tumor
tissue in each sample. An estimate of the tumor
tissue proportion is provided by the pathologist. - RNAi experiments in which transfection efficiency
(resp. knockdown efficiency) is substantially
below 100. The proportion of cells in which RNAi
works is estimated via TaqMan analysis of the
target RNA.
11Problems in Mixed Cell Samples
Let S be a collection of samples. Let xs be the
relative proportion of cells of type 1 in this
sample s.
Then,
Expression
12Connections to Linear Regression
Idea Observe that can be estimated as the
slope of the linear fit of the data
Though formally identical to the t-statistic, the
difference is in the calculation of d.
13Ingredients for the Calculation of d Shrinkage
estimation of d using penalized regression
The standard regression estimate for d is the
minimizer of the quadratic loss
measurement
linear fit
A way to make linear regression more robust is to
add a linear penalty term for d (Lasso,
R.Tibshirani 1998)
Here, ? is the so-called shrinkage parameter. It
is comparable to s0 in the SAM-statistics and
avoids overfitting.
How do we choose ??
14Ingredients for the Calculation of d Selection
of the shrinkage parameter ?
Idea Taking a Bayesian view, d can be derived as
the mode of a posterior distribution
15Ingredients for the Calculation of d Selection
of the shrinkage parameter ?
If we assume that the entries in d follow a
Laplacew,0 distribution, then ths shrinkage
parameter ? can be derived from the shape
parameter of this distribution as ?
1/(2w) Apply a Kolmogorov-Smirnov test to justify
this assumption and fit the shape parameter w.
16Ingredients for the Calculation of d Adjustment
to heteroscedasticity of the measurements
Variance depends upon expression intensity
Use the error model proposed in vsn, use vsn to
estimate the individual variances.
Expression
Perform a weighted penalized linear regression
instead of a simple linear regression for d.
17Ingredients for the Calculation of d The
Two-Component Model
(Taken from W.Huber)
measured intensity offset gain ?
true abundance
A robust fitting method for the estimation of the
parameters ai ,bi ,s1 ,s2 has been developed by
W.Huber and A.v.Heydebreck. It has been
implemented in the R package vsn.
18The final algorithm
- Estimate the empirical distribution of the
entries of d by a simple linear regression - Fit a Laplace distribution to obtain the
shrinkage parameter ? - Estimate the individual measurement variances by
the vsn procedure, determine regression weights
ws - For each gene, calculate the t-statistics
19Validation by Simulation (1)
- Data generation
- Fix all gene expression values in class 1 to one
value - Alter half of the genes in class 2 by some
constant value - Add normally distributed noise.
20Validation by Simulation (2)
- Data generation
- Draw from a log-normal distribution
- Draw a fold change vector f by
- Calculate as the product of with the
fold change vetor - Generate 8 samples with mixture proportions
evenly distributed in 0,1 - Add normally distributed noise
21Validation by Simulation (2)
ROC curve
Predictive power
t
SAM
(fraction of truly higher expressed genes among
found genes)
t
Sensitivity
Positive predictive value PPV
Type I error
found genes
22Validation by simulated data (3)
- Generate 16 mixture samples as in example (2)
- Calculate the statistics for 4 configurations
23Outlook
- Prove superiority over method proposed by Ghosh
- Accomplish the paper
- Write R software package
- Apply methodology to RNAi data
Thanks to ...
- Tim Beissbarth for data acquisition and
preprocessing - Andreas Buness, Markus Ruschhaupt for helpful
discussions - Gordon Smyth, Dileepa Diyagama, Andrew Holloway
from the WEHI (Melbourne) for the data - Annemarie Poustka, Holger Sültmann