Title: The Imputation of Missing Values in cDNA Microarray Data, and its Effect on the Significance Analysi
1The Imputation of Missing Values in cDNA
Microarray Data,and its Effect on the
Significance Analysis of Differential Expression.
- Rebecka Jornsten
- Department of Statistics, Rutgers University
- http//www.stat.rutgers.edu/rebecka
- Cambridge, November 9-12 2005
- Joint work with
- Dr. Ming Ouyang, Hui Wang, and Prof. Bill Welsh
(UMDNJ)
2- Outline
- SECTION 1 DNA Microarrays Data Structure
- SECTION 2 Missing Data Imputation Methods
- SECTION 3 LinCmbAdaptive Missing Data
Imputation - SECTION 4 LinCmb-Local vs Global imputation
- SECTION 5 RMSE comparisons
- SECTION 6 Significance Analysis
- SECTION 7 Conclusion Future Work
3SECTION 1
- Missing data
- -as much as 10 on a single array!
- manually flagged or flagged by the image
processing routine - smears, high background, signal below detection
level,
Ideal microarray image
Actual images
Smears,dust
Low intensity spots
Overlapping spots
High background
4Gene Expression Data
SECTION 1
- Gene expression data on p genes for n samples
mRNA samples different experiments/arrays
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 NA 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
- Significance analysis
- Clustering/Classification
- High-level analysis tasks
- How to deal with missing values?
- Missing value imputation methods are usually
compared in terms of RMSE (root mean square
error), not in terms of their effect on
high-level analysis.
5SECTION 2
- Missing value imputation methods
- Impute with 0
- Impute with mean(gene) ROWimpute
- Gaussian mixtures (GMC)
- K-nearest-neighbors (kNN)
- Transform based methods (SVD, BPCA)
- And many, many more
- So.how do we decide which method to use?
6SECTION 2
- Which method?
- Is there a best method for any scenario?
- Which criteria should we use to compare
imputation methods? - Comparisons are usually based on RMSE.
- What about the effect on the high-level analyses
(testing, clustering,)?
7SECTION 3
- LinCmb adaptive imputation
- Which method to use? Let the data decide!
- We combine a library of imputation methods (ROW,
kNN, SVD, BPCA, GMC (1-5 mixtures)) - How? We assign a weight to each method
Model Stacking. - The weights are obtained through training on the
data at hand.
8SECTION 3
- LinCmb adaptive imputation
- Denote by M the gene expression data, M if
missing - Denote by R,K,S,B,G1,G2,,G5 the imputation of M
by methods ROW,kNN,SVD,BPCA and GMC. - We obtain model weights r,k,s,b,g1,g2,,g5 that
minimize - (M-rRkKsSbBg1G1g5G5)
- s.t. (r,k,s,b,g1,,g5)gt0, rksbg1g51
- But we dont know M!
2
9SECTION 3
- LinCmb adaptive imputation
- The proportion of missing values in data matrix
is p - We first use kNN to impute the missing values
Mtrue - We generate fake missing values Mfake
(proportion p/(1-p)). Their true values are known
to LinCmb. If Mfake coincides with an Mtrue, it
is not treated as missing. - We can now train the weights using Mfake
10SECTION 3
- LinCmb adaptive imputation
- Of course, there is instability in estimating the
weights for LinCmb. - We repeat the training procedure T times (e.g.
T30-200), obtaining several sets of weights
r,k,s,b,g1,,g5. - To impute Mtrue, we combine the T models via a
simple model averaging procedure. - (We also tried median models, but this did not
significantly alter/improve the results.)
11SECTION 4
An illustration on Gene expression data
- We illustrate LinCmb on a Liver/Liver cancer data
set from the Stanford Microarray data base
(Gollub et al 2003). - The original data set consisted of 207 arrays,
23000 probes. After pre-screening we obtain a
data set of 20 liver and 20 cancer samples,
for 6511 probes with no missing values.
- Simulation setup
- Using the liver data set we simulate -
1 ,4, or 7 missing values, at random (MAR). - We repeat this 200 times. Each iteration we
compute the RMSE of LinCmb and its constituent
methods
12SECTION 4
Results model weights. Liver samples
global
local
13SECTION 4
Results model weights. Liver cancer samples
global
local
14SECTION 4
- Model weights
- Local vs Global
- When there are few missing values, local methods
perform quite well and LinCmb assigns large
weights to these methods - When there are a lot of missing values local
information is not available, and LinCmb assigns
large weights to global methods. - For the more heterogeneous cancer data LinCmb
assigns larger weights to the local methods.
global
local
15RMSE(method)-RMSE(LinCmb) for liver .
SECTION 5
1 missing
4 missing
7 missing
4
RMSE of LinCmb
liver cancer data
7
16SECTION 5
- RMSE
- LinCmb is clearly the best method in terms of
RMSE. - SVD and ROWimpute perform poorly!
- Local and global methods are competitive if there
are only a few missing values, - but local method performance deteriorates when
there are a lot of missing values. - Among local methods, the popular kNN is not
competitive.
17SECTION 6
- Differential Expression
- Imputation methods are usually compared in terms
of RMSE. - Since LinCmb is trained to minimize the MSE,
LinCmb is the winner in terms of RMSE. - But what about the effect on high-level analysis?
Shouldnt that be our focus? - We decide to compare LinCmb to its constituent
methods in terms of accuracy of gene selection.
18SECTION 6
- Differential Expression
- We will focus on two types of tests
-the standard t-test -the regularized
t-test of (Baldi and Long, 2001) - We identify a set of significantly differentially
expressed genes via the p-value adjustment
method/FDR procedure (Benjamini-Hochberg, 1995) - We fix the FDR level at .1 in this study (we
have explored other cutoffs as well).
19SECTION 6
- Differential Expression
- On the original data set (no missing values) this
selection produces a gold standard gene list GS - As before, we simulate missing data and impute
using the various methods. - We then compare the gene lists GS produced using
the imputed data, to the list GS. - How many genes are in GS but not in GS?
- How many genes are in GS but not in GS?
FPR (false positive rate)GS\GS/GS
FNR (false negative rate)GS\GS/GS
20SECTION 6
Differential Expression Standard t False
positive/False negative rates
(power )
(FPR )
(X)
(?)
(?)
21SECTION 6
Differential Expression Regularized t False
positive/False negative rates
(power )
(FPR )
(X)
(?)
(?)
22SECTION 6
- Differential Expression
- (FPRFNR)reg-tlt(FPRFNR)std-t
- NONE has the lowest FPR, and highest FNR
- FPR increases drastically using ROW
underestimating the variance - If a regularized test is used/few values are
missing, NONE performs overall better than ROW! - kNN is a little better than ROW in terms of FPR,
but tends to have a slightly higher FNR - BPCA and LinCmb seems to control FPR and FNR
23- Another Comparison
- Its difficult to compare FPR and FNR
simultaneously. - We decide to compare FPR at fixed FNR levels.
- We construct two gene lists from the imputed data
-GS100 shortest gene ranking list that
completely contains GS (FNR 0) -GS95
shortest gene ranking list that contains 95 of
GS (FNR 5) - We compute the false positive rate,
FPRGS100\GS/GS100 We compute
FPR for GS95 in a similar fashion
SECTION 6
24FPR for GS100
SECTION 6
reg-t
FPR at 1 4
7 missing
std-t
25FPR for GS95
SECTION 6
reg-t
FPR at 1 4
7 missing
std-t
26- FPR GS100 - 1 to 7 missing
- SVD and ROW perform OK in terms of FPR despite
poor performance in terms of RMSE - Local methods are competitive when only 1 of
values are missing, but performance deteriorates
quickly - kNN performs poorly at 4-7 missing
- FPR performance does not deteriorate as rapidly
as RMSE - Methods more comparable when reg-t is used
SECTION 6
- FPR GS95 - 1 missing
- FPR at 5 FNR mimics the RMSE performance (SVD
and ROW perform poorly), but the performance
curve is flatter for FPR than RMSE - With 1 missing values, forgoing imputation is
not a bad strategy.
27SECTION 7
- Discussion
- LinCmb assigns weights to constituent methods
local methods are assigned large weights when
local information is available. - LinCmb RMSE performance is better than any of the
constituent methods - FPR performance indicates that high-level
analysis is not as sensitive to imputation, but
stay away from ROWimpute, SVDimpute and kNN! - Better not to impute when few values are missing
- Robust and global methods are in general better
in terms of FPR.
28SECTION 7
- Future work
- Where do we go from here?
- Current work
- We are extending this to deal with imputation
differently within an array. For genes with many
missing values we use a global approach, for
genes with few missing values we use a local
approach. - Extension to test the Missing-at-Random
assumption. We should treat observations that are
not missing at random as censored (i.e.
informative missingness)