The Imputation of Missing Values in cDNA Microarray Data, and its Effect on the Significance Analysi - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

The Imputation of Missing Values in cDNA Microarray Data, and its Effect on the Significance Analysi

Description:

We illustrate LinCmb on a Liver/Liver cancer data set from the Stanford ... obtain a data set of 20 liver and 20 cancer samples, for 6511 probes with no missing ... – PowerPoint PPT presentation

Number of Views:267

Avg rating:3.0/5.0

Slides: 29

Provided by: r552

Category:

more less

Transcript and Presenter's Notes

Title: The Imputation of Missing Values in cDNA Microarray Data, and its Effect on the Significance Analysi

1
The Imputation of Missing Values in cDNA
Microarray Data,and its Effect on the
Significance Analysis of Differential Expression.

Rebecka Jornsten
Department of Statistics, Rutgers University
http//www.stat.rutgers.edu/rebecka
Cambridge, November 9-12 2005
Joint work with
Dr. Ming Ouyang, Hui Wang, and Prof. Bill Welsh
(UMDNJ)

Outline
SECTION 1 DNA Microarrays Data Structure
SECTION 2 Missing Data Imputation Methods
SECTION 3 LinCmbAdaptive Missing Data
Imputation
SECTION 4 LinCmb-Local vs Global imputation
SECTION 5 RMSE comparisons
SECTION 6 Significance Analysis
SECTION 7 Conclusion Future Work

3
SECTION 1

Missing data
-as much as 10 on a single array!
manually flagged or flagged by the image
processing routine
smears, high background, signal below detection
level,

Ideal microarray image
Actual images
Smears,dust
Low intensity spots
Overlapping spots
High background
4
Gene Expression Data
SECTION 1

Gene expression data on p genes for n samples

mRNA samples different experiments/arrays
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 NA 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes

Significance analysis
Clustering/Classification

High-level analysis tasks
How to deal with missing values?
Missing value imputation methods are usually
compared in terms of RMSE (root mean square
error), not in terms of their effect on
high-level analysis.

Ignore
Impute

5
SECTION 2

Missing value imputation methods
Impute with 0
Impute with mean(gene) ROWimpute
Gaussian mixtures (GMC)
K-nearest-neighbors (kNN)
Transform based methods (SVD, BPCA)
And many, many more
So.how do we decide which method to use?

6
SECTION 2

Which method?
Is there a best method for any scenario?
Which criteria should we use to compare
imputation methods?
Comparisons are usually based on RMSE.
What about the effect on the high-level analyses
(testing, clustering,)?

7
SECTION 3

LinCmb adaptive imputation
Which method to use? Let the data decide!
We combine a library of imputation methods (ROW,
kNN, SVD, BPCA, GMC (1-5 mixtures))
How? We assign a weight to each method
Model Stacking.
The weights are obtained through training on the
data at hand.

8
SECTION 3

LinCmb adaptive imputation
Denote by M the gene expression data, M if
missing
Denote by R,K,S,B,G1,G2,,G5 the imputation of M
by methods ROW,kNN,SVD,BPCA and GMC.
We obtain model weights r,k,s,b,g1,g2,,g5 that
minimize
(M-rRkKsSbBg1G1g5G5)
s.t. (r,k,s,b,g1,,g5)gt0, rksbg1g51
But we dont know M!

2
9
SECTION 3

LinCmb adaptive imputation
The proportion of missing values in data matrix
is p
We first use kNN to impute the missing values
Mtrue
We generate fake missing values Mfake
(proportion p/(1-p)). Their true values are known
to LinCmb. If Mfake coincides with an Mtrue, it
is not treated as missing.
We can now train the weights using Mfake

10
SECTION 3

LinCmb adaptive imputation
Of course, there is instability in estimating the
weights for LinCmb.
We repeat the training procedure T times (e.g.
T30-200), obtaining several sets of weights
r,k,s,b,g1,,g5.
To impute Mtrue, we combine the T models via a
simple model averaging procedure.
(We also tried median models, but this did not
significantly alter/improve the results.)

11
SECTION 4
An illustration on Gene expression data

We illustrate LinCmb on a Liver/Liver cancer data
set from the Stanford Microarray data base
(Gollub et al 2003).
The original data set consisted of 207 arrays,
23000 probes. After pre-screening we obtain a
data set of 20 liver and 20 cancer samples,
for 6511 probes with no missing values.

Simulation setup
Using the liver data set we simulate -
1 ,4, or 7 missing values, at random (MAR).
We repeat this 200 times. Each iteration we
compute the RMSE of LinCmb and its constituent
methods

12
SECTION 4
Results model weights. Liver samples
global
local
13
SECTION 4
Results model weights. Liver cancer samples
global
local
14
SECTION 4

Model weights
Local vs Global
When there are few missing values, local methods
perform quite well and LinCmb assigns large
weights to these methods
When there are a lot of missing values local
information is not available, and LinCmb assigns
large weights to global methods.
For the more heterogeneous cancer data LinCmb
assigns larger weights to the local methods.

global
local
15
RMSE(method)-RMSE(LinCmb) for liver .
SECTION 5
1 missing
4 missing
7 missing
4
RMSE of LinCmb
liver cancer data
7
16
SECTION 5

RMSE
LinCmb is clearly the best method in terms of
RMSE.
SVD and ROWimpute perform poorly!
Local and global methods are competitive if there
are only a few missing values,
but local method performance deteriorates when
there are a lot of missing values.
Among local methods, the popular kNN is not
competitive.

17
SECTION 6

Differential Expression
Imputation methods are usually compared in terms
of RMSE.
Since LinCmb is trained to minimize the MSE,
LinCmb is the winner in terms of RMSE.
But what about the effect on high-level analysis?
Shouldnt that be our focus?
We decide to compare LinCmb to its constituent
methods in terms of accuracy of gene selection.

18
SECTION 6

Differential Expression
We will focus on two types of tests
-the standard t-test -the regularized
t-test of (Baldi and Long, 2001)
We identify a set of significantly differentially
expressed genes via the p-value adjustment
method/FDR procedure (Benjamini-Hochberg, 1995)
We fix the FDR level at .1 in this study (we
have explored other cutoffs as well).

19
SECTION 6

Differential Expression
On the original data set (no missing values) this
selection produces a gold standard gene list GS
As before, we simulate missing data and impute
using the various methods.
We then compare the gene lists GS produced using
the imputed data, to the list GS.
How many genes are in GS but not in GS?
How many genes are in GS but not in GS?

FPR (false positive rate)GS\GS/GS
FNR (false negative rate)GS\GS/GS
20
SECTION 6
Differential Expression Standard t False
positive/False negative rates
(power )
(FPR )
(X)
(?)
(?)
21
SECTION 6
Differential Expression Regularized t False
positive/False negative rates
(power )
(FPR )
(X)
(?)
(?)
22
SECTION 6

Differential Expression
(FPRFNR)reg-tlt(FPRFNR)std-t
NONE has the lowest FPR, and highest FNR
FPR increases drastically using ROW
underestimating the variance
If a regularized test is used/few values are
missing, NONE performs overall better than ROW!
kNN is a little better than ROW in terms of FPR,
but tends to have a slightly higher FNR
BPCA and LinCmb seems to control FPR and FNR

Another Comparison
Its difficult to compare FPR and FNR
simultaneously.
We decide to compare FPR at fixed FNR levels.
We construct two gene lists from the imputed data
-GS100 shortest gene ranking list that
completely contains GS (FNR 0) -GS95
shortest gene ranking list that contains 95 of
GS (FNR 5)
We compute the false positive rate,
FPRGS100\GS/GS100 We compute
FPR for GS95 in a similar fashion

SECTION 6
24
FPR for GS100
SECTION 6
reg-t
FPR at 1 4
7 missing
std-t
25
FPR for GS95
SECTION 6
reg-t
FPR at 1 4
7 missing
std-t
26

FPR GS100 - 1 to 7 missing
SVD and ROW perform OK in terms of FPR despite
poor performance in terms of RMSE
Local methods are competitive when only 1 of
values are missing, but performance deteriorates
quickly
kNN performs poorly at 4-7 missing
FPR performance does not deteriorate as rapidly
as RMSE
Methods more comparable when reg-t is used

SECTION 6

FPR GS95 - 1 missing
FPR at 5 FNR mimics the RMSE performance (SVD
and ROW perform poorly), but the performance
curve is flatter for FPR than RMSE
With 1 missing values, forgoing imputation is
not a bad strategy.

27
SECTION 7

Discussion
LinCmb assigns weights to constituent methods
local methods are assigned large weights when
local information is available.
LinCmb RMSE performance is better than any of the
constituent methods
FPR performance indicates that high-level
analysis is not as sensitive to imputation, but
stay away from ROWimpute, SVDimpute and kNN!
Better not to impute when few values are missing
Robust and global methods are in general better
in terms of FPR.

28
SECTION 7

Future work
Where do we go from here?
Current work
We are extending this to deal with imputation
differently within an array. For genes with many
missing values we use a global approach, for
genes with few missing values we use a local
approach.
Extension to test the Missing-at-Random
assumption. We should treat observations that are
not missing at random as censored (i.e.
informative missingness)