The Imputation of Missing Values in cDNA Microarray Data, and its Effect on the Significance Analysi - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

The Imputation of Missing Values in cDNA Microarray Data, and its Effect on the Significance Analysi

Description:

We illustrate LinCmb on a Liver/Liver cancer data set from the Stanford ... obtain a data set of 20 liver and 20 cancer samples, for 6511 probes with no missing ... – PowerPoint PPT presentation

Number of Views:267
Avg rating:3.0/5.0
Slides: 29
Provided by: r552
Category:

less

Transcript and Presenter's Notes

Title: The Imputation of Missing Values in cDNA Microarray Data, and its Effect on the Significance Analysi


1
The Imputation of Missing Values in cDNA
Microarray Data,and its Effect on the
Significance Analysis of Differential Expression.
  • Rebecka Jornsten
  • Department of Statistics, Rutgers University
  • http//www.stat.rutgers.edu/rebecka
  • Cambridge, November 9-12 2005
  • Joint work with
  • Dr. Ming Ouyang, Hui Wang, and Prof. Bill Welsh
    (UMDNJ)

2
  • Outline
  • SECTION 1 DNA Microarrays Data Structure
  • SECTION 2 Missing Data Imputation Methods
  • SECTION 3 LinCmbAdaptive Missing Data
    Imputation
  • SECTION 4 LinCmb-Local vs Global imputation
  • SECTION 5 RMSE comparisons
  • SECTION 6 Significance Analysis
  • SECTION 7 Conclusion Future Work

3
SECTION 1
  • Missing data
  • -as much as 10 on a single array!
  • manually flagged or flagged by the image
    processing routine
  • smears, high background, signal below detection
    level,

Ideal microarray image
Actual images
Smears,dust
Low intensity spots
Overlapping spots
High background
4
Gene Expression Data
SECTION 1
  • Gene expression data on p genes for n samples

mRNA samples different experiments/arrays
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 NA 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
  • Significance analysis
  • Clustering/Classification
  • High-level analysis tasks
  • How to deal with missing values?
  • Missing value imputation methods are usually
    compared in terms of RMSE (root mean square
    error), not in terms of their effect on
    high-level analysis.
  • Ignore
  • Impute

5
SECTION 2
  • Missing value imputation methods
  • Impute with 0
  • Impute with mean(gene) ROWimpute
  • Gaussian mixtures (GMC)
  • K-nearest-neighbors (kNN)
  • Transform based methods (SVD, BPCA)
  • And many, many more
  • So.how do we decide which method to use?

6
SECTION 2
  • Which method?
  • Is there a best method for any scenario?
  • Which criteria should we use to compare
    imputation methods?
  • Comparisons are usually based on RMSE.
  • What about the effect on the high-level analyses
    (testing, clustering,)?

7
SECTION 3
  • LinCmb adaptive imputation
  • Which method to use? Let the data decide!
  • We combine a library of imputation methods (ROW,
    kNN, SVD, BPCA, GMC (1-5 mixtures))
  • How? We assign a weight to each method
    Model Stacking.
  • The weights are obtained through training on the
    data at hand.

8
SECTION 3
  • LinCmb adaptive imputation
  • Denote by M the gene expression data, M if
    missing
  • Denote by R,K,S,B,G1,G2,,G5 the imputation of M
    by methods ROW,kNN,SVD,BPCA and GMC.
  • We obtain model weights r,k,s,b,g1,g2,,g5 that
    minimize
  • (M-rRkKsSbBg1G1g5G5)
  • s.t. (r,k,s,b,g1,,g5)gt0, rksbg1g51
  • But we dont know M!

2
9
SECTION 3
  • LinCmb adaptive imputation
  • The proportion of missing values in data matrix
    is p
  • We first use kNN to impute the missing values
    Mtrue
  • We generate fake missing values Mfake
    (proportion p/(1-p)). Their true values are known
    to LinCmb. If Mfake coincides with an Mtrue, it
    is not treated as missing.
  • We can now train the weights using Mfake

10
SECTION 3
  • LinCmb adaptive imputation
  • Of course, there is instability in estimating the
    weights for LinCmb.
  • We repeat the training procedure T times (e.g.
    T30-200), obtaining several sets of weights
    r,k,s,b,g1,,g5.
  • To impute Mtrue, we combine the T models via a
    simple model averaging procedure.
  • (We also tried median models, but this did not
    significantly alter/improve the results.)

11
SECTION 4
An illustration on Gene expression data
  • We illustrate LinCmb on a Liver/Liver cancer data
    set from the Stanford Microarray data base
    (Gollub et al 2003).
  • The original data set consisted of 207 arrays,
    23000 probes. After pre-screening we obtain a
    data set of 20 liver and 20 cancer samples,
    for 6511 probes with no missing values.
  • Simulation setup
  • Using the liver data set we simulate -
    1 ,4, or 7 missing values, at random (MAR).
  • We repeat this 200 times. Each iteration we
    compute the RMSE of LinCmb and its constituent
    methods

12
SECTION 4
Results model weights. Liver samples
global
local
13
SECTION 4
Results model weights. Liver cancer samples
global
local
14
SECTION 4
  • Model weights
  • Local vs Global
  • When there are few missing values, local methods
    perform quite well and LinCmb assigns large
    weights to these methods
  • When there are a lot of missing values local
    information is not available, and LinCmb assigns
    large weights to global methods.
  • For the more heterogeneous cancer data LinCmb
    assigns larger weights to the local methods.

global
local
15
RMSE(method)-RMSE(LinCmb) for liver .
SECTION 5
1 missing
4 missing
7 missing
4
RMSE of LinCmb
liver cancer data
7
16
SECTION 5
  • RMSE
  • LinCmb is clearly the best method in terms of
    RMSE.
  • SVD and ROWimpute perform poorly!
  • Local and global methods are competitive if there
    are only a few missing values,
  • but local method performance deteriorates when
    there are a lot of missing values.
  • Among local methods, the popular kNN is not
    competitive.

17
SECTION 6
  • Differential Expression
  • Imputation methods are usually compared in terms
    of RMSE.
  • Since LinCmb is trained to minimize the MSE,
    LinCmb is the winner in terms of RMSE.
  • But what about the effect on high-level analysis?
    Shouldnt that be our focus?
  • We decide to compare LinCmb to its constituent
    methods in terms of accuracy of gene selection.

18
SECTION 6
  • Differential Expression
  • We will focus on two types of tests
    -the standard t-test -the regularized
    t-test of (Baldi and Long, 2001)
  • We identify a set of significantly differentially
    expressed genes via the p-value adjustment
    method/FDR procedure (Benjamini-Hochberg, 1995)
  • We fix the FDR level at .1 in this study (we
    have explored other cutoffs as well).

19
SECTION 6
  • Differential Expression
  • On the original data set (no missing values) this
    selection produces a gold standard gene list GS
  • As before, we simulate missing data and impute
    using the various methods.
  • We then compare the gene lists GS produced using
    the imputed data, to the list GS.
  • How many genes are in GS but not in GS?
  • How many genes are in GS but not in GS?

FPR (false positive rate)GS\GS/GS
FNR (false negative rate)GS\GS/GS
20
SECTION 6
Differential Expression Standard t False
positive/False negative rates
(power )
(FPR )
(X)
(?)
(?)
21
SECTION 6
Differential Expression Regularized t False
positive/False negative rates
(power )
(FPR )
(X)
(?)
(?)
22
SECTION 6
  • Differential Expression
  • (FPRFNR)reg-tlt(FPRFNR)std-t
  • NONE has the lowest FPR, and highest FNR
  • FPR increases drastically using ROW
    underestimating the variance
  • If a regularized test is used/few values are
    missing, NONE performs overall better than ROW!
  • kNN is a little better than ROW in terms of FPR,
    but tends to have a slightly higher FNR
  • BPCA and LinCmb seems to control FPR and FNR

23
  • Another Comparison
  • Its difficult to compare FPR and FNR
    simultaneously.
  • We decide to compare FPR at fixed FNR levels.
  • We construct two gene lists from the imputed data
    -GS100 shortest gene ranking list that
    completely contains GS (FNR 0) -GS95
    shortest gene ranking list that contains 95 of
    GS (FNR 5)
  • We compute the false positive rate,
    FPRGS100\GS/GS100 We compute
    FPR for GS95 in a similar fashion

SECTION 6
24
FPR for GS100
SECTION 6
reg-t
FPR at 1 4
7 missing
std-t
25
FPR for GS95
SECTION 6
reg-t
FPR at 1 4
7 missing
std-t
26
  • FPR GS100 - 1 to 7 missing
  • SVD and ROW perform OK in terms of FPR despite
    poor performance in terms of RMSE
  • Local methods are competitive when only 1 of
    values are missing, but performance deteriorates
    quickly
  • kNN performs poorly at 4-7 missing
  • FPR performance does not deteriorate as rapidly
    as RMSE
  • Methods more comparable when reg-t is used

SECTION 6
  • FPR GS95 - 1 missing
  • FPR at 5 FNR mimics the RMSE performance (SVD
    and ROW perform poorly), but the performance
    curve is flatter for FPR than RMSE
  • With 1 missing values, forgoing imputation is
    not a bad strategy.

27
SECTION 7
  • Discussion
  • LinCmb assigns weights to constituent methods
    local methods are assigned large weights when
    local information is available.
  • LinCmb RMSE performance is better than any of the
    constituent methods
  • FPR performance indicates that high-level
    analysis is not as sensitive to imputation, but
    stay away from ROWimpute, SVDimpute and kNN!
  • Better not to impute when few values are missing
  • Robust and global methods are in general better
    in terms of FPR.

28
SECTION 7
  • Future work
  • Where do we go from here?
  • Current work
  • We are extending this to deal with imputation
    differently within an array. For genes with many
    missing values we use a global approach, for
    genes with few missing values we use a local
    approach.
  • Extension to test the Missing-at-Random
    assumption. We should treat observations that are
    not missing at random as censored (i.e.
    informative missingness)
Write a Comment
User Comments (0)
About PowerShow.com