Control of selection bias in microarray data analysis - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Control of selection bias in microarray data analysis

Description:

Control of selection bias in microarray data analysis ... Milan 10th June 2003. The standard approach in gene profiling. Choose a suitable pair ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 12
Provided by: giusepp9
Category:

less

Transcript and Presenter's Notes

Title: Control of selection bias in microarray data analysis


1
Control of selection bias in microarray data
analysis
  • C.Furlanello, M. Serafini, S. Merler, G. Jurman
  • ITC-irst - Trento

Microarray Meeting 2003 Milan 10th June 2003
2
The standard approach in gene profiling
Risk of selection bias
Choose a suitable pair classifier/ranking method
Filter/rank/obtain an optimal number of features
Train the classifier and obtain a model
Publish results
Validate the model (gene profile predictor)
Apply on unlabeled data
3
The selection bias problem
Consider the public colon cancer dataset, and
apply the above procedure, without taking care of
not using part of the test set data for the
feature ranking or for the model development.
A zero error (CV) may be obtained with only 8
genes (). But when repeating the same experiment
after a label randomization, a very similar
result is reached 14 genes are sufficient to get
a zero error estimate.
() similar results of near perfect
classification with few genes were published in
PNAS, Machine Learning, Genome Research, etc.
Colon cancer data
4
This effect is by no means casual the same
procedure results in a zero error estimate with
very few genes on synthetic data, too.
Simulation data 5050 samples binary
classification problem 5000 gaussian
features f1000 1000 relevant featuresf0 no
relevant features
Even on purely noisy data, the machine indicates
some features as relevant, with 20 of them
sufficient to reach perfect classification.
This is known as the selection bias. (Ambroise
McLachlan, 2002).
Typical wrong experimental settings
  • Consider 3 subset S0, S1, S2 of a dataset S.
  • Get the optimal number of features on S0 , e.g.
    by min CV errorTrain the model on S1 and test it
    on the disjoint set S2 (right)
  • If S2 is not available, often a CV error estimate
    over S0 U S1 is used as predictive (or in
    general, if S0 and S2 overlap) ? selection bias!
  • over-optimistic model performance (perfect or
    near perfect classification with very few genes
    wrong!)

5
The methodology scheme
The approach we propose to avoid the selection
bias consists in an external stratified partition
resampling scheme coupled to an internal K-fold
cross validation applied to a ranking method.
Feature ranking and modeling
Optimal number of genes
OFS-M
Exponential fit
ONF (3-fold CV)
6
The modeling procedure is replicated on resampled
sets, with validation always operated on a
disjoint test set.
VAL
The multiplicity of extractions of genes in the
replicated experiments may be used as an
additional measure of importance.
Finally, estimates are compared with experiments
with randomized labels in order to detect design
problem.
7
The classifier/ranking machine
As the methodology steps involve a heavy
computational workload, it is crucial how to
choose the pair (ranking,classifier)
Support Vector Machine This learning technique
identifies the maximal margin hyperplane
separating two classes of data. It is fast (in
the linear case), accurate and customizable
through parameter tuning.
The RFE idea given N features (genes), train a
SVM, compute a cost function J from internals of
the SVM, rank features in terms of contribution
to J, discard the feature less contributing to J.
Reapply procedure on the N-1 features. This is
called Recursive Feature Elimination (RFE).
8
The entropy-based approach
BUT the RFE is computationally very expensive.
A different feature section approach is needed
? discard at each run vn of the remaining n
features (v-RFE) ? adaptively discard a number
of features depending on the behavior of J, or,
better, of a function pi .
IDEA in order to get rid of a suitable chunk of
genes, two measures may be employed to evaluate
the contribution of the features to the SVM the
entropy H and the mean Mof the SVM weights
distribution
9
The E-RFE algorithm
The resulting procedure is faster than other
RFE-based methods, but nevertheless it reaches
comparable accuracy.
10
Results I
Last but not least
The E-RFE procedure is fast and reasonably
accurate both on synthetic and microarray
datasets. VAL avoids both data overfitting
problems and the selection bias.
ATE average test error rates () - 50 VAL runs
n average optimal number of features Time
elapsed time for each run (s)
11
Results II
Even on a more recent and larger public
microarray database (76 samples described by
expression levels of 16063 genes) the accuracy
reached is good.
ATE average test error rates () - 50 VAL runs
n average optimal number of features Time
elapsed time for each run (s)
The temporal gain obtained by E-RFE with respect
to the v-RFE can also be read on the curves
representing the steps required in discarding the
features.
Ramaswamy et al., PNAS 2001 and Nature
Genetics 2003. 64 primary adenocarcinomas vs. 12
metastatic ad. - Affymetrix technology
Write a Comment
User Comments (0)
About PowerShow.com