Title: SVM-based techniques for biomarker discovery in proteomic pattern data
1SVM-based techniques for biomarker discovery in
proteomic pattern data
- Elena Marchiori
- Department of Computer Science
- Vrije Universiteit Amsterdam
2Overview
- Variable selection
- SVM-based techniques
- Application to proteomic pattern data
- Results
- Conclusion
3Variable Selection
- Select a small subset of input variables (for
example genes in gene expression data, m/z values
in proteomic pattern data) which are used for
building classifier - Advantages
- it is cheaper to measure less variables
- the resulting classifier is simpler and
potentially faster - prediction accuracy may improve by discarding
irrelevant variables - identifying relevant variables gives more insight
into the nature of the corresponding
classification problem (biomarker detection)
4Support Vector Machines
- Advantages
- maximize the margin between two classes in the
feature space characterized by a kernel function - are robust with respect to high input dimension
- Disadvantages
- difficult to incorporate background knowledge
- Sensitive to outliers
5Binary classification
f(x) sign(wTx b)
6Linear Separators
7SVM separable classes
Support vectors uniquely characterize optimal
hyper-plane
margin
Optimal hyper-plane
Support vector
8SVM and outliers
outlier
9SVM-RFE
- Linear binary classifier decision function
- Recursive Feature Elimination (SVM-RFE)
- at each iteration
- eliminate threshold of variables with lower
score - recompute scores of remaining variables
- SVM-RFE based algorithms
- run SVM-RFE with different thresholds
- JOIN select variables occurring more than cutoff
times - ENSEMBLE consider majority vote of resulting
classifiers
10SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
11SVM-RFE variant
- Input Train set, threshold T, number N of
variables to be selected - Output subset of variables of size N
- RFE
- Train Run linear SVM on train set
- Score generate a sequence of variables ordered
wrt the absolute value of their weight - Eliminate remove T of variables from ordered
sequence - Repeat (train, score, eliminate) on train set
restricted to remaining variables until only N
variables are left
12JOIN and ENSEMBLE SVM-RFE
13Case Study proteomic pattern data
- Petricoin et al papers
- Commercial analysis software (Proteome Quest)
http//www.correlogic.com/ - Data sets available at http//ncifdaproteomics.co
m/ppatterns.php
14Data generation SELDI-TOF MSSurface-enhanced
laser desorption/ionization time-of-flight mass
spectrometry
- Method for profiling a population of proteins in
a sample according to the size and net charge of
individual proteins. - The readout is a spectrum of peaks. The position
of a protein in the spectrum corresponds to its
time of flight because the small proteins fly
faster than the heavy ones.
1 Serum on protein binding plate 2 Insert plate
in vacuum chamber 3 Irradiate plate with laser 4
This launches the proteins / peptides 5 Measure
time of flight (TOF) of Ions, related to the
molecular weight of proteins
15Example of proteomic pattern profile from one
blood sample
Abundance
Time of flight
- Heavier peptides move slower -gt
- Time of flight corresponds to weight
- Weight corresponds to peptides
- Measurement of relative abundance of detected
peptides in serum
16How to use such data?
- Diagnostic tool
- design a classifier for discriminating healthy
from disease samples - Biomarkers identification
- Variable subset selection (VSS) select a subset
of input variables (m/z values) that best
discriminate the two classes (potential
biomarkers)
17Commercial Tools
- Proteome Quest (Correlogic) GAclustering, no
pre-selection (Petricoin et al., The Lancet 2002) - Propeak (3Z Informatics) separability analysis
bootstrap - Biomarker AMplification Filter BAMF (Eclipse
Diagnostics) ?
18Non-commercial Techniques
- Pre-processing ranking kNN (Zhu et al., PNAS
2003) - Pre-selection boosted decision trees (Qu et
al., Clin. Chem. 2002) - Filter FS classifier (Liu et al., Genome
Informatics 2002) - GA SVM, SVM-RFE ensemble (Jong et al., EvoBIO
2004, Jong et al. CIBCB 2004) - Many others any ML method for classification/FS
(see, e.g., special issue on FS, JMLR 2003)
19Goal and Methods
- Goal analyze performance of SVM-based techniques
for classification and variable selection with
proteomic pattern data - SVM
- SVM-RFE
- Ensemble SVM-RFE
- Majority vote of SVM-RFE classifiers obtained
from SVM-RFE with different cutoff values - Join SVM-RFE
- SVM trained on N variables that have been
selected more often by SVM-RFE with different
threshold values
20DataSets
- Two proteomic pattern datasets from prostate and
ovarian cancer from NCI/CCR and FDA/CBER Clinical
proteomics Program Databank
M/z values
healthy
cancer
tot
15154
253
69
322
Prostate
15154
115 (15 benign)
100
Ovarian 4/03/02
215
Data sets available at http//ncifdaproteomics.c
om/ppatterns.php
21Experimental Setup
- 10 random partitions of datasetT (50),H (25),V
(25) - Algorithms
- SVM trained on union of T and H
- SVM-RFE(threshold) with thresholds
0.2,0.3,0.4,0.5, 0.6,0.7 - Choose threshold giving best classifier
sensitivity on H - JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with
cutoffs 1, 2, 3, 4, 5 - Choose cutoff giving best classifier sensitivity
on H - Performance average (over 10 V's)
-
22Results Prostate Dataset
23Results Ovarian Dataset
24Controversy
- Noise, bias, results reliability and
reproducibility in serum proteomics - Sorace, Zhan, BMC Bioinformatics, 2004,
- Petricoin, BMC Bioinformatics, 2004,
- Baggerly, Journal of the National Cancer
Institute, vol. 97, No.4, 2005. - Liotta, Journal of the National Cancer Institute,
vol. 97, No.4, 2005. - Ransohoff, Journal of the National Cancer
Institute, vol. 97, No.4, 2005.
25Conclusion
- Many machine learning techniques can be used for
potential biomarker detection with pattern
proteomic data. - SVM based techniques are a possible effective
choice because of the high input dimension of
such data. - Computational analysis of pattern proteomic data
has to use a correct methodology that considers
biases induced by the selection and
classification algorithms and by the data
splitting. - Problems related to reliability and
reproducibility of data are inherent to the
laboratory technology and actually addressed by
researchers and practitioners.
26Acknowledgments
- Connie Jimenez (Biology, VUMC)
- Aad van der Vaart (Statistics, VUA)