SVM-based techniques for biomarker discovery in proteomic pattern data

About This Presentation

Title:

SVM-based techniques for biomarker discovery in proteomic pattern data

Description:

SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 27

Provided by: Elen114

Category:

more less

Transcript and Presenter's Notes

Title: SVM-based techniques for biomarker discovery in proteomic pattern data

1
SVM-based techniques for biomarker discovery in
proteomic pattern data

Elena Marchiori
Department of Computer Science
Vrije Universiteit Amsterdam

2
Overview

Variable selection
SVM-based techniques
Application to proteomic pattern data
Results
Conclusion

3
Variable Selection

Select a small subset of input variables (for
example genes in gene expression data, m/z values
in proteomic pattern data) which are used for
building classifier
Advantages
it is cheaper to measure less variables
the resulting classifier is simpler and
potentially faster
prediction accuracy may improve by discarding
irrelevant variables
identifying relevant variables gives more insight
into the nature of the corresponding
classification problem (biomarker detection)

4
Support Vector Machines

Advantages
maximize the margin between two classes in the
feature space characterized by a kernel function
are robust with respect to high input dimension
Disadvantages
difficult to incorporate background knowledge
Sensitive to outliers

5
Binary classification
f(x) sign(wTx b)
6
Linear Separators
7
SVM separable classes
Support vectors uniquely characterize optimal
hyper-plane
margin
Optimal hyper-plane
Support vector
8
SVM and outliers
outlier
9
SVM-RFE

Linear binary classifier decision function
Recursive Feature Elimination (SVM-RFE)
at each iteration
eliminate threshold of variables with lower
score
recompute scores of remaining variables
SVM-RFE based algorithms
run SVM-RFE with different thresholds
JOIN select variables occurring more than cutoff
times
ENSEMBLE consider majority vote of resulting
classifiers

10
SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
11
SVM-RFE variant

Input Train set, threshold T, number N of
variables to be selected
Output subset of variables of size N
RFE
Train Run linear SVM on train set
Score generate a sequence of variables ordered
wrt the absolute value of their weight
Eliminate remove T of variables from ordered
sequence
Repeat (train, score, eliminate) on train set
restricted to remaining variables until only N
variables are left

12
JOIN and ENSEMBLE SVM-RFE
13
Case Study proteomic pattern data

Petricoin et al papers
Commercial analysis software (Proteome Quest)
http//www.correlogic.com/
Data sets available at http//ncifdaproteomics.co
m/ppatterns.php

14
Data generation SELDI-TOF MSSurface-enhanced
laser desorption/ionization time-of-flight mass
spectrometry

Method for profiling a population of proteins in
a sample according to the size and net charge of
individual proteins.
The readout is a spectrum of peaks. The position
of a protein in the spectrum corresponds to its
time of flight because the small proteins fly
faster than the heavy ones.

1 Serum on protein binding plate 2 Insert plate
in vacuum chamber 3 Irradiate plate with laser 4
This launches the proteins / peptides 5 Measure
time of flight (TOF) of Ions, related to the
molecular weight of proteins
15
Example of proteomic pattern profile from one
blood sample
Abundance
Time of flight

Heavier peptides move slower -gt
Time of flight corresponds to weight
Weight corresponds to peptides
Measurement of relative abundance of detected
peptides in serum

16
How to use such data?

Diagnostic tool
design a classifier for discriminating healthy
from disease samples
Biomarkers identification
Variable subset selection (VSS) select a subset
of input variables (m/z values) that best
discriminate the two classes (potential
biomarkers)

17
Commercial Tools

Proteome Quest (Correlogic) GAclustering, no
pre-selection (Petricoin et al., The Lancet 2002)
Propeak (3Z Informatics) separability analysis
bootstrap
Biomarker AMplification Filter BAMF (Eclipse
Diagnostics) ?

18
Non-commercial Techniques

Pre-processing ranking kNN (Zhu et al., PNAS
2003)
Pre-selection boosted decision trees (Qu et
al., Clin. Chem. 2002)
Filter FS classifier (Liu et al., Genome
Informatics 2002)
GA SVM, SVM-RFE ensemble (Jong et al., EvoBIO
2004, Jong et al. CIBCB 2004)
Many others any ML method for classification/FS
(see, e.g., special issue on FS, JMLR 2003)

19
Goal and Methods

Goal analyze performance of SVM-based techniques
for classification and variable selection with
proteomic pattern data
SVM
SVM-RFE
Ensemble SVM-RFE
Majority vote of SVM-RFE classifiers obtained
from SVM-RFE with different cutoff values
Join SVM-RFE
SVM trained on N variables that have been
selected more often by SVM-RFE with different
threshold values

20
DataSets

Two proteomic pattern datasets from prostate and
ovarian cancer from NCI/CCR and FDA/CBER Clinical
proteomics Program Databank

M/z values
healthy
cancer
tot
15154
253
69
322
Prostate
15154
115 (15 benign)
100
Ovarian 4/03/02
215
Data sets available at http//ncifdaproteomics.c
om/ppatterns.php
21
Experimental Setup

10 random partitions of datasetT (50),H (25),V
(25)
Algorithms
SVM trained on union of T and H
SVM-RFE(threshold) with thresholds
0.2,0.3,0.4,0.5, 0.6,0.7
Choose threshold giving best classifier
sensitivity on H
JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with
cutoffs 1, 2, 3, 4, 5
Choose cutoff giving best classifier sensitivity
on H
Performance average (over 10 V's)

22
Results Prostate Dataset
23
Results Ovarian Dataset
24
Controversy

Noise, bias, results reliability and
reproducibility in serum proteomics
Sorace, Zhan, BMC Bioinformatics, 2004,
Petricoin, BMC Bioinformatics, 2004,
Baggerly, Journal of the National Cancer
Institute, vol. 97, No.4, 2005.
Liotta, Journal of the National Cancer Institute,
vol. 97, No.4, 2005.
Ransohoff, Journal of the National Cancer
Institute, vol. 97, No.4, 2005.

25
Conclusion

Many machine learning techniques can be used for
potential biomarker detection with pattern
proteomic data.
SVM based techniques are a possible effective
choice because of the high input dimension of
such data.
Computational analysis of pattern proteomic data
has to use a correct methodology that considers
biases induced by the selection and
classification algorithms and by the data
splitting.
Problems related to reliability and
reproducibility of data are inherent to the
laboratory technology and actually addressed by
researchers and practitioners.

SVM-based techniques for biomarker discovery in proteomic pattern data - PowerPoint PPT Presentation

SVM-based techniques for biomarker discovery in proteomic pattern data

SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam – PowerPoint PPT presentation