SVM-based techniques for biomarker discovery in proteomic pattern data - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

SVM-based techniques for biomarker discovery in proteomic pattern data

Description:

SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 27
Provided by: Elen114
Category:

less

Transcript and Presenter's Notes

Title: SVM-based techniques for biomarker discovery in proteomic pattern data


1
SVM-based techniques for biomarker discovery in
proteomic pattern data
  • Elena Marchiori
  • Department of Computer Science
  • Vrije Universiteit Amsterdam

2
Overview
  • Variable selection
  • SVM-based techniques
  • Application to proteomic pattern data
  • Results
  • Conclusion

3
Variable Selection
  • Select a small subset of input variables (for
    example genes in gene expression data, m/z values
    in proteomic pattern data) which are used for
    building classifier
  • Advantages
  • it is cheaper to measure less variables
  • the resulting classifier is simpler and
    potentially faster
  • prediction accuracy may improve by discarding
    irrelevant variables
  • identifying relevant variables gives more insight
    into the nature of the corresponding
    classification problem (biomarker detection)

4
Support Vector Machines
  • Advantages
  • maximize the margin between two classes in the
    feature space characterized by a kernel function
  • are robust with respect to high input dimension
  • Disadvantages
  • difficult to incorporate background knowledge
  • Sensitive to outliers

5
Binary classification
f(x) sign(wTx b)
6
Linear Separators
7
SVM separable classes
Support vectors uniquely characterize optimal
hyper-plane
margin
Optimal hyper-plane
Support vector
8
SVM and outliers
outlier
9
SVM-RFE
  • Linear binary classifier decision function
  • Recursive Feature Elimination (SVM-RFE)
  • at each iteration
  • eliminate threshold of variables with lower
    score
  • recompute scores of remaining variables
  • SVM-RFE based algorithms
  • run SVM-RFE with different thresholds
  • JOIN select variables occurring more than cutoff
    times
  • ENSEMBLE consider majority vote of resulting
    classifiers

10
SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
11
SVM-RFE variant
  • Input Train set, threshold T, number N of
    variables to be selected
  • Output subset of variables of size N
  • RFE
  • Train Run linear SVM on train set
  • Score generate a sequence of variables ordered
    wrt the absolute value of their weight
  • Eliminate remove T of variables from ordered
    sequence
  • Repeat (train, score, eliminate) on train set
    restricted to remaining variables until only N
    variables are left

12
JOIN and ENSEMBLE SVM-RFE
13
Case Study proteomic pattern data
  • Petricoin et al papers
  • Commercial analysis software (Proteome Quest)
    http//www.correlogic.com/
  • Data sets available at http//ncifdaproteomics.co
    m/ppatterns.php

14
Data generation SELDI-TOF MSSurface-enhanced
laser desorption/ionization time-of-flight mass
spectrometry
  • Method for profiling a population of proteins in
    a sample according to the size and net charge of
    individual proteins.
  • The readout is a spectrum of peaks. The position
    of a protein in the spectrum corresponds to its
    time of flight because the small proteins fly
    faster than the heavy ones.

1 Serum on protein binding plate 2 Insert plate
in vacuum chamber 3 Irradiate plate with laser 4
This launches the proteins / peptides 5 Measure
time of flight (TOF) of Ions, related to the
molecular weight of proteins
15
Example of proteomic pattern profile from one
blood sample
Abundance
Time of flight
  • Heavier peptides move slower -gt
  • Time of flight corresponds to weight
  • Weight corresponds to peptides
  • Measurement of relative abundance of detected
    peptides in serum

16
How to use such data?
  • Diagnostic tool
  • design a classifier for discriminating healthy
    from disease samples
  • Biomarkers identification
  • Variable subset selection (VSS) select a subset
    of input variables (m/z values) that best
    discriminate the two classes (potential
    biomarkers)

17
Commercial Tools
  • Proteome Quest (Correlogic) GAclustering, no
    pre-selection (Petricoin et al., The Lancet 2002)
  • Propeak (3Z Informatics) separability analysis
    bootstrap
  • Biomarker AMplification Filter BAMF (Eclipse
    Diagnostics) ?

18
Non-commercial Techniques
  • Pre-processing ranking kNN (Zhu et al., PNAS
    2003)
  • Pre-selection boosted decision trees (Qu et
    al., Clin. Chem. 2002)
  • Filter FS classifier (Liu et al., Genome
    Informatics 2002)
  • GA SVM, SVM-RFE ensemble (Jong et al., EvoBIO
    2004, Jong et al. CIBCB 2004)
  • Many others any ML method for classification/FS
    (see, e.g., special issue on FS, JMLR 2003)

19
Goal and Methods
  • Goal analyze performance of SVM-based techniques
    for classification and variable selection with
    proteomic pattern data
  • SVM
  • SVM-RFE
  • Ensemble SVM-RFE
  • Majority vote of SVM-RFE classifiers obtained
    from SVM-RFE with different cutoff values
  • Join SVM-RFE
  • SVM trained on N variables that have been
    selected more often by SVM-RFE with different
    threshold values

20
DataSets
  • Two proteomic pattern datasets from prostate and
    ovarian cancer from NCI/CCR and FDA/CBER Clinical
    proteomics Program Databank

M/z values
healthy
cancer
tot
15154
253
69
322
Prostate
15154
115 (15 benign)
100
Ovarian 4/03/02
215
Data sets available at http//ncifdaproteomics.c
om/ppatterns.php
21
Experimental Setup
  • 10 random partitions of datasetT (50),H (25),V
    (25)
  • Algorithms
  • SVM trained on union of T and H
  • SVM-RFE(threshold) with thresholds
    0.2,0.3,0.4,0.5, 0.6,0.7
  • Choose threshold giving best classifier
    sensitivity on H
  • JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with
    cutoffs 1, 2, 3, 4, 5
  • Choose cutoff giving best classifier sensitivity
    on H
  • Performance average (over 10 V's)

22
Results Prostate Dataset
23
Results Ovarian Dataset
24
Controversy
  • Noise, bias, results reliability and
    reproducibility in serum proteomics
  • Sorace, Zhan, BMC Bioinformatics, 2004,
  • Petricoin, BMC Bioinformatics, 2004,
  • Baggerly, Journal of the National Cancer
    Institute, vol. 97, No.4, 2005.
  • Liotta, Journal of the National Cancer Institute,
    vol. 97, No.4, 2005.
  • Ransohoff, Journal of the National Cancer
    Institute, vol. 97, No.4, 2005.

25
Conclusion
  • Many machine learning techniques can be used for
    potential biomarker detection with pattern
    proteomic data.
  • SVM based techniques are a possible effective
    choice because of the high input dimension of
    such data.
  • Computational analysis of pattern proteomic data
    has to use a correct methodology that considers
    biases induced by the selection and
    classification algorithms and by the data
    splitting.
  • Problems related to reliability and
    reproducibility of data are inherent to the
    laboratory technology and actually addressed by
    researchers and practitioners.

26
Acknowledgments
  • Connie Jimenez (Biology, VUMC)
  • Aad van der Vaart (Statistics, VUA)
Write a Comment
User Comments (0)
About PowerShow.com