PepArML: A modelfree, resultcombining peptide identification arbiter via machine learning PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: PepArML: A modelfree, resultcombining peptide identification arbiter via machine learning


1
PepArML A model-free, result-combining peptide
identification arbiter via machine learning
  • Xue Wu, Chau-Wen Tseng, Nathan Edwards
  • University of Maryland, College Park, and
  • Georgetown University Medical Center

2
Comparison of Search Engines
  • No single score is comprehensive
  • Search engines disagree
  • Many spectra lack confident peptide assignment
  • Many spectra lack any peptide assignment

Searle et al. JPR 7(1), 2008
3
Black-box Techniques
  • Significance re-estimation
  • Target-Decoy search
  • Bimodal distribution fit
  • Supervised machine learning
  • Train predictors on synthetic datasets
  • Select and/or create (many) good features
  • Result combiners
  • Incorrect peptide IDs unlikely to match
  • Significance re-estimation
  • Independence and/or supervised model

4
PepArML
  • Unified machine learning result combiner
  • Significance re-estimation too!
  • Model-free feature use and result combination
  • Use agreement and features if useful
  • Unsupervised training procedure
  • No loss of classification performance

5
PepArML Overview
X!Tandem
PepArML
Mascot
OMSSA
Other
6
PepArML Overview
Feature extraction
X!Tandem
PepArML
Mascot
OMSSA
Other
7
Dataset Construction
X!Tandem
Mascot
OMSSA
T
F
T

T
8
Dataset Construction
  • Calibrant 8 Protein Mix (C8)
  • 4594 MS/MS spectra (LTQ)
  • 618 (11.2) true positives
  • Sashimi 17mix_test2 (S17)
  • 1389 MS/MS spectra (Q-TOF)
  • 354 (25.4) true positives
  • AURUM 1.0 (364 Proteins)
  • 7508 MS/MS spectra (MALDI-TOF-TOF)
  • 3775 (50.3) true positives

9
PepArML Machine Learning
  • Machine learning (generally) helps single search
    engines
  • PepArML result-combiner (C-TMO) improves on
    single search engines
  • Sometimes combining two search engines works as
    well, or better, than three

10
PepArML vs Search Engines (C8)
11
True vs. Est. FDR (C-TMO, C8)
12
PepArML vs Search Engines (C8)
13
PepArML Pairs vs PepArML (C8)
14
Sensitivity Comparison
15
Feature Evaluation
Tandem
Mascot
OMSSA
16
Application to Real Data
  • How well do these models generalize?
  • Different instruments
  • Spectral characteristics change scores
  • Search parameters
  • Different parameters change score values
  • Supervised learning requires
  • (Synthetic) experimental data from every
    instrument
  • Search results from available search engines
  • Training/models for all parameters x search
    engine sets x instruments

17
Model Generalization
Train S17 / Score S17
Train C8 / Score S17
18
Rescuing Machine Learning
  • Train a new machine learning model for every
    dataset!
  • Generalization not required
  • No predetermined search engines, parameters,
    instruments, features
  • Perhaps we can guess the true proteins
  • Most proteins not in doubt
  • Machine learning can tolerate imperfect labels

19
Unsupervised Learning
20
Unsupervised Learning (S17)
21
Unsupervised Learning (S17)
22
Protein Selection Heuristic
  • Modeled on typical protein identification
    criteria
  • High confidence peptide IDs
  • At least 2 non-overlapping peptides
  • At least 10 sequence coverage
  • Robust, fast convergence
  • Easily enforce additional constraints

23
What about real data?
  • Dr. Rado Goldman (LCCC, GUMC)
  • Proteolytic serum peptides from clinical
    hepatocellular carcinoma samples
  • 200 MALDI MS/MS Spectra (TOF-TOF)
  • PepArML for non-specific search of IPI-Human
  • Increase in confidence sensitivity
  • Observation of ragged proteolytic trimming

24
Protein Identification Example
M T O

25
Future Directions
  • Apply to more experimental datasets
  • Integrate
  • novel features
  • new search engines, spectral matching
  • multiple searches with varied parameters,
    sequence databases
  • Construct meta-search engine
  • FDR by bimodal fit instead of decoys
  • Release as open source
  • http//peparml.sourceforge.org

26
http//PepArML.SourceForge.Net
27
Acknowledgements
  • Xue Wu Dr. Chau-Wen Tseng,
  • Computer ScienceUniversity of Maryland, College
    Park
  • Dr. Brian Balgley, Dr. Paul Rudnick
  • Calibrant Biosystems NIST
  • Dr. Rado Goldman, Dr. Yanming An
  • Department of OncologyGeorgetown University
    Medical Center
  • Kam Ho To
  • Biochemistry Masters studentGeorgetown
    University
  • Funding NIH/NCI CPTAC

28
(No Transcript)
29
PepArML vs Search Engines (S17)
30
PepArML vs Search Engines (S17)
31
PepArML Pairs vs PepArML (C8)
32
PepArML Pairs vs PepArML (S17)
33
PepArML Pairs vs PepArML (S17)
34
Unsupervised Learning (C8)
35
Unsupervised Learning (C8)
Write a Comment
User Comments (0)
About PowerShow.com