Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification

Description:

Quantitative Molecular Biological ... Molecular Networks. Information Transduction in. protein-protein ... a peptide of molecular weight 2300 Da ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 19
Provided by: yyu6
Category:

less

Transcript and Presenter's Notes

Title: Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification


1
Improving Statistical Significance Assignment in
Mass Spectrometry Based Peptide Identification
Yi-Kuo Yu Quantitative Molecular Biological
Physics (QMBP) Group National Center for
Biotechnology Information National Library of
Medicine, National Institutes of Health
  • Overview
  • Statistical Significance in Peptide
    Identification
  • Statistics through deNovo method
  • Example
  • New way to Combine Search Results

2
QMBP Research using Biowulf
  • Molecular Dynamics
  • Protein Folding Simulations
  • Molecular Networks
  • Information Transduction in
  • protein-protein interaction networks
  • Molecular Interactions
  • Exact electrostatic force/energy
  • Mass Spectrometry
  • statistics of peptide/protein ID

3
Mass Spect. Task force
LCDR Gelio Alves
Dr. Aleksey Ogurtsov

  • Relevant References
  • Gelio Alves and Yi-Kuo Yu.
  • Statistical Characterization of a 1D Random
    Potential Problem with applications in score
  • statistics of MS-based peptide sequencing
  • Physica A (2008), 3876538-6544.
    doi10.1016.
  • 2. G. Alves , A. Ogurtsov, Wells W. Wu,
    Guanhui Wang, R-F Shen and Yi-Kuo Yu
  • Calibrating E-values for MS2 Database
    Search Methods
  • Biology Direct, 2007, 226
  • 3. Gelio Alves, Wells W. Wu,Guanghui Wang,
    Rong-Fong Shen and Yi-Kuo Yu
  • Enhancing Peptide Identification Confidence
    by Combining Search Methods
  • Journal of Proteome Research, 73102-3113
    (2008).

4
Overview MS-based Proteomics
Protein Identification is important for
Proteomics/System Biology
  • Important issues
  • Protein ID in a mixture,
  • Protein Circuit / Localization,
  • (3) Signaling and Communication.

Desirable to understand Proteins involved?
A generic pathway
5
What can mass spect do?
Protein identification through peptide
identification
MS/MS produces fragments of partial-peptides
(a,b,c)s and (x,y,z)s, thus provides more
information about the peptide for sequencing.
Given a set of MS/MS spectra, by database
searches or denovo sequencing, one may identify
peptides involved and then infer the proteins
involved.
6
What is the problem?
Confidence assignment in peptide
identifications (How to confidently interpret
biological experiments)
Where to draw the line when selecting peptide
candidates? How to rank peptide candidates
across spectra? How to compare results analyzed
using different search methods? (Does a top hit
in method M1 carries the same meaning as that in
method M2?) How to compare results from
different experiments?
  • A possible solution is to have robust statistical
    significance assignment that provides
  • a quantifiable confidence measure for peptide ID
  • the flexibility to compare results from different
    spectra and even from different search methods.

7
Solid Statistics (E-values) might be our best
rescue
In the context of peptide searches, both the E-
and P-values may be viewed as monotonically
decreasing functions of some algorithm-dependent
quality score S. For a given quality score
cutoff, P-value refers to the probability of
finding a random hit with quality score greater
than or equal to the cutoff. E-value is defined
as the expected number of hits in a random
database with quality score greater than or
equal to the cutoff. E P(random_db_size)
Equivalent to Bonferroni Correction Key
assumption needed Aside from the true peptides,
the rest of the peptides in the database appear
to be random with respect to any given MS/MS
spectrum. Using correct E-values, we can
compare search results from different spectra
and even different search methods!
8
Arent there many methods reporting E-values
already, why not just use them?
Apparently, most E-values reported deviate from
the textbook definition.
9
  • To circumvent the statistical inaccuracy
  • We developed RAId_DbS,
  • a new search method that has
  • satisfactory statistics (see below)
  • but without losing performance
  • (see ROC curves to the right)

using profile data
using centroid data
10
(2) We provide a protocol to calibrate
E-values There exist methods that do not report
E-value. To compare the search results from
these methods, one needs to calibrate statistics,
see G Alves, AY Ogurtsov, Y-K Yu, Calibrating
E-values for MS2 Database Search Methods Biol.
Direct (2007), 226 problem may lose
spectrum-specific statistics.
11
(3) Statistical calibration leads to a way to
combine search results from different methods
(but cant enforce spectrum-specific statistics),
see Alves et al. Enhancing peptide
identification by combining search methods JPR
(2008), 731023113.
Other advantage of having accurate
E-value Simple connection to the False
Discovery Rate (FDR)
where Ec is the E-value cutoff, N is the total
number of spectra, and H(Ec) is the cumulative
number of hits with E-value smaller than or equal
to Ec. No need to search in decoy database to
get FDR!
12
Spectrum-Specific Statistics
Why spectrum-specific statistics?
Fragment peaks depend on parent ion charge state,
the presence of co-eluted materials and their
physical interactions with each other, and the
relative kinetic energy of the inert gas
(CID), or the relative kinetic energy of the
electrons (ECD, ETD), and the peptide/co-eluted
material concentrations, and the
peptide/co-eluted material conformation in
gaseous phase, etc.
13
The complication
  • Spectrum-specific noise demands
    spectrum-specific statistics.
  • Not every search method can do this.
  • Only two known methods use spectrum-specific
    statistics
  • X!Tandem (fitted empirically)
  • RAId_DbS (derived theoretically)
  • Recently SEQUEST developers have also
    investigated the
  • possibility of using spectrum-specific XCorr
    statistics.

14
A new approach obtaining statistical standard
from scoring all possible peptides
Merit Bypass the need of decoy database (when
FDR is considered) and the need of
E-value calibration.
Challenge the astronomically large number of
peptides to score. For a peptide of
molecular weight 2300 Da there are 1026 .
Scoring 109 peptides per second
would take 3.2 x 109 years!
Solution see our recent paper, Physica A
(2008), 3876538-6544. doi10.1016.
all possible human tryptic peptides
all possible tryptic peptides
15
A new approach obtaining statistical standard
from scoring all possible peptides (cont)
Algorithm also capable of incorporating internal
structures such as peptide
lengths, hydrophobicity etc. by extending the
dimension of the internal
array. Physica A (2008), 3876538-6544 Scoring
functions RAId_DbS, Hyperscore (X!Tandem),
K-score, XCorr, WP.
This dynamic programming algorithm can score all
possible peptides in a few seconds. A similar
algorithm was proposed independently by Pevzners
group JPR (2008), 73354-3363.
16
P-value of each candidate peptide from the 50MB
random database is Inferred from the denovo score
histogram of all possible peptides.
ISB, Centroid data
(RAId_DbS strategy).
17
Combining search results
For a given spectrum s, search a database using
methods, M1 and M2, return hit lists L1(s) and
L2(s) respectively along with database P-values
Pdb.
L2(s) GAMHLER 3.4e-6 TVPMRQK
1.6e-3 VGTMGSK 0.06
L1(s) U L2(s) GAMHLER 1.0
3.4e-6 SAMPLER 1.4e-4 1.0 TVPMRQK
4.6e-2 1.6e-3 VGTMGSK 1.0
0.06 HVGTMHK 0.13 1.0
L1(s) SAMPLER 1.4e-4 TVPMRQK
4.6e-2 HVGTMHK 0.13
Peptide not present in a report list is assigned
a database P-value 1.
18
Remarks and Acknowledgement
It is anticipated that combining search methods
that are orthogonal to each other might be most
advantageous. It is easy to check the correlation
between information utilized by various scoring
methods. RAId_denovo can be accessed from
http//www.ncbi.nlm.nih.gov/CBBresearch/qmbp/
(standalone version will be available for
download this summer)
We thank the administrative group of the Biowulf
computers for constant technical support, which
considerably helped our computational progress in
improving the peptide identification statistics
over the past few years.
We thank Dr. R.-F. Shen for providing various
peptide MS/MS data.
Write a Comment
User Comments (0)
About PowerShow.com