Statistical Significance for Peptide Identification by Tandem Mass Spectrometry - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Description:

Statistical Significance for Peptide Identification by Tandem Mass ... Pin the tail on the donkey... 29. Probability Concepts. Throwing darts. One at a time ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 65
Provided by: cbcb6
Category:

less

Transcript and Presenter's Notes

Title: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry


1
Statistical Significance for Peptide
Identification by Tandem Mass Spectrometry
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park

2
Mass Spectrometry for Proteomics
  • Measure mass of many (bio)molecules
    simultaneously
  • High bandwidth
  • Mass is an intrinsic property of all
    (bio)molecules
  • No prior knowledge required

3
Mass Spectrometry for Proteomics
  • Measure mass of many molecules simultaneously
  • ...but not too many, abundance bias
  • Mass is an intrinsic property of all
    (bio)molecules
  • ...but need a reference to compare to

4
High Bandwidth
5
Mass is fundamental!
6
Mass Spectrometry for Proteomics
  • Mass spectrometry has been around since the turn
    of the century...
  • ...why is MS based Proteomics so new?
  • Ionization methods
  • MALDI, Electrospray
  • Protein chemistry automation
  • Chromatography, Gels, Computers
  • Protein sequence databases
  • A reference for comparison

7
Sample Preparation for Peptide Identification
8
Single Stage MS
MS
m/z
9
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
10
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
11
Peptide Fragmentation
Peptides consist of amino-acids arranged in a
linear backbone.
N-terminus
H-HN-CH-CO-NH-CH-CO-NH-CH-CO-OH
Ri-1
Ri
Ri1
C-terminus
AA residuei-1
AA residuei
AA residuei1
12
Peptide Fragmentation
13
Peptide Fragmentation
yn-i-1
-HN-CH-CO-NH-CH-CO-NH-
CH-R
Ri
i1
R
i1
bi1
14
Peptide Fragmentation
15
Peptide Fragmentation
1166
1020
907
778
663
534
405
292
145
88
b ions
K
L
E
D
E
E
L
F
G
S
147
260
389
504
633
762
875
1022
1080
1166
y ions
100
Intensity
0
m/z
250
500
750
1000
16
Peptide Fragmentation
1166
1020
907
778
663
534
405
292
145
88
b ions
K
L
E
D
E
E
L
F
G
S
147
260
389
504
633
762
875
1022
1080
1166
y ions
y6
100
y7
Intensity
y5
b3
b4
y2
y3
b5
y8
y4
b8
y9
b6
b7
b9
0
m/z
250
500
750
1000
17
Peptide Identification
  • For each (likely) peptide sequence
  • 1. Compute fragment masses
  • 2. Compare with spectrum
  • 3. Retain those that match well
  • Peptide sequences from protein sequence databases
  • Swiss-Prot, IPI, NCBIs nr, ...
  • Automated, high-throughput peptide identification
    in complex mixtures

18
High Quality Peptide Identification E-value lt
10-8
19
Moderate quality peptide identification E-value
lt 10-3
20
Amino-Acid Molecular Weights
21
Peptide Identification
  • Peptide fragmentation by CID is poorly understood
  • MS/MS spectra represent incomplete information
    about amino-acid sequence
  • I/L, K/Q, GG/N,
  • Correct identifications dont come with a
    certificate!

22
Peptide Identification
  • High-throughput workflows demand we analyze all
    spectra, all the time.
  • Spectra may not contain enough information to be
    interpreted correctly
  • bad static on a cell phone
  • Peptides may not match our assumptions
  • its all Greek to me
  • Dont know is an acceptable answer!

23
Peptide Identification
  • Rank the best peptide identifications
  • Is the top ranked peptide correct?

24
Peptide Identification
  • Rank the best peptide identifications
  • Is the top ranked peptide correct?

25
Peptide Identification
  • Rank the best peptide identifications
  • Is the top ranked peptide correct?

26
Peptide Identification
  • Incorrect peptide has best score
  • Correct peptide is missing?
  • Potential for incorrect conclusion
  • What score ensures no incorrect peptides?
  • Correct peptide has weak score
  • Insufficient fragmentation, poor score
  • Potential for weakened conclusion
  • What score ensures we find all correct peptides?

27
Statistical Significance
  • Cant prove particular identifications are right
    or wrong...
  • ...need to know fragmentation in advance!
  • A minimal standard for identification scores...
  • ...better than guessing.
  • p-value, E-value, statistical significance

28
Pin the tail on the donkey
29
Probability Concepts
  • Throwing darts
  • One at a time
  • Blindfolded
  • Uniform distribution?
  • Independent?
  • Identically distributed?
  • Pr Dart hits 20 0.05

30
Probability Concepts
  • Throwing darts
  • One at a time
  • Blindfolded
  • Three darts
  • Pr Hitting 20 3 times
  • 0.05 0.05 0.05
  • Pr Hit 20 at least twice
  • 0.007125 0.000125

31
Probability Concepts
32
Probability Concepts
  • Throwing darts
  • One at a time
  • Blindfolded
  • 100 darts
  • Pr Hitting 20 3 times
  • 0.139575
  • Pr Hit 20 at least twice
  • 0.9629188

33
Probability Concepts
34
Match Score
  • Dartboard represents the mass range of the
    spectrum
  • Peaks of a spectrum are slices
  • Width of slice corresponds to mass tolerance
  • Darts represent
  • random masses
  • masses of fragments of a random peptide
  • masses of peptides of a random protein
  • masses of biomarkers from a random class
  • How many darts do we get to throw?

35
Match Score
  • What is the probability that we match at least 5
    peaks?

270
330
870
550
755
580
36
Match Score
  • Pr Match s peaks
  • Binomial( p , n )
  • Poisson( p n ), for small p and large n
  • p is prob. of random mass / peak match,
  • n is number of darts (fragments in our answer)

37
Match Score
  • Theoretical distribution
  • Used by OMSSA
  • Proposed, in various forms, by many.
  • Probability of random mass / peak match
  • IID (independent, identically distributed)
  • Based on match tolerance

38
Match Score
  • Theoretical distribution assumptions
  • Each dart is independent
  • Peaks are not related
  • Each dart is identically distributed
  • Chance of random mass / peak match is the same
    for all peaks

39
Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
40
Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
41
Number of Trials
  • Tournament size number of trials
  • Number of peptides tried
  • Related to sequence database size
  • Probability that a random match score is s
  • 1 Pr all match scores lt s
  • 1 Pr match score lt s Trials ()
  • Assumes IID!
  • Expect value
  • E Trials Pr match s
  • Corresponds to Bonferroni bound on ()

42
Better Dart Throwers
43
Better Random Models
  • Comparison with completely random model isnt
    really fair
  • Match scores for real spectra with real peptides
    obey rules
  • Even incorrect peptides match with non-random
    structure!

44
Better Random Models
  • Want to generate random fragment masses (darts)
    that behave more like the real thing
  • Some fragments are more likely than others
  • Some fragments depend on others
  • Theoretical models can only incorporate this
    structure to a limited extent.

45
Better Random Models
  • Generate random peptides
  • Real looking fragment masses
  • No theoretical model!
  • Must use empirical distribution
  • Usually require they have the correct precursor
    mass
  • Score function can model anything we like!

46
Better Random Models
Fenyo Beavis, Anal. Chem., 2003
47
Better Random Models
Fenyo Beavis, Anal. Chem., 2003
48
Better Random Models
  • Truly random peptides dont look much like real
    peptides
  • Just use peptides from the sequence database!
  • Caveats
  • Correct peptide (non-random) may be included
  • Peptides are not independent
  • Reverse sequence avoids only the first problem

49
Extrapolating from the Empirical Distribution
  • Often, the empirical shape is consistent with a
    theoretical model

Geer et al., J. Proteome Research, 2004
Fenyo Beavis, Anal. Chem., 2003
50
False Positive Rate Estimation
  • Each spectrum is a chance to be right, wrong, or
    inconclusive.
  • How many decisions are wrong?
  • Given identification criteria
  • SEQUEST Xcorr, E-value, Score, etc., plus...
  • ...threshold
  • Use decoy sequences
  • random, reverse, cross-species
  • Identifications must be incorrect!

51
False Positive Rate Estimation
  • FP in real search hits in decoy search
  • Need same size database, or rate conversion
  • FP Rate decoy hits
  • real hits
  • FP Rate 2 x decoy hits .
  • ( real hits decoy hits)

52
False Positive Rate Estimation
  • A form of statistical significance
  • In theory, E-value and a FP rate are the same.
  • Search engine independent
  • Easy to implement
  • Assumes a single threshold for all spectra
  • Spectrum/Peptide Identification scores are not
    iid!...
  • ...but E-values, in principle, are.

53
Peptide Prophet
  • From the Institute for Systems Biology
  • Keller et al., Anal. Chem. 2002
  • Re-analysis of SEQUEST results
  • Spectra are trials
  • Assumes that many of the spectra are not
    correctly identified

54
Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
55
Peptide Prophet
  • Assumes a bimodal distribution of scores, with a
    particular shape
  • Ignores database size
  • but it is included implicitly
  • Like empirical distribution for peptide sampling,
    can be applied to any score function
  • Can be applied to any search engines results

56
Peptide Prophet
  • Caveats
  • Are spectra scores sampled from the same
    distribution?
  • Is there enough correct identifications for
    second peak?
  • Are spectra independent observations?
  • Are distributions appropriately shaped?
  • Huge improvement over raw SEQUEST results

57
Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
58
Peptides to Proteins
59
Peptides to Proteins
  • A peptide sequence may occur in many different
    protein sequences
  • Variants, paralogues, protein families
  • Separation, digestion and ionization is not well
    understood
  • Proteins in sequence database are extremely
    non-random, and very dependent

60
Publication Guidelines
61
Publication Guidelines
  • Computational parameters
  • Spectral processing
  • Sequence database
  • Search program
  • Statistical analysis
  • Number of peptides per protein
  • Each peptide sequence counts once!
  • Multiple forms of the same peptide count once!

62
Publication Guidelines
  • Single-peptide proteins must be explicitly
    justified by
  • Peptide sequence
  • N and C terminal amino-acids
  • Precursor mass and charge
  • Peptide Scores
  • Multiple forms of the peptide counted once!
  • Biological conclusions based on single-peptide
    proteins must show the spectrum

63
Publication Guidelines
  • More stringent requirements for PMF data
    analysis
  • Similar to that for tandem mass spectra
  • Management of protein redundancy
  • Peptides identified from a different species?
  • Spectra submission encouraged

64
Summary
  • Could guessing be as effective as a search?
  • More guesses improves the best guess
  • Better guessers help us be more discriminating
  • Peptide to proteins is not as simple as it seems
  • Publication guidelines reflect sound statistical
    principles.
Write a Comment
User Comments (0)
About PowerShow.com