Statistical Significance for Peptide Identification by Tandem Mass Spectrometry - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Description:

Statistical Significance for Peptide Identification by Tandem Mass ... Pin the tail on the donkey... 29. Probability Concepts. Throwing darts. One at a time ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 65

Provided by: cbcb6

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

1
Statistical Significance for Peptide
Identification by Tandem Mass Spectrometry

Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park

2
Mass Spectrometry for Proteomics

Measure mass of many (bio)molecules
simultaneously
High bandwidth
Mass is an intrinsic property of all
(bio)molecules
No prior knowledge required

3
Mass Spectrometry for Proteomics

Measure mass of many molecules simultaneously
...but not too many, abundance bias
Mass is an intrinsic property of all
(bio)molecules
...but need a reference to compare to

4
High Bandwidth
5
Mass is fundamental!
6
Mass Spectrometry for Proteomics

Mass spectrometry has been around since the turn
of the century...
...why is MS based Proteomics so new?
Ionization methods
MALDI, Electrospray
Protein chemistry automation
Chromatography, Gels, Computers
Protein sequence databases
A reference for comparison

7
Sample Preparation for Peptide Identification
8
Single Stage MS
MS
m/z
9
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
10
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
11
Peptide Fragmentation
Peptides consist of amino-acids arranged in a
linear backbone.
N-terminus
H-HN-CH-CO-NH-CH-CO-NH-CH-CO-OH
Ri-1
Ri
Ri1
C-terminus
AA residuei-1
AA residuei
AA residuei1
12
Peptide Fragmentation
13
Peptide Fragmentation
yn-i-1
-HN-CH-CO-NH-CH-CO-NH-
CH-R
Ri
i1
R
i1
bi1
14
Peptide Fragmentation
15
Peptide Fragmentation
1166
1020
907
778
663
534
405
292
145
88
b ions
K
L
E
D
E
E
L
F
G
S
147
260
389
504
633
762
875
1022
1080
1166
y ions
100
Intensity
0
m/z
250
500
750
1000
16
Peptide Fragmentation
1166
1020
907
778
663
534
405
292
145
88
b ions
K
L
E
D
E
E
L
F
G
S
147
260
389
504
633
762
875
1022
1080
1166
y ions
y6
100
y7
Intensity
y5
b3
b4
y2
y3
b5
y8
y4
b8
y9
b6
b7
b9
0
m/z
250
500
750
1000
17
Peptide Identification

For each (likely) peptide sequence
1. Compute fragment masses
2. Compare with spectrum
3. Retain those that match well
Peptide sequences from protein sequence databases
Swiss-Prot, IPI, NCBIs nr, ...
Automated, high-throughput peptide identification
in complex mixtures

18
High Quality Peptide Identification E-value lt
10-8
19
Moderate quality peptide identification E-value
lt 10-3
20
Amino-Acid Molecular Weights
21
Peptide Identification

Peptide fragmentation by CID is poorly understood
MS/MS spectra represent incomplete information
about amino-acid sequence
I/L, K/Q, GG/N,
Correct identifications dont come with a
certificate!

22
Peptide Identification

High-throughput workflows demand we analyze all
spectra, all the time.
Spectra may not contain enough information to be
interpreted correctly
bad static on a cell phone
Peptides may not match our assumptions
its all Greek to me
Dont know is an acceptable answer!

23
Peptide Identification

Rank the best peptide identifications
Is the top ranked peptide correct?

24
Peptide Identification

Rank the best peptide identifications
Is the top ranked peptide correct?

25
Peptide Identification

Rank the best peptide identifications
Is the top ranked peptide correct?

26
Peptide Identification

Incorrect peptide has best score
Correct peptide is missing?
Potential for incorrect conclusion
What score ensures no incorrect peptides?
Correct peptide has weak score
Insufficient fragmentation, poor score
Potential for weakened conclusion
What score ensures we find all correct peptides?

27
Statistical Significance

Cant prove particular identifications are right
or wrong...
...need to know fragmentation in advance!
A minimal standard for identification scores...
...better than guessing.
p-value, E-value, statistical significance

28
Pin the tail on the donkey
29
Probability Concepts

Throwing darts
One at a time
Blindfolded
Uniform distribution?
Independent?
Identically distributed?
Pr Dart hits 20 0.05

30
Probability Concepts

Throwing darts
One at a time
Blindfolded
Three darts
Pr Hitting 20 3 times
0.05 0.05 0.05
Pr Hit 20 at least twice
0.007125 0.000125

31
Probability Concepts
32
Probability Concepts

Throwing darts
One at a time
Blindfolded
100 darts
Pr Hitting 20 3 times
0.139575
Pr Hit 20 at least twice
0.9629188

33
Probability Concepts
34
Match Score

Dartboard represents the mass range of the
spectrum
Peaks of a spectrum are slices
Width of slice corresponds to mass tolerance
Darts represent
random masses
masses of fragments of a random peptide
masses of peptides of a random protein
masses of biomarkers from a random class
How many darts do we get to throw?

35
Match Score

What is the probability that we match at least 5
peaks?

270
330
870
550
755
580
36
Match Score

Pr Match s peaks
Binomial( p , n )
Poisson( p n ), for small p and large n
p is prob. of random mass / peak match,
n is number of darts (fragments in our answer)

37
Match Score

Theoretical distribution
Used by OMSSA
Proposed, in various forms, by many.
Probability of random mass / peak match
IID (independent, identically distributed)
Based on match tolerance

38
Match Score

Theoretical distribution assumptions
Each dart is independent
Peaks are not related
Each dart is identically distributed
Chance of random mass / peak match is the same
for all peaks

39
Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
40
Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
41
Number of Trials

Tournament size number of trials
Number of peptides tried
Related to sequence database size
Probability that a random match score is s
1 Pr all match scores lt s
1 Pr match score lt s Trials ()
Assumes IID!
Expect value
E Trials Pr match s
Corresponds to Bonferroni bound on ()

42
Better Dart Throwers
43
Better Random Models

Comparison with completely random model isnt
really fair
Match scores for real spectra with real peptides
obey rules
Even incorrect peptides match with non-random
structure!

44
Better Random Models

Want to generate random fragment masses (darts)
that behave more like the real thing
Some fragments are more likely than others
Some fragments depend on others
Theoretical models can only incorporate this
structure to a limited extent.

45
Better Random Models

Generate random peptides
Real looking fragment masses
No theoretical model!
Must use empirical distribution
Usually require they have the correct precursor
mass
Score function can model anything we like!

46
Better Random Models
Fenyo Beavis, Anal. Chem., 2003
47
Better Random Models
Fenyo Beavis, Anal. Chem., 2003
48
Better Random Models

Truly random peptides dont look much like real
peptides
Just use peptides from the sequence database!
Caveats
Correct peptide (non-random) may be included
Peptides are not independent
Reverse sequence avoids only the first problem

49
Extrapolating from the Empirical Distribution

Often, the empirical shape is consistent with a
theoretical model

Geer et al., J. Proteome Research, 2004
Fenyo Beavis, Anal. Chem., 2003
50
False Positive Rate Estimation

Each spectrum is a chance to be right, wrong, or
inconclusive.
How many decisions are wrong?
Given identification criteria
SEQUEST Xcorr, E-value, Score, etc., plus...
...threshold
Use decoy sequences
random, reverse, cross-species
Identifications must be incorrect!

51
False Positive Rate Estimation

FP in real search hits in decoy search
Need same size database, or rate conversion
FP Rate decoy hits
real hits
FP Rate 2 x decoy hits .
( real hits decoy hits)

52
False Positive Rate Estimation

A form of statistical significance
In theory, E-value and a FP rate are the same.
Search engine independent
Easy to implement
Assumes a single threshold for all spectra
Spectrum/Peptide Identification scores are not
iid!...
...but E-values, in principle, are.

53
Peptide Prophet

From the Institute for Systems Biology
Keller et al., Anal. Chem. 2002
Re-analysis of SEQUEST results
Spectra are trials
Assumes that many of the spectra are not
correctly identified

54
Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
55
Peptide Prophet

Assumes a bimodal distribution of scores, with a
particular shape
Ignores database size
but it is included implicitly
Like empirical distribution for peptide sampling,
can be applied to any score function
Can be applied to any search engines results

56
Peptide Prophet