Title: Statistical calibration of MS/MS spectrum library search scores
1Statistical calibration of MS/MS spectrum library
search scores
- Barbara Frewen
- January 10, 2011
- University of Washington
2Protein identification
Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVF
LFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLL
IAFYSTSSEAFVPK
Protein Mixture
Proteins B0205.7 casein kinase C29A12.3a lig-1
DNA ligase C29E6.1a mucin like protein
Digestion to Peptides
3Acquiring MS/MS spectra
MS/MS
Isolate Proteins
Cell lysis
Digest to Peptides
MS
Load onto column
µLC/µLC
4Which proteins are in my sample?
Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVF
LFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLL
IAFYSTSSEAFVPK
Protein Mixture
Proteins B0205.7 casein kinase C29A12.3a lig-1
DNA ligase C29E6.1a mucin like protein
Digestion to Peptides
5Matching a spectrum to a peptide sequence
- De novo
- Infer peptide sequence from m/z of observed peaks
- Database search
- Compare observed peaks to predict peaks for each
peptide from a list of candidate sequences - Library search
- Compare observed peaks to known spectra
6Building a spectrum library
- Ideally, infuse synthesized peptides
- ISB has gold standard spectra from five peptides
per protein in human - University of Washington (MacCoss) will have
spectra from 790 transcription factors and 350
kinases - Alternatively, use high-quality peptide-spectrum
matches from shotgun proteomics experiments - BiblioSpec now parses search results from
SEQUEST, Mascot, X! Tandem, ProteinPilot, Scaffold
7Library file formats
BiblioSpec binary SQTLite
compact ? ?
fast ?
flexible/extensible ?
accessible ?
8Using a spectrum library
Spectrum identification via library searching
Resource for designing SRM directed experiments
Compact, unified format for compiling results and
sharing between labs
9Searching a spectrum library
SEQUEST
BiblioSpec
Peptide ID list
Scan1 0.7 EGSSDEEVP Scan1 0.3 TFAEILNPI Scan1
0.2 ARFDLNNHD ------------------- Scan2 0.5
EDEESIRAV Scan2 0.2 WLGDDCFMV Scan2 0.1
IDRAAWKAV ------------------- Scan3 0.2
EITTRDMGN Scan3 0.1 GRNMCTAKL
m/z 594.2
score 0.2
MS/MS query spectra
2 GDTIENFK
300.4
1 CGCCLYNT
522.3
2 FMACSDEK
593.9
3 QWDKEPPR
765.1
Library of identified spectra
3 NGISLTIVR
940.4
10Comparing library and database search
- Created a large library of spectra from worm
peptides - Identified a different set of spectra using both
library and database search - Compared BiblioSpec results with SEQUEST results
to evaluate performance
- spectrum score library SEQUEST agree?
- 0.l7 AFEQWK LVVAMK NO False positive
- 0.83 DLAVER DLAVER YES True positive
-
11Similarity score discriminates between correct
and incorrect matches
agree
disagree
insert hist/roc
Histogram of search scores
ROC and 1 ROC curve AUC 0.978
12BiblioSpec and SEQUEST results agree
- BiblioSpec found 91 of SEQUEST IDs
- Two reasons BiblioSpec and SEQUEST disagree
- Query ion not in library
- BiblioSpec found a different peptide to be more
similar - Only 7 of query spectra not correctly identified
were in library. Most disagreed because the
correct match was not in library.
13Compute p-values to evaluate results
- The BiblioSpec search score provides good
discrimination - But its unclear where to place a threshold
between correct and incorrect matches - Use statistical methods to estimate the
probability that a match is incorrect and to
estimate the fraction of incorrect matches above
a score threshold.
14How likely is the match incorrect?
- distribution of scores for a spectrum vs all
possible incorrect matches
low score large area to right p-value 0.4
high score small area to right p-value 0.01
score
15Estimating the null distribution
- Representative sample of scores from incorrect
matches - Guarantee they are incorrect by using decoys
- In database searching, scores from decoy
peptides are used to estimate the null
distribution - How can we create decoy spectra?
16Generate decoy spectra by shifting the m/z of
the peaks
- Requirements
- fast to generate
- sequence agnostic
- representative scores
- Evaluation
- score distributions mimic real spectra
- generate a data set of incorrect matches to real
spectra
real spectrum
decoy spectrum
17Circularly shifted peaks are similar to real
spectra
18Circularly shifted peaks are similar to real
spectra
19Percolator computes p-values
- Semisupervised machine learning to classify
correct verses incorrect matches - Trains with high-scoring real matches vs decoy
matches - Classifies all real matches using that model
http//per-colator.com Käll et al. 2007 Nature
Methods Käll et al. 2008 Bioinformatics
20Evaluate p-values
- Compute p-values for incorrect matches to real
spectra - Percolator p-values should correspond with
rank-based p-values
ID Percolator rank
rank/n 745AF_8518 0.000230787 1
1/n 691AF_10025 0.000461467 2
2/n 691AF_10107 0.000692201 3
3/n 691AF_10301 0.000922934 4
4/n ... ... ...
... 691AF_5048 0.001153669 12
12/n ... ... ...
...
21Calibrating p-values
Calculated p-value
Rank p-value
22Better discrimination with p-values
- Percolator combines
- search score
- delta m/z
- delta search score
- charge
- petpide length
- candidates
- copies in library
precision (tp / tp fp)
recall (tp / tp fn)
23Better discrimination with p-values
24p-values distinguish between correct and
incorrect matches
precision (tp / tp fp)
recall (tp / tp fn)
25p-values distinguish between correct and
incorrect matches
26p-values provide a universal metric for comparing
to other search results
high scoring matches
library search
Spectra
Compiled results
low scoring spectra
database search
high scoring matches
27Acknowledgements
- MacCoss lab
- Jesse Canterbury
- Michael Bereman
- Jarrett Egertson
- Greg Finney
- Eileen Heimer
- Edward Hsieh
- Alana Killeen
- Brendan MacLean
- Gennifer Merrihew
- Daniela Tomazela
28(No Transcript)
29Percolator distinguishes between correct and
incorrect matches
30Spectrum-sequence assignments
- spectrum score library SEQUEST agree?
- 0.l7 AFEQWK LVVAMK NO False positive
- 0.83 DLAVER DLAVER YES True positive
-
31Test procedure
Query Spectra unfractionated worm one MuDPIT,
220,845 spectra similar DTASelect criteria 14,926
spectra 5,358 ions
MS/MS spectra whole worm lysate 4
fractionation methods 31 MuDPITs, 6,634,874
spectra
Peptide ID List
Scan1 0.7 EGSSDEEVP Scan1 0.3 TFAEILNPI Scan1
0.2 ARFDLNNHD ------------------- Scan2 0.5
EDEESIRAV Scan2 0.2 WLGDDCFMV Scan2 0.1
IDRAAWKAV ------------------- Scan3 0.2
EITTRDMGN Scan3 0.1 GRNMCTAKL
BlibSearch
SEQUEST DTASelect
Library
List of spectrum-sequence pairs 366,400
spectra estimated 51 false positives
BlibFilter
Library Multiple spectra per peptide
Filtered Library Statistics 26,708 spectra
21,264 sequences 3,573 proteins
file scan seq run1.ms2 404 DALLQW run1.ms2 651
PJAMVM run5.ms2 924 SAITTY
BlibBuild
32Optimize processing parameters
- Noise removal
- a fixed number of peaks
- a fixed fraction of the total intensity
- all peaks above a defined noise level
- Intensity normalization
- log transform
- bin peaks, divide by base peak in each bin
- square root of intensity
- square root weighted by peak m/z
100
33Uses of Spectrum Libraries
- A basis for spectrum identification via
spectrum-spectrum searches - A reference for designing SRM experiments
- Skyline
- A repository for spectrum identifications
- A unified format for consolidating results,
sharing with other labs
34Spectrum shuffling techniques
- Blindly shuffle peaks
- Shuffle blocks of peaks
- Shift peaks circularly
- Identify fragment ions from peptides, shuffle
sequence and move peaks accordingly
35Parameter Test Results
Intensity Noise Order Score
MZ TOPN 50 I 0.9918
MZ TOPN 100 N 0.9915
MZ HALF I 0.9887
MZ TOPN 200 N 0.9882
BIN TOPN 100 N 0.9881
MZ TOPN 100 I 0.9873
MZ TOPN 200 I 0.9861
MZ TOPN 50 N 0.9859
MZ TOPN 300 N 0.9856
BIN TOPN 200 N 0.9853
MZ TOPN 300 I 0.9838
BIN TOPN 50 I 0.9825
BIN HALF I 0.9811
Intensity Noise Order Score
SQ TOPN 50 N 0.9807
BIN TOPN 100 I 0.9803
BIN TOPN 300 I 0.9788
SQ TOPN 100 N 0.9787
BIN TOPN 200 I 0.9777
BIN TOPN 50 N 0.9769
BIN TOPN 300 N 0.9766
SQ TOPN 300 N 0.9761
SQ HALF I 0.9756
SQ TOPN 200 N 0.9751
BIN HALF N 0.9635
MZ HALF N 0.9465
SQ HALF N 0.9442
Processing Order N noise first I intensity first
Intensity Adjustments BIN bin peaks, divide by
max per bin MZ weight peak intensity by m/z SQ
square root of intensity
Noise Reduction T top n peaks used C top 50 of
peak intensity