Statistical calibration of MS/MS spectrum library search scores - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical calibration of MS/MS spectrum library search scores

Description:

BlibSearch Peptide ID List Filtered Library Statistics 26,708 spectra 21,264 sequences 3,573 proteins Query Spectra unfractionated worm one MuDPIT, 220,845 spectra ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 36
Provided by: BarbaraF154
Category:

less

Transcript and Presenter's Notes

Title: Statistical calibration of MS/MS spectrum library search scores


1
Statistical calibration of MS/MS spectrum library
search scores
  • Barbara Frewen
  • January 10, 2011
  • University of Washington

2
Protein identification
Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVF
LFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLL
IAFYSTSSEAFVPK
Protein Mixture
Proteins B0205.7 casein kinase C29A12.3a lig-1
DNA ligase C29E6.1a mucin like protein
Digestion to Peptides
3
Acquiring MS/MS spectra
MS/MS
Isolate Proteins
Cell lysis
Digest to Peptides
MS
Load onto column
µLC/µLC
4
Which proteins are in my sample?
Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVF
LFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLL
IAFYSTSSEAFVPK
Protein Mixture
Proteins B0205.7 casein kinase C29A12.3a lig-1
DNA ligase C29E6.1a mucin like protein
Digestion to Peptides
5
Matching a spectrum to a peptide sequence
  • De novo
  • Infer peptide sequence from m/z of observed peaks
  • Database search
  • Compare observed peaks to predict peaks for each
    peptide from a list of candidate sequences
  • Library search
  • Compare observed peaks to known spectra

6
Building a spectrum library
  • Ideally, infuse synthesized peptides
  • ISB has gold standard spectra from five peptides
    per protein in human
  • University of Washington (MacCoss) will have
    spectra from 790 transcription factors and 350
    kinases
  • Alternatively, use high-quality peptide-spectrum
    matches from shotgun proteomics experiments
  • BiblioSpec now parses search results from
    SEQUEST, Mascot, X! Tandem, ProteinPilot, Scaffold

7
Library file formats
BiblioSpec binary SQTLite
compact ? ?
fast ?
flexible/extensible ?
accessible ?
8
Using a spectrum library
Spectrum identification via library searching
Resource for designing SRM directed experiments
Compact, unified format for compiling results and
sharing between labs
9
Searching a spectrum library
SEQUEST
BiblioSpec
Peptide ID list
Scan1 0.7 EGSSDEEVP Scan1 0.3 TFAEILNPI Scan1
0.2 ARFDLNNHD ------------------- Scan2 0.5
EDEESIRAV Scan2 0.2 WLGDDCFMV Scan2 0.1
IDRAAWKAV ------------------- Scan3 0.2
EITTRDMGN Scan3 0.1 GRNMCTAKL
m/z 594.2
score 0.2
MS/MS query spectra
2 GDTIENFK
300.4
1 CGCCLYNT
522.3
2 FMACSDEK
593.9
3 QWDKEPPR
765.1
Library of identified spectra
3 NGISLTIVR
940.4
10
Comparing library and database search
  • Created a large library of spectra from worm
    peptides
  • Identified a different set of spectra using both
    library and database search
  • Compared BiblioSpec results with SEQUEST results
    to evaluate performance
  • spectrum score library SEQUEST agree?
  • 0.l7 AFEQWK LVVAMK NO False positive
  • 0.83 DLAVER DLAVER YES True positive

11
Similarity score discriminates between correct
and incorrect matches
agree
disagree
insert hist/roc
Histogram of search scores
ROC and 1 ROC curve AUC 0.978
12
BiblioSpec and SEQUEST results agree
  • BiblioSpec found 91 of SEQUEST IDs
  • Two reasons BiblioSpec and SEQUEST disagree
  • Query ion not in library
  • BiblioSpec found a different peptide to be more
    similar
  • Only 7 of query spectra not correctly identified
    were in library. Most disagreed because the
    correct match was not in library.

13
Compute p-values to evaluate results
  • The BiblioSpec search score provides good
    discrimination
  • But its unclear where to place a threshold
    between correct and incorrect matches
  • Use statistical methods to estimate the
    probability that a match is incorrect and to
    estimate the fraction of incorrect matches above
    a score threshold.

14
How likely is the match incorrect?
  • distribution of scores for a spectrum vs all
    possible incorrect matches

low score large area to right p-value 0.4
high score small area to right p-value 0.01
score
15
Estimating the null distribution
  • Representative sample of scores from incorrect
    matches
  • Guarantee they are incorrect by using decoys
  • In database searching, scores from decoy
    peptides are used to estimate the null
    distribution
  • How can we create decoy spectra?

16
Generate decoy spectra by shifting the m/z of
the peaks
  • Requirements
  • fast to generate
  • sequence agnostic
  • representative scores
  • Evaluation
  • score distributions mimic real spectra
  • generate a data set of incorrect matches to real
    spectra

real spectrum
decoy spectrum
17
Circularly shifted peaks are similar to real
spectra
18
Circularly shifted peaks are similar to real
spectra
19
Percolator computes p-values
  • Semisupervised machine learning to classify
    correct verses incorrect matches
  • Trains with high-scoring real matches vs decoy
    matches
  • Classifies all real matches using that model

http//per-colator.com Käll et al. 2007 Nature
Methods Käll et al. 2008 Bioinformatics
20
Evaluate p-values
  • Compute p-values for incorrect matches to real
    spectra
  • Percolator p-values should correspond with
    rank-based p-values

ID Percolator rank
rank/n 745AF_8518 0.000230787 1
1/n 691AF_10025 0.000461467 2
2/n 691AF_10107 0.000692201 3
3/n 691AF_10301 0.000922934 4
4/n ... ... ...
... 691AF_5048 0.001153669 12
12/n ... ... ...
...
21
Calibrating p-values
Calculated p-value
Rank p-value
22
Better discrimination with p-values
  • Percolator combines
  • search score
  • delta m/z
  • delta search score
  • charge
  • petpide length
  • candidates
  • copies in library

precision (tp / tp fp)
recall (tp / tp fn)
23
Better discrimination with p-values
24
p-values distinguish between correct and
incorrect matches
precision (tp / tp fp)
recall (tp / tp fn)
25
p-values distinguish between correct and
incorrect matches
26
p-values provide a universal metric for comparing
to other search results
high scoring matches
library search
Spectra
Compiled results
low scoring spectra
database search
high scoring matches
27
Acknowledgements
  • MacCoss lab
  • Jesse Canterbury
  • Michael Bereman
  • Jarrett Egertson
  • Greg Finney
  • Eileen Heimer
  • Edward Hsieh
  • Alana Killeen
  • Brendan MacLean
  • Gennifer Merrihew
  • Daniela Tomazela
  • Mike MacCoss
  • Bill Noble

28
(No Transcript)
29
Percolator distinguishes between correct and
incorrect matches
30
Spectrum-sequence assignments
  • spectrum score library SEQUEST agree?
  • 0.l7 AFEQWK LVVAMK NO False positive
  • 0.83 DLAVER DLAVER YES True positive

31
Test procedure
Query Spectra unfractionated worm one MuDPIT,
220,845 spectra similar DTASelect criteria 14,926
spectra 5,358 ions
MS/MS spectra whole worm lysate 4
fractionation methods 31 MuDPITs, 6,634,874
spectra
Peptide ID List
Scan1 0.7 EGSSDEEVP Scan1 0.3 TFAEILNPI Scan1
0.2 ARFDLNNHD ------------------- Scan2 0.5
EDEESIRAV Scan2 0.2 WLGDDCFMV Scan2 0.1
IDRAAWKAV ------------------- Scan3 0.2
EITTRDMGN Scan3 0.1 GRNMCTAKL
BlibSearch
SEQUEST DTASelect
Library
List of spectrum-sequence pairs 366,400
spectra estimated 51 false positives
BlibFilter
Library Multiple spectra per peptide
Filtered Library Statistics 26,708 spectra
21,264 sequences 3,573 proteins
file scan seq run1.ms2 404 DALLQW run1.ms2 651
PJAMVM run5.ms2 924 SAITTY
BlibBuild
32
Optimize processing parameters
  • Noise removal
  • a fixed number of peaks
  • a fixed fraction of the total intensity
  • all peaks above a defined noise level
  • Intensity normalization
  • log transform
  • bin peaks, divide by base peak in each bin
  • square root of intensity
  • square root weighted by peak m/z

100
33
Uses of Spectrum Libraries
  • A basis for spectrum identification via
    spectrum-spectrum searches
  • A reference for designing SRM experiments
  • Skyline
  • A repository for spectrum identifications
  • A unified format for consolidating results,
    sharing with other labs

34
Spectrum shuffling techniques
  • Blindly shuffle peaks
  • Shuffle blocks of peaks
  • Shift peaks circularly
  • Identify fragment ions from peptides, shuffle
    sequence and move peaks accordingly

35
Parameter Test Results
Intensity Noise Order Score
MZ TOPN 50 I 0.9918
MZ TOPN 100 N 0.9915
MZ HALF I 0.9887
MZ TOPN 200 N 0.9882
BIN TOPN 100 N 0.9881
MZ TOPN 100 I 0.9873
MZ TOPN 200 I 0.9861
MZ TOPN 50 N 0.9859
MZ TOPN 300 N 0.9856
BIN TOPN 200 N 0.9853
MZ TOPN 300 I 0.9838
BIN TOPN 50 I 0.9825
BIN HALF I 0.9811
Intensity Noise Order Score
SQ TOPN 50 N 0.9807
BIN TOPN 100 I 0.9803
BIN TOPN 300 I 0.9788
SQ TOPN 100 N 0.9787
BIN TOPN 200 I 0.9777
BIN TOPN 50 N 0.9769
BIN TOPN 300 N 0.9766
SQ TOPN 300 N 0.9761
SQ HALF I 0.9756
SQ TOPN 200 N 0.9751
BIN HALF N 0.9635
MZ HALF N 0.9465
SQ HALF N 0.9442
Processing Order N noise first I intensity first
Intensity Adjustments BIN bin peaks, divide by
max per bin MZ weight peak intensity by m/z SQ
square root of intensity
Noise Reduction T top n peaks used C top 50 of
peak intensity
Write a Comment
User Comments (0)
About PowerShow.com