Loading...

PPT – Interpreting MS/MS Proteomics Results PowerPoint presentation | free to view - id: 5241db-OTRkN

The Adobe Flash plugin is needed to view this content

Interpreting MS/MS Proteomics Results

The first thing I should say is that none of the

material presented is original research done at

Proteome Software

but we do strive to make the tools presented here

available in our software product Scaffold. With

that caveat aside

- Brian C. Searle
- Proteome Software Inc.
- Portland, Oregon USA
- Brian.Searle_at_ProteomeSoftware.com
- NPC Progress Meeting
- (February 2nd, 2006)

Illustrated by Toni Boudreault

Organization

SEQUEST

Identify

This is an foremost an introduction so were

first going to talk about

Then were going to talk about the motivations

behind the development of the first really useful

bioinformatics technique in our field, SEQUEST.

how you go about identifying proteins with tandem

mass spectrometry in the first place

This technique has been extended by two other

tools called X! Tandem and Mascot.

X! Tandem/Mascot

Were also going to talk about how these programs

differ

Differ Combine

and how we can use that to our advantage by

considering them simultaneously using

probabilities.

Start with a protein

A

A

I

E

P

A

T

H

K

K

Q

So, this is proteomics, so were going to use

tandem mass spectrometry to identify proteins--

hopefully many of them, and hopefully very

quickly.

I

G

L

R

L

K

N

V

I

T

I

D

D

C

G

V

R

T

A

Cut with an enzyme

A

A

I

E

P

A

T

And to use this technique you generally have to

lyse the protein into peptides about 8 to 20

amino acids in length and

H

K

K

Q

I

G

L

R

L

K

N

V

I

T

I

D

D

C

G

V

R

T

A

Select a peptide

A

A

I

E

P

A

T

H

K

K

Q

I

G

L

Look at each peptide individually.

R

L

K

We select the peptide by mass using the first

half of the tandem mass spectrometer

N

V

I

T

I

D

D

C

G

V

R

T

A

Impart energy in collision cell

A

E

P

T

I

R

H2O

The mass spectrometer imparts energy into the

peptide causing it to fragment at the peptide

bonds between amino acids.

Measure mass of daughter ions

The masses of these fragment ions is recorded

using the second mass spectrometer.

A

E

P

T

A

E

P

A

E

Intensity

399.2

A

298.1

201.1

72.0

M/z

B-type Ions

These ions are commonly called B ions, based on

nomenclature you dont really want to know about

A

E

P

T

I

R

H2O

Intensity

72.0

129.0

97.0

101.0

113.1

174.1

M/z

But the mass difference between the peaks

corresponds directly to the amino acid sequence.

B-type Ions

A

E

P

T

I

R

H2O

Intensity

72.0

129.0

97.0

101.0

113.1

174.1

AE-A

AEP -AE

AEPT -AEP

AEPTI -AEPT

AEPTIR -AEPTI

A-0

For example, the A-E peak minus the A peak should

produce the mass of E.

You can build these mass differences up and

derive a sequence for the original peptide

This is pretty neat and it makes tandem mass

spectrometry one of the best tools out there for

sequencing novel peptides.

M/z

But there are a couple confounding factors.

So, it seems pretty easy, doesnt it?

For example

B-type Ions

B ions have a tendency to degrade and lose carbon

monoxide producing

A

E

P

T

I

R

H2O

CO

CO

CO

CO

CO

CO

Intensity

M/z

A-type Ions

A ions.

A

E

P

T

I

R

H2O

Furthermore

CO

CO

CO

CO

CO

CO

M/z

Y-type Ions

The second half are represented as Y ions that

sequence backwards.

And, unfortunately, this is the real world, so

R

I

T

P

E

A

H2O

Intensity

M/z

Y-type Ions

All the peaks have different measured heights

and many peaks can often be missing.

R

I

T

P

E

A

H2O

Intensity

M/z

B-type, A-type, Y-type Ions

All these peaks are seen together simultaneously

and we dont even know

R

I

T

P

E

A

H2O

Intensity

M/z

What type of ion they are, making the mass

differences approach even more difficult.

Finally, as with all analytical techniques,

Intensity

M/z

Theres noise,

producing a final spectrum that looks like

Intensity

M/z

And so its actually fairly difficult to

.This, on a good day.

Intensity

M/z

compute the mass differences to sequence the

peptide, certainly in a computer automated way.

A

E

P

T

I

R

H2O

Intensity

72.0

129.0

97.0

101.0

113.1

174.1

M/z

So the community needed a new technique.

Now, it wasnt all without hope

Known Ion Types

We knew a couple of things about peptide

fragmentation.

- B-type ions
- A-type ions
- Y-type ions

Not only do we know to expect B, A, and Y ions,

but

Known Ion Types

We also know a couple of other variations on

those ions that come up.

- B-type ions
- A-type ions
- Y-type ions
- B- or Y-type 2H ions
- B- or Y-type -NH3 ions
- B- or Y-type -H2O ions

We even know something about the

Known Ion Types

likelihood of seeing each type of ion,

- B-type ions
- A-type ions
- Y-type ions
- B- or Y-type 2H ions
- B- or Y-type -NH3 ions
- B- or Y-type -H2O ions

- 100
- 20
- 100
- 50
- 20
- 20

where generally B and Y ions are most prominent.

If we know the amino acid sequence of a peptide,

we can guess what the spectra should look like!

So its actually pretty easy to guess what a

spectrum should look like

if we know what the peptide sequence is.

ELVISLIVESK

Model Spectrum

So as an example, consider the peptide ELVIS

LIVES K

that was synthesized by Rich Johnson in Seattle

Courtesy of Dr. Richard Johnson http//www.hairyf

atguy.com/

Model Spectrum

We can create a hypothetical spectrum based on

our rules

B/Y type ions (100)

Where B and Y ions are estimated at 100,

plus 2 ions are estimated at 50,

and other stragglers are at 20.

B/Y 2H type ions (50)

A type ions B/Y -NH3/-H2O (20)

Model Spectrum

So if we consider the spectrum that was derived

from the ELVIS LIVES K peptide

Model Spectrum

We can find where the overlap is between the

hypothetical and the actual spectra

Model Spectrum

And say conclusively based on the evidence that

the spectrum does belong to the ELVIS LIVES K

peptide.

But who cares?

The more important question is

what about situations where we dont know the

sequence?

We guess!

PepSeq

And so this was an approach followed by a program

called PepSeq

which would guess every combination of amino

acids possible

- AAAAAAAAAA
- AAAAAAAAAC
- AAAAAAAACC
- AAAAAAACCC
- ELVISLIVESK
- WYYYYYYYYY
- YYYYYYYYYY

build a hypothetical spectrum,

and find the best matching hypothetical.

J. Rozenski et al., Org. Mass Spectrom., 29

(1994) 654-658.

PepSeq

This was a start,

but its clearly impossibly hard with larger

peptides

- Impossibly hard after 7 or 8 amino acids!
- High false positive rate because you consider so

many options

and theres a lot of room to overfit the data.

PepSeq

So obviously this isnt going to work in the long

run.

Another strategy is needed!

- Impossibly hard after 7 or 8 amino acids!
- High false positive rate because you consider so

many options

Sequencing Explosion

We needed a new invention to come around

and that was shotgun Sanger-sequencing

- 1977 Shotgun sequencing invented,

bacteriophage fX174 sequenced. - 1989 Yeast Genome project announced
- 1990 Human Genome project announced
- 1992 First chromosome (Yeast) sequenced
- 1995 H. influenza sequenced
- 1996 Yeast Genome sequenced
- 2000 Human Genome draft

In 89 and 90 the Yeast and Human Genome projects

were announced

followed by the first chromosome in 92

et cetra, et cetra

Sequencing Explosion

- 1977 Shotgun sequencing invented,

bacteriophage fX174 sequenced. - 1989 Yeast Genome project announced
- 1990 Human Genome project announced
- 1992 First chromosome (Yeast) sequenced
- 1995 H. influenza sequenced
- 1996 Yeast Genome sequenced
- 2000 Human Genome draft

Eng, J. K. McCormack, A. L. Yates, J. R. III

J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.

In 1994 Jimmy Eng and John Yates published a

technique to exploit genome sequencing

for use in tandem mass spectrometry.

And the idea was

SEQUEST

.instead of searching all possible peptide

sequences,

Now, in the post- genomic world this seems like a

pretty trivial idea,

search only those in genome databases.

but back then there was a lot of assumption

placed on the idea

that wed actually have a complete Human genome

in a reasonable amount of time.

SEQUEST

- 21014 -- All possible 11mers
- (ELVISLIVESK)
- 21010 -- All possible peptides in NR
- 1108 -- All tryptic peptides in NR
- 4106 -- All Human tryptic peptides in NR

So, In terms of 11amino acid peptides

So that was huge,

were talking about a 10 thousand fold

difference between searching every possible 11mer

those in the current non-redundant protein

database from the NCBI

it made hypothetical spectrum matching feasible.

And a 100 million fold difference for searching

human trypic peptides

SEQUEST Model Spectrum

Instead of trying to make a better model,

SEQUEST made a couple of other interesting

improvements as well

they decided just to make the actual spectrum

look like the model with normalization

Jimmy and John noted that there was a

discontinuity between the intensities of the

hypothetical spectrum and the actual spectrum.

For a scoring function they decided to use

Cross-Correlation,

Like so.

which basically sums the peaks that overlap

between hypothetical and the actual spectra

SEQUEST Model Spectrum

And then they shifted the spectra back and .

SEQUEST Model Spectrum

They used this number, also called the

Auto-Correlation, as their background.

Forth so that the peaks shouldnt align.

SEQUEST Model Spectrum

SEQUEST XCorr

This is another representation of the Cross

Correlation and the Auto Correlation.

Cross Correlation (direct comparison)

Correlation Score

Auto Correlation (background)

Offset (AMU)

Gentzel M. et al Proteomics 3 (2003) 1597-1610

SEQUEST XCorr

The XCorr score is the Cross Correlation divided

by the average of the auto correlation over a 150

AMU range.

The XCorr is high if the direct comparison is

significantly greater than the background,

Cross Correlation (direct comparison)

which is obviously good for peptide

identification.

Auto Correlation (background)

Correlation Score

Offset (AMU)

XCorr

Gentzel M. et al Proteomics 3 (2003) 1597-1610

SEQUEST DeltaCn

And this XCorr is actually a pretty robust method

for estimating how accurate the match is,

and so far, there really havent been any

significant improvements on it.

The DeltaCn is another score that scientists

often use.

It measures how good the XCorr is relative to the

next best match.

As you can see, this is actually a pretty crude

calculation.

Heres another representation of that sentiment.

The XCorr is a strong measure of accuracy,

whereas the DeltaCn is a weak measure of relative

goodness. .

Accuracy Score

Relative Score

Strong (XCorr)

Weak (DeltaCn)

SEQUEST

Obviously, there could be an alternative method

that focuses more on the success of the relative

score.

Mascot and X! Tandem fit that bill.

Accuracy Score

Relative Score

Strong (XCorr)

Weak (DeltaCn)

SEQUEST

Alternate Method

Strong

Weak

X! Tandem Scoring

by-Score Sum of intensities of peaks

matching B-type or Y-type ions HyperScore

Now the X! Tandem accuracy score is rather crude.

It only considers B and Y ions and

and attaches these factorial terms with an

admittedly hand waving argument.

Fenyo, D. Beavis, R. C. Anal. Chem., 75 (2003)

768-774

Distribution of Incorrect Hits

But instead of just considering the best match to

the second best, it looks at the distribution of

lower scoring hits, assuming that they are all

wrong.

This is somewhat based on ideas pioneered with

the BLAST algorithm.

Here, every bar represents the number of matches

at a given score.

The X! Tandem creators found that the

distribution decays (or slopes down)

exponentially

of Matches

Second Best

Best Hit

Hyper Score

Estimate Likelihood (E-Value)

and the log of the distribution is relatively

linear because of the exponential decay.

Log( of Matches)

Best Hit

Hyper Score

Estimate Likelihood (E-Value)

Hyper Score

Expected Number Of Random Matches

Log( of Matches)

Best Hit

If the distribution represents the number of

random matches at any given score,

the linear fit should correspond to the expected

number of random matches.

Estimate Likelihood (E-Value)

Score of 60 has 1/10 chance of occurring at random

Log( of Matches)

Best Hit

And from this, you can calculate the likelihood

that the best match is random.

This is called an E-Value, or Expected-Value.

In this case, a score of 60 corresponds with a

log number of matches being -1

which means the estimated number of random

matches for that score is 0.1

X! Tandem and Mascot

Now, X! Tandem calculates this E-Value

empirically.

E-Value Likelihood that match is incorrect relative to N guesses Empirical (X! Tandem)

P-Value Likelihood that match is incorrect (EPN) Theoretical (Mascot)

Another search engine, Mascot, tries to get at

the same kind of number using theoretical

calculations,

most likely based on the number of identified

peaks and the likelihood of finding certain amino

acids in the genome database.

Theyve never explicitly published their

algorithm, so well never really know,

but I suspect its something smart.

I just want to bring up a point that well touch

on a little later

X! Tandem and Mascot

the E-Value that X! Tandem calculates

and the P-Value that Mascot calculates are

probabilistically based,

but they can only estimate the likelihood that

the match is wrong.

E-Value Likelihood that match is incorrect relative to N guesses Empirical (X! Tandem)

P-Value Likelihood that match is incorrect (EPN) Theoretical (Mascot)

Probability Likelihood that match is correct Note (Probability?1-P)! Likelihood that match is correct Note (Probability?1-P)!

This is realistically not nearly as useful as

knowing

the probability that a peptide identification is

right,

which is NOT 1 minus the P-Value.

Now, lets go back and fill in the X! Tandem part

of our accuracy/relativity scoring grid.

To reiterate, the XCorr is an excellent measure

of accuracy

whereas the E-Value is an excellent measure of

how good the best score is relative to the rest.

If we assume that accuracy and relativity scores

are independent measures of goodness,

could we use both the SEQUESTs XCorr and X!

Tandems E-Value together?

10 Protein Control Sample

And the answer is a resounding yes.

Each point on this graph is a spectrum, where

correct identifications are marked in red, while

incorrect identifications are marked in blue.

X! Tandem -log(E-Value)

We know whats correct and incorrect because this

is a control sample.

SEQUEST Discriminant Score

Although in general the spectra SEQUEST scores

well are spectra X!Tandem also scores well,

there is considerable scatter between the search

engines.

10 Protein Control Sample

One might wonder if X! Tandem and Mascot use

similar scoring approaches,

would they benefit as much,

but the answer is surprisingly still yes!

X! Tandem -log(E-Value)

Mascot Ion-Identity Score

Now, why are the scores so different?

Why So Different?

Well, here are a couple of possible reasons.

- Sequest
- Considers relative intensities
- X! Tandem
- Considers semi-tryptic peptides
- Considers only B/Y-type Ions

- Mascot
- Considers theoretical
- P-Value relative to search space

SEQUEST is the only method to consider relative

intensities.

Why So Different?

X! Tandem is the only method to consider peptides

outside the standard search space by default,

- Sequest
- Considers relative intensities
- X! Tandem
- Considers semi-tryptic peptides
- Considers only B/Y-type Ions

- Mascot
- Considers theoretical
- P-Value relative to search space

such as semi-tryptic peptides.

However, its the only score that considers only

B and Y ions,

as opposed to a complete model.

Why So Different?

- Sequest
- Considers relative intensities
- X! Tandem
- Considers semi-tryptic peptides
- Considers only B/Y-type Ions

- Mascot
- Considers theoretical
- P-Value relative to search space

And Mascot is the only search engine to compute a

completely theoretical P-Value

Consider Multiple Algorithms?

So we clearly want to consider multiple search

engines simultaneously,

X! Tandem -log(E-Value)

but how?

Mascot Ion-Identity Score

How To Compare Search Engines?

- SEQUEST XCorrgt2.5, DeltaCngt0.1
- Mascot Ion Score-Identity Scoregt0
- X! Tandem E-Valuelt0.01

You cant use a thresholding system

For example, a SEQUEST match with an XCorr of 2.5

doesnt mean the same thing

because its impossible to find corresponding

thresholds.

as an X! Tandem match with an E-Value of 0.01.

How To Compare Search Engines?

- SEQUEST XCorrgt2.5, DeltaCngt0.1
- Mascot Ion Score-Identity Scoregt0
- X! Tandem E-Valuelt0.01

The simplest way would be to convert the scores

into probabilities and compare those.

We advocate for Andrew Keller and Alexy

Nesviskiis Peptide Prophet approach

because it actually calculates a true

probability, not just a p-value.

- Need to convert scores to probabilities!

10 Protein Control Sample (Q-ToF) X! Tandem

approach

Other Incorrect IDs for Spectrum

So if you remember,

X! Tandem considers the best peptide match for a

spectrum against a distribution of incorrect

matches

Possibly Correct?

of Matches

Mascot Ion-Identity Score

10 Protein Control Sample (Q-ToF) Peptide Prophet

approach

ALL Other Best Matches

Well, Peptide Prophet looks across the entire

sample, and not at just one spectrum at a time.

It compares the best match against all of the

other best matches in the sample, which is

clearly bimodal.

Possibly Correct?

of Matches

Mascot Ion-Identity Score

Keller, A. et al Anal. Chem. 74, 5383-5392

10 Protein Control Sample (Q-ToF) Peptide Prophet

approach

ALL Other Best Matches

The low mode represents matches that are most

likely wrong while the high mode represents

matches that are probably right.

Possibly Correct?

of Matches

Mascot Ion-Identity Score

Keller, A. et al Anal. Chem. 74, 5383-5392

10 Protein Control Sample (Q-ToF) Peptide Prophet

approach

Peptide Prophet curve fits two distributions to

the modes,

following the assumption that the low scoring

distribution is Incorrect

Incorrect

and that the higher scoring distribution is

correct.

Possibly Correct?

of Matches

Correct

Mascot Ion-Identity Score

10 Protein Control Sample (Q-ToF)

Incorrect

These two distributions can be analyzed using

Bayesian statistics with this formula.

Now that formula looks pretty complex, but

Possibly Correct?

of Matches

Correct

Mascot Ion-Identity Score

10 Protein Control Sample (Q-ToF)

Incorrect

It just calculates the height of the correct

distribution at a particular score, divided by

the height of both distributions.

of Matches

Correct

Mascot Ion-Identity Score

10 Protein Control Sample (Q-ToF)

This is essentially the probability of having

that score and being correct divided by the

probability of just having that score

Incorrect

Correct

Mascot Ion-Identity Score

Incorrect

Possibly Correct?

of Matches

Correct

Mascot Ion-Identity Score

This is a neat method because it actually

considers the likelihood of being correct,

rather than X! Tandem and Mascot, which only

calculate the probability of being incorrect.

Its because of this that Peptide Prophet can get

produce a true probability,

which is important when the sample

characteristics change.

Q-ToF

Incorrect

Possibly Correct?

of Matches

Correct

Mascot Ion-Identity Score

For example, the control sample weve been

looking at was derived from Q-ToF data

which produces pretty high quality results

Q-ToF Ion Trap

Incorrect

If you compare that to the same sample on run on

an Ion Trap, the probability of being correct is

greatly diminished.

Possibly Correct?

of Matches

Correct

If youll note, the Incorrect distribution

doesnt change very much between the two

analyses, however, the likelihood that the

identification is right changes dramatically!

Mascot Ion-Identity Score

Incorrect

Possibly Correct?

of Matches

Correct

Ion Trap

As Peptide Prophet considers the correct

distribution, it is immune to fluctuations

between samples.

P-Values and E-Values dont consider this

information, so they cant be compared across

multiple samples, or different examinations of

the same sample

hence the reason why we need to use Peptide

Prophet for comparing two different search engines

Mascot Ion-Identity Score

Incorrect

Possibly Correct?

of Matches

Correct

Consider Multiple Algorithms?

X! Tandem -log(E-Value)

So going back to the scatter plot between X!

Tandem and Mascot,

Mascot Ion-Identity Score

we can use Peptide Prophet to compute the score

threshold that represents a 95 cut-off

Consider Multiple Algorithms?

Like so.

X! Tandem -log(E-Value)

Mascot Ion-Identity Score

This allows you to fairly consider the answers

from both search engines simultaneously.

The important thing to note, is that if you

looked at a different sample, these thresholds

should change depending on the height of the

correct distributions

Conclusion

So in conclusion,

- All search engines use different criteria,

producing different scores - Using multiple search engines simultaneously

yields better results - Peptide Prophet can normalize search engine

results

all of the search engines look at different

criteria

Conclusion

And we can leverage this to identify more

peptides

- All search engines use different criteria,

producing different scores - Using multiple search engines simultaneously

yields better results - Peptide Prophet can normalize search engine

results

Conclusion

And that Peptide Prophet is a great mechanism for

doing that

- All search engines use different criteria,

producing different scores - Using multiple search engines simultaneously

yields better results - Peptide Prophet can normalize search engine

results

because it calculates true probabilities,

instead of p-values

The End