Title: Efficient and accurate algorithms for peptide mass spectrometry
1Efficient and accurate algorithms for peptide
mass spectrometry
- Dissertation presentation
- Stephen Tanner
- May 30, 2007
- Lab page http//peptide.ucsd.edu
2Overview
- Introduction What is mass spectrometry? How
does it fit into the broader context of biology
and bioinformatics? (Chapter 1) - Spectrum annotation (Chapters 2, 3, 4)
- Discovering post-translational modifications
(Chapters 5, 6) - Genome annotation (Chapter 7)
- Gene set analysis of microarrays (Chapter 8)
3From genomics to proteomics
DNA
Transcription
mRNA
Translation
Protein
4Key technologies
Genomics
Capillary sequencers are a central technology for
studying DNA Microarrays are a central
technology for studying RNA Mass spectrometry
is a central technology for studying the proteome.
Transcript- omics
Proteomics
52002 Chemistry Nobel Prize
- Given for MS and NMR applied to proteins
- The citation highlights several current and
potential applications
Some five years ago, mass spectrometry
definitively crossed the border to biochemistry.
The general ways that it provides structural
deter-mination, identification and trace level
analysis have many applications in the
biochemical field. It has become an attractive
alternative to Edman sequencing, earlier
dominant, and has an unsurpassed ability to
identify posttranscriptional modifications and
non-covalent interactions in for example
antigen-antibody binding studies for identifying
ligands to orphan receptors.
6Peptide Mass Spectrometry
A protein sample is digested (typically with
trypsin) to generate peptides. The peptides are
then separated by liquid chromatography.
7Mass spectrometry
The mass spectrometer separates the eluting
peptides by mass-to-charge ratio (m/z), and
records a mass spectrum.
Intensity
m/z
8Above Diagram of a mass spectrometer (courtesy
of ChemGuide.com). Molecules are accelerated by
a series of charged plates, their time of flight
determined by their mass-to-charge ratio.
9Left An LTQ mass spectrometer (image from
University of Vermont) Right A high-end Fourier
Transform mass spectrometer (image from Pacific
Northwest National Labs)
10Tandem MS
Secondary Fragmentation
Ionized parent peptide
11Peptide fragmentation
- Peptides are fragmented, typically through
collision with inert atoms. - Peptides break at peptide bonds, generating an
N-terminal b ion and a C-terminal y ion.
H...-HN-CH-CO-NH-CH-CO-OH
Rn-1
Rn1
H...-HN-CH-CO
H3N-CH-CO--OH
Rn-1
Rn1
b ion (includes N-terminus)
y ion (includes C-terminus)
Spectrum One peak for each fragment type
12Above A sample peptide tandem mass spectrum,
identified and labeled by the InsPecT software
toolkit.
13The Need for Bioinformatics
- High-throughput technologies like MS generate
huge volumes of data much faster than the data
can be analyzed and integrated by legacy methods. - Analysis becomes the bottleneck, and algorithms
address this bottleneck - Bioinformatics also helps improve accuracy - and
provide accurate measurements of accuracy.
14Known problem
Bioinformatics application
- Suppose it takes 1 second to locate one word in a
large text. How long would it take to locate 1
million words? - (The naive answer One million seconds!)
- The Aho-Corasik algorithm takes roughly the same
time to find one million words as for one word.
- Suppose it takes 1 second to interpret one
spectrum using a database. How long would it
take to search 1 million spectra? - Early tools, like Sequest, have runtimes that
grow linearly with the number of scans - InsPecT uses the Aho-Corasik algorithm to search
efficiently (up to 100 times faster than Sequest)
15Key algorithms
Genome assembly and gene finding are two
important problems in genomics. Finding up-
and down-regulated genes and gene sets is a key
problem in transcriptomics. Peptide
identification (InsPecT) and modification site
identification (MS-Alignment) are two important
problems in proteomics.
Genomics
Transcript- omics
Proteomics
16Peptide identification
- Given a peptide tandem spectrum, we wish to
identify the peptide which produced it. - Identifying peptides with modified residues (or
point mutation) is important as well - Many interesting applications of mass
spectrometry (e.g. quantitation) rely upon
accurate peptide annotations.
17Tanner, S., Shu, H., Frank, A., Wang, L., Zandi,
E., Mumby, M., Pevzner, P., and Bafna, V., 2005.
Inspect Fast and accurate identification of
post-translationally modified peptides from
tandem mass spectra. Anal. Chem.,
77(14)46264639.Frank, A., Tanner, S., Bafna,
V., and Pevzner, P., 2005. Peptide sequence tags
for fast database search in mass-spectrometry. J.
of Proteome Research, 4(4)12871295.
InsPecT Fast and Accurate Spectrum Annotation
18Database search
- One way to identify peptides (first implemented
by tools like Sequest and Mascot) is to enumerate
and score all possibilities from the sequence
database. - Theoretical spectra are compared against the
mass fingerprint of the spectrum
Theoretical spectrum (1 of 10,000)
Input spectrum
Match score
19Drawbacks of database search
- Enumerating all candidates is too slow,
particularly when modifications and non-tryptic
peptides must be considered. - A modern instrument produces a million spectra
per day! - Early tools used an over-simplified match scoring
model
20De novo interpretation
- What if we have no sequence database?
- A de novo algorithm such as PEAKS or PepNovo
attempts to recover the entire peptide sequence
from the spectrum. - However, due to incomplete fragmentation and
noise peaks, we can only generate partial peptide
reconstructions in most cases.
NG? GN? AT?
V
G
P
??
21Filtering via tags
- If we identify a part of the sequence (tag) from
the spectrum itself, we can efficiently filter
for regions containing that string. - Recall Exact match for strings is very fast.
- Search time does not grow with number of query
strings. - Computational problem identify a collection of
tags from a spectrum, such that at least one
matches the true peptide. - We identify tags via a graph theoretic formulation
22Peptide mass graphs
- We obtain candidate prefix residue masses by
treating spectrum peaks as b or y fragments. - Masses which differ by the mass of an amino acid
are linked by an edge.
W
R
V
A
L
G
T
E
P
L
K
C
W
D
T
23Tag-based search
W
R
TAG Prefix Mass AVG 0.0 WTD
120.2 PET 211.4
V
A
L
T
G
E
P
L
K
C
W
D
T
- InsPecT generates short peptide sequence tags
from the spectrum, and uses these tags to filter
the database. - Tag-based search is a hybrid of de novo and
traditional database search. - Tags make database search much faster, analogous
to the way that BLASTs filter speeds up sequence
search.
24Tag-based filtering
MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSA
YLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR
EFSASLTQGLLK SAEDLEADK
MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSA
YLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR
EFSASLTQGLLK SAEDLEADK
Tools like Sequest must score every peptide from
the database with approximately correct mass
(left). Using InsPecT, the expensive scoring
step need only be run on those candidates
matching a sequence tag (right).
25(root)
A
D
F
...
H
V
...
I/L
M
Prefix 250.1Da Suffix 1000.5Da Spectrum 1
Prefix 762.8Da Suffix 626.0Da Spectrum 23
Prefix 334.5Da Suffix 220.5Da Spectrum 3
Tags from all spectra are loaded into a trie.
The trie lets us scan the protein database for
any number of strings in linear time. When a
tripeptide tag is matched and the flanking masses
are matched, we obtain a candidate peptide.
26Scoring tag masses
Figure 3.2 Bayesian network for scoring masses.
In nodes corresponding to peaks, the odds that a
peak is present (in a charge-2 or a charge-3)
spectrum are indicated.
27Scoring tag masses
- We use a Bayesian network to score each mass,
using binned intensity levels - Masses receive high scores if they have peak
patterns typical of valid break points
y b Probability
High High 0.202
High Low 0.183
High Absent 0.112
Absent High 0.053
Absent Low 0.032
Absent Absent 0.010
Left Simplified portion of the conditional
probability table for one node of the bayesian
network. In ion trap spectra, most break points
produce a relatively strong y fragment, and a
weak (but present) b fragment.
28Scoring tags
- Each tag is scored using the Bayesian network
(for masses), including flanking amino acid
effects. - Edge skew is penalized.
- The top 25 tags are retained for searching.
- InsPecT can easily be extended to new
instruments. For instance, it can be retrained
to handle c and z ion series (from ETD
instruments) without recompiling the code.
29Scoring candidate peptides
- Filtering results in a list of candidate peptides
which must be scored to obtain the best match. - A match scoring function assigns a match quality
score (MQScore), given a spectrum and a peptide. - The MQScore is computed using a support vector
machine (SVN) on a total of seven features
measuring match quality. - The MQScore distinguishes the correct candidate
from incorrect candidates.
30Identifying correct annotations
- In a typical experiment, only 10-30 of spectra
are successfully interpreted. - We wish to focus on those spectra whose
top-ranking candidate is correct. - To help do this, we consider the gap between the
top candidates MQScore and the nearest runner-up
(delta-score).
31False discovery rates
- In any high-throughput experiment, quantifying
false discovery rates is crucial - We include decoy (shuffled) proteins in the
database as a negative control. - We quantify the empirical false discovery rate by
counting the number of matches to these bogus
records.
32Above Histogram showing false discovery rate (y
axis) versus weighted score (x axis) for results
of a large search.
33The seqeuence of human crystallin beta B1 is
shown above, annotated with post-translational
modifications discovered by InsPecT in a study of
cataractous lens. Some modifications are
produced by chemical damage, others are
deliberate modifications carried out in a
carefully-regulated manner. Comparisons of
modificaiton rates suggest that deamidation (net
mass shift 1) plays a role in cataract formation.
34MS-Alignment and PTMFinder Unrestrictive
Modification Search
Tsur, D., Tanner, S., Zandi, E., Bafna, V., and
Pevzner, P., 2005. Identification of
post-translational modifications via blind search
of mass-spectra.
Nature Biotechnology,
2315621567. Tanner, S., Pevzner, P., and Bafna,
V., 2006. Unrestrictive identification of
post-translational modifications through peptide
mass spectrometry. Nat Protocols,
1(1)6772. Wilmarth, P. A. amd Tanner, S.,
Dasari, S., Nagalla, S. R., Riviere, M. A.,
Bafna, V., Pevzner, P. A., and David, L. L.,
2006. Age-related changes in human crystallins
determined from comparative analysis of
post-translational modifications in young and
aged lens Does deamidation contribute to
crystallin insolubility? Journal of Proteome
Research, 2006. Tanner, S., Payne, S. H.,
Dasari, S., Shen, Z., Wilmarth, P., David, L.,
Loomis, W. F., Briggs, S. P., and Bafna, V.,
2007. Accurate annotation of peptide
modifications through unrestrictive database
search. In preparation.
35Post-translational modifications
- After assembly, proteins are often modified to
control their structure, to regulate enzyme
activity, or by chemical damage. - Hundreds of different modification types are
known. Databases such as UniMod, RESID, and ABRF
catalog them.
36Restrictive vs. unrestrictive search
- InsPecT can handle several modification types at
once, but the user must still guess a list of
allowed modification types - In unrestrictive search, the virtual database of
modified peptides is thousands of times larger
than the sequence database itself. - Identifying all peptide candidates becomes
unfeasible. However, an alignment procedure can
find the best modified peptide
37Simplified diagram of MS-Alignment algorithm. We
construct dots for each database position
(horizontal axis) and for each spectrum peak
(vertical axis). Paths are diagonal lines, with
one or two modifications (horizontal / vertical
segments) permitted. An annotation is a path from
top to the bottom of the graph. The
highest-scoring paths are retained and re-scored.
38Analysis of unrestrictive results
- We obtained interesting results in the Nature
Biotechnology paper, but did not report a false
discovery rate for sites. - As peptide datasets grow, there will be less
emphasis on individual spectral correctness. - Instead we use the high redundancy of large
datasets to focus on identification of modified
peptides, and modified sites.
39PTMFinder
- The PTMFinder procedure attaches a false
discovery rate to modification sites (analogous
to PeptideProphet and unmodified search) - A site may be supported by several peptides, and
by hundreds of spectra. - High spectrum-level accuracy is not sufficient
(or necessary) to give high site-level accuracy - Combining features across spectra produces a very
accurate model.
40Handling d-correct annotations
- In unrestrictive search, each peptide has dozens
of neighbors with similar fragmentation - Examples
- Q-17GEAMLAPK QG-17EAMLAPK
- Q-16GEAMLAPK G111EAMLAPK
- PTMFinder merges and reconciles redundant
peptides, and attempts to annotate peptides using
known chemical modifications (Unrestrictive, but
not blind)
41Figure 6.3 ROC curve for categorization of
modified lens peptides using the PTMFinder
support vector machine (SVM). The accuracy of
the PTMFinder model is significantly higher than
a simple spectrum-level score cutoff. In
addition, PTMFinder is more effective than
selecting those sites which correspond to the
most common modification types (amino acid and
mass) by spectrum count.
42PTMFinder analysis
- Studied a small, heavily-modified data set from
human lens, and a large data set from HEK293 cell
extract - Also studied 1.4million spectra from the protist
Dictyostelium discoidens
43Ten different peptide species witness histidine
methylation of actin. Combining evidence from
multiple peptide species gives a site p-value of
6.6x10-12. Fully tryptic peptides are most
common, but missed cleavages and post-digest
decay produce several other peptide species. We
found this modification site to be conserved
between Homo sapiens and the protist
Dictyostelium discoidens.
44Figure 6.5 Venn diagram summarizing sites of
N-terminal acetylation (left) and phosphorylation
(right) sites in human proteins. Known sites
from two databases (Uniprot and HPRD) are shown,
along with sites identified from a corpus of 20
million spectra derived from the HEK293 cell line
analyzed by MS-Alignment and PTMFinder.
45Genome Annotation
Improving gene annotation with mass spectrometry.
Tanner, Stephen and Shen, Zhouxin and Ng, Julio
and Florea, Liliana and Guigo, Roderic and
Briggs, Steven P and Bafna, Vineet, 2007. Genome
Research 17(2), 231-239. Whole proteome analysis
of post-translational modifications applications
of mass-spectrometry for proteogenomic
annotation. Nitin Gupta, Stephen Tanner, Navdeep
Jaitly, Joshua Adkins, Mary Lipton, Robert
Edwards, Margaret Romine, Andrei Osterman, Vineet
Bafna, Richard D. Smith, Pavel Pevzner, 2007. In
preparation.
46Genome annotation
Genomics
Genome annotation is generally seen as something
done before transcriptomics and proteomics.
The direction of information flow mirrors the
central dogma.
Transcript- omics
Proteomics
47Genome annotation
Genomics
Mass spectrometry is an attractive method for
discovering genes and improving gene
annotations. Roughly 25 of tryptic peptides
span a splice junction, so intron boundaries (and
alternative splicing) can be conformed at the
translational level. MS/MS has different sources
of error than ESTs, providing a novel line of
evidence for gene finding.
ESTs
Transcript- omics
Peptide IDs
Proteomics
48Genomic Search
- Proteins have many isoforms and sequence
variants. Storing and searching every feasible
sequence is inefficient! - Storing the proteome as an exon graph is more
efficient, and results are trivially mapped back
onto the genome - (Valuable even in cases where the genome is
perfectly annotated!)
49Figure 7.3 A portion of the exon graph for
heterogenous nuclear ribonuclear protein K. The
labeled edge represents a codon split across a
splice junction. The dotted edge is an adjacent
edge corresponding to a longer form of an exon.
Searching the exon graph reveals peptides
spanning both outgoing edges from the central
node, confirming alternative splicing at the
level of translation.
50Exon Graph
- We used gene predictions (GeneID) and EST
mappings (dbEST, ESTMapper) to build a graph of
putative exons and introns in the human genome - The graph incorporates coding SNPs from dbSNP
- A modified version of InsPecT was then used to
search the graph
51Above Evidence for novel exons upstream of the
annotated start site of retinoblastoma-associated
protein RAP140 (gi5881256). Matched peptides are
shown below their corresponding genomic location,
with spectrum counts indicated in parentheses.
Those peptides which match the reference protein
sequence are also shown.
52Figure 7.6 Novel exons are supported by peptide
identifications and by sequence homology. Above
is a multiple alignment for hypothetical protein
sequences from chimp (gi55639283), rat
(gi62531299), and human (genome translation,
similar to gi20070384). Introns are indicated by
colons. The peptides identified from mass spectra
are indicated below the protein sequence. The
novel 3 exon is supported by three peptide
identifications, as well as gt95 amino acid
sequence conservation across species.
53Genome annotation results
- Discovery of novel exons in a dozen human genes
- Confirmation of many genes predicted de novo or
from ESTs - Detection of alternative splicing, and coding
SNPs, at the translational level
54Gene Set Analysis
Generalized gene set queries for microarray
analysis. Stephen Tanner, Pankaj Agarwal, 2007.
In preparation.
55Overview
- Microarray experiments compare mRNA between
tissue types or treatment conditions. They
measure up- and down-regulation of genes. - The data is very noisy (particularly for
low-abundance transcripts) and can be difficult
to interpret. - Collecting readings corresponding to gene sets
(e.g. a set of all genes annotated with a GO
term) are one way to address this.
56Motivating Example
- A microarray experiment compared muscle RNA from
young and from aged males (GEO data-set GDS287). - I computed the cyber-t statistic for each gene.
- After correcting for multiple hypothesis
testing, no genes were significantly up- or
down-regulated. - But, perhaps we can find a set of genes thats
significantly up- or down-regulated.
57GQuery algorithm
- Input
- a vector of readings for 20,000 genes
- a binary vector indicating which genes are
members of the gene set - Output
- An enrichment score measuring the degree to which
the set is enriched for up- or down-regulated
genes. Computed using Pearson correlation.
58GQuery algorithm notes
- Several other enirchment statistics were tried,
with similar results - The same model handles queries against a database
of other microarray experiments (e.g. matching
diseases to compounds with opposing effects) - The computations are simple, but careful
statistics are required.
59Challenge False positives
- Because gene sets represent co-regulated genes,
the expression levels of their members are
tightly correlated. - Our null model must correct for this, or we will
obtain many false positives. - We calibrated our p-value readings using a
diverse corpus of microarray experiments.
60Above Empirical cumulative distribution function
(CDF) for gene set scores across a corpus of
experiments. Calibration filters false positives
- e.g. a score of 0.05 is highly significant for
one gene set, but not for others.
61Validation experiment
- Given related experiments, we expect to see the
same enriched gene sets - Example Comparing muscle from young and aged
males/females, we see downregulation of the TCA
cycle - To a first approximation, sets shared across
unrelated experiments are false positives.
62Accuracy (computed using the false discovery
rate) of queries for five pairs of related
microarray experiments. Calibration of p-values
using a large corpus (either GEO or the
Connectivity Map) was significantly more
effective than computing a p-value using
permutation of class labels. (Permutation also
requires many replicates to be reliable)
63Experiment Rank p-value Name
Muscle 1 7.89E-10 Glycolysis_and_Gluconeogenesis
Muscle 2 6.93E-09 Costamere CC
Muscle 3 4.37E-07 superpathway of glycolysis, pyruvate dehydrogenase, TCA, and glyoxylate bypass
AD 1 1.13E-11 Proton-Transporting Two-Sector ATPase Complex CC
AD 2 1.13E-11 Hydrogen-Translocating V-Type ATPase Complex CC
AD 3 9.31E-11 Long-Term Memory BP
Above Examples of gene sets obtained by the
GQuery algorithm. Statistical power is greatly
increased when measuring up- or down-regulation
of many genes at once. Gene sets such as
Long-term memory are easier to interpret than
lists of gene identifiers.
64Summary
- Analysis of high-throughput mass spectrometry
requires efficeint algorithms - Reporting accuracy (e.g. using a negative control
to compute false discovery rates) is vital for
high-throughput analysis - The software I developed will be used and
extended by our lab and by collaborators.
65Acknowledgments
- My committee - Vineet Bafna, Julian Schroeder,
Pavel Pevzner, Steve Briggs, Trey Ideker - Fellow students - Ari Frank, Nuno Bandeira,
Samuel Payne, Nitin Gupta, Julio Ng, Natalie
Castellana, Vagisha Sharma, Qian Peng - Industry contacts - Helge Weissig, Pankaj Agarwal
- Collaborating labs (Ingolf Krueger, Larry David,
Ebrahim Zandi, Marc Mumby, Elizabeth Komives, Guy
Salvesen, Richard Smith, and many more!)