Efficient and accurate algorithms for peptide mass spectrometry - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Efficient and accurate algorithms for peptide mass spectrometry

Description:

Discovering post-translational modifications (Chapters 5, 6) Genome ... years ago, mass spectrometry definitively crossed the border to biochemistry. ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 66
Provided by: swt
Category:

less

Transcript and Presenter's Notes

Title: Efficient and accurate algorithms for peptide mass spectrometry


1
Efficient and accurate algorithms for peptide
mass spectrometry
  • Dissertation presentation
  • Stephen Tanner
  • May 30, 2007
  • Lab page http//peptide.ucsd.edu

2
Overview
  • Introduction What is mass spectrometry? How
    does it fit into the broader context of biology
    and bioinformatics? (Chapter 1)
  • Spectrum annotation (Chapters 2, 3, 4)
  • Discovering post-translational modifications
    (Chapters 5, 6)
  • Genome annotation (Chapter 7)
  • Gene set analysis of microarrays (Chapter 8)

3
From genomics to proteomics
DNA
Transcription
mRNA
Translation
Protein
4
Key technologies
Genomics
Capillary sequencers are a central technology for
studying DNA Microarrays are a central
technology for studying RNA Mass spectrometry
is a central technology for studying the proteome.
Transcript- omics
Proteomics
5
2002 Chemistry Nobel Prize
  • Given for MS and NMR applied to proteins
  • The citation highlights several current and
    potential applications

Some five years ago, mass spectrometry
definitively crossed the border to biochemistry.
The general ways that it provides structural
deter-mination, identification and trace level
analysis have many applications in the
biochemical field. It has become an attractive
alternative to Edman sequencing, earlier
dominant, and has an unsurpassed ability to
identify posttranscriptional modifications and
non-covalent interactions in for example
antigen-antibody binding studies for identifying
ligands to orphan receptors.
6
Peptide Mass Spectrometry
A protein sample is digested (typically with
trypsin) to generate peptides. The peptides are
then separated by liquid chromatography.
7
Mass spectrometry
The mass spectrometer separates the eluting
peptides by mass-to-charge ratio (m/z), and
records a mass spectrum.
Intensity
m/z
8
Above Diagram of a mass spectrometer (courtesy
of ChemGuide.com). Molecules are accelerated by
a series of charged plates, their time of flight
determined by their mass-to-charge ratio.
9
Left An LTQ mass spectrometer (image from
University of Vermont) Right A high-end Fourier
Transform mass spectrometer (image from Pacific
Northwest National Labs)
10
Tandem MS
Secondary Fragmentation
Ionized parent peptide
11
Peptide fragmentation
  • Peptides are fragmented, typically through
    collision with inert atoms.
  • Peptides break at peptide bonds, generating an
    N-terminal b ion and a C-terminal y ion.

H...-HN-CH-CO-NH-CH-CO-OH
Rn-1
Rn1
H...-HN-CH-CO
H3N-CH-CO--OH
Rn-1
Rn1
b ion (includes N-terminus)
y ion (includes C-terminus)
Spectrum One peak for each fragment type
12
Above A sample peptide tandem mass spectrum,
identified and labeled by the InsPecT software
toolkit.
13
The Need for Bioinformatics
  • High-throughput technologies like MS generate
    huge volumes of data much faster than the data
    can be analyzed and integrated by legacy methods.
  • Analysis becomes the bottleneck, and algorithms
    address this bottleneck
  • Bioinformatics also helps improve accuracy - and
    provide accurate measurements of accuracy.

14
Known problem
Bioinformatics application
  • Suppose it takes 1 second to locate one word in a
    large text. How long would it take to locate 1
    million words?
  • (The naive answer One million seconds!)
  • The Aho-Corasik algorithm takes roughly the same
    time to find one million words as for one word.
  • Suppose it takes 1 second to interpret one
    spectrum using a database. How long would it
    take to search 1 million spectra?
  • Early tools, like Sequest, have runtimes that
    grow linearly with the number of scans
  • InsPecT uses the Aho-Corasik algorithm to search
    efficiently (up to 100 times faster than Sequest)

15
Key algorithms
Genome assembly and gene finding are two
important problems in genomics. Finding up-
and down-regulated genes and gene sets is a key
problem in transcriptomics. Peptide
identification (InsPecT) and modification site
identification (MS-Alignment) are two important
problems in proteomics.
Genomics
Transcript- omics
Proteomics
16
Peptide identification
  • Given a peptide tandem spectrum, we wish to
    identify the peptide which produced it.
  • Identifying peptides with modified residues (or
    point mutation) is important as well
  • Many interesting applications of mass
    spectrometry (e.g. quantitation) rely upon
    accurate peptide annotations.

17
Tanner, S., Shu, H., Frank, A., Wang, L., Zandi,
E., Mumby, M., Pevzner, P., and Bafna, V., 2005.
Inspect Fast and accurate identification of
post-translationally modified peptides from
tandem mass spectra. Anal. Chem.,
77(14)46264639.Frank, A., Tanner, S., Bafna,
V., and Pevzner, P., 2005. Peptide sequence tags
for fast database search in mass-spectrometry. J.
of Proteome Research, 4(4)12871295.
InsPecT Fast and Accurate Spectrum Annotation
18
Database search
  • One way to identify peptides (first implemented
    by tools like Sequest and Mascot) is to enumerate
    and score all possibilities from the sequence
    database.
  • Theoretical spectra are compared against the
    mass fingerprint of the spectrum

Theoretical spectrum (1 of 10,000)
Input spectrum
Match score
19
Drawbacks of database search
  • Enumerating all candidates is too slow,
    particularly when modifications and non-tryptic
    peptides must be considered.
  • A modern instrument produces a million spectra
    per day!
  • Early tools used an over-simplified match scoring
    model

20
De novo interpretation
  • What if we have no sequence database?
  • A de novo algorithm such as PEAKS or PepNovo
    attempts to recover the entire peptide sequence
    from the spectrum.
  • However, due to incomplete fragmentation and
    noise peaks, we can only generate partial peptide
    reconstructions in most cases.

NG? GN? AT?
V
G
P
??
21
Filtering via tags
  • If we identify a part of the sequence (tag) from
    the spectrum itself, we can efficiently filter
    for regions containing that string.
  • Recall Exact match for strings is very fast.
  • Search time does not grow with number of query
    strings.
  • Computational problem identify a collection of
    tags from a spectrum, such that at least one
    matches the true peptide.
  • We identify tags via a graph theoretic formulation

22
Peptide mass graphs
  • We obtain candidate prefix residue masses by
    treating spectrum peaks as b or y fragments.
  • Masses which differ by the mass of an amino acid
    are linked by an edge.

W
R
V
A
L
G
T
E
P
L
K
C
W
D
T
23
Tag-based search
W
R
TAG Prefix Mass AVG 0.0 WTD
120.2 PET 211.4
V
A
L
T
G
E
P
L
K
C
W
D
T
  • InsPecT generates short peptide sequence tags
    from the spectrum, and uses these tags to filter
    the database.
  • Tag-based search is a hybrid of de novo and
    traditional database search.
  • Tags make database search much faster, analogous
    to the way that BLASTs filter speeds up sequence
    search.

24
Tag-based filtering
MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSA
YLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR
EFSASLTQGLLK SAEDLEADK
MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSA
YLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR
EFSASLTQGLLK SAEDLEADK
Tools like Sequest must score every peptide from
the database with approximately correct mass
(left). Using InsPecT, the expensive scoring
step need only be run on those candidates
matching a sequence tag (right).
25
(root)
A
D
F
...
H
V
...
I/L
M
Prefix 250.1Da Suffix 1000.5Da Spectrum 1
Prefix 762.8Da Suffix 626.0Da Spectrum 23
Prefix 334.5Da Suffix 220.5Da Spectrum 3
Tags from all spectra are loaded into a trie.
The trie lets us scan the protein database for
any number of strings in linear time. When a
tripeptide tag is matched and the flanking masses
are matched, we obtain a candidate peptide.
26
Scoring tag masses
Figure 3.2 Bayesian network for scoring masses.
In nodes corresponding to peaks, the odds that a
peak is present (in a charge-2 or a charge-3)
spectrum are indicated.
27
Scoring tag masses
  • We use a Bayesian network to score each mass,
    using binned intensity levels
  • Masses receive high scores if they have peak
    patterns typical of valid break points

y b Probability
High High 0.202
High Low 0.183
High Absent 0.112

Absent High 0.053
Absent Low 0.032
Absent Absent 0.010
Left Simplified portion of the conditional
probability table for one node of the bayesian
network. In ion trap spectra, most break points
produce a relatively strong y fragment, and a
weak (but present) b fragment.
28
Scoring tags
  • Each tag is scored using the Bayesian network
    (for masses), including flanking amino acid
    effects.
  • Edge skew is penalized.
  • The top 25 tags are retained for searching.
  • InsPecT can easily be extended to new
    instruments. For instance, it can be retrained
    to handle c and z ion series (from ETD
    instruments) without recompiling the code.

29
Scoring candidate peptides
  • Filtering results in a list of candidate peptides
    which must be scored to obtain the best match.
  • A match scoring function assigns a match quality
    score (MQScore), given a spectrum and a peptide.
  • The MQScore is computed using a support vector
    machine (SVN) on a total of seven features
    measuring match quality.
  • The MQScore distinguishes the correct candidate
    from incorrect candidates.

30
Identifying correct annotations
  • In a typical experiment, only 10-30 of spectra
    are successfully interpreted.
  • We wish to focus on those spectra whose
    top-ranking candidate is correct.
  • To help do this, we consider the gap between the
    top candidates MQScore and the nearest runner-up
    (delta-score).

31
False discovery rates
  • In any high-throughput experiment, quantifying
    false discovery rates is crucial
  • We include decoy (shuffled) proteins in the
    database as a negative control.
  • We quantify the empirical false discovery rate by
    counting the number of matches to these bogus
    records.

32
Above Histogram showing false discovery rate (y
axis) versus weighted score (x axis) for results
of a large search.
33
The seqeuence of human crystallin beta B1 is
shown above, annotated with post-translational
modifications discovered by InsPecT in a study of
cataractous lens. Some modifications are
produced by chemical damage, others are
deliberate modifications carried out in a
carefully-regulated manner. Comparisons of
modificaiton rates suggest that deamidation (net
mass shift 1) plays a role in cataract formation.
34
MS-Alignment and PTMFinder Unrestrictive
Modification Search
Tsur, D., Tanner, S., Zandi, E., Bafna, V., and
Pevzner, P., 2005. Identification of
post-translational modifications via blind search
of mass-spectra.
Nature Biotechnology,
2315621567. Tanner, S., Pevzner, P., and Bafna,
V., 2006. Unrestrictive identification of
post-translational modifications through peptide
mass spectrometry. Nat Protocols,
1(1)6772. Wilmarth, P. A. amd Tanner, S.,
Dasari, S., Nagalla, S. R., Riviere, M. A.,
Bafna, V., Pevzner, P. A., and David, L. L.,
2006. Age-related changes in human crystallins
determined from comparative analysis of
post-translational modifications in young and
aged lens Does deamidation contribute to
crystallin insolubility? Journal of Proteome
Research, 2006. Tanner, S., Payne, S. H.,
Dasari, S., Shen, Z., Wilmarth, P., David, L.,
Loomis, W. F., Briggs, S. P., and Bafna, V.,
2007. Accurate annotation of peptide
modifications through unrestrictive database
search. In preparation.
35
Post-translational modifications
  • After assembly, proteins are often modified to
    control their structure, to regulate enzyme
    activity, or by chemical damage.
  • Hundreds of different modification types are
    known. Databases such as UniMod, RESID, and ABRF
    catalog them.

36
Restrictive vs. unrestrictive search
  • InsPecT can handle several modification types at
    once, but the user must still guess a list of
    allowed modification types
  • In unrestrictive search, the virtual database of
    modified peptides is thousands of times larger
    than the sequence database itself.
  • Identifying all peptide candidates becomes
    unfeasible. However, an alignment procedure can
    find the best modified peptide

37
Simplified diagram of MS-Alignment algorithm. We
construct dots for each database position
(horizontal axis) and for each spectrum peak
(vertical axis). Paths are diagonal lines, with
one or two modifications (horizontal / vertical
segments) permitted. An annotation is a path from
top to the bottom of the graph. The
highest-scoring paths are retained and re-scored.
38
Analysis of unrestrictive results
  • We obtained interesting results in the Nature
    Biotechnology paper, but did not report a false
    discovery rate for sites.
  • As peptide datasets grow, there will be less
    emphasis on individual spectral correctness.
  • Instead we use the high redundancy of large
    datasets to focus on identification of modified
    peptides, and modified sites.

39
PTMFinder
  • The PTMFinder procedure attaches a false
    discovery rate to modification sites (analogous
    to PeptideProphet and unmodified search)
  • A site may be supported by several peptides, and
    by hundreds of spectra.
  • High spectrum-level accuracy is not sufficient
    (or necessary) to give high site-level accuracy
  • Combining features across spectra produces a very
    accurate model.

40
Handling d-correct annotations
  • In unrestrictive search, each peptide has dozens
    of neighbors with similar fragmentation
  • Examples
  • Q-17GEAMLAPK QG-17EAMLAPK
  • Q-16GEAMLAPK G111EAMLAPK
  • PTMFinder merges and reconciles redundant
    peptides, and attempts to annotate peptides using
    known chemical modifications (Unrestrictive, but
    not blind)

41
Figure 6.3 ROC curve for categorization of
modified lens peptides using the PTMFinder
support vector machine (SVM). The accuracy of
the PTMFinder model is significantly higher than
a simple spectrum-level score cutoff. In
addition, PTMFinder is more effective than
selecting those sites which correspond to the
most common modification types (amino acid and
mass) by spectrum count.
42
PTMFinder analysis
  • Studied a small, heavily-modified data set from
    human lens, and a large data set from HEK293 cell
    extract
  • Also studied 1.4million spectra from the protist
    Dictyostelium discoidens

43
Ten different peptide species witness histidine
methylation of actin. Combining evidence from
multiple peptide species gives a site p-value of
6.6x10-12. Fully tryptic peptides are most
common, but missed cleavages and post-digest
decay produce several other peptide species. We
found this modification site to be conserved
between Homo sapiens and the protist
Dictyostelium discoidens.
44
Figure 6.5 Venn diagram summarizing sites of
N-terminal acetylation (left) and phosphorylation
(right) sites in human proteins. Known sites
from two databases (Uniprot and HPRD) are shown,
along with sites identified from a corpus of 20
million spectra derived from the HEK293 cell line
analyzed by MS-Alignment and PTMFinder.
45
Genome Annotation
Improving gene annotation with mass spectrometry.
Tanner, Stephen and Shen, Zhouxin and Ng, Julio
and Florea, Liliana and Guigo, Roderic and
Briggs, Steven P and Bafna, Vineet, 2007. Genome
Research 17(2), 231-239. Whole proteome analysis
of post-translational modifications applications
of mass-spectrometry for proteogenomic
annotation. Nitin Gupta, Stephen Tanner, Navdeep
Jaitly, Joshua Adkins, Mary Lipton, Robert
Edwards, Margaret Romine, Andrei Osterman, Vineet
Bafna, Richard D. Smith, Pavel Pevzner, 2007. In
preparation.
46
Genome annotation
Genomics
Genome annotation is generally seen as something
done before transcriptomics and proteomics.
The direction of information flow mirrors the
central dogma.
Transcript- omics
Proteomics
47
Genome annotation
Genomics
Mass spectrometry is an attractive method for
discovering genes and improving gene
annotations. Roughly 25 of tryptic peptides
span a splice junction, so intron boundaries (and
alternative splicing) can be conformed at the
translational level. MS/MS has different sources
of error than ESTs, providing a novel line of
evidence for gene finding.
ESTs
Transcript- omics
Peptide IDs
Proteomics
48
Genomic Search
  • Proteins have many isoforms and sequence
    variants. Storing and searching every feasible
    sequence is inefficient!
  • Storing the proteome as an exon graph is more
    efficient, and results are trivially mapped back
    onto the genome
  • (Valuable even in cases where the genome is
    perfectly annotated!)

49
Figure 7.3 A portion of the exon graph for
heterogenous nuclear ribonuclear protein K. The
labeled edge represents a codon split across a
splice junction. The dotted edge is an adjacent
edge corresponding to a longer form of an exon.
Searching the exon graph reveals peptides
spanning both outgoing edges from the central
node, confirming alternative splicing at the
level of translation.
50
Exon Graph
  • We used gene predictions (GeneID) and EST
    mappings (dbEST, ESTMapper) to build a graph of
    putative exons and introns in the human genome
  • The graph incorporates coding SNPs from dbSNP
  • A modified version of InsPecT was then used to
    search the graph

51
Above Evidence for novel exons upstream of the
annotated start site of retinoblastoma-associated
protein RAP140 (gi5881256). Matched peptides are
shown below their corresponding genomic location,
with spectrum counts indicated in parentheses.
Those peptides which match the reference protein
sequence are also shown.
52
Figure 7.6 Novel exons are supported by peptide
identifications and by sequence homology. Above
is a multiple alignment for hypothetical protein
sequences from chimp (gi55639283), rat
(gi62531299), and human (genome translation,
similar to gi20070384). Introns are indicated by
colons. The peptides identified from mass spectra
are indicated below the protein sequence. The
novel 3 exon is supported by three peptide
identifications, as well as gt95 amino acid
sequence conservation across species.
53
Genome annotation results
  • Discovery of novel exons in a dozen human genes
  • Confirmation of many genes predicted de novo or
    from ESTs
  • Detection of alternative splicing, and coding
    SNPs, at the translational level

54
Gene Set Analysis
Generalized gene set queries for microarray
analysis. Stephen Tanner, Pankaj Agarwal, 2007.
In preparation.
55
Overview
  • Microarray experiments compare mRNA between
    tissue types or treatment conditions. They
    measure up- and down-regulation of genes.
  • The data is very noisy (particularly for
    low-abundance transcripts) and can be difficult
    to interpret.
  • Collecting readings corresponding to gene sets
    (e.g. a set of all genes annotated with a GO
    term) are one way to address this.

56
Motivating Example
  • A microarray experiment compared muscle RNA from
    young and from aged males (GEO data-set GDS287).
  • I computed the cyber-t statistic for each gene.
  • After correcting for multiple hypothesis
    testing, no genes were significantly up- or
    down-regulated.
  • But, perhaps we can find a set of genes thats
    significantly up- or down-regulated.

57
GQuery algorithm
  • Input
  • a vector of readings for 20,000 genes
  • a binary vector indicating which genes are
    members of the gene set
  • Output
  • An enrichment score measuring the degree to which
    the set is enriched for up- or down-regulated
    genes. Computed using Pearson correlation.

58
GQuery algorithm notes
  • Several other enirchment statistics were tried,
    with similar results
  • The same model handles queries against a database
    of other microarray experiments (e.g. matching
    diseases to compounds with opposing effects)
  • The computations are simple, but careful
    statistics are required.

59
Challenge False positives
  • Because gene sets represent co-regulated genes,
    the expression levels of their members are
    tightly correlated.
  • Our null model must correct for this, or we will
    obtain many false positives.
  • We calibrated our p-value readings using a
    diverse corpus of microarray experiments.

60
Above Empirical cumulative distribution function
(CDF) for gene set scores across a corpus of
experiments. Calibration filters false positives
- e.g. a score of 0.05 is highly significant for
one gene set, but not for others.
61
Validation experiment
  • Given related experiments, we expect to see the
    same enriched gene sets
  • Example Comparing muscle from young and aged
    males/females, we see downregulation of the TCA
    cycle
  • To a first approximation, sets shared across
    unrelated experiments are false positives.

62
Accuracy (computed using the false discovery
rate) of queries for five pairs of related
microarray experiments. Calibration of p-values
using a large corpus (either GEO or the
Connectivity Map) was significantly more
effective than computing a p-value using
permutation of class labels. (Permutation also
requires many replicates to be reliable)
63
Experiment Rank p-value Name
Muscle 1 7.89E-10 Glycolysis_and_Gluconeogenesis
Muscle 2 6.93E-09 Costamere CC
Muscle 3 4.37E-07 superpathway of glycolysis, pyruvate dehydrogenase, TCA, and glyoxylate bypass

AD 1 1.13E-11 Proton-Transporting Two-Sector ATPase Complex CC
AD 2 1.13E-11 Hydrogen-Translocating V-Type ATPase Complex CC
AD 3 9.31E-11 Long-Term Memory BP
Above Examples of gene sets obtained by the
GQuery algorithm. Statistical power is greatly
increased when measuring up- or down-regulation
of many genes at once. Gene sets such as
Long-term memory are easier to interpret than
lists of gene identifiers.
64
Summary
  • Analysis of high-throughput mass spectrometry
    requires efficeint algorithms
  • Reporting accuracy (e.g. using a negative control
    to compute false discovery rates) is vital for
    high-throughput analysis
  • The software I developed will be used and
    extended by our lab and by collaborators.

65
Acknowledgments
  • My committee - Vineet Bafna, Julian Schroeder,
    Pavel Pevzner, Steve Briggs, Trey Ideker
  • Fellow students - Ari Frank, Nuno Bandeira,
    Samuel Payne, Nitin Gupta, Julio Ng, Natalie
    Castellana, Vagisha Sharma, Qian Peng
  • Industry contacts - Helge Weissig, Pankaj Agarwal
  • Collaborating labs (Ingolf Krueger, Larry David,
    Ebrahim Zandi, Marc Mumby, Elizabeth Komives, Guy
    Salvesen, Richard Smith, and many more!)
Write a Comment
User Comments (0)
About PowerShow.com