Title: Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry
1Direct Experimental Observation of Functional
Protein Isoforms by Tandem Mass Spectrometry
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2Synopsis
- MS/MS spectra provide evidence for the amino-acid
sequence of functional proteins. - Key concepts
- Spectrum acquisition is unbiased
- Direct observation of amino-acid sequence
- Sensitive to small sequence variations
3Synopsis
- MS/MS spectra provide evidence for the amino-acid
sequence of functional proteins. - Applications
- Cancer biomarkers
- Genome annotation
4Mass Spectrometry for Proteomics
- Measure mass of many (bio)molecules
simultaneously - High bandwidth
- Mass is an intrinsic property of all
(bio)molecules - No prior knowledge required
5Mass Spectrometer
- Time-Of-Flight (TOF)
- Quadrapole
- Ion-Trap
- MALDI
- Electro-SprayIonization (ESI)
6High Bandwidth
7Mass is fundamental!
8Mass Spectrometry for Proteomics
- Measure mass of many molecules simultaneously
- ...but not too many, abundance bias
- Mass is an intrinsic property of all
(bio)molecules - ...but need a reference to compare to
9Mass Spectrometry for Proteomics
- Mass spectrometry has been around since the turn
of the century... - ...why is MS based Proteomics so new?
- Ionization methods
- MALDI, Electrospray
- Protein chemistry automation
- Chromatography, Gels, Computers
- Protein / genome sequences
- A reference for comparison
10Sample Preparation for Peptide Identification
11Single Stage MS
MS
m/z
12Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
13Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
14Peptide Identification
- For each (likely) peptide sequence
- 1. Compute fragment masses
- 2. Compare with spectrum
- 3. Retain those that match well
- Peptide sequences from (any) sequence database
- Swiss-Prot, IPI, NCBIs nr, ESTs, genomes, ...
- Automated, high-throughput peptide identification
in complex mixtures
15Peptide Identification
- ...can provide direct experimental evidence for
the amino-acid sequence of functional proteins. - Evidence for
- Functional protein isoforms
- Translation start and frame
- Proteins with short open-reading-frames
16Why is this useful for ...... genome annotation?
- Evidence for SNPs and alternative splicing stops
with transcription - No genomic or transcript evidence for translation
start-site. - Conservation doesnt stop at coding bases!
- Statistical gene-finders struggle with
micro-exons, translation start-site, and short
ORFs.
17Why is this useful for ...... cancer biomarkers?
- Alternative splicing is the norm!
- Only 20-25K human genes
- Each gene makes many proteins
- Some splicing is believed to be silencing
- Lots of splicing in cancer
- Proteins have clinical implications
- Statistical biomarker discovery
- Putative malfunctioning proteins
18What can be observed?
- Known coding SNPs
- Novel coding mutations
- Alternative splicing isoforms
- Microexons ( non-cannonical splice-sites )
- Alternative translation start-sites ( codons )
- Alternative translation frames
- Dark open-reading-frames
19Splice Isoform
- Human Jurkat leukemia cell-line
- Lipid-raft extraction protocol, targeting T cells
- von Haller, et al. MCP 2003.
- LIME1 gene
- LCK interacting transmembrane adaptor 1
- LCK gene
- Leukocyte-specific protein tyrosine kinase
- Proto-oncogene
- Chromosomal aberration involving LCK in
leukemias. - Multiple significant peptide identifications
20Splice Isoform
21Novel Splice Isoform
22Novel Mutation
- HUPO Plasma Proteome Project
- Pooled samples from 10 male 10 female healthy
Chinese subjects - Plasma/EDTA sample protocol
- Li, et al. Proteomics 2005. (Lab 29)
- TTR gene
- Transthyretin (pre-albumin)
- Defects in TTR are a cause of amyloidosis.
- Familial amyloidotic polyneuropathy
- late-onset, dominant inheritance
23Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
24Novel Mutation
25Translation Start-Site
- Human erythroleukemia K562 cell-line
- Depth of coverage study
- Resing et al. Anal. Chem. 2004.
- THOC2 gene
- Part of the heteromultimeric THO/TREX complex.
- Initially believed to be a novel ORF
- RefSeq mRNA in Jun 2007, no RefSeq protein
- TrEMBL entry Feb 2005, no SwissProt entry
- Genbank mRNA in May 2002 (complete CDS)
- Plenty of EST support
- 100,000 bases upstream of other isoforms
26Translation Start-Site
27Translation Start-Site
28Translation Start-Site
29Translation Start-Site
30Easily distinguish minor sequence variations
- Two B. anthracis Sterne a/ß SASP annotations
- RefSeq/Gb MVMARN... (7441 Da)
- CMR MARN... (7211 Da)
- Intact proteins differ by 230 Da
- 7441 Da vs 7211 Da
- N-terminal tryptic peptides
- MVMAR (606.3 Da), MVMARNR (876.4 Da), vs
- MARNR (646.3 Da)
- Very different MS/MS spectra
31Bacterial Gene-Finding
- Find all the open-reading-frames...
TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA
Stopcodon
Stopcodon
...courtesy of Art Delcher
32Bacterial Gene-Finding
- Find all the open-reading-frames......but
they overlap which ones are correct?
Reversestrand
Stopcodon
ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT
TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA
Stopcodon
Stopcodon
ShiftedStop
...courtesy of Art Delcher
33Coding-Sequence Score
...courtesy of Art Delcher
34Glimmer3 Performance
- Glimmer3 trained compared to RefSeq genes with
annotated function - Correct STOP
- 99.6
- Correct START
- 84.3
- Not all the genomes necessarily have
carefully/accurately annotated start sites, so
the results for number of correct starts may be
suspect.
35N-terminal peptides
- (Protein) N-terminal peptides establish
- start-site of known unexpected ORFs
- Use
- Directly to annotate genomes
- Evaluate and improve algorithms
- Map cross-species
36N-terminal peptide workflows
- Typical proteomics workflows sample peptides from
the proteome randomly - Caulobacter crescentus (70)
- 3733 Proteins (RefSeq Genome annot.)
- 66K tryptic peptides (600 Da to 3000 Da)
- 2085 N-terminal tryptic peptides (3)
37N-terminal peptide workflow
- Protect protein N-terminus
- Digest to peptides
- Chemically modify free peptide N-term
- Use chem. mod. to capture unwanted peptides
Nat Biotech, Vol. 21, pp. 566-569, 2003.
38Increasing N-terminal peptide coverage
- Multiple (digest) enzymes
- trypsin-R 60 (80)
- acid lys-C trypsin 85 (94)
- Repeated LC-MS/MS
- Precursor Exclusion / Inclusion lists
- MALDI / ESI
- Protein separation and/or orthogonal
fractionation
Anal Chem, Vol. 76, pp. 4193-4201, 2004.
39Proteomics Informatics
- Search spectra against
- Entire bacterial genome
- All Met initiated peptides or
- Statistically likely Met initiated peptides.
- Easily consider initial Met loss PTM, too
- Off-the-shelf MS/MS search engines (Mascot /
X!Tandem / OMSSA)
40Other Practical Issues
- Suitable for commonly available instrumentation
- Only the sample prep. is (somewhat) novel.
- Need living organism
- Stage of life-cycle?
- Bang for buck?
- N-terminal peptides /
- In discussions with JCVI (ex TIGR)
- Possible pilot project?
41Other Research Projects
- Improving peptide identification by MS/MS
- Spectral matching using HMMs
- Combining search engine results
- Spectral matching for detection and quantitation
- Microorganism identification using MS
- Live public web-site and database
- (Inexact) uniqueness guarantees
- Primer/Probe oligo design
- Pathogen detection (DNA Peptide)
- Significant false-positive peptide identifications
42Spectral Matching
- Detection vs. identification
- Increased sensitivity
- No novel peptides
- NIST GC/MS Spectral Library
- Identifies small molecules,
- 100,000s of (consensus) spectra
- Bundled/Sold with many instruments
- Dot-product spectral comparison
- Current project Peptide MS/MS
43Peptide DLATVYVDVLK
44Peptide DLATVYVDVLK
45Hidden Markov Models for Spectral Matching
- Capture statistical variation and consensus in
peak intensity - Capture semantics of peaks
- Extrapolate model to other peptides
- Good specificity with superior sensitivity for
peptide detection - Assign 1000s of additional spectra (w/ p-value lt
10-5)
46www.RMIDb.org
47www.RMIDb.org
- Statistics
- 16.7 x 106 (6.4 x 106) protein sequences
- 40,000 organisms, 19,700 species
- 557 (415) complete genomes
- Sources
- TIGRs CMR, SwissProt, TrEMBL, Genbank Proteins,
RefSeq Proteins Genomes - Inclusive Glimmer3 predictions on Genomes
- Pfam and GO assignments using BOINC grid
48www.RMIDb.org
Accessed from all over the world...
49Uniqueness guarantees
- 20-mer oligo signatures for B. anthracis
- In all available strains as exact match
- No (inexact) match to other Bacillus species
50Uniqueness guarantees
- Human genome primer design problem
- 4-unique DNA 20-mers
- Edit-distance 5 to any non-specific
hybridization site - No such valid loci on Chr. 22!
- Currently analyzing entire genome
- 3-unique DNA 20-mers
- Initial experiments suggest 0.01 valid
- Approx. 1 valid oligo every 10,000 bases
51Future Research Plans
- Cancer biomarkers
- Optimize proteomics workflow for protein sequence
coverage - Improve informatics infrastructure to make
interpretation easier - Identify splice variants in cancer cell-lines
(MCF-7) and clinical brain tumor samples
52Future Research Plans
- Genome Annotation
- Collect evidence for functional alternative
splicing in public datasets into dbPEP. - Conduct pilot project for bacterial genome
annotation with JCVI. - Improve informatics infrastructure to make
interpretation easier.
53Future Research Plans
- Peptide Identification
- Expand library of HMM models for high-confidence
spectral matching - Spectral matching for biomarkers and quantitation
(with Calibrant). - Specificity metric for peptides identified using
MS/MS
54Future Research Plans
- Microorganism identification by mass
spectrometry - Specificity of tandem mass spectra
- Revamp RMIDb prototype
- Incorporate spectral matching, top-down.
55Future Research Plans
- Oligonucleotide Design
- Uniqueness oracle for inexact match in human
- Integration with Primer3
- Tiling, multiplexing, pooling, tag arrays
56Acknowledgements
- Catherine Fenselau, Steve Swatkoski
- UMCP Biochemistry
- Chau-Wen Tseng, Xue Wu
- UMCP Computer Science
- Cheng Lee, Brian Balgley
- Calibrant Biosystems
- PeptideAtlas, HUPO PPP, X!Tandem
- Funding NIH/NCI, USDA/ARS