Novel Peptide Identification using ESTs and Genomic Sequence presentation

About This Presentation

Transcript and Presenter's Notes

Title: Novel Peptide Identification using ESTs and Genomic Sequence

1
Novel Peptide Identification using ESTs and
Genomic Sequence

Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park

2
Sample Preparation for Peptide Identification
3
Mass Spectrometer

ElectronMultiplier(EM)

Time-Of-Flight (TOF)
Quadrapole
Ion-Trap

MALDI
Electro-SprayIonization (ESI)

4
Single Stage MS
MS
m/z
5
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
6
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
7
Peptide Identification

For each (likely) peptide sequence
1. Compute fragment masses
2. Compare with spectrum
3. Retain those that match well
Peptide sequences from protein sequence databases
Swiss-Prot, IPI, NCBIs nr, ...
Automated, high-throughput peptide identification
in complex mixtures

8
What goes missing?

Known coding SNPs
Novel coding mutations
Alternative splicing isoforms
Alternative translation start-sites
Microexons
Alternative translation frames

9
Why should we care?

Alternative splicing is the norm!
Only 20-25K human genes
Each gene makes many proteins
Proteins have clinical implications
Biomarker discovery
Evidence for SNPs and alternative splicing stops
with transcription
Genomic assays, ESTs, mRNA sequence.
Little hard evidence for translation start site

10
Novel Splice Isoform
11
Novel Splice Isoform
12
Novel Frame
13
Novel Frame
14
Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
15
Novel Mutation
16
Genomic Peptide Sequences

Genomic DNA
Exons introns, 6 frames, large (3Gb ? 6Gb)
ESTs
No introns, 6 frames, large (4Gb ? 8Gb)
Used by gene, protein, and alternative splicing
annotation pipelines
Highly redundant, nucleotide error rate 1

17
Compressed EST Database

Six-frame translation of all ESTs
Optionally, ESTs that map to a gene
Eliminate ORFs lt 30 amino-acids
Amino-acid 30-mers
Observed in at least two ESTs
Represent AA 30-mers in C3 FASTA database
Complete, Correct, Compact

18
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
19
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
20
Sequence Databases CSBH-graphs

Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
21
Sequence Databases CSBH-graphs

All k-mers represented by an edge have the same
count

1
2
2
1
2
22
cSBH-graphs

Quickly determine those that occur twice

2
2
1
2
23
Compressed-SBH-graph
2
2
1
2
ACDEFGI
24
Compressed EST Database

Gene centric compressed EST peptide sequence
database
20,774 sequence entries
8Gb vs 223 Mb
35 fold compression
22 hours becomes 15 minutes
E-values improve by similar factor!
Makes routine EST searching feasible
Search ESTs instead of IPI?

25
Conclusions

Peptides identify more than just proteins
Compressed peptide sequence databases make
routine EST searching feasible
cSBH-graph edge counts C2/C3 enumeration
algorithms
Minimal FASTA representation of k-mer sets

26
Collaborators

Chau-Wen Tseng, Xue Wu
Computer Science
Catherine Fenselau, Crystal Harvey
Biochemistry
Calibrant Biosystems
Thanks to PeptideAtlas, X!Tandem

Write a Comment

User Comments (0)

About PowerShow.com

Novel Peptide Identification using ESTs and Genomic Sequence PowerPoint PPT Presentation