Title: Proteomics: A Challenge for Technology and Information Science
1Proteomics A Challenge for Technology and
Information Science
CBCB Seminar, November 21, 2005 Tim Griffin Dept.
Biochemistry, Molecular Biology and
Biophysics tgriffin_at_umn.edu
2What is proteomics?
Proteomics includes not only the identification
and quantification of proteins, but also the
determination of their localization,
modifications, interactions, activities, and,
ultimately, their function. -Stan Fields in
Science, 2001.
3Genomics vs. Proteomics
Similarities ?Large datasets, tools needed
for annotation and interpretation of
results Differences ?Genomics generally
mature technologies, data processing methods,
questions asked usually involve quantitative
changes in RNA transcripts (microarrays) ?Prote
omics still evolving, complexity of protein
biochemical properties expression changes,
modifications, interactions, activities many
questions to ask and data to interpret, methods
changing, different approaches (mass spec, arrays
etc.),
4Genomics, Proteomics, and Systems Biology
5Shotgun identification of proteins in mixtures
by LC-MS/MS
Liquid chromatography coupled to tandem mass
spectrometry (MS/MS)
µLC separation (50-100 um)
Tandem mass spectrum (thousands in a matter of
hours)
6Peptide sequence determination from MS/MS spectra
Collision-induced dissociation (CID) creates two
prominent ion series
y13
y12
y11
y10
y9
y8
y7
y6
y5
y4
y3
y2
y1
y14
y-series
H2N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COOH
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b1
b-series
7Peptide sequence identifies the protein
GDIVNLGSIAGR
DIVNLGSIAGR
IVNLGSIAGR
VNLGSIAGR
NLGSIAGR
LGSIAGR
GSIAGR
H2N-NSGDIVNLGSIAGR-COOH
Relative Abundance
SIAGR
IAGR
AGR
GR
R
200
400
600
800
1000
1200
m/z
YMR134W, yeast protein involved in iron metabolism
8High-throughput protein identification by
LC-MS/MS and automated sequence database searching
Raw MS/MS spectrum
Direct identification of 1000 proteins from
complex mixtures
Protein sequence and/or DNA sequence database
search
Peptide sequence match
Protein identification
9Dealing with the data
- Experimental information, metadata capture
1. Data acquisition
- Sequence database searching
- Quantitative analysis
2. Peak analysis
Integrated workflow?
- Database mining
- Assignment of function, pathway, localization
etc. - Output for database archiving, publication
3. Knowledge annotation and interpretation
101. Data acquisition capturing experimental
information
Proteomics Experimental Data Repository (PEDRo)
Proposed schema
- Similar to genomic needs, but experimental info
a bit different
112. Peak Analysis
Computational algorithms for searching MS/MS
spectra against protein sequence databases, mRNA
sequences, DNA sequences
- ProFound
- Mascot
- PepSea
- MS-Fit
- MOWSE
- Peptident
- Multident
- Sequest
- PepFrag
- MS-Tag
Protein identification
- need cpu horsepower (parallel computing)
122. Peak Analysis data formats
Format 1
Format 3
Format 2
?
?
Output 3
Output 2
Output 1
- Lack of flexibility
- Slow to evolve
- Lack of incorporation of competing products,
methods
132. Peak Analysis need general, flexible,
in-house solutions
Format 1
Format 3
Format 2
reverse engineering of data formats
General tools for analysis of multiple data
formats
142. Peak Analysis reverse engineering data formats
http//sashimi.sourceforge.net/software_glossolali
a.html
152. Peak analysis quality control of protein
matches
filtering
Unfiltered 105 matches (lots of noise and
junk)
Filtered thousands of true matches
- Statistical analysis of database results (tools
are available)
162. Peak Analysis Quantitative analysis
- External chemical labeling
- Metabolic labeling (SILAC)
- Enzymatic incorporation (O16/O18)
- Flexibility is key need tools to handle
different quantitative methods
172. Peak Analysis Quantitative analysis
Sample 2
Relative intensity relative protein abundance
Sample 1
18Evolving methodologies iTRAQ
Sample 1 2 3 4
Digest to peptides
Digest to peptides
Digest to peptides
Digest to peptides
iTRAQ label 114 115 116 117
Multidimensional separation
3
2
4
1
Intensity
MS/MS spectrum
m/z
114
116
115
117
Diagnostic ions used for quantitative analysis
Peptide fragments used for sequence identification
- 4-way multiplexing simultaneous comparison of
multiple states, replicates
19Need for changeable tools
new
3
old
116.0972
2
4
Intensity
115.0963
1
117.1025
114.1005
Automated analysis tools?
203. Knowledge annotation making sense of lists of
data
213. Knowledge annotation mining proteomic/genomic
databases
223. Knowledge annotation needs
- Annotation accession numbers and protein names
- Functional assignments (functional degeneracy?)
- Pathway assignments
- Subcellular localization
- Disease implications
- Comparison of different proteomic datasets (i.e.
expression profiles compared to modification
state profiles, other protein properties) - Automated and streamlined??
- Publication and deposit in databases
- Visualization of complex phenomena,
interpretation of biological relevance - Modeling, integration with genomics data
computational and systems biology