Title: Predicting Protein Structures and Structural Features on a Genomic Scale Pierre Baldi School of Information and Computer Sciences Institute for Genomics and Bioinformatics University of California, Irvine
1Predicting Protein Structures and Structural
Features on a Genomic ScalePierre BaldiSchool
of Information and Computer SciencesInstitute
for Genomics and BioinformaticsUniversity of
California, Irvine
2UNDERSTANDING INTELLIGENCE
- Human intelligence (inverse problem)
- AI (direct problem)
- Choice of specific problems is key
- Protein structure prediction is a good problem
3PROTEINS
- R1
R3 -
- Ca N Cß
Ca - / \ / \ /
\ / \ - N Cß Ca
N Cß -
- R2
4(No Transcript)
5Utility of Structural Information
(Baker and Sali, 2001)
6CAVEAT
7REMARKS
- Structure/Folding
- Backbone/Full Atom
- Homology Modeling
- Fold Recognition (Threading)
- Ab Initio (Physical Potentials/Molecular
Dynamics, Statistical Mechanics/Lattice Models) - Statistical/Machine Learning (Training Sets, SS
prediction) - Mixtures ab-initio with statistical potentials,
machine learning with profiles, etc.
8PROTEIN STRUCTURE PREDICTION (ab initio)
9(No Transcript)
10Helices
- 1GRJ (Grea Transcript Cleavage Factor From
Escherichia Coli)
11Antiparallel ß-sheets
- 1MSC (Bacteriophage Ms2 Unassembled Coat Protein
Dimer)
12Parallel ß-sheets
13Contact map
14Secondary structure prediction
15GRAPHICAL MODELS BAYESIAN NETWORKS
- X1, ,Xn random variables associated with the
vertices of a DAG Directed Acyclic Graph - The local conditional distributions P(XiXj j
parent of i) are the parameters of the model.
They can be represented by look-up tables
(costly) or other more compact parameterizations
(Sigmoidal Belief Networks, XOR, etc). - The global distribution is the product of the
local characteristicsP(X1,,Xn) ?i P(XiXj
j parent of i)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21DATA PREPARATION
-
- Starting point PDB data base.
- Remove sequences not determined by X ray
diffraction. - Remove sequences where DSSP crashes.
- Remove proteins with physical chain
breaks (neighboring AA having
distances exceeding 4 Angstroms) - Remove sequences with resolution worst
than 2.5 Angstroms. - Remove chains with less than 30 AA.
- Remove redundancy (Hobohms algorithm,
Smith-Waterman, PAM 120, etc.) - Build multiple alignments (BLAST,
PSI-BLAST, etc.)
22SECONDARY STRUCTURE PROGRAMS
- DSSP (Kabsch and Sander, 1983) works by
assigning potential backbone hydrogen bonds
(based on the 3D coordinates of the backbone
atoms) and subsequently by identifying repetitive
bonding patterns. - STRIDE (Frishman and Argos, 1995) in addition
to hydrogen bonds, it uses also dihedral angles. - DEFINE (Richards and Kundrot, 1988) uses
difference distance matrices for evaluating the
match of interatomic distances in the protein to
those from idealized SS.
23SECONDARY STRUCTURE ASSIGNMENTS
- DSSP classes
- H alpha helix
- E sheet
- G 3-10 helix
- S kind of turn
- T beta turn
- B beta bridge
- I pi-helix (very rare)
- C the rest
- CASP (harder) assignment
- a H and G
- ß E and B
- ? the rest
- Alternative assignment
- a H
- ß B
- ? the rest
24ENSEMBLES
25(No Transcript)
26(No Transcript)
27(No Transcript)
28FUNDAMENTAL LIMITATIONS
- 100 CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE
FOR SEVERAL REASONS - SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY
NEED CHAPERONES - QUATERNARY STRUCTURE BETA-STRAND PARTNERS MAY BE
ON A DIFFERENT CHAIN - STRUCTURE MAY DEPEND ON OTHER VARIABLES
ENVIRONMENT, PH - DYNAMICAL ASPECTS
- FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES
29(No Transcript)
30(No Transcript)
31BB-RNNs
322D RNNs
332D INPUTS
- AA at positions i and j
- Profiles at positions i and j
- Correlated profiles at positions i and j
- Secondary Structure, Accessibility, etc.
34(No Transcript)
35PERFORMANCE ()
6Å 8Å 10Å 12Å
non-contacts 99.9 99.8 99.2 98.9
contacts 71.2 65.3 52.2 46.6
all 98.5 97.1 93.2 88.5
36Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1HCR, chain A Sequence
GRPRAINKHEQEQISRLLEKGHPRQQLAIIFGIGVSTLYRYFPASSIKKR
MN True SS CCCCCCCCHHHHHHHHHHHCCCCHHHHHHHCECCHHH
HHHHCCCCCCCCCCC Pred SS CCCCCCCHHHHHHHHHHHHCCCCH
HHHEEHECHHHHHHHHCCCHHHHHHHCC
PDB ID 1HCR Chain A (52 residues)
Model 147 RMSD 3.47Å
37Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1BC8, chain C Sequence
MDSAITLWQFLLQLLQKPQNKHMICWTSNDGQFKLLQAEEVARLWGIRKN
KPNMNYDKLSRALRYYYVKNIIKKVNGQKFVYKFVSYPEILNM True
SS CCCCCCHHHHHHHHCCCHHHCCCCEECCCCCEEECCCHHHHHHHH
HHHHCCCCCCHHHHHHHHHHHHHHCCEEECCCCCCEEEECCCCHHHCC P
red SS CCCHHHHHHHHHHHHHCCCCCCEEEEECCCEEEEECCHHHH
HHHHHHHCCCCCCCHHHHHHHHHHHHHCCCEEECCCCEEEEEEECCHHHH
CC
PDB ID 1BC8 Chain C (93 residues)
Model 1714 RMSD 4.21Å
38CASP6 Self AssessmentEvaluation based on GDT_TS
of first submitted model GDT_TS Global
Distance Test Total ScoreGDT_TS (GDT_P1
GDT_P2 GDT_P4 GDT_P8 ) / 4 Pn percentage
of residues under distance cutoff n
39Hard Target Summary
- Top 10 groups displayed, of 65 registered servers
- Assessment on 25 new fold and fold recognition
analogous target domains
N number of targets predicted Av.R.
average rank sumZ sum of Z scores on all
targets in set sumZpos sum of Z scores for
predictions with positive Z score group N Av.R.
sumZ sumZpos BAKER-ROBETTA 25 9.12 27.81 27.94 ba
ldi-group-server 24 10.04 20.47 22.33 Rokky 25
11.60 17.56 18.46 Pmodeller5 20 12.55 14.35 16.1
1 ZHOUSPARKS2 25 15.00 12.67 15.91 ACE 25 13.
08 11.63 14.89 Pcomb2 24 16.04 10.82 13.58 RAP
TOR 24 16.33 9.81 13.21 zhousp3 25 15.72 9.30
12.90 PROTINFO-AB 19 16.74 8.62 12.59
40Hard Target Summary
- Top 10 groups displayed, of 65 registered servers
- Assessment on 19 new fold and fold recognition
analogous target domains less than 120 residues
N number of targets predicted Av.R.
average rank sumZ sum of Z scores on all
targets in set sumZpos sum of Z scores for
predictions with positive Z score group N Av.R.
sumZ sumZpos baldi-group-server 19 6.74 20.61 20.6
1 BAKER-ROBETTA 19 9.11 20.44 20.57 Rokky 19 1
2.11 12.30 13.20 PROTINFO-AB 16 12.63 11.53 12.5
9 ZHOUSPARKS2 19 15.32 8.91 11.87 Pcomb2 18 1
5.39 9.54 11.48 Pmodeller5 15 14.47 9.25 11.00
PROTINFO 18 16.22 8.66 10.56 ACE 19 14.21 6.98
10.24 RAPTOR 18 17.00 7.22 9.64
41Target T0281Detailed Target Analysis
- Target Information
- Length 70 amino acids
- Resolution 1.52 Å
- PDB code 1WHZ
- Description Hypothetical Protein From Thermus
Thermophilus Hb8 - Domains single domain
- Assessment
- GDT_TS server rank of our 1st model 2
- GDT_TS 51.07
- RMSD to native 6.15
42Target T0281Contact Map Comparison
note true map is lower left
True Map vs. Predicted Map
True Map vs. Recovered Map
43Target T0281Structure Comparison
true structure
predicted structure
44Target T0281Structure Comparison Superposition
True structure thick trace Predicted structure
thin trace
45Target T0280_2Detailed Target Analysis
- Target Information
- Length 51 amino acids
- Resolution 2.00 Å
- PDB code 1WD5
- Description Putative phosphoribosyl transferase,
T. thermophilus - Domains 2nd domain, residues 53-103 of 208 AA
sequence
- Assessment
- GDT_TS server rank of our 1st model 1 (also 1st
among human groups) - GDT_TS 54.41
- RMSD to native 5.81
46Target T0280_2Contact Map Comparison
note true map is lower left
True Map vs. Predicted Map
True Map vs. Recovered Map
47Target T0281Structure Comparison
true structure
predicted structure
48Target T0281Structure Comparison Superposition
True structure thick trace Predicted structure
thin trace
49THE SCRATCH SUITE
- www.igb.uci.edu
- DOMpro domains
- DISpro disordered regions
- SSpro secondary structure
- SSpro8 secondary structure
- ACCpro accessibility
- CONpro contact number
- DI-pro disulphide bridges
- BETA-pro beta partners
- CMAP-pro contact map
- CCMAP-pro coarse contact map
- CON23D-pro contact map to 3D
- 3D-pro 3D structure (homology fold recognition
ab-initio)
50(No Transcript)
51- SISQQTVWNQMATVRTPLNFDSSKQSFCQFSVDLLGGGISVDKTGDWITL
VQNSPISNLL - CCCECCCCCCEEEECCCCCCCCCCCCEEEEEEECCCCEEEECCCCCCEEE
EECCHHHHHH - CCCEEEEECEEEEECCCCCCCTCCCCEEEEEEEETCSEEEECTTTTEEEE
EECCHHHHHH - -----------------------------------
---- - --------------------------
- -------------------------------
- --------------------------
- eeeeee---e--e-e-eee-ee-eee---------e-e--eeeeee----
---------- - RVAAWKKGCLMVKVVMSGNAAVKRSDWASLVQVFLTNSNSTEHFDACRWT
KSEPHSWELI - HHHHHHCCCEEEEEEEEEECCEEECCCCCEEEEEEEECCCCCCCCCEEEE
EECCCCCCCC - HHHHHHTTCEEEEEEEEEEEEEEECCCCCEEEEEEEECCCTTCCCEEEEE
EECCTCCEEE - -----------------------
---------- - --------------------
---- - -----------------
---- - ------------------
---- - -----ee---e-------e-e-ee-e-e-e-----e--eeee--e-----
--e-e-ee-e
52Advantage of Machine Learning
- Pitfalls of traditional ab-initio approaches
- Machine learning systems take time to train
(weeks). - Once trained however they can predict structures
almost faster than proteins can fold. - Predict or search protein structures on a genomic
or bioengineering scale .
53DAG-RNNs APPROACH
- Two steps
- 1. Build relevant DAG to connect inputs, outputs,
and hidden variables - 2. Use a deterministic (neural network)
parameterization together with appropriate
stationarity assumptions/weight sharingoverall
models remains probabilistic - Process structured data of variable size,
topology, and dimensions efficiently - Sequences, trees, d-lattices, graphs, etc
- Convergence theorems
- Other applications
54(No Transcript)
55Convergence Theorems
- Posterior Marginals
- sBN?dBN in distribution
- sBN?dBN in probability (uniformly)
- Belief Propagation
- sBN?dBN in distribution
- sBN?dBN in probability (uniformly)
56Structural Databases
- PPDB Poxvirus Proteomic Database
- ICBS Inter Chain Beta Sheet Database
57(No Transcript)
58(No Transcript)
59(No Transcript)
60Strategies for drug design
- Block, modulate, mediate ß-sheet interactions
- Covalent modification of a chain to prevent
ß-sheet formation
61(No Transcript)
62(No Transcript)
63Three-Stage Prediction of Protein Beta-Sheets
Using Neural Networks, Alignments, and Graph
Algorithms
- Jianlin Cheng and Pierre Baldi
- School of Info. and Computer Sci.
- University of California Irvine
64Beta-Sheet Architecture
65Importance of Predicting Beta-Sheet Structure
- AB-Initio Structure Prediction
- Fold Recognition
- Model Refinement
- Protein Design
- Protein Stability
66Previous Work
- Methods
- Statistical potential approach for strand
alignment. (Hubbard, 1994 Zhu and Braun, 1999) - Statistical potentials to improve beta-sheet
secondary structure prediction.(Asogawa,1997) - Information theory approach for strand alignment.
(Steward and Thornton, 2000) - Neural networks for beta-residue contacts.
(Baldi, et.al, 2000) - Shortcomings
- Focus on one single aspect not utilize
structural contexts and evolutionary information
not exploit constraints enough not publicly
available.
67Three-Stage Prediction of Beta-Sheets
- Stage 1
- Predict beta-residue pairings using
2D-Recursive Neural Networks (2D-RNN). - Stage 2
- Align beta-strands using alignment algorithms.
- Stage 3
- Predict beta-strand pairs and beta-sheet
architecture using graph algorithms.
68Dataset and Statistics
Num
Chains 916
Beta residues 48,996
Residue Pairs 31,638
Beta Strand 10,745
Strand Pairs 8,172
Beta Sheet 2,533
69Stage 1 Prediction of Beta-Residue Pairings
Using 2D-RNN
Target / Output Matrix (mm)
Input Matrix I (mm)
(i,j)
2D-RNN O f(I)
(i,j)
Tij 0/1 Oij Pairing Prob.
Iij
i-2 i-1 i i1 i2 j-2 j-1 j j1 j2 i-j
Total 251 inputs
20 profiles
3 SS
2 SA
70An Example Target
Protein 1VJG
Beta-Residue Pairing Map (Target Matrix)
71An Example Output
72Stage 2 Beta-Strand Alignment
Anti-parallel
1 m
- Use output probability matrix as scoring matrix
- Dynamic programming
- Disallow gaps and use simplified searching
algorithms
n 1
Parallel
1 m
1 n
Total number of alignments 2(mn-1)
73Strand Alignment and Pairing Matrix
- The alignment score (Pseudo Binding Energy) is
the sum of the probabilities of paired residues. - The best alignment is the alignment with maximum
score. - Strand Pairing Matrix.
Strand Pairing Matrix of 1VJG
74Stage 3 Prediction of Beta-Strand Pairings and
Beta-Sheet Architecture
Strand Pairing Constraints
75Minimum Spanning Tree Like Algorithm
Strand Pairing Graph (SPG)
Goal Find a set of connected subgraphs that
maximize the sum of pseudo-energy and
satisfy the constraints. Algorithm Minimum
Spanning Tree Like Algorithm.
76Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 1 Pair strand 4 and 5
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
Strand Pairing Matrix of 1VJG
77Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 2 Pair strand 1 and 2
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
Strand Pairing Matrix of 1VJGA
N
78Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 3 Pair strand 1 and 3
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
3
Strand Pairing Matrix of 1VJGA
N
79Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 4 Pair strand 3 and 6
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
3
6
Strand Pairing Matrix of 1VJGA
N
80Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 5 Pair strand 6 and 7
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
C
7
2
1
3
6
7
Strand Pairing Matrix of 1VJGA
N
81Beta-Residue Pairing Results
- Sensitivity Specificity 41
- Base-line 2.3. Ratio of improvement 17.8.
- ROC area 0.86
- At 5 FPR, TPR is 58
- CMAPpro
- Spec. and Sens. is 27. ROC area0.8.
- TPR42 at 5 FPR.
82Strand Pairing Results
- Naïve algorithm of pairing all adjacent strands
- Specificity 42
- Sensitivity 50
- MST like algorithm
- Specificity 53
- Sensitivity 59
- gt20 correctly predicted strand pairs are
non-adjacent strand pairs
83Strand Alignment Results
On the correctly predicted pairs
Paring Direction Align. All Align. Anti-P Align. Para. Align. Bridge
Acc. 93 72 69 71 88
On all native pairs
Pairing Direction Align. All Align. Anti-P Align. Para. Align. Bridge
Acc. 84 66 63 66 73
- Pairing direction is 15 higher than
- of random algorithm.
- Alignment accuracy is improved by gt15.
84Application and Future Work
- New methods for beta-residue pairings (e.g.
Linear Programming, SVM), and strand alignment
and pairings. More inputs (Punta and Rost, 2005).
- Applications
- AB-Initio Structure Sampling (beta-sheet)
- Fold Recognition (conservation of beta-sheets)
- Contact Map
- Model Refinement (pairing direction/alignment)
- Web server and dataset
- http//www.ics.uci.edu/baldig/betasheet.html
85A New Fold Example (CASP6)
- 1S12 (T0201, 94 residues)
True SS
CEEEEECCCEEEEECCCCCHHHHHHHHHHHHHHHHHHHHCCCEEEEEECC
EEEEEECCCCHHHHHHHHHHHHHHHHHHHHCCCCEEEEECCCCCC
Predicted SS
CEEEEEECCEEEECCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHEHHCCC
CEEEEHHHHHHHHHHHHHHHHHHHHHHHHHCCCCEEEEEEECCC
True 12, 2-4, 3-4, 1-5
Strand Pairing Matrix
1 2 3 4 5
1 0 1.71 .05 .29 .33
2 0 .06 .41 .12
3 0 .22 .04
4 0 .53
5 0
Predicted 1-2, 2-4, 3-4, 4-5
5
1
2
4
3
Rendered in Rasmol
86ACKNOWLEDGMENTS
- UCI
- Gianluca Pollastri, Pierre-Francois Baisnee,
Michal Rosen-Zvi - Arlo Randall, S. Joshua Swamidass, Jianlin Cheng,
Yimeng Dou, Yann Pecout, Mike Sweredoski,
Alessandro Vullo, Lin Wu - James Nowick, Luis Villareal
- DTU Soren Brunak
- Columbia Burkhard Rost
- U of Florence Paolo Frasconi
- U of Bologna Rita Casadio, Piero Fariselli
- www.igb.uci.edu/
- www.ics.uci.edu/pfbaldi
871DFN Defensin
88A Perfectly Predicted Example
Sequence with cysteine's position identified
MSNHTHHLKFKTLKRAWKASKYFIVGLSC29LYKFNLKSLVQTALST
LAMITLTSLVITAIIYISVGNAKAKPTSKPTIQQTQQPQNHTSPFFTEHN
YKSTHTSIQSTTLSQLLNIDTTRGITYGHSTNETQNRKIKGQSTLPATRK
PPINPSGSIPPENHQDHNNFQTLPYVPC173STC176EGNLAC18
2LSLC18 6HIETERAPSRAPTITLKKTPKPKTTKKPTKTTIHHRT
SPETKLQPKNNTATPQQG ILSSTEHHTNQSTTQI Length 257,
Total number of cysteines 5 Four bonded
cysteines form two disulfide bonds 173
-------186 ( red cysteine pair) 176 -------182
(blue cysteine pair)
Prediction Results from DIpro (http//contact.ics.
uci.edu/bridge.html) Predicted Bonded
Cysteines 173,176,182,186 Predicted disulfide
bonds Bond_Index Cys1_Position Cys2_Position 1 17
3 186 2 176 182 Prediction Accuracy for both
bond state and bond pair are 100.
89A Hard Example with Many Non-Bonded Cysteines
Sequence with cysteine's position identified
MTLGRRLAC9LFLAC14VLPALLLGGTALASEIVGGRRARPHAWP
FMVSLQLRGGHFC55GATLIAPNFVMSAAHC71VANVNVRAVRVVL
GAHNLSRREPTRQVFAVQRIFENGYDPVNLLNDIVILQLNGSATINANVQ
VAQLPAQGRRLGNGVQC151LAMGWGLLGRNRGIASVLQELNVTVVTS
LC181RRSNVC187TLVRGRQAGVC198FGDSGSPLVC208N
GLIHGIASFVRGGC223ASGLYPDAFAPVAQFVNWIDSIIQRSEDNPC
254PHPRDPDPASRTH Length 267, Total Cysteine
Number 11 Eight bonded cysteines form four
disulfide bonds 55 ----- 71 (Red), 151 -----
208 (Blue), 181 ----- 187 (Green), 198 ----- 223
(Purple)
Prediction Results from DIpro (http//contact.ics.
uci.edu/bridge.html) Predicted Bonded
Cysteines 9,14,55,71,181,187,223,254 Predicted
Disulfide Bonds Bond_Index Cys1_Position Cys2_Pos
ition 1 55 71 (correct) 2 9 14
(wrong) 3 223 254 (wrong) 4 181 187
(correct) Bond State Recall 5 / 8 0.625,
Bond State Precision 5 / 8 0.625 Pair Recall
2 / 4 0.5 Pair Precision 2 / 4 0.5 Bond
number is predicted correctly.
90Prediction Accuracy on SP51 Dataset on All
Cysteines
Bond Num Bond State Recall() Bond State Precision() Pair Recall() Pair Precision()
1 91 46 74 39
2 93 77 61 51
3 90 74 54 45
4 77 87 52 59
5 71 86 33 42
6 65 84 27 34
7 63 85 36 55
8 66 89 27 41
9 60 83 23 35
10 55 86 30 45
11 62 86 34 47
12 67 97 17 23
15 50 94 27 50
16 82 99 11 13
17 61 96 22 33
18 50 82 6 9
19 47 90 11 20
Overall bond state recall 78 overall bond
state precision 74 bond number prediction
accuracy 53 average difference between true
bond number and predicted bond number 1.1 .
91CURRENT WORK
- Feedback
- Ex SS ? Contacts ? SS ? Contacts
- Homology, homology, homology
- SSpro 4.0 performs at 88
92(No Transcript)